notes on software development: 2016

Wednesday, November 30, 2016

Check your lotto chances: what do the numbers say?

This post is a part 2 of using probability to determine our chances to win the lottery. In the previous post we checked the probability of drawing the numbers.

Oh the lottery win, could it be? We have determined that our chances are very slim, but we also know that the lotto people have to use a machine or a bunch of balls in a rolling drum to get the numbers. Maybe there could be an issue with the balls or the machine? or the environment its in? Maybe its not really random? Maybe there will be some patterns to how they get drawn.

Lets get all the numbers from previous draws and see what comes out of that. In Canada the lottery published the list of winning numbers, so we are going to get the files, and find out if there is (just maybe) a way to predict the winning numbers. Maybe one number is more likely to be drawn then the others.

To find out, lets load up the numbers from a CSV file and find out whats there:

import os, csv
from collections import Counter
#'Atlantic649.csv'#'Lotto649.csv'#
class LottoStats(object):
    def __init__(self, file_name='Lotto649.csv'):
        '''
        We want to get the numbers out of a spreadsheet, its colums are like so:
        Draw Date,No1,No2,No3,No4,No5,No6,Bonus,
        '''
        lotto_file = open(os.path.abspath(file_name), 'rb')
        reader = csv.reader(lotto_file, dialect = csv.excel)
        self.lotto_numbers = [] 
        self.draws = 0
        for row in reader:  
            self.draws += 1 
            for i, cell in enumerate(row):
                if (i != 0 and i != 7 and cell):
                    self.lotto_numbers.append(cell)
        '''
        Python has the handy Counter object that

        will return the elements and their counts
        '''
        self.counts = Counter(self.lotto_numbers)
 
    def get_numbers(self):
        return self.lotto_numbers

    def get_counts(self):
        return self.counts.items()

    def count_most_common(self, n=9):
        return self.counts.most_common(n)

    def count_least_common(self, n=9):
        return self.counts.most_common(len(self.lotto_numbers))[:-n-1:-1]

    def get_draws(self):
        return self.draws

stats = LottoStats('Lotto649.csv') 
# lets validate the expected counts with a test
assert len(stats.get_counts()) == 49

print stats.get_draws() # 2950
print stats.count_most_common(6) #[('34', 406), ('31', 405), ('45', 397), ('43', 394), ('40', 393), ('47', 386)]
print stats.count_least_common(6) #[('28', 331), ('14', 333), ('15', 339), ('16', 340), ('22', 341), ('13', 342)]

counts = stats.get_counts()
print(counts)
occurances = [occ[1] for occ in counts]
print(str(occurances))

From our observation by just checking the common and least common, we see that some numbers seem to be drawn more than others. Is that a big deal?

From the occurrences, we have the number 34 appearing 406 times, and we have 28 in there 331 times. That seems like it could be a big spread, but remember that we had 2950 draws.

We need to determine the chances that the number can be drawn at all. Out of 2950 draws, it's a 406/2950 or .1376 or 13% chance of happening and 28 has .1122.

Whats a good tool to add some insight here? What does this data tell us? There are two really valuable statistical tools to give us some understanding.

One is sample standard deviation which just indicates how far apart the values are. This isn't really useful on its own, but because our data has the properties it does it can be combined with the average to find.the coefficient of variance, which tells us how close all the values are to the average.

A set like [10,10,10] has a cv of 0; there is no difference. What does the set of our probabilities tell us?

from stats import standard_deviation

num_chances = [(num / float(2950)) for num in occurances]
print (num_chances)
sd = standard_deviation(num_chances, False) #The standard deviation is 0.00614129698134.
mean = sum(num_chances) / len(num_chances)
print("Mean "+str(mean))
variance = sd / float(mean) 
print("coefficient of variance: "+str(variance)) #0.0501539253476

The standard deviation is 0.00614129698134, and the Mean (or average) is 0.122448979592, so the coefficient of variance: 0.0501539253476

The 0.05 is extremely low, getting close to 0 and that tells us the values of the numbers are all very, very close together. This means the actual drawing of the numbers seems to have no meaningful effect on the odds. It pretty much a random draw, so we can conclude that the spinning drum with the balls flying around do a pretty good job.

Predictions?

So we have described the situation, so lets try some Predictive or Prescriptive analytics on this issue.

Can we predict the lottery? With the long odds, and the measured fairness of the process, it's just not accurately predictable enough to pursue specific numbers. The prescriptive then follows that you just shouldn't buy a lottery ticket expecting to make money doing it. But what are the odds of winning over 10 or 20 years, even just once? They are the same each week, they don't change.

Statistics tell a story about data. There has to be some understanding about the data itself, and
the types of questions you have understood to apply the methods and tools of stats properly.

Aside

There is a way to make money from the lottery. Let's calculate the cost if played over 10 years, every week playing 100 draws a year. I think its $2. Lets say I bought an investment that gave 5% every year.

year_cost = 2 * 52 * 2 # $208 per year
total_gain = 0
for y in range(10):
    total_gain += year_cost * 1.05
print(total_gain)

If you kept that under your mattress, you would have $2080 or $2184 if you bought the investment and the best part is that you have 100% chance of getting that money, instead of the 0.00000000715112% chance of the lotto.

Maybe think of it this way: what are the chances that you will need a spare $2000 in the future?

Saturday, November 26, 2016

Checking my lottery chances with Probability

Check your lottery chances with Statistics and Probability, maybe we can find a way to get some lucky numbers!

In this post we show how methods of statistics and probability can be used to
understand the chances of actually winning the lottery. The various methods will be worked out in python.

Analytics

This is really 'Descriptive' analytics, we are just telling the story of what's there.
After, we can get into predictive and prescriptive analytics but before that the foundation needs some building.

First lets get an understanding of the odds of winning the lottery, and how we find out what those odds are.

There are actually 2 interpretations of probability; 2 different approaches to solve a how likely an event will happen. One is called "classic" and has evolved a bit to be termed "frequentist". The other is usually called "subjective" or "bayesian". Lets see how these approaches compare with our lotto chances problem.

Counting

First, the classic example. We have a set of 49 numbers and we want to find out what the chances are of finding a set of 6 unique numbers. This we can check as a combination, it's 49 'choose' 6:

Combinations are expressed as:

$$ \dfrac{n!}{(n-r!)*(r!)} $$

We use combinations instead of permutations because the actual order of the result doesn't matter. If we draw a 1, 20, 33, 41, 5, 6, or 1, 33, 41, 20, 6, 5. It doesn't matter, we just need 6 unique numbers.

A solution in python could be something like:

n = 49
r = 6
import math
denominator = math.factorial(n-r) * math.factorial(r)
num_combos = math.factorial(n) / denominator
print(num_combos) #13983816

So that's about 14 million, so you have a 1 in 14 million chances to get those numbers.

Bayesian

Now let's use a Bayesian probability approach to find out how good our chances are. In Bayesian probability only the current likelihood is known and that prior knowledge is taken into account to determine the probability of the next event. Remember that the entire set of data doesn't need to be known, you are just venturing forward with what you have for each new case.

Do the numbers drawn previously matter? They do: we are looking for unique numbers, so we can't draw the same number twice. These are dependent events. Two events are dependent if the outcome or occurrence of the first affects the outcome or occurrence of the second so that the probability is changed.

When two events, A and B, are dependent, the probability of both occurring is expressed by:

P(A and B) = P(A)*P(B|A)

Its the probability of the first event, combined with second, and that result it combined with the third, and so on. The order of the result still doesn't matter.

In python it could be expressed like so:

select = 49
choose = 6
total_prob = choose / float(select) 
for x in range(5):
    select = select - 1
    choose = choose - 1
    prob =  choose / float(select)
    total_prob = total_prob * prob

odds = 1 / float(total_prob)
print str(odds) # 13983816.0

Just under 14 million again, but when you look closer you see that the result is exactly the same (13983816).
It's the same result, just a different way of going about it.

What is interesting about the Bayesian probability interpretation solution is that we loop until we get to the end; but don't need to know how big the initial set is. That way interpreting probability lends itself nicely to computation models and is why Bayesian probability methods are a big part of the emerging field of Data Science. This isn't so obvious in the descriptive part of analytics, but the Bayesian methods have really shined when data isn't complete, or when the data set is extremely large.

Using frequentest methods with large data sets you need to take samples and understand what constitutes a sample set that is significant enough.

Is one better than the other? I don't think it's a better or worse, but just different. They both have advantages considering the problem and conditions around it, but it's interesting to see how different methods can have the same results.

To get really into it, read this very good book:

http://xcelab.net/rm/statistical-rethinking/

Its also a good idea just to read posts from those that do this all the time:

http://www.fharrell.com/2017/02/my-journey-from-frequentist-to-bayesian.html

Save the theory for another day, we have to get back to our Lottery problem!

We now know that the chances are long, no matter how they are measured. How about the actual drawing of the numbers? In the next post, we are going to get some historical data on the numbers that have been drawn, and see if we can find some patterns to better our odds.

Tuesday, November 15, 2016

Algorithms intro: Profiling python code

When developing software and implementing algorithms its always important to be as efficient as possible, to have the most elegant solution that is concise, understandable to others, but most importantly uses the resources it has in the most efficient way.

This post extends the previous http://jseller.blogspot.ca/2016/11/algorithms-intro-implementing-math-proof.html and profiles the algorithms we created to implement Euclids GCD method

Profiling

With python and most every language has some profiling tools to check various computer resources; most frequently this is the memory used and CPU cycles consumed. For our algorithm post we are just concerned with the time any of our algorithms take to execute.

import cProfile, pstats, io

#you can run any code as a string to be executed:
cProfile.run('fill_rectangle_with_squares(Rectangle(24000,19453))')

#but the profile object can also be enabled and disabled to run many lines.
pr = cProfile.Profile()
pr.enable()
fill_rectangle_with_squares(Rectangle(2400000,194530))
pr.disable()
pr.print_stats()

With

If there are many lines, this can get a little difficult keeping the first two and last two in the same spot, or accidentally deleted the cleanup part. Python has a 'with' statement to make this nice and clean. http://effbot.org/zone/python-with-statement.htm

import cProfile, pstats, io
class profile_block:
    def __enter__(self):
        self.profile = cProfile.Profile()
        self.profile.enable()
        #set things up
        #return thing
    def __exit__(self, type, value, traceback):
        #tear things down
        self.profile.disable()
        self.profile.print_stats()

Try running a few versions and look at the output. It will probably take some larger numbers until you can see a change. The times depends on the machine, but by using the 'with' block we know the disable and printing will always happen, and the code is nicer.

with profile_block():
    fill_rectangle_with_squares(Rectangle(24000000,19453003))

with profile_block():
    fill_rectangle_with_squares_iterative(24000000,19453003)

Running the different approaches with the large example show a 3x difference with the optimized version.

with profile_block():
    fill_rectangle_with_squares_iterative(2400000223334219423423424234240,1945303434230234234242322)

with profile_block():
    greatest_common_divisor(2400000223334219423423424234240,1945303434230234234242322)

The output. This is a big improvement between our two versions.

fill 2400000223334219423423424234240, 1945303434230234234242322 with squares 
smallest square found 2, 2 
         6 function calls in 0.989 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 euclid.py:104(__exit__)
        1    0.000    0.000    0.000    0.000 euclid.py:22(__init__)
        2    0.000    0.000    0.000    0.000 euclid.py:25(__str__)
        1    0.989    0.989    0.989    0.989 euclid.py:55(fill_rectangle_with_squares_iterative)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

greatest common divisor 2
         3 function calls in 0.386 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 euclid.py:104(__exit__)
        1    0.386    0.386    0.386    0.386 euclid.py:77(greatest_common_divisor)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Wrap it up

From the first part of this post we solved the problem with computation and in this case it was a classic, Euclid Greatest Common Denominator method. His proof in Elements uses only geometry, but we also used geometry as a handy tool to visualize and understand the proof.

What we wanted to find out though, was how to validate this method with computation, and understand which approach works best.

As recursive implementations can more closely model the real world scenario, we have to keep in mind that this is being solved with computation, so our iterative approach is able to take advantage of the computing power at hand.

Finding out how well that actually performed is the job of profiling, and a well measured system, or just a small part of it, can go a long way to understanding how to write efficient code.

Algorithms intro: implementing a math proof

What's a nice way of showing a visual example to better understand how a math proof can be implemented as an algorithm in code? Here we are showing an example of implementing an algorithm, comparing iterative and recursive algorithms for solving a method; and validating the results with profiling.

In problem solving, a great methodology to use is: Understand, Plan, Execute and Review.
We will use a bit of geometry to understand the problem, and then plan how to solve it; then we will implement code to execute that plan and review how things worked out.

Understand

The Euclidean algorithm calculates the greatest common divisor (GCD) of two natural numbers a and b. (or x and y).

We can use some geometry to visualize this, check out this video of it in action. From that, it looks like a matter of:

Fill with squares along the shortest side of the rectangle
For the remaining rectangle, fill with squares again, until smallest square is found.

Plan

Lets write down the steps in more detail:

fill rectangle with squares

find shortest sid
divide long side by square side, get remainder
if remainder is 0, smallest square is found

for remainder, fill rectangle with squares

Execute

To model the visualization, lets make a Rectangle object, and use that with a function. Lets start with what we know, the Rectangle:

class Rectangle(object):
    def __init__(self, x, y):
        self.x = x
        self.y = y
    def __str__(self):
        return "%s, %s" % (self.x, self.y)

Now lets plan how to fill the rectangle just like the visualization happens, the next rectangle is filled, and so on until the end. This style of implementation is recursive.

fill_rectangle_with_squares(2,4)

fill_rectangle_with_squares(2,2)

2 is smallest square

We keep calling the fill_rectangle_with_squares function, until the x and y sides are equal. This is the stopping condition.

def fill_rectangle_with_squares(rectangle):
    print("Fill " + str(rectangle))
    # check the sides, if they aren't equal make a new rectangle with the remaining
    if rectangle.x == rectangle.y:
        print('smallest square found %s ' % str(rectangle))
        return rectangle.x
    elif rectangle.x > rectangle.y:
        return fill_rectangle_with_squares(Rectangle(rectangle.x - rectangle.y, rectangle.y))
    elif rectangle.y > rectangle.x:
        return fill_rectangle_with_squares(Rectangle(rectangle.x, rectangle.y - rectangle.x))

Now we call the function with an assert to test the result

To test to see if this function returns what we want, we use assert to check. We want to assert that a 2 by 4 rectangle is anything other than 2. Lets test some others as well

assert fill_rectangle_with_squares(Rectangle(2,2)) == 2
assert fill_rectangle_with_squares(Rectangle(140,68)) == 4

assert fill_rectangle_with_squares(Rectangle(24,12)) == 12

Review

Lets keep reviewing our solution with other sizes, like: Rectangle(143,62) and Rectangle(270,192). These work fine, but what happens when larger values are involved?

Rectangle(2400000223334219423423424234240,1945303434230234234242322)

When this would fail would vary from machine to machine, but on my home laptop I get a:


RuntimeError: maximum recursion depth exceeded

This means the call stack limit is reached, and this limit is bound by the design of the python language itself. Some languages like LISP don't have this limitation, it's just ends up using what resources it has. The limit is imposed because the amount of callable memory is a finite space on the machine. If you are using a large amount of recursion depth, and not getting the results you need then there is way to only bound by time, and the running of the CPU.

How can we make this better? Lets implement the algorithm using iteration.

def fill_rectangle_with_squares_iterative(x,y):
    rectangle = Rectangle(x,y)
    print('fill %s with squares ' % str(rectangle))
    found = False
    while not found:
        found = rectangle.x == rectangle.y
        if rectangle.x > rectangle.y:
            rectangle.x = rectangle.x - rectangle.y
        elif rectangle.y > rectangle.x:
            rectangle.y = rectangle.y - rectangle.x
    print('smallest square found %s ' % str(rectangle))

So far, so good, and the tests work with fill_rectangle_with_squares_iterative(24000000,19453003)

Can the iterative version take the massive difficult rectangle that blew the call stack with the recursive version?

fill_rectangle_with_squares_iterative(2400000223334219423423424234240,1945303434230234234242322)

It does, but takes a couple seconds.

Optimize

I know it took a couple seconds because the console lets me know how long any code execution takes. What's really happening here? To find out more I need to profile the code which will will in the second part of this post.

Before we do though, lets take a crack at optimizing the code even farther.

When you look at the iterative function, see that we are just modifying the Rectangle objects x and y values. We can just keep track of the x and y and not bother making the Rectangle object at all. We could maybe make one object and pass it instead of making a new object for each call, but lets go the step further. It's only 2 variables that makes it concise.

We used the Rectangle as a way to visualize the problem, but nice terse notation is possible and ends up being very clear, so lets boil down the description to the original:

Given two (natural) numbers not prime to one another, to find their greatest common measure.
(The Elements: Book VII: Proposition 2)

def greatest_common_divisor(x,y):
    while not x == y:
        if x > y:
            x = x - y
        elif y > x:
            y = y - x
    return x
    
print('greatest common divisor found %s' % (str(greatest_common_divisor(24,10))))

That's much more straightforward, and can probably be more concise, but I'd like to stop here to keep it readable and easier to understand.

Wrap it up

Lets review what we did in the post. we understood the problems and used some geometry to show a solution. Since it was Euclids proof, it seems like the right thing to do. We then planned a solution using recursion as it matched our real world model.
This worked to a point, but we reviewed our options and implemented an iterative method that could handle much larger rectangles. Once we had that solved we optimized for the computational abilities of the language and the machine we are using.

Our last implementation works for all the tests above, and certainly seems to run faster. How much faster?
We will profile all these functions and find out in the next post: http://jseller.blogspot.ca/2016/11/algorithms-intro-profiling-python-code.html

Saturday, January 16, 2016

LAMP development on windows

The goal for this is to run a LAMP stack on windows. This way, I can do my same development at home, as work. If I was going to choose, and was going to be on an island with only one setup, I would probably take trusty ol' linux; it has everything I need. I use linux at work, and the mac, but at home I have a windows laptop. This is by choice, as I can test out my software with the various machines and not get too far away from one type of operating system.

The issue for me was: How do I do some development at home, and run a similar setup to what I have at work? For this, virtual machines have come along to the point where you can 'box' up a particular environment and run it on any host machine you like.

Pre-conditions: install cygwin tools to allow unix commandline and tools on a windows box. This way you can use the same command line tools on your windows shell as any linux box. While I do like using all types of operating systems, the linux command line is my favourite, and the novelty of knowing DOS commands wore off pretty quickly for me.

Good times with PHP

LAMP can mean a few things, but the traditional setup is Apache, Mysql and PHP. As I have worked along my development path I have found that I am now favoring Python, Nginx and PostreSQL, but for this post I am going to work with a stack provided from the Laravel PHP project. This has nginx, mysql and its php version. Laravel has been on an aggressive development path in it's short history, and now has some very cool development features: built in patterns, unit testing and other good stuff. Laravel provides a tool called Homestead, that uses Vagrant and VirtualBox to make a very nice virtual environment for running Linux on any system, including my cheap windows laptop.

I followed the steps explained here: https://laravel.com/docs/5.2/homestead with the commentary on the steps and modifications noted below.

SSH

From the tutorial, there is a need for a ssh key to be able to remotely login to the Linux OS once it is running in the virtual machine. On windows the path for the ssh keys may not be there, so you will have to generate this key.
If you have been using git, there is probably a .ssh folder in your C:\Users\username directory. You will see 'jseller' below, but it's whatever your username is.
C:\Users\jseller\.ssh>ssh-keygen -t rsa -C "useremail@gmail.com"

Don't use an empty passphrase, you will want to ssh to the virtual box when its up. When it asks for a path to the key file: Enter file in which to save the key (//.ssh/id_rsa): c:/users/jseller/.ssh/id_rsa

Paths

This is a key point when using paths, I have found lowercase and unix slashes work consistently.
So, when you feel the need to type: C:\Myfolder\SomeDirectory, use: c:/myfolder/somedirectory. Like the interwebs, lowercase and forward slashes win.

Running Vagrant

I got an error when doing 'vagrant up' on windows,
The host path of the shared folder is missing: ~/Code
This was fixed for me by just creating the Code directory in my root folder (in cygwin\windows the ~ directory ends up being c:\users\username)

Then 'vagrant up' started the virtualbox with the output:

 Bringing machine 'default' up with 'virtualbox' provider...
==> default: Importing base box 'laravel/homestead'...
==> default: Matching MAC address for NAT networking...
==> default: Checking if box 'laravel/homestead' is up to date...
==> default: Setting the name of the VM: homestead-7
==> default: Clearing any previously set network interfaces...
==> default: Preparing network interfaces based on configuration...
    default: Adapter 1: nat
    default: Adapter 2: hostonly
==> default: Forwarding ports...
    default: 80 (guest) => 8000 (host) (adapter 1)
    default: 443 (guest) => 44300 (host) (adapter 1)
    default: 3306 (guest) => 33060 (host) (adapter 1)
    default: 5432 (guest) => 54320 (host) (adapter 1)
    default: 22 (guest) => 2222 (host) (adapter 1)
==> default: Running 'pre-boot' VM customizations...
==> default: Booting VM...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address: 127.0.0.1:2222
    default: SSH username: vagrant
    default: SSH auth method: private key

There was some output about my private key being insecure, as the key-gen described above is just a quick way to make a key locally.

Then I seemed to have an issue with mounting drives:

Failed to mount folders in Linux guest. This is usually because
the "vboxsf" file system is not available. Please verify that
the guest additions are properly installed in the guest and
can work properly. The command attempted was:

mount -t vboxsf -o uid=`id -u vagrant`,gid=`getent group vagrant | cut -d: -f3`, C:\Users\jseller\vagrant\Code C:UsersjsellervagrantCode
mount -t vboxsf -o uid=`id -u vagrant`,gid=`id -g vagrant`, C:\Users\jseller\vagrant\Code C:UsersjsellervagrantCode

The error output from the last command was:
/sbin/mount.vboxsf: mounting failed with the error: Protocol error

The issue here was with the paths to my folders, and the backslash escaping the string (see the C:UsersjsellervagrantCode line). I changed that in my homestead.yml to use forward slashes: /vagrant/Code/Laravel/public and ran 'vagrant reload' which reloads the configuration.

Success! It started up my virtual machine and configured nginx with the FastCGI for php and mysql database.

You should be able to now ssh to the virtual machine:

C:\Users\jseller\.ssh>ssh vagrant@127.0.0.1 -p 2222

vagrant@127.0.0.1's password:
Welcome to Ubuntu 14.04.3 LTS (GNU/Linux 3.19.0-25-generic x86_64)

The nginx will need some configuration help If you try '192.168.10.10' and get "No input file specified" (see http://stackoverflow.com/questions/24274387/using-laravel-homestead-no-input-file-specified )

Vagrant commands used:

To start, use 'vagrant up'
To reload config changes in homestead.yml use 'vagrant reload'
To shutdown, use 'vagrant halt'

Wrap up

Ok, so now I have a virtual machine running linux on my windows laptop, and laravel/homestead has come pre-configured with Nginx, mysql, and Laravel PHP. It's really handy and I don't have to do any crazy magic with environment variables on windows or other setup, I just run vagrant and I'm self contained in familiar linux.