Monday, June 10, 2013

Bell Curve Simulation Done in Excel

I don't know what possessed me to do this in the wee hours this morning when I couldn't sleep, but I thought to construct a bell curve from first principles.  I had seen something like this as a Java Applet in 1990s.  In that applet, a ball was dropped onto a grid of pegs.  When it hit a peg, it could go either left or right, with each possibility equally likely.  Perhaps it did this at a height of 20, meaning the ball would hit 20 pegs before coming to rest.  Now do this ball after ball and look at the distribution of where the balls end up.  Most will be near the center.  A few will be well to the left and a few will be well to the right.  The distribution should look like a bell curve.

I did this here in Excel but I changed things a little.  When I did this exactly as described above, I discovered to my chagrin that there were no observations at odd number positions.  So I could have done the above with half steps instead of whole ones, but instead I did the following. At each juncture the ball could go left or right as before, but it also could stay where it is, a third possibility.  In this simulation, left is a value of -1, right is a value of +1, and staying where it is gives a value of 0.  To get these outcomes equally likely (assuming that RAND() produces a uniform random variable on the interval [0,1] and different realizations are independent of one another) I generated a realization of RAND() for each ball and each peg it hit. 

On a per ball basis this is done 20 times (with the results in columns AC to AV.  Each ball is a row.  I then translated these realizations into left, right or stay put. The translations are in columns C to V.  (If you zoom out to 50% view, you can see both realizations and translations in one view.)  For example, in cell F44 there is the formula:

=IF(AF44<1/3,-1,IF(AF44<2/3,0,1))

There are 4000 rows of data.   The first such row begins in row 44.  There are thus 80,000 realizations of RAND() generated (4000 rows times 20 entries per row). 

In rows 2 through 42, I record the positions (in column A) and the number of balls at the position (in column B).  To get the number at a position I use the COUNTIF function.  As a little check I add up all the balls at each position.  This is in cell D24.  Sure enough, it gives a value of 4000, as it should.

I then graphed the positions and number of balls with a graph that connects the points.  (Tufte might not like that but it's good enough for government work.)  You can also just eyeball the results. 

Finally, I put in a Spin Button.  Each time you change the value with the Spin Button, it should produce a new simulation.  By doing that several times, it will give you a sense of what is idiosyncratic to that particular simulation and what seems robust from one simulation to the next.  This is the benefit of doing the exercise in Excel.   

It does seem to produce a near Bell Curve.  That is good.  It also produces something different each time, which is also good.  There is variation in the sample sum and sample mean from one sample to the next. 

When I 'learned' probability and statistics, back in the 1970s, it was all from textbooks and all about theory.  There was nothing about data.  It seems to me simulations such as this would serve as a bridge between the two.  I believe that nowadays most statistics textbooks come with simulation software of some sort.  That's a plus.  But it would be better still if students could build the simulations themselves.  They can with this Excel simulation.

No comments: