Computing normally distributed variables

sharpnova · Nov 23, 2011

So I've written some code to generate random variables. (using the polar form of the Box-Muller method)

My issue is that though the values make sense with few enough samples, once there are 10's or 100's of millions of values generated, it's not getting anywhere beyond 5.1 standard deviations. It seems to be capped at 5.1 standard deviations.

And I'm beginning to fear this is an issue with the limited precision VBA uses for storing the results of calculations.

Here is my code:

Code:

Function gauss()
    numSamples = 1000000
    max = -1
    
    For i = 0 To (numSamples / 2 - 1)
        Do
            x1 = 2 * Rnd - 1   'basically flatly distributed on [-1,1]
            x2 = 2 * Rnd - 1    'same
            w = x1 * x1 + x2 * x2
        Loop While w >= 1

        w = Sqr((-2 * Log(w)) / w)
        If x1 * w > max Then max = x1 * w
        If x2 * w > max Then max = x2 * w
    Next i
    MsgBox "Max: " & max & " s.d."
End Function

I believe this function generates a normal distribution centered around 0 with a s.d. = 1. That's why max is the maximum number of standard deviations above the mean for any of the generated values.

Notice that for 50 samples, this generates stuff roughly in the 2 standard deviation range. (correct)

For 1000 samples, it generates stuff roughly in the 3 standard deviation range. (correct)

It continues to be correct up to and including about 5 standard deviations. Then it hits about 5.1 standard deviations and doesn't increase at all anymore, despite going up to 100,000,000 samples.

Can anyone point out why this may be occuring?

MarkK · Nov 23, 2011

If you don't seed the random number generator using the 'Randomize' statement before calling 'Rnd', you will simply generate the same sequence. To test this just print the first ten or so random numbers, and then print them again...

Code:

sub test 14351346
  dim i as integer, j as integer
  for j = 0 to 1
    for i = 0 to 9
      debug.print "i = " & i & ": " & rnd
    next
  next
end sub

Also, as far as precision goes, you can use a Decimal datatype in VBA that has 28 significant digits. From VBA Help...

Decimal Data Type

Decimal variables are stored as 96-bit (12-byte) signed integers scaled by a variable power of 10. The power of 10 scaling factor specifies the number of digits to the right of the decimal point, and ranges from 0 to 28. With a scale of 0 (no decimal places), the largest possible value is +/-79,228,162,514,264,337,593,543,950,335. With a 28 decimal places, the largest value is +/-7.9228162514264337593543950335 and the smallest, non-zero value is +/-0.0000000000000000000000000001.

Note
At this time the Decimal data type can only be used within a Variant, that is, you cannot declare a variable to be of type Decimal. You can, however, create a Variant whose subtype is Decimal using the CDec function.

sharpnova · Nov 23, 2011

It's not an issue of seeding the Rnd function. I get different results every time. That isn't the issue.

But to be sure, I tested your code. And I got different results every time.

I'm still trying to figure out the reason for this particular problem. Any ideas? If I manually set x1 and x2 to be something tiny like .00000001, then I get values several standard deviations above the mean.

It's just that the code I'm using isn't generating them on its own.

See if you can't get my code to generate values above 5.1 s.d. when using more than 10,000,000 or 100,000,000 samples.

MarkK · Nov 24, 2011

Yeah, my guess then is that it's the precision of the random number. I ran a million random numbers and got 60 zeros and the next smallest was
0.00000005960464. So if in your formula you got...

Code:

  Do
    x = 0.00000005960464
    y = 0
    r = x * x + y * y
  Loop While r = 0 Or r >= 1

...which would be your best case without zeroing out, the Max would be 8.15733594099359.
But you multiply everything by 2 and then subtract 1, so your best precision would be ...

Code:

    x = 0.00000011920928
    y = 0

... yielding a Max of 7.9855833182045

sharpnova · Nov 28, 2011

Yep. And in the end I found a way around this issue:

polar form.

I'm utilizing the Box-Muller method for getting random variables (monte carlo style) but by converting to polar, I get rid of a trig calculation. One less precision-killing operation resulted in yielding values roughly one standard deviation greater.

Now the bottleneck is runtime not precision. Which is exactly where I wanted to be.

Just posting the solution I found here in case anyone else runs into a similar need.

gemma-the-husky · Nov 28, 2011

just a thought - i am not statistically proficient enough to know exactly what you are doing, but you may be stuck between a rock and a hard place

if you use long numbers, then you have a maximum value of 2^31 (approx 2 biilion), which would give an absolute limit to the sampled values

if you use doubles, you will have a precision error, which may also affect your measurements (ie - doubles can only represent a subset (and clearly a small subset) of all the possible values), and therefore the "random" numbers may not be "random" enough for your purposes.

The other question of course, is how many SD's would you expect. There must be some empirical limit to the maximium diversion from the mean/median in most populations?

Computing normally distributed variables

sharpnova

Registered User.

MarkK

bit cruncher

sharpnova

Registered User.

MarkK

bit cruncher

sharpnova

Registered User.

gemma-the-husky

Super Moderator

Similar threads

Users who are viewing this thread