-
-
Save tjlytle/73e0f928ae3ee5f90d09157103f1ee71 to your computer and use it in GitHub Desktop.
<?php | |
$limit = 1000000; | |
$counts = [ | |
'1' => 0, | |
'2' => 0, | |
'3' => 0, | |
'4' => 0, | |
'5' => 0, | |
'6' => 0, | |
'7' => 0, | |
'8' => 0, | |
'9' => 0 | |
]; | |
for ($current = 0; $current < $limit; $current++) { | |
//$number = (string) rand(1, getrandmax()); | |
$number = (string) random_int(1, PHP_INT_MAX); | |
$counts[$number[0]]++; | |
} | |
foreach ($counts as $number => $count) { | |
echo $number . ': ' . ($count/$limit)*100 . '%' . PHP_EOL; | |
} |
I ran this locally, similar here:
PHP 7.3.11 (cli) (built: Nov 11 2019 23:27:48) ( NTS )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.3.11, Copyright (c) 1998-2018 Zend Technologies
with Zend OPcache v7.3.11, Copyright (c) 1999-2018, by Zend Technologies
[omerida@grouch ~]$ php random.php
1: 12.0346%
2: 12.0273%
3: 12.0347%
4: 12.0503%
5: 12.0481%
6: 12.0497%
7: 12.0254%
8: 12.1209%
9: 3.609%
weird stuff happens depending on the max number you generate.
random_int(1,10)
:
1: 20.0621%
2: 9.969%
3: 9.9416%
4: 9.9852%
5: 10.0163%
6: 10.0116%
7: 9.9868%
8: 10.0089%
9: 10.0185%
But also if you set it to 50:
1: 22.0307%
2: 22.0503%
3: 21.9458%
4: 21.9983%
5: 3.9774%
6: 1.9844%
7: 1.9977%
8: 1.9949%
9: 2.0205%
And you can start to see what's going on here - you can't bin them by the starting digit and expect to see a uniform distribution.
So, for example, if you're generating numbers between 1 and 10, you're counting 2-9 in there own buckets (which get about 10% probability you get from chance). But the outcomes for 1s and 10s are combined in the "1" bucket going on.
Similarly between 1 and 50. There are a lot of numbers in that range that start with the digits 1, 2, 3, and 4 (1 think 11 of them for each). But for 5, 6, 7, 8, 9 there are fewer. 5 is just 5 and 50, 6 is only 6, etc...
Yup, that's the issue: https://twitter.com/tjlytle/status/1202990988240850947
Max in size cuts out most of the set of numbers that start with 9 (near the max)
If you set the min and max to 11--99, you get a normal distribution:
1: 11.1096%
2: 11.1252%
3: 11.058%
4: 11.1275%
5: 11.0732%
6: 11.1191%
7: 11.1153%
8: 11.1355%
9: 11.1366%
Based on what I read, it looks like this implementation is wrong (those are the percentage of occurences), but should look like:
<?php
$limit = 1000000;
$counts = [
'1' => 0,
'2' => 0,
'3' => 0,
'4' => 0,
'5' => 0,
'6' => 0,
'7' => 0,
'8' => 0,
'9' => 0
];
for ($current = 0; $current < $limit; $current++) {
//$number = (string) rand(1, getrandmax());
$number = (string) random_int(1, PHP_INT_MAX);
$counts[$number[0]]++;
}
foreach ($counts as $number => $count) {
echo $number . ': ' . benfordLaw($number) .PHP_EOL;
}
function benfordLaw($leading){
return (log(1+(1/$leading))/log(10));
}
Output
1: 0.30102999566398
2: 0.17609125905568
3: 0.1249387366083
4: 0.096910013008056
5: 0.079181246047625
6: 0.066946789630613
7: 0.057991946977687
8: 0.051152522447381
9: 0.045757490560675
See implementation and expected outputs in other languages: https://rosettacode.org/wiki/Benford%27s_law#JavaScript
For those wondering, php version:
And output: