Use of
“Random” / Stratified Sampling
To Closely Approximate
Counts of Prime Numbers and Twin Primes
And be sure to read Statistical Primes Part 2 When you are done here!
I don’t want to spend too much time on this because I’m getting tired and the hurricane is coming tomorrow night. Anyway,
While working on and writing the last article about Twin Primes, it started to become painfully obvious that, as the numbers got exponentially larger, the runtimes of the TwinPrimes program also got exponentially larger (duh!). For example:
It required 35 minutes on my PC
To compute the number of Primes and TwinPrimes
For N = 1 to 1 Billion
Now that doesn’t seem so bad until we decide we want to do that for even larger/more numbers; or other calculations or measurements related to Primes for larger numbers. For example, if I wanted to count the Primes and TwinPrimes for N = 1 to 1 Trillion. On my PC that would take about 27 days! That is simply not reasonable. That led me to investigate and experiment using “statistical” methods to very quickly and very closely approximate results 💡 . Just to jump ahead a little in order to demonstrate what I’m talking about…On my PC using my TwinPrimes program
It required 35 minutes to identify and count all of the Primes and TwinPrimes For N = 1 to 1 Billion
Then I enhanced the TwinPrimes program by adding the ability to “closely” approximate the results very quickly by using the statistical technique called “Stratified Sampling.” In the enhanced TwinPrimes program we can now optionally specify:
How many strata to use and
How many elements (samples) within each stratum to process.
Using the enhanced program, the 35 minute runtime was reduced to a mere 20 seconds! Let me repeat that… The cpu/runtime was reduced from 35 minutes to a mere 20 seconds! However, the results (the counts of Primes and Twin Primes) are now estimates based on the samples processed. Now the obvious question is, how good/accurate are the results (estimates)? Here are some results:
For N = 1 –> 1 BILLION (1 to 10^{9})



 The estimated count of Primes is less that ¼ of 1% different (greater) than the actual count of Primes for N = 1 to 1 Billion.
 And, the estimated count of Twin Primes is less than 1/3 of 1% different (greater) than the actual count for the same range (1 to 1 Billion).


Here are a couple more interesting examples using the Stratified Sampling capabilities of the enhanced TwinPrimes program:
For N = 1 –> 1 TRILLION (1 to 10^{12})
Without using sampling techniques the required runtime would exceed 26 days! That’s clearly unacceptable.
If we use the sampling techniques the runtime is reduced to 78 seconds! Yup… you read that right… from 26 days down to a ridiculous 78 seconds!!!
But there is a price for trimming that runtime from almost a month down to 78 seconds. And that price is that now our results are estimates instead of actual counts. However, the estimates are pretty good.
.1. The estimated count of Primes differs from the actual count by just .07 of 1%. That is, the estimated Prime count was about 1.00068 times the actual Prime count.
.2. The estimated count of Twin Primes differs from the actual count by .9 of 1% (that is, 9tenths of 1%).
For N = 1 –> 1 QUADRILLION (1 to 10^{15})
Without using sampling techniques the required runtime would exceed 72 years! That’s way too long!.
If we use the sampling techniques the runtime is reduced to just 15 minutes! Yup… you read that right… from 72 years down to just 15 minutes!!!
But again we do pay a price for trimming that runtime from many decades down to 15 minutes.
.1. The estimated Prime count exceeds the actual Prime count by just .28 of 1%. That is, the estimated Prime count was about 1.0028 times the actual Prime count.
.2. The estimated Twin Primes count exceeded the actual Twin Primes count by .99 of 1% (let’s just round it to 1%).
Based on the above, using statistical sampling techniques to estimate the counts of Primes and Twin Primes seems pretty effective (depending on your requirements or goals).
It turns out there are multiple statistical techniques available to be used but I’ve tried just the one that I dreamed up and which I later learned is a simple variation on “Stratified Sampling.” In the case of using the enhanced TwinPrimes program, when using the sampling/estimating option, we can specify how many strata to evenly divide 1–>N into and we also specify how many numbers to process in each stratum (processed sequentially starting with the first number in the stratum). This technique seemed reasonable and is predicated on the conjecture that the distribution of primes, although not “random” in the “pure” sense, is:
 Pretty much random but the “thinning” effect of the Prime Number Theorem must always be taken into account.
 Random enough for many purposes
 Random enough to make effective use of common statistical techniques for “closely” estimating things like counts of Primes and Twin Primes.
Again, there are other statistical/sampling techniques that could be used (effectively?) other than the variation on Stratified Sampling that I use. I leave it to the reader to pursue that.
An important thing to think about here…
… The Prime Number Theorem (PNT) describes (quantifies) the “thinning” of the Primes as we move further out on the number line. This thinning effect is one reason I chose to use Stratified Sampling (although I didn’t know it was called that at the time). The thinking was… The more stratums (layers) we have the better we approximate (account for) the thinning effect of the Prime Number Theorem. Of course, if we go too far with this we lose the program efficiency! The obvious ultimate extreme is to have so many stratums that we end up processing EVERY number because we have a stratum for every number! All that said, with the enhanced TwinPrimes program we can specify how many stratums we want (and how many numbers to process within each stratum/layer).
We also have to specify how many elements (numbers) to process within each stratum. But we need to choose a size that can reflect the “essense” of each stratum (layer). For example, if we don’t process enough elements (numbers) then we can easily be misled. For example, if we process just 1 number in a stratum, and that number is composite, the we might be led to believe that all numbers in that stratum are NOT Prime! On the other hand, if we choose to process too many numbers/elements of a stratum then we do not realize any savings/trimming of processing time!
Epilogue
So that’s it. When I started the Twin Primes project my goal was to try to definitively convince myself whether (or not) there’s an infinity of twin primes and hopefully how they might be distributed. I did NOT think the project would lead to investigating statistical sampling visavis counting Prime related phenomena and how the results further and strongly reinforce the notion of Primes being randomly distributed but within the constraints of the Prime Number Theorem.
I think this turned out to be a pretty good project. How ’bout you?
Below are a couple of interesting things you may want to examine:
 A screen print of a run of Primes.exe using the Sampling/Estimating option for 1–>N = 1 BILLION along with 10,000 stratums (layers) and 1000 numbers for each stratum.
 A section of a spreadsheet where I recorded example results.
Click on this Screen Print for a better view. Note: “est.” means “estimates”
Click on the spreadsheet to get a clear view.
The End