Skip to content

Instantly share code, notes, and snippets.

@ccritchfield
Last active December 13, 2019 18:34
Show Gist options
  • Save ccritchfield/d4caebaec1855e24716bd4c58709f1fd to your computer and use it in GitHub Desktop.
Save ccritchfield/d4caebaec1855e24716bd4c58709f1fd to your computer and use it in GitHub Desktop.
Python - Sample Sizing Big Data To Reduce Processing
-----------------------------------------
Sample Sizing Big Data to Reduce Processing
-----------------------------------------
When grinding massive data sets, it'd be nice to
figure out a statistically supported subset sample
size you could come up with and run instead, that
way you could cut down on the time and resources
needed to come up with some results. IE: instead
of spending tons of time and resources grinding
a whole data set, you could just figure out a sample
size you cuold use to do simple random sampling of
items from the data set, and run the samples instead,
knowing they are within certain statistical parameters
(confidence level & margin of error) of representation
of the overall data set.
I wrote a Linkedin Article talking more about this...
https://www.linkedin.com/pulse/sample-sizing-big-data-reduce-processing-craig-critchfield-mis/
(The article and code go hand-in-hand)
But, the bottomline is that while other statistical
sampling formulas need you to know the population's
average or standard deviation already, Cochran's Formula
just needs you to determine a confidence level, margin
of error, and know the total items in your population
(population size). Plug those in, and it can spit out
a sample size that has statistical support to be
representative of your overall population.
Cochran's Formula caps out on sample sizes when you
near 20,000 population size (for 5% margin of error)
and 200,000 population size (for 1% margin of error).
So, if you have massive "big data" data sets (2M, 2B,
2T, etc), you can pull small sample sizes that are far
quicker to process knowing the end results will be about
the same as if you processed the entire data set.
EG: if you had tons of tweets, and you were grinding them
for sentiment analysis. You could simply come up with
a sample size and pull some simple random samples to
grind instead, knowing the sentiment results you get
from the sample will be statistically representative
of the population to some degree given the confidence
level and margin of error you decide to use.
This can help you greatly reduce the processing time
and resources needed to see patterns or analysis of
large data.
-----------------------------------------
Files
-----------------------------------------
-- samplesize.py --
main file that has the "samplesize()" function in it.
samplesize function takes 3 arguments...
N = population size (eg: the len(dataset) size of your dataset)
C = Confidence Level (eg: .95 confidence level)
E = Margin of Error (eg: within +/- .05)
debug = have it spit out results to debug console for testing if you like
-- samplesize_zscore.py --
This file needs to go with the samplesize.py file,
as it's a support function (zscore) to convert the
confidence levels into zscores for the Cochran's
Formula in "samplesize()". You cuold merge the
code into samplesize.py if you want. I just offloaded
it to a separate file to keep things modular in
case I need to re-use it.
-- samplesize_test_sizes.py --
This runs tests on various confidence level,
margin of error and population sizes to see
what sample sizes the Cochran Formula samplesize()
function comes up with. Charts them out, so
you can see.
-- samplesize_test_stats.py --
This creates a large dataset (2M integers from 0 to 100)
and uses it like a population to run avg & std dev on.
Then I use samplesize() to generate a sample size to draw
simple random samples from it and run avg & std dev on the
sample subset to see how close it comes to the population
metrics. This is to see how accurate the sampling is in
representing the population. It also runs some resampling /
multi-sampling to try to increase the sampling accuracy
by averaging in the results of 2 more sample runs to see
if 3 sample runs increase accuracy of metrics when compared
to population metrics.
####################################################
"""
------------------------------
Estimating Sample Size Using
Cochran's Formula with Default
50% Proportion
------------------------------
we don't know the population mean or standard deviation;
but we want to calculate a quick-n-dirty sample size to
limit the amount of processing we do while still having
some confidence the processing we're doing will give us
a decent picture of what all the data would tell us if
we did it all.
So, we can revert to using a proportion sample size
calculation that just needs a confidence interval and
margin of error... Cochran's Formula.
This site talks about Cochran's Formula...
https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/find-sample-size/
These two sites demonstrate it in web calculators
http://www.raosoft.com/samplesize.html
https://www.surveysystem.com/sscalc.htm
What it boils down to is we can come up with a quick-n-
dirty sample size as long as we have 3 things...
* Population Size
* Confidence Level
* Margin of Error
Let's use archery as an example...
Population would be the total number of shots we could take. But,
due to limited time or resources, we want to just do a limited
amount of sample shots to get a feel for how the overall number
of shots might turn out.
Margin of error is how wide your target is.
Confidence Level is how confident you will be that the shot
you take will be within that target (MOE) area.
When using proportion for sample size estimation, you would
usually provide a 4th value, The Percentage of population you
think would respond a certain way. (the Proportion)
EG: if you were doing a survey, you might feel 75% of the
population would respond True to something, and 25% respond
False.
If you don't know the response rate, though, then you can
default to 50% to maximize the samples. 50% gives the largest
sample size, so if proportion is unknown, 50% is used as
default to maximize sample size to account for this.
So, we have statistical things like this...
N = population size we're sampling
n = sample size we're calculating
C = Confidence Level
Z = z-score for confidence level
E = Margin of Error
P = Proportion (Percentage)
Cochran's Formula is this...
sample size = ( Zscore^2 ( Proportion ( 1 - Proportion ) ) ) / Margin of Error^2
...or, in statistical notation...
n = ( Z^2 ( P ( 1 - P ) ) ) / E^2
If you really wanted to get statistically fancy,
then you'll notice that the P(1-P) is just the
PQ formula in statistics...
P = Proportion
Q = 1 - Proportion
EG: if we had a 75% P chance to do something,
then we'd have a 25% (1-P) Q chance to not do it.
Anyways, since we know we're using a default P of 50%,
we can go ahead and plug that in...
n = ( Z^2 ( 0.5 ( 1 - 0.5 ) ) ) / E^2
...and, We can pre-calc to reduce the formula down to this ...
n = ( Z^2 * 0.25 ) / E^2
...in laymen's terms...
sample size = ( Zscore^2 * 0.25 ) / Margin of Error^2
(Note the X^2 is MS Excel's notation for pow'ing things
I was modelling things out in Excel first.)
The problem with Cochran's Formula is it was designed
for really large population sizes. If you start to get
into really small ones (eg: 100 or less), it calculates
a sample size larger then the population size. And, for
large populations, it quickly ramps up the sample size
to cap out quickly.
So, his formula finishes with an exponential smoothing
to re-proportion sample sizes for small populations to
be less then the pop size, and helps increment the
sample size for large pops up smoother then radically.
So, the sample size calculated above gets ran through
the following at the end...
sample size = sample size / ( 1 + ( ( sample size - 1 ) / population ) )
...or, in statistical terms...
n = n / ( 1 + ( ( n - 1 ) / N ) )
The interesting thing about this is the sample size
doesn't increase much after population size hits
200,000 or higher.
EG: For confidence interval of 95%,
and margin-of-error of 5%...
population = 20,000
sample size = 376 (--)
population = 200,000
sample size = 383 (+7)
population = 2,000,000
sample size = 384 (+1)
At higher confidence levels, the sample
size increases more, but not by much. But,
altering the Margin of Error increases
sample sizes dramatically.
EG: MOE of 1% starts to get into 10,000+ samples
for large population sets.
To get around this, you could simply resample
several times using a larger Margin of Error
of 5% or such. Because central limit theory
says that if you sample enough, you eventually
get closer to the average, std dev,
etc of the actual population (although you'd
never be 100% accurate unless you actually did
process the whole population.)
EG: So, instead of doing 1 sample with MOE of 1%,
you could do 3 samples at MOE 5%, then average
the results. (In stat, there is Standard Error
functions you can use to determine how close
the sampling is coming to the population metrics,
but if you want to be quick-n-dirty about it...)
"""
####################################################
#------------------------------
# Imports
#------------------------------
from samplesize_zscore import zscore
#------------------------------
# Globals / Constants
#------------------------------
"""
I created two versions of the function
* one uses laymen's terms
* one uses statistical notation
While it's easy to follow along when looking
at the code below, I provided two versions
so you can use whichever you like when
you're coding on an IDE that has intellisense.
When typing out the function call, the
intellisense will show the function parameters.
If you're comfortable with statistical notation,
then use that function. If you like things
spelled out, then use the laymen's function
which will make it clear what parameter is what.
The constant below lets you set which
version of the function you want to use.
"""
# true = use statistical notation for var names
# false = spell things out in laymen's terms for var names
STATISTICAL_NOTATION = False
#---------------------------
# Cochran's Formula for Sample Size Estimation
#---------------------------
"""
For normal (binomial) distribution sampling...
* population needs to be 30+
* we need to assume population follows a normal distrubtion
"""
if STATISTICAL_NOTATION:
#---------------------------
# use the one with statistical notation
#---------------------------
"""
N = population size
C = confidence level
E = margin of error
n = sample size
P = proportion ( 1-P = Q, so P(1-P) = PQ)
debug = print debug outputs for testing (defaulting to false)
"""
def samplesize( N, C, E, debug = False ):
try:
# do some pre-emptive error-checking
assert ( C < 1.0 ), "C must be less than 100%"
assert ( C > 0.0 ), "C must be greater than 0%"
assert ( E < 1.0 ), "E must be less than 100%"
assert ( E > 0.0 ), "E must be greater than 0%"
assert ( N > 0.0 ), "N must be greater than 0"
# need two-tail z-score
Z = zscore( C, 2 )
# sample size proportion
# since we're using 50% default, we can
# pre-calc P(1-P) ... IE: PQ
# n = ( Z**2 * P ( 1 - P ) ) / E**2
n = ( Z**2 * 0.25 ) / E**2
# corrective function that scales
# sample size to pop size better
# to smooth out sample sizing and
# handle very small populatons
n = n / ( 1 + ( ( n - 1 ) / N ) )
# sample is a float / decimal
# recast as whole number integer
# so if we use it in range() it
# won't throw an error
n = int( n )
if debug:
print("Sample Size " + str( n ) + ", Population " + str( N ))
return n
except AssertionError as error:
print( "Error: " + str ( error ) )
else:
#---------------------------
# use the one that spells things out
#---------------------------
"""
inputs... self-explanatory, except for...
debug = print debug outputs for testing (defaulting to false)
"""
def samplesize( population, confidence, margin_of_error, debug = False ):
try:
# do some pre-emptive error-checking
assert ( confidence < 1.0 ), "Confidence must be less than 100%"
assert ( confidence > 0.0 ), "Confidence must be greater than 0%"
assert ( margin_of_error < 1.0 ), "Margin of Error must be less than 100%"
assert ( margin_of_error > 0.0 ), "Margin of Error must be greater than 0%"
assert ( population > 0.0 ), "Population must be greater than 0"
# need two-tail z-score
z = zscore( confidence, 2 )
# sample size proportion
# since we're using 50% default, we can
# pre-calc proportion ( 1 - proportion )
# sample = ( zscore**2 * proportion ( 1 - proportion ) ) / margin_of_error**2
sample = ( z**2 * 0.25 ) / margin_of_error**2
# corrective function that scales
# sample size to pop size better
# to smooth out sample sizing and
# handle very small populatons
sample = sample / ( 1 + ( ( sample - 1 ) / population ) )
# sample is a float / decimal
# recast as whole number integer
# so if we use it in range() it
# won't throw an error
sample = int( sample )
if debug:
print("Sample Size " + str( sample ) + ", Population " + str( population ) )
return sample
except AssertionError as error:
print( "Error: " + str ( error ) )
######################################
# setup up some debug / testing code
# that will only run if this file
# is active as "main"
######################################
if __name__ == "__main__":
#---------------------------
# test sample size function
#---------------------------
N = 20000
C = 0.95
E = 0.05
print("-------------------------")
samplesize( N, C, E, True )
print("-------------------------")
##########################################################
#
# Sample Size Ceiling Check (with Charts)
#
##########################################################
"""
Cochran Formula sample size caps out quickly as we near
200,000 population. So, let's demonstrate by plowing
through some pop sizes at various confidence levels
and margins of errors.
Code below uses following statistical notations...
C = Confidence Level
E = Margin of Error
N = Population Size
n = sample size
"""
##########################################################
#------------------------------
# Imports
#------------------------------
import matplotlib.pyplot as plt
from samplesize import samplesize
#------------------------------
# Globals / Constants
#------------------------------
"""
dashed border to make message titles stand-out
and break apart different debug output runs
"""
BORDER_DASH = "-" * 50
"""
we're going to exponentially increase pop
sizes to generate samples. We don't want our
chart going crazy with a massive expanse of
pop sizes, so instead we convert the pop sizes
into string labels applied to each tick on
the x axis ... we're doing 10 runs, but
we're starting the plot ticks on 0 to
have a 0,0 starting point, and then the
pop sizes are on ticks 1 thru 10
"""
PLOT_TICKS = [0,1,2,3,4,5,6,7,8,9,10]
"""
pyplot formats can use string short cuts
- = use a line chart
b,r,g,y = blue, red, green, yellow colors
o = use dots as markers
so -bo = use a blue line & put a dot marker on it (same color).
we're only doing 4 plot runs, so hard-coding
the formats here to iterate through for a
multi-line run chart later on.
"""
PLOT_FORMATS = ['-bo', '-ro', '-go', '-yo']
"""
we'll use some lists to keep track
of what conf levels, margin of error,
pop sizes & sample sizes we do over
several runs, so we can chart them
later on.
population & sample size are
going to be lists of lists,
because we're gonna run
10 population sizes to generate
10 sample sizes per conf/moe
combination run we do.
"""
marginoferr = []
confidences = []
populations = []
samplesizes = []
#---------------------------
# Generate Sample Runs
#---------------------------
# function iterates through pop sizes
# exponentially to gernate sample sizes
def samplerun( C, E, debug = False ):
if debug:
print( BORDER_DASH )
print( "conf " + str(C) + "\n" +
"error " + str(E) )
print( BORDER_DASH )
# we'll rack up the pop & samples
# in temp lists. they start with
# 0 in each so we can have a 0,0
# x,y starting point in our charts
pops = [0]
smps = [0]
# exponentially ramp up pop size
# and generate sample size for it
for x in range(10):
N = 200 * 10**x
n = samplesize( N, C, E, debug )
pops.append(N)
smps.append(n)
# figure out what slot we're on
# in our global lists, and load
# our respective values
i = len(populations)
populations.insert( i, pops )
samplesizes.insert( i, smps )
confidences.insert( i, C )
marginoferr.insert( i, E )
#---------------------------
debug_output = True
samplerun( 0.95, 0.05, debug_output ) # conf 95%, moe 5%
samplerun( 0.99, 0.05, debug_output ) # conf 99%, moe 5%
samplerun( 0.95, 0.01, debug_output ) # conf 95%, moe 1%
samplerun( 0.99, 0.01, debug_output ) # conf 99%, moe 1%
# debug ... see what's in our lists
if 0:
print(populations)
print(samplesizes)
print(confidences)
print(marginoferr)
#---------------------------
# generate charts for our runs
#---------------------------
# func spits out single line chart
# so we can see each individual run.
# This lets us see the sample size
# scale better for each separate run.
# Note, the plots show up fine in Spyder's
# plot windows, but in VSCode the population
# labels get truncated off. So, you have
# to click on each chart's "settings" when
# they run, and adjust the "bottom" to
# bring the labels up into view
def singlechart( i ):
# make some shortcut vars for the run we're doing
C = confidences[i]
E = marginoferr[i]
N = populations[i]
n = samplesizes[i]
# chart title
title = "conf " + str(C) + "\nerror " + str(E)
# plot sample sizes for run using 1st line format style
plt.plot( n, PLOT_FORMATS[0], label = title )
# the population sizes are explonential
# so we just want to show them as text
# labels for the 10 pops we sampled
plt.xticks( PLOT_TICKS, N, rotation = 90)
# force 0,0 to be exact bottom-left corner w/o padding
plt.xlim( xmin = 0.0 )
plt.ylim( ymin = 0.0 )#,ymax=20000)
plt.suptitle( title )
plt.ylabel( 'sample size' )
plt.xlabel( 'population size' )
plt.grid( True, 'major', 'both' )
# plt.legend()
plt.show()
#---------------------------
# func spits out one chart with
# 4 lines giving a comparison of
# sample sizes for each conf &
# error run
def addline( plt, i ):
# plt = chart we're working on
# i = sample run we're working with
C = confidences[i]
E = marginoferr[i]
n = samplesizes[i]
f = PLOT_FORMATS[i]
L = "conf " + str(C) + ", err " + str(E)
plt.plot( n, f, label = L )
def multichart():
# plotting sample sizes for each run
for i in range(4):
addline( plt, i )
# the population sizes are exponential
# so we just want to show them as text
# labels for the 10 we ran. Since we
# ran the same pops each time, we can
# just use the first pop list for this.
plt.xticks( PLOT_TICKS, populations[0], rotation = 90)
# force 0,0 to be exact bottom-left corner w/o padding
# also max out y's ceiling so we can show the
# difference between run chart lines better
plt.xlim( xmin = 0.0 )
plt.ylim( ymin = 0.0, ymax = 20000 )
plt.suptitle( "Sample Size Ceiling Comparison" )
plt.ylabel( 'sample size' )
plt.xlabel( 'population size' )
plt.grid( True, 'major', 'both' )
# this legend style looks fine in Spyder's plot output
# plt.legend( loc = 'upper left', bbox_to_anchor = ( 1.05, 1 ) )
# this one embeds the legend in the chart for VSCode editor
plt.legend()
plt.show()
#---------------------------
# generate a single run chart per
# conf + err run to see how the
# sample sizes range for each
for pop_run in range(len(populations)):
singlechart(pop_run)
# generate a multi-line run chart
# to see how they compare to each
# other (showing how margin of
# error greatly impacts sample size)
multichart()
#---------------------------
"""
debug console output
--------------------------------------------------
conf 0.95
error 0.05
--------------------------------------------------
Sample Size 131, Population 200
Sample Size 322, Population 2000
Sample Size 376, Population 20000
Sample Size 383, Population 200000
Sample Size 384, Population 2000000
Sample Size 384, Population 20000000
Sample Size 384, Population 200000000
Sample Size 384, Population 2000000000
Sample Size 384, Population 20000000000
Sample Size 384, Population 200000000000
--------------------------------------------------
conf 0.99
error 0.05
--------------------------------------------------
Sample Size 153, Population 200
Sample Size 498, Population 2000
Sample Size 642, Population 20000
Sample Size 661, Population 200000
Sample Size 663, Population 2000000
Sample Size 663, Population 20000000
Sample Size 663, Population 200000000
Sample Size 663, Population 2000000000
Sample Size 663, Population 20000000000
Sample Size 663, Population 200000000000
--------------------------------------------------
conf 0.95
error 0.01
--------------------------------------------------
Sample Size 195, Population 200
Sample Size 1655, Population 2000
Sample Size 6488, Population 20000
Sample Size 9163, Population 200000
Sample Size 9557, Population 2000000
Sample Size 9599, Population 20000000
Sample Size 9603, Population 200000000
Sample Size 9603, Population 2000000000
Sample Size 9603, Population 20000000000
Sample Size 9603, Population 200000000000
--------------------------------------------------
conf 0.99
error 0.01
--------------------------------------------------
Sample Size 197, Population 200
Sample Size 1784, Population 2000
Sample Size 9067, Population 20000
Sample Size 15316, Population 200000
Sample Size 16450, Population 2000000
Sample Size 16573, Population 20000000
Sample Size 16585, Population 200000000
Sample Size 16587, Population 2000000000
Sample Size 16587, Population 20000000000
Sample Size 16587, Population 200000000000
as you can see, the margin of error impacts
sample size greatly.
"""
##########################################################
#
# Sample Size Metrics vs. Population Metrics Check
#
##########################################################
"""
Since Cochran Formula is pulling small sample sizes for
potentially massive data sets, it's good to get an idea
of how far off we might be in the stats of the sample
vs. population (eg: std dev, average, etc.)
So, let's...
1) create a massive population data set
2) generate a sample size for it
3) pull simple random samples from population set
4) compare pop & sample statistics
Code below uses following statistical notations...
C = Confidence Level
E = Margin of Error
N = Population Size
n = sample size
"""
##########################################################
#------------------------------
# Imports
#------------------------------
from numpy import mean, std
from numpy.random import randint
from samplesize import samplesize
#------------------------------
# Globals / Constants
#------------------------------
BORDER = "-" * 50
#---------------------------
# test how far off numbers are
#---------------------------
print(BORDER)
N = 2000000
C = 0.95
E = 0.05
n = samplesize( N, C, E, True )
pop = [] # population dataset
smp = [] # sample dataset randomly pulled from pop
#---------------------------
# fill population list with
# random integers (0 to 100)
print("filling population dataset...")
# numpy's randint let's us do it all in one
# shot using the "size" paramter which we
# just set return N random ints
pop = randint( 0, 100, size = N )
"""
for i in range( N ):
randomvalue = randint(0, 100) # random 0 to 100
pop.insert( i, randomvalue )
"""
#---------------------------
# pull random samples from population data
print("pulling random samples...")
for i in range( n ):
randomsample = randint(0, N - 1) # random sample 0 to pop max
smp.insert( i, pop[randomsample] )
"""
Note that we're doing sampling
where we can potentially include
the same sample multiple times.
The larger the population, the
less likely this will happen.
If we want to really get elaborate,
we can code the sample selection
algorithm to go pick another random
sample if we've already sampled
the one it chooses.
But, for now, we're just demonstrating
some code, so ..
"""
#---------------------------
# calculate std dev, mean & coefficient of variation
# for pop and sample to compare how far off we are
print("calculating metrics...")
popMean = mean( pop )
smpMean = mean( smp )
popStdv = std( pop )
smpStdv = std( smp )
popCoef = popStdv / popMean
smpCoef = smpStdv / smpMean
print(BORDER)
print("Pop std.dev = " + str( popStdv ) )
print("Smp std.dev = " + str( smpStdv ) )
print("Difference = " + str( abs(popStdv - smpStdv) ) )
print(BORDER)
print("Pop average = " + str( popMean ) )
print("Smp average = " + str( smpMean ) )
print("Difference = " + str( abs(popMean - smpMean) ) )
print(BORDER)
print("Pop coeff.v = " + str( popCoef ) )
print("Smp coeff.v = " + str( smpCoef ) )
print("Difference = " + str( abs(popCoef - smpCoef) ) )
#---------------------------
# resampling to try to increase accuracy
#---------------------------
"""
Central Limit Theory says that if you sample
a population enough times, then the metrics
you create by merging your samples metrics
(eg: averaging your sample averages) will come
really close to the population metrics.. but
never 100% close, because the only way to
have 100% accuracy is to do the metrics
on the entire population.
So, we can do multiple sampling passes,
then average their metrics together to
see if we come closer.
Eventually you hit a resampling where you
might as well just increase the conf level
and / or margin of error and run a single
sample run instead of multiple runs,
though.
For very small populations, this point is
hit very fast. EG: for populations of 200,
if you did several resamples with 95% conf
and 5% error, you'll pretty much have hit
the sample size for a 99% conf and 1% error
rate. So, might as well just ramp those up
instead and do a single sample pass.
But, when you start to get into pop sizes
of, eg: 2000 ... a conf of 99% and error of
1% gives a sample size of 1655. A conf of
95% and error of 5% gives a sample size of
322.
1655 / 322 = about 5 sample runs of 322
you could do before hitting 1655 samples.
So, there's a decision.. should you resample
using the same conf & error rate you already
generated a sample size for? Or, should you
just increase the conf level and error rate
a bit and run another sample size number?
You'll need to decide how you want to do
it.
But, ultimately, the more samples you do,
the greater the accuracy you can generate.
Whether that's through resampling via a
small sample size multiple times, or ramping
up the conf & error to generate a larger
single sample size.. up to you.
I'm sure there's some statistics to help
decide which is better. But, for now, let's
just do a multi-sample run to see how it
looks.
"""
#---------------------------
print(BORDER)
#---------------------------
# pull a 2nd round of random samples from population
# using the sample size we already generated
print("pulling random samples (2nd sampling)...")
# purge old samples from samples list
smp.clear()
for i in range( n ):
randomsample = randint(0, N - 1) # random sample 0 to pop max
smp.insert( i, pop[randomsample] )
# generte metrics for 2nd sampling
smpMean2 = mean( smp )
smpStdv2 = std( smp )
smpCoef2 = smpStdv2 / smpMean2
#---------------------------
# pull a 3rd round of random samples from population
# using the sample size we already generated
print("pulling random samples (3rd sampling)...")
# purge old samples from samples list
smp.clear()
for i in range( n ):
randomsample = randint(0, N - 1) # random 0 to pop max
smp.insert( i, pop[randomsample] )
# generte metrics for 2nd sampling
smpMean3 = mean( smp )
smpStdv3 = std( smp )
smpCoef3 = smpStdv2 / smpMean2
#---------------------------
avgSmpMean = ( smpMean + smpMean2 + smpMean3 ) / 3
avgSmpStdv = ( smpStdv + smpStdv2 + smpStdv3 ) / 3
avgSmpCoef = ( smpCoef + smpCoef2 + smpCoef3 ) / 3
print(BORDER)
print("Metrics compare using avg of 3 sample runs")
print("IE: ran sampling 3 times, and took avg of ")
print(" avg, std dev & cv to see if they get closer")
print(" to pop metrics via more sampling.")
print(BORDER)
print("Pop std.dev = " + str( popStdv ) )
print("Smp std.dev = " + str( avgSmpStdv ) )
print("Difference = " + str( abs(popStdv - avgSmpStdv) ) )
print(BORDER)
print("Pop average = " + str( popMean ) )
print("Smp average = " + str( avgSmpMean ) )
print("Difference = " + str( abs(popMean - avgSmpMean) ) )
print(BORDER)
print("Pop coeff.v = " + str( popCoef ) )
print("Smp coeff.v = " + str( avgSmpCoef ) )
print("Difference = " + str( abs(popCoef - avgSmpCoef) ) )
# calculate differences for
# 1 sample run vs. 3 sample runs
samplediff1 = [ \
abs(popStdv - smpStdv), \
abs(popMean - smpMean), \
abs(popCoef - smpCoef) \
]
samplediff3 = [ \
abs(popStdv - avgSmpStdv), \
abs(popMean - avgSmpMean), \
abs(popCoef - avgSmpCoef) \
]
print(BORDER)
print("Compare differences of 1 sample run vs. 3 sample runs")
print("(lower is better)")
print(BORDER)
print("Pop Std Dev - Sample Std Dev")
print("1 sample run difference: " + str(samplediff1[0]))
print("3 sample run difference: " + str(samplediff3[0]))
print(BORDER)
print("Pop Avg - Sample Avg")
print("1 sample run difference: " + str(samplediff1[1]))
print("3 sample run difference: " + str(samplediff3[1]))
print(BORDER)
print("Pop Coeff Var - Sample Coeff Var")
print("1 sample run difference: " + str(samplediff1[2]))
print("3 sample run difference: " + str(samplediff3[2]))
print(BORDER)
"""
if you run the code several times, sometimes
the multi-sampling does a better job, and sometimes
it doesn't.
"""
##########################################################
"""
Cochran's Formula needs to convert Confidence Level
Percents into two-tail z-scores. So, we need to use
a converter to generate them.
Farming this off to a seperate file for import to
help streamline code in other files.
"""
##########################################################
#------------------------------
# Imports
#------------------------------
"""
Python 3.8 stat pkg started including
a normal distribution object to do
stuff with, like zscores, but I was using
Python 3.7.x, so using scipy instead
"""
#from statistics import NormalDist as nd
from scipy.stats import norm as nd
#------------------------------
# Globals / Constants
#------------------------------
#------------------------------
# Z-Score for Confidence Level
#------------------------------
"""
scipy.stats.norm.ppf can convert confidence level to z-score
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html#scipy.stats.norm
this stackoverflow helped figure it out...
https://stackoverflow.com/questions/20864847/probability-to-z-score-and-vice-versa-in-python
basically, scipy.stats.norm.ppf converts conf level % to
left-tail z-score by default. So, 0.95 would get 1.64 z-score.
Cochran's Formula needs two-sided z-score, IE: 0.95 to get
1.96. So, we have to convert the confidence level to a
two-side version first before outputting the ppf for it.
"""
# confidence = confidence level % as decimal (eg: 0.95 confidence)
# tails = 1 for one-sided tail, 2 for two-sided tail
# debug = true to print debug output for testing
def zscore ( confidence, tails, debug = False ):
try:
assert ( 0 < tails < 3 ), "Tails has to be 1 or 2"
assert ( confidence < 1 ), "Confidence must be greater then 0.0 (0%)"
assert ( confidence > 0 ), "Confidence must be less then 1.0 (100%)"
# ppf returns one-tail z by default
# so convert confidence if needing two-tail
if tails == 2:
confidence = 1 - ( ( 1 - confidence ) / 2 )
z = nd.ppf( confidence )
if debug:
print("zscore for " + str( confidence ) + " = " + str( z ))
return z
except AssertionError as error:
print( "Error: " + str ( error ) )
######################################
# setup up some debug / testing code
# that will only run if this file
# is active as "main"
######################################
if __name__ == "__main__":
# debug ... test zscore function
if 1:
print( "----------------------------" )
print( "conf \t one \t two" )
print( "----------------------------" )
# for confidence level 90% to 99%
# generate one and two-tail scores
for i in range(10):
C = 0.9 + 0.01 * i # confidence level
Z1 = zscore( C, 1 ) # one-tail zscore
Z2 = zscore( C, 2 ) # one-tail zscore
C = round ( C, 2 ) # round everything to 2 decimal places
Z1 = round ( Z1, 2 )
Z2 = round ( Z2, 2 )
print( str(C) + "\t" + str(Z1) + "\t" + str(Z2) )
print( "----------------------------" )
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment