ccritchfield · December 13, 2019 18:34
diff --git a/_samplesize.readme b/_samplesize.readme
 -----------------------------------------
 Sample Sizing Big Data to Reduce Processing
 -----------------------------------------

 When grinding massive data sets, it'd be nice to 
 figure out a statistically supported subset sample
 size you could come up with and run instead, that
 way you could cut down on the time and resources
 needed to come up with some results. IE: instead
 of spending tons of time and resources grinding
 a whole data set, you could just figure out a sample
 size you cuold use to do simple random sampling of
 items from the data set, and run the samples instead,
 knowing they are within certain statistical parameters
 (confidence level & margin of error) of representation
 of the overall data set.

 I wrote a Linkedin Article talking more about this...

 https://www.linkedin.com/pulse/sample-sizing-big-data-reduce-processing-craig-critchfield-mis/

 (The article and code go hand-in-hand)

 But, the bottomline is that while other statistical
 sampling formulas need you to know the population's
 average or standard deviation already, Cochran's Formula
 just needs you to determine a confidence level, margin
 of error, and know the total items in your population
 (population size). Plug those in, and it can spit out
 a sample size that has statistical support to be
 representative of your overall population.

 Cochran's Formula caps out on sample sizes when you
 near 20,000 population size (for 5% margin of error)
 and 200,000 population size (for 1% margin of error).
 So, if you have massive "big data" data sets (2M, 2B,
 2T, etc), you can pull small sample sizes that are far
 quicker to process knowing the end results will be about
 the same as if you processed the entire data set.

 EG: if you had tons of tweets, and you were grinding them
 for sentiment analysis. You could simply come up with
 a sample size and pull some simple random samples to
 grind instead, knowing the sentiment results you get
 from the sample will be statistically representative
 of the population to some degree given the confidence
 level and margin of error you decide to use.

 This can help you greatly reduce the processing time
 and resources needed to see patterns or analysis of
 large data.

 -----------------------------------------
 Files
 -----------------------------------------

 -- samplesize.py --

 main file that has the "samplesize()" function in it.
 samplesize function takes 3 arguments...
 N = population size (eg: the len(dataset) size of your dataset)
 C = Confidence Level (eg: .95 confidence level)
 E = Margin of Error (eg: within +/- .05)
 debug = have it spit out results to debug console for testing if you like

 -- samplesize_zscore.py --

 This file needs to go with the samplesize.py file,
 as it's a support function (zscore) to convert the
 confidence levels into zscores for the Cochran's
 Formula in "samplesize()". You cuold merge the
 code into samplesize.py if you want. I just offloaded
 it to a separate file to keep things modular in
 case I need to re-use it.

 -- samplesize_test_sizes.py --

 This runs tests on various confidence level,
 margin of error and population sizes to see
 what sample sizes the Cochran Formula samplesize()
 function comes up with. Charts them out, so
 you can see.

 -- samplesize_test_stats.py --

 This creates a large dataset (2M integers from 0 to 100)
 and uses it like a population to run avg & std dev on.
 Then I use samplesize() to generate a sample size to draw
 simple random samples from it and run avg & std dev on the
 sample subset to see how close it comes to the population
 metrics. This is to see how accurate the sampling is in
 representing the population. It also runs some resampling /
 multi-sampling to try to increase the sampling accuracy
 by averaging in the results of 2 more sample runs to see
 if 3 sample runs increase accuracy of metrics when compared
 to population metrics.



diff --git a/samplesize.py b/samplesize.py
 ####################################################
 """
 ------------------------------
 Estimating Sample Size Using 
 Cochran's Formula with Default
 50% Proportion
 ------------------------------

    we don't know the population mean or standard deviation;
    but we want to calculate a quick-n-dirty sample size to
    limit the amount of processing we do while still having
    some confidence the processing we're doing will give us
    a decent picture of what all the data would tell us if
    we did it all.
    
    So, we can revert to using a proportion sample size
    calculation that just needs a confidence interval and
    margin of error... Cochran's Formula.

    This site talks about Cochran's Formula...

    https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/find-sample-size/

    
    These two sites demonstrate it in web calculators
    
    http://www.raosoft.com/samplesize.html
    
    https://www.surveysystem.com/sscalc.htm



    What it boils down to is we can come up with a quick-n-
    dirty sample size as long as we have 3 things...

    * Population Size
    * Confidence Level
    * Margin of Error

    Let's use archery as an example...

    Population would be the total number of shots we could take. But,
    due to limited time or resources, we want to just do a limited
    amount of sample shots to get a feel for how the overall number
    of shots might turn out.

    Margin of error is how wide your target is.
    
    Confidence Level is how confident you will be that the shot
    you take will be within that target (MOE) area.
    
    When using proportion for sample size estimation, you would
    usually provide a 4th value, The Percentage of population you
    think would respond a certain way. (the Proportion)

    EG: if you were doing a survey, you might feel 75% of the
    population would respond True to something, and 25% respond
    False.
    
    If you don't know the response rate, though, then you can
    default to 50% to maximize the samples. 50% gives the largest
    sample size, so if proportion is unknown, 50% is used as
    default to maximize sample size to account for this.
    
    So, we have statistical things like this...
    
    N = population size we're sampling
    n = sample size we're calculating
    C = Confidence Level
    Z = z-score for confidence level
    E = Margin of Error    
    P = Proportion (Percentage)

    Cochran's Formula is this...

    sample size = ( Zscore^2 ( Proportion ( 1 - Proportion ) ) ) / Margin of Error^2

    ...or, in statistical notation...

    n = ( Z^2 ( P ( 1 - P ) ) ) / E^2

    If you really wanted to get statistically fancy,
    then you'll notice that the P(1-P) is just the
    PQ formula in statistics...
    
    P = Proportion
    Q = 1 - Proportion
    
    EG: if we had a 75% P chance to do something,
    then we'd have a 25% (1-P) Q chance to not do it.

    Anyways, since we know we're using a default P of 50%,
    we can go ahead and plug that in...

    n = ( Z^2 ( 0.5 ( 1 - 0.5 ) ) ) / E^2
    
    ...and, We can pre-calc to reduce the formula down to this ...

    n = ( Z^2 * 0.25 ) / E^2


    ...in laymen's terms...
    
    sample size = ( Zscore^2 * 0.25 ) / Margin of Error^2

    (Note the X^2 is MS Excel's notation for pow'ing things
    I was modelling things out in Excel first.)

    The problem with Cochran's Formula is it was designed
    for really large population sizes. If you start to get
    into really small ones (eg: 100 or less), it calculates
    a sample size larger then the population size. And, for
    large populations, it quickly ramps up the sample size
    to cap out quickly.

    So, his formula finishes with an exponential smoothing
    to re-proportion sample sizes for small populations to
    be less then the pop size, and helps increment the 
    sample size for large pops up smoother then radically.
    
    So, the sample size calculated above gets ran through
    the following at the end...
    
    sample size = sample size / ( 1 + ( ( sample size - 1 ) / population ) )

    ...or, in statistical terms...
    
    n = n / ( 1 + ( ( n - 1 ) / N ) )

    The interesting thing about this is the sample size
    doesn't increase much after population size hits
    200,000 or higher.
    
    EG: For confidence interval of 95%,
        and margin-of-error of 5%...
    
    population  = 20,000
    sample size = 376 (--)
    
    population  = 200,000
    sample size = 383 (+7)
    
    population  = 2,000,000
    sample size = 384 (+1)

    At higher confidence levels, the sample
    size increases more, but not by much. But,
    altering the Margin of Error increases
    sample sizes dramatically.

    EG: MOE of 1% starts to get into 10,000+ samples
    for large population sets.

    To get around this, you could simply resample
    several times using a larger Margin of Error
    of 5% or such. Because central limit theory
    says that if you sample enough, you eventually
    get closer to the average, std dev,
    etc of the actual population (although you'd
    never be 100% accurate unless you actually did
    process the whole population.)

    EG: So, instead of doing 1 sample with MOE of 1%,
    you could do 3 samples at MOE 5%, then average
    the results. (In stat, there is Standard Error
    functions you can use to determine how close
    the sampling is coming to the population metrics,
    but if you want to be quick-n-dirty about it...)
 """
 ####################################################


 #------------------------------
 #   Imports
 #------------------------------

 from samplesize_zscore import zscore



 #------------------------------
 #   Globals / Constants
 #------------------------------
 """
    I created two versions of the function

    * one uses laymen's terms
    * one uses statistical notation

    While it's easy to follow along when looking
    at the code below, I provided two versions
    so you can use whichever you like when
    you're coding on an IDE that has intellisense.
    When typing out the function call, the
    intellisense will show the function parameters.
    If you're comfortable with statistical notation,
    then use that function. If you like things
    spelled out, then use the laymen's function
    which will make it clear what parameter is what.

    The constant below lets you set which
    version of the function you want to use.
 """

 # true  = use statistical notation for var names
 # false = spell things out in laymen's terms for var names

 STATISTICAL_NOTATION = False




 #---------------------------
 # Cochran's Formula for Sample Size Estimation
 #---------------------------
 """
    For normal (binomial) distribution sampling...
    * population needs to be 30+
    * we need to assume population follows a normal distrubtion
 """


 if STATISTICAL_NOTATION:

    #---------------------------
    # use the one with statistical notation
    #---------------------------
    """
        N         = population size
        C         = confidence level
        E         = margin of error
        n         = sample size
        P         = proportion ( 1-P = Q, so P(1-P) = PQ)
        debug     = print debug outputs for testing (defaulting to false)
    """    
    def samplesize( N, C, E, debug = False ):
    
        try:
            # do some pre-emptive error-checking
            assert ( C < 1.0 ), "C must be less than 100%"
            assert ( C > 0.0 ), "C must be greater than 0%"
            assert ( E < 1.0 ), "E must be less than 100%"
            assert ( E > 0.0 ), "E must be greater than 0%"
            assert ( N > 0.0 ), "N must be greater than 0"
      
            # need two-tail z-score
            Z = zscore( C, 2 )
            
            # sample size proportion
            # since we're using 50% default, we can
            # pre-calc P(1-P) ... IE: PQ
        #   n = ( Z**2 * P ( 1 - P ) ) / E**2
            n = ( Z**2 * 0.25 ) / E**2
        
            # corrective function that scales
            # sample size to pop size better
            # to smooth out sample sizing and
            # handle very small populatons
            n = n / ( 1 + ( ( n - 1 ) / N ) )
        
            # sample is a float / decimal
            # recast as whole number integer
            # so if we use it in range() it
            # won't throw an error
            n = int( n )
            
            if debug:
                print("Sample Size " + str( n ) + ", Population " + str( N ))
        
            return n
    
        except AssertionError as error:
            print( "Error: " + str ( error ) )

 else:

    #---------------------------
    # use the one that spells things out
    #---------------------------
    """
    inputs... self-explanatory, except for...
        debug   = print debug outputs for testing (defaulting to false)
    """

    def samplesize( population, confidence, margin_of_error, debug = False ):
    
        try:
            # do some pre-emptive error-checking
            assert ( confidence < 1.0 ), "Confidence must be less than 100%"
            assert ( confidence > 0.0 ), "Confidence must be greater than 0%"
            assert ( margin_of_error < 1.0 ), "Margin of Error must be less than 100%"
            assert ( margin_of_error > 0.0 ), "Margin of Error must be greater than 0%"
            assert ( population > 0.0 ), "Population must be greater than 0"

            # need two-tail z-score
            z = zscore( confidence, 2 )
        
            # sample size proportion
            # since we're using 50% default, we can
            # pre-calc proportion ( 1 - proportion )
        #   sample = ( zscore**2 * proportion ( 1 - proportion ) ) / margin_of_error**2
            sample = ( z**2 * 0.25 ) / margin_of_error**2
        
            # corrective function that scales
            # sample size to pop size better
            # to smooth out sample sizing and
            # handle very small populatons
            sample = sample / ( 1 + ( ( sample - 1 ) / population ) )
        
            # sample is a float / decimal
            # recast as whole number integer
            # so if we use it in range() it
            # won't throw an error
            sample = int( sample )
        
            if debug:
                print("Sample Size " + str( sample ) + ", Population " + str( population ) )
        
            return sample
    
        except AssertionError as error:
            print( "Error: " + str ( error ) )




 ######################################
 # setup up some debug / testing code
 # that will only run if this file
 # is active as "main"
 ######################################

 if __name__ == "__main__":

    #---------------------------
    # test sample size function
    #---------------------------

    N = 20000
    C = 0.95
    E = 0.05
    
    print("-------------------------")
    samplesize( N, C, E, True )
    print("-------------------------")
diff --git a/samplesize_test_sizes.py b/samplesize_test_sizes.py
 ##########################################################
 #
 # Sample Size Ceiling Check (with Charts)
 #
 ##########################################################
 """

    Cochran Formula sample size caps out quickly as we near
    200,000 population. So, let's demonstrate by plowing
    through some pop sizes at various confidence levels
    and margins of errors.
    
    Code below uses following statistical notations...

    C = Confidence Level
    E = Margin of Error
    N = Population Size
    n = sample size

 """
 ##########################################################


 #------------------------------
 #   Imports
 #------------------------------

 import matplotlib.pyplot as plt

 from samplesize import samplesize




 #------------------------------
 #   Globals / Constants
 #------------------------------

 """
    dashed border to make message titles stand-out
    and break apart different debug output runs
 """

 BORDER_DASH     = "-" * 50

 """
    we're going to exponentially increase pop
    sizes to generate samples. We don't want our
    chart going crazy with a massive expanse of
    pop sizes, so instead we convert the pop sizes
    into string labels applied to each tick on
    the x axis ... we're doing 10 runs, but
    we're starting the plot ticks on 0 to
    have a 0,0 starting point, and then the
    pop sizes are on ticks 1 thru 10
 """

 PLOT_TICKS      = [0,1,2,3,4,5,6,7,8,9,10]

 """
    pyplot formats can use string short cuts
    -       = use a line chart
    b,r,g,y = blue, red, green, yellow colors
    o       = use dots as markers
    so -bo  = use a blue line & put a dot marker on it (same color).
    we're only doing 4 plot runs, so hard-coding
    the formats here to iterate through for a
    multi-line run chart later on.
 """

 PLOT_FORMATS    = ['-bo', '-ro', '-go', '-yo']


 """
    we'll use some lists to keep track
    of what conf levels, margin of error,
    pop sizes & sample sizes we do over
    several runs, so we can chart them
    later on.

    population & sample size are
    going to be lists of lists,
    because we're gonna run
    10 population sizes to generate
    10 sample sizes per conf/moe
    combination run we do.
 """

 marginoferr = []
 confidences = []
 populations = []
 samplesizes = []



 #---------------------------
 # Generate Sample Runs
 #---------------------------

 # function iterates through pop sizes
 # exponentially to gernate sample sizes

 def samplerun( C, E, debug = False ):

    if debug:
        print( BORDER_DASH )
        print(  "conf  " + str(C) + "\n" +
                "error " + str(E) )
        print( BORDER_DASH )

    # we'll rack up the pop & samples
    # in temp lists. they start with
    # 0 in each so we can have a 0,0
    # x,y starting point in our charts

    pops = [0]
    smps = [0]

    # exponentially ramp up pop size
    # and generate sample size for it

    for x in range(10):
        N = 200 * 10**x
        n = samplesize( N, C, E, debug )
        pops.append(N)
        smps.append(n)

    
    # figure out what slot we're on
    # in our global lists, and load
    # our respective values

    i = len(populations)

    populations.insert( i, pops )
    samplesizes.insert( i, smps )
    confidences.insert( i, C )
    marginoferr.insert( i, E )



 #---------------------------

 debug_output = True

 samplerun( 0.95, 0.05, debug_output )    # conf 95%, moe 5%
 samplerun( 0.99, 0.05, debug_output )    # conf 99%, moe 5%
 samplerun( 0.95, 0.01, debug_output )    # conf 95%, moe 1%
 samplerun( 0.99, 0.01, debug_output )    # conf 99%, moe 1%

 # debug ... see what's in our lists
 if 0:
    print(populations)
    print(samplesizes)
    print(confidences)
    print(marginoferr)




 #---------------------------
 # generate charts for our runs
 #---------------------------

 # func spits out single line chart
 # so we can see each individual run.
 # This lets us see the sample size
 # scale better for each separate run.

 # Note, the plots show up fine in Spyder's
 # plot windows, but in VSCode the population
 # labels get truncated off. So, you have
 # to click on each chart's "settings" when
 # they run, and adjust the "bottom" to
 # bring the labels up into view


 def singlechart( i ):

    # make some shortcut vars for the run we're doing
    C = confidences[i]
    E = marginoferr[i]
    N = populations[i]
    n = samplesizes[i]

    # chart title
    title = "conf  " + str(C) + "\nerror " + str(E)

    # plot sample sizes for run using 1st line format style
    plt.plot( n, PLOT_FORMATS[0], label = title )
    
    # the population sizes are explonential
    # so we just want to show them as text
    # labels for the 10 pops we sampled
    plt.xticks( PLOT_TICKS, N, rotation = 90)

    # force 0,0 to be exact bottom-left corner w/o padding
    plt.xlim( xmin = 0.0 )
    plt.ylim( ymin = 0.0 )#,ymax=20000)

    plt.suptitle( title )
    plt.ylabel( 'sample size' )
    plt.xlabel( 'population size' )
    plt.grid( True, 'major', 'both' )
 #    plt.legend()
    plt.show()

 #---------------------------

 # func spits out one chart with
 # 4 lines giving a comparison of
 # sample sizes for each conf &
 # error run

 def addline( plt, i ):
 # plt   = chart we're working on
 # i     = sample run we're working with
    C = confidences[i]
    E = marginoferr[i]
    n = samplesizes[i]
    f = PLOT_FORMATS[i]
    L = "conf " + str(C) + ", err " + str(E)
    plt.plot( n, f, label = L )


 def multichart():

    # plotting sample sizes for each run

    for i in range(4):
        addline( plt, i )

    # the population sizes are exponential
    # so we just want to show them as text
    # labels for the 10 we ran. Since we
    # ran the same pops each time, we can
    # just use the first pop list for this.
    plt.xticks( PLOT_TICKS, populations[0], rotation = 90)

    # force 0,0 to be exact bottom-left corner w/o padding
    # also max out y's ceiling so we can show the
    # difference between run chart lines better
    plt.xlim( xmin = 0.0 )
    plt.ylim( ymin = 0.0, ymax = 20000 )

    plt.suptitle( "Sample Size Ceiling Comparison" )
    plt.ylabel( 'sample size' )
    plt.xlabel( 'population size' )
    plt.grid( True, 'major', 'both' )

    # this legend style looks fine in Spyder's plot output
 #    plt.legend( loc = 'upper left', bbox_to_anchor = ( 1.05, 1 ) )

    # this one embeds the legend in the chart for VSCode editor
    plt.legend()

    plt.show()


 #---------------------------

 # generate a single run chart per
 # conf + err run to see how the
 # sample sizes range for each

 for pop_run in range(len(populations)):
    singlechart(pop_run)

 # generate a multi-line run chart
 # to see how they compare to each
 # other (showing how margin of
 # error greatly impacts sample size)

 multichart()


 #---------------------------

 """
 debug console output

 --------------------------------------------------
 conf  0.95
 error 0.05
 --------------------------------------------------
 Sample Size 131, Population 200
 Sample Size 322, Population 2000
 Sample Size 376, Population 20000
 Sample Size 383, Population 200000
 Sample Size 384, Population 2000000
 Sample Size 384, Population 20000000
 Sample Size 384, Population 200000000
 Sample Size 384, Population 2000000000
 Sample Size 384, Population 20000000000
 Sample Size 384, Population 200000000000
 --------------------------------------------------
 conf  0.99
 error 0.05
 --------------------------------------------------
 Sample Size 153, Population 200
 Sample Size 498, Population 2000
 Sample Size 642, Population 20000
 Sample Size 661, Population 200000
 Sample Size 663, Population 2000000
 Sample Size 663, Population 20000000
 Sample Size 663, Population 200000000
 Sample Size 663, Population 2000000000
 Sample Size 663, Population 20000000000
 Sample Size 663, Population 200000000000
 --------------------------------------------------
 conf  0.95
 error 0.01
 --------------------------------------------------
 Sample Size 195,  Population 200
 Sample Size 1655, Population 2000
 Sample Size 6488, Population 20000
 Sample Size 9163, Population 200000
 Sample Size 9557, Population 2000000
 Sample Size 9599, Population 20000000
 Sample Size 9603, Population 200000000
 Sample Size 9603, Population 2000000000
 Sample Size 9603, Population 20000000000
 Sample Size 9603, Population 200000000000
 --------------------------------------------------
 conf  0.99
 error 0.01
 --------------------------------------------------
 Sample Size 197,   Population 200
 Sample Size 1784,  Population 2000
 Sample Size 9067,  Population 20000
 Sample Size 15316, Population 200000
 Sample Size 16450, Population 2000000
 Sample Size 16573, Population 20000000
 Sample Size 16585, Population 200000000
 Sample Size 16587, Population 2000000000
 Sample Size 16587, Population 20000000000
 Sample Size 16587, Population 200000000000


 as you can see, the margin of error impacts
 sample size greatly.
 """
diff --git a/samplesize_test_stats.py b/samplesize_test_stats.py
 ##########################################################
 #
 # Sample Size Metrics vs. Population Metrics Check
 #
 ##########################################################
 """

    Since Cochran Formula is pulling small sample sizes for
    potentially massive data sets, it's good to get an idea
    of how far off we might be in the stats of the sample
    vs. population (eg: std dev, average, etc.)

    So, let's...

    1) create a massive population data set
    2) generate a sample size for it
    3) pull simple random samples from population set
    4) compare pop & sample statistics
    
   Code below uses following statistical notations...

    C = Confidence Level
    E = Margin of Error
    N = Population Size
    n = sample size

 """
 ##########################################################


 #------------------------------
 #   Imports
 #------------------------------

 from numpy          import mean, std
 from numpy.random   import randint
 from samplesize     import samplesize


 #------------------------------
 #   Globals / Constants
 #------------------------------

 BORDER = "-" * 50


 #---------------------------
 # test how far off numbers are
 #---------------------------

 print(BORDER)

 N = 2000000
 C = 0.95
 E = 0.05
 n = samplesize( N, C, E, True )

 pop = []    # population dataset
 smp = []    # sample dataset randomly pulled from pop

 #---------------------------

 # fill population list with
 # random integers (0 to 100)
 print("filling population dataset...")

 # numpy's randint let's us do it all in one
 # shot using the "size" paramter which we
 # just set return N random ints

 pop = randint( 0, 100, size = N )
 """
 for i in range( N ):
    randomvalue = randint(0, 100) # random 0 to 100
    pop.insert( i, randomvalue )
 """
 #---------------------------

 # pull random samples from population data
 print("pulling random samples...")

 for i in range( n ):
    randomsample = randint(0, N - 1) # random sample 0 to pop max
    smp.insert( i, pop[randomsample] )

 """
    Note that we're doing sampling
    where we can potentially include
    the same sample multiple times.
    
    The larger the population, the
    less likely this will happen.
    
    If we want to really get elaborate,
    we can code the sample selection
    algorithm to go pick another random
    sample if we've already sampled
    the one it chooses.
    
    But, for now, we're just demonstrating
    some code, so ..
 """

 #---------------------------


 # calculate std dev, mean & coefficient of variation
 # for pop and sample to compare how far off we are

 print("calculating metrics...")

 popMean = mean( pop )
 smpMean = mean( smp )
 popStdv = std( pop )
 smpStdv = std( smp )
 popCoef = popStdv / popMean
 smpCoef = smpStdv / smpMean

 print(BORDER)
 print("Pop std.dev = " + str( popStdv ) )
 print("Smp std.dev = " + str( smpStdv ) )
 print("Difference  = " + str( abs(popStdv - smpStdv) ) )
 print(BORDER)
 print("Pop average = " + str( popMean ) )
 print("Smp average = " + str( smpMean ) )
 print("Difference  = " + str( abs(popMean - smpMean) ) )
 print(BORDER)
 print("Pop coeff.v = " + str( popCoef ) )
 print("Smp coeff.v = " + str( smpCoef ) )
 print("Difference  = " + str( abs(popCoef - smpCoef) ) )



 #---------------------------
 # resampling to try to increase accuracy
 #---------------------------
 """
    Central Limit Theory says that if you sample
    a population enough times, then the metrics
    you create by merging your samples metrics
    (eg: averaging your sample averages) will come
    really close to the population metrics.. but
    never 100% close, because the only way to
    have 100% accuracy is to do the metrics
    on the entire population.
    
    So, we can do multiple sampling passes,
    then average their metrics together to
    see if we come closer.
    
    Eventually you hit a resampling where you
    might as well just increase the conf level
    and / or margin of error and run a single
    sample run instead of multiple runs,
    though.

    For very small populations, this point is
    hit very fast. EG: for populations of 200,
    if you did several resamples with 95% conf
    and 5% error, you'll pretty much have hit
    the sample size for a 99% conf and 1% error
    rate. So, might as well just ramp those up
    instead and do a single sample pass.
    
    But, when you start to get into pop sizes
    of, eg: 2000 ... a conf of 99% and error of
    1% gives a sample size of 1655. A conf of
    95% and error of 5% gives a sample size of
    322.
    
    1655 / 322 = about 5 sample runs of 322
    you could do before hitting 1655 samples.
    
    So, there's a decision.. should you resample
    using the same conf & error rate you already
    generated a sample size for? Or, should you
    just increase the conf level and error rate
    a bit and run another sample size number?
    
    You'll need to decide how you want to do
    it.

    But, ultimately, the more samples you do,
    the greater the accuracy you can generate.
    Whether that's through resampling via a
    small sample size multiple times, or ramping
    up the conf & error to generate a larger
    single sample size.. up to you.
    
    I'm sure there's some statistics to help
    decide which is better. But, for now, let's
    just do a multi-sample run to see how it
    looks.
 """

 #---------------------------
 print(BORDER)
 #---------------------------

 # pull a 2nd round of random samples from population
 # using the sample size we already generated
 print("pulling random samples (2nd sampling)...")

 # purge old samples from samples list
 smp.clear()

 for i in range( n ):
    randomsample = randint(0, N - 1) # random sample 0 to pop max
    smp.insert( i, pop[randomsample] )

 # generte metrics for 2nd sampling
 smpMean2 = mean( smp )
 smpStdv2 = std( smp )
 smpCoef2 = smpStdv2 / smpMean2

 #---------------------------

 # pull a 3rd round of random samples from population
 # using the sample size we already generated
 print("pulling random samples (3rd sampling)...")

 # purge old samples from samples list
 smp.clear()

 for i in range( n ):
    randomsample = randint(0, N - 1) # random 0 to pop max
    smp.insert( i, pop[randomsample] )

 # generte metrics for 2nd sampling
 smpMean3 = mean( smp )
 smpStdv3 = std( smp )
 smpCoef3 = smpStdv2 / smpMean2

 #---------------------------

 avgSmpMean = ( smpMean + smpMean2 + smpMean3 ) / 3 
 avgSmpStdv = ( smpStdv + smpStdv2 + smpStdv3 ) / 3 
 avgSmpCoef = ( smpCoef + smpCoef2 + smpCoef3 ) / 3 

 print(BORDER)
 print("Metrics compare using avg of 3 sample runs")
 print("IE: ran sampling 3 times, and took avg of ")
 print("    avg, std dev & cv to see if they get closer")
 print("    to pop metrics via more sampling.")
 print(BORDER)
 print("Pop std.dev = " + str( popStdv ) )
 print("Smp std.dev = " + str( avgSmpStdv ) )
 print("Difference  = " + str( abs(popStdv - avgSmpStdv) ) )
 print(BORDER)
 print("Pop average = " + str( popMean ) )
 print("Smp average = " + str( avgSmpMean ) )
 print("Difference  = " + str( abs(popMean - avgSmpMean) ) )
 print(BORDER)
 print("Pop coeff.v = " + str( popCoef ) )
 print("Smp coeff.v = " + str( avgSmpCoef ) )
 print("Difference  = " + str( abs(popCoef - avgSmpCoef) ) )


 # calculate differences for
 # 1 sample run vs. 3 sample runs

 samplediff1 = [                         \
                abs(popStdv - smpStdv), \
                abs(popMean - smpMean), \
                abs(popCoef - smpCoef)  \
              ]


 samplediff3 = [                             \
                abs(popStdv - avgSmpStdv),  \
                abs(popMean - avgSmpMean),  \
                abs(popCoef - avgSmpCoef)   \
              ]


 print(BORDER)
 print("Compare differences of 1 sample run vs. 3 sample runs")
 print("(lower is better)")
 print(BORDER)
 print("Pop Std Dev - Sample Std Dev")
 print("1 sample run difference: " + str(samplediff1[0]))
 print("3 sample run difference: " + str(samplediff3[0]))
 print(BORDER)
 print("Pop Avg - Sample Avg")
 print("1 sample run difference: " + str(samplediff1[1]))
 print("3 sample run difference: " + str(samplediff3[1]))
 print(BORDER)
 print("Pop Coeff Var - Sample Coeff Var")
 print("1 sample run difference: " + str(samplediff1[2]))
 print("3 sample run difference: " + str(samplediff3[2]))
 print(BORDER)
 """
    if you run the code several times, sometimes
    the multi-sampling does a better job, and sometimes
    it doesn't.
 """
diff --git a/samplesize_zscore.py b/samplesize_zscore.py
 ##########################################################
 """
    Cochran's Formula needs to convert Confidence Level
    Percents into two-tail z-scores. So, we need to use
    a converter to generate them.

    Farming this off to a seperate file for import to
    help streamline code in other files.
 """
 ##########################################################


 #------------------------------
 #   Imports
 #------------------------------
 """
    Python 3.8 stat pkg started including
    a normal distribution object to do
    stuff with, like zscores, but I was using
    Python 3.7.x, so using scipy instead
 """

 #from statistics import NormalDist as nd

 from scipy.stats import norm as nd



 #------------------------------
 #   Globals / Constants
 #------------------------------



 #------------------------------
 # Z-Score for Confidence Level
 #------------------------------
 """
    scipy.stats.norm.ppf can convert confidence level to z-score
    
    https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html#scipy.stats.norm
    
    this stackoverflow helped figure it out...
    
    https://stackoverflow.com/questions/20864847/probability-to-z-score-and-vice-versa-in-python
    
    basically, scipy.stats.norm.ppf converts conf level % to 
    left-tail z-score by default. So, 0.95 would get 1.64 z-score.

    Cochran's Formula needs two-sided z-score, IE: 0.95 to get
    1.96. So, we have to convert the confidence level to a
    two-side version first before outputting the ppf for it.
 """


 # confidence    = confidence level %  as decimal (eg: 0.95 confidence)
 # tails         = 1 for one-sided tail, 2 for two-sided tail
 # debug         = true to print debug output for testing

 def zscore ( confidence, tails, debug = False ):

    try:
        assert ( 0 < tails < 3 ),  "Tails has to be 1 or 2"
        assert ( confidence < 1 ), "Confidence must be greater then 0.0 (0%)"
        assert ( confidence > 0 ), "Confidence must be less then 1.0 (100%)"


        # ppf returns one-tail z by default
        # so convert confidence if needing two-tail
        if tails == 2:
            confidence = 1 - ( ( 1 - confidence ) / 2 )
    
        z = nd.ppf( confidence )
    
        if debug:
            print("zscore for " + str( confidence ) + " = " + str( z ))
    
        return z

    except AssertionError as error:
        print( "Error: " + str ( error ) )
        
        
        
 ######################################
 # setup up some debug / testing code
 # that will only run if this file
 # is active as "main"
 ######################################

 if __name__ == "__main__":

    # debug ... test zscore function
    if 1:
        print( "----------------------------" )
        print( "conf \t one \t two" )
        print( "----------------------------" )

        # for confidence level 90% to 99%
        # generate one and two-tail scores

        for i in range(10):
            C  = 0.9 + 0.01 * i     # confidence level
            Z1 = zscore( C, 1 )     # one-tail zscore
            Z2 = zscore( C, 2 )     # one-tail zscore
            C  = round ( C,  2 )    # round everything to 2 decimal places
            Z1 = round ( Z1, 2 )
            Z2 = round ( Z2, 2 )
            print( str(C) + "\t" + str(Z1) + "\t" + str(Z2) )

        print( "----------------------------" )
	-----------------------------------------
	Sample Sizing Big Data to Reduce Processing
	-----------------------------------------

	When grinding massive data sets, it'd be nice to
	figure out a statistically supported subset sample
	size you could come up with and run instead, that
	way you could cut down on the time and resources
	needed to come up with some results. IE: instead
	of spending tons of time and resources grinding
	a whole data set, you could just figure out a sample
	size you cuold use to do simple random sampling of
	items from the data set, and run the samples instead,
	knowing they are within certain statistical parameters
	(confidence level & margin of error) of representation
	of the overall data set.

	I wrote a Linkedin Article talking more about this...

	https://www.linkedin.com/pulse/sample-sizing-big-data-reduce-processing-craig-critchfield-mis/

	(The article and code go hand-in-hand)

	But, the bottomline is that while other statistical
	sampling formulas need you to know the population's
	average or standard deviation already, Cochran's Formula
	just needs you to determine a confidence level, margin
	of error, and know the total items in your population
	(population size). Plug those in, and it can spit out
	a sample size that has statistical support to be
	representative of your overall population.

	Cochran's Formula caps out on sample sizes when you
	near 20,000 population size (for 5% margin of error)
	and 200,000 population size (for 1% margin of error).
	So, if you have massive "big data" data sets (2M, 2B,
	2T, etc), you can pull small sample sizes that are far
	quicker to process knowing the end results will be about
	the same as if you processed the entire data set.

	EG: if you had tons of tweets, and you were grinding them
	for sentiment analysis. You could simply come up with
	a sample size and pull some simple random samples to
	grind instead, knowing the sentiment results you get
	from the sample will be statistically representative
	of the population to some degree given the confidence
	level and margin of error you decide to use.

	This can help you greatly reduce the processing time
	and resources needed to see patterns or analysis of
	large data.

	-----------------------------------------
	Files
	-----------------------------------------

	-- samplesize.py --

	main file that has the "samplesize()" function in it.
	samplesize function takes 3 arguments...
	N = population size (eg: the len(dataset) size of your dataset)
	C = Confidence Level (eg: .95 confidence level)
	E = Margin of Error (eg: within +/- .05)
	debug = have it spit out results to debug console for testing if you like

	-- samplesize_zscore.py --

	This file needs to go with the samplesize.py file,
	as it's a support function (zscore) to convert the
	confidence levels into zscores for the Cochran's
	Formula in "samplesize()". You cuold merge the
	code into samplesize.py if you want. I just offloaded
	it to a separate file to keep things modular in
	case I need to re-use it.

	-- samplesize_test_sizes.py --

	This runs tests on various confidence level,
	margin of error and population sizes to see
	what sample sizes the Cochran Formula samplesize()
	function comes up with. Charts them out, so
	you can see.

	-- samplesize_test_stats.py --

	This creates a large dataset (2M integers from 0 to 100)
	and uses it like a population to run avg & std dev on.
	Then I use samplesize() to generate a sample size to draw
	simple random samples from it and run avg & std dev on the
	sample subset to see how close it comes to the population
	metrics. This is to see how accurate the sampling is in
	representing the population. It also runs some resampling /
	multi-sampling to try to increase the sampling accuracy
	by averaging in the results of 2 more sample runs to see
	if 3 sample runs increase accuracy of metrics when compared
	to population metrics.
	####################################################
	"""
	------------------------------
	Estimating Sample Size Using
	Cochran's Formula with Default
	50% Proportion
	------------------------------

	we don't know the population mean or standard deviation;
	but we want to calculate a quick-n-dirty sample size to
	limit the amount of processing we do while still having
	some confidence the processing we're doing will give us
	a decent picture of what all the data would tell us if
	we did it all.

	So, we can revert to using a proportion sample size
	calculation that just needs a confidence interval and
	margin of error... Cochran's Formula.

	This site talks about Cochran's Formula...

	https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/find-sample-size/


	These two sites demonstrate it in web calculators

	http://www.raosoft.com/samplesize.html

	https://www.surveysystem.com/sscalc.htm



	What it boils down to is we can come up with a quick-n-
	dirty sample size as long as we have 3 things...

	* Population Size
	* Confidence Level
	* Margin of Error

	Let's use archery as an example...

	Population would be the total number of shots we could take. But,
	due to limited time or resources, we want to just do a limited
	amount of sample shots to get a feel for how the overall number
	of shots might turn out.

	Margin of error is how wide your target is.

	Confidence Level is how confident you will be that the shot
	you take will be within that target (MOE) area.

	When using proportion for sample size estimation, you would
	usually provide a 4th value, The Percentage of population you
	think would respond a certain way. (the Proportion)

	EG: if you were doing a survey, you might feel 75% of the
	population would respond True to something, and 25% respond
	False.

	If you don't know the response rate, though, then you can
	default to 50% to maximize the samples. 50% gives the largest
	sample size, so if proportion is unknown, 50% is used as
	default to maximize sample size to account for this.

	So, we have statistical things like this...

	N = population size we're sampling
	n = sample size we're calculating
	C = Confidence Level
	Z = z-score for confidence level
	E = Margin of Error
	P = Proportion (Percentage)

	Cochran's Formula is this...

	sample size = ( Zscore^2 ( Proportion ( 1 - Proportion ) ) ) / Margin of Error^2

	...or, in statistical notation...

	n = ( Z^2 ( P ( 1 - P ) ) ) / E^2

	If you really wanted to get statistically fancy,
	then you'll notice that the P(1-P) is just the
	PQ formula in statistics...

	P = Proportion
	Q = 1 - Proportion

	EG: if we had a 75% P chance to do something,
	then we'd have a 25% (1-P) Q chance to not do it.

	Anyways, since we know we're using a default P of 50%,
	we can go ahead and plug that in...

	n = ( Z^2 ( 0.5 ( 1 - 0.5 ) ) ) / E^2

	...and, We can pre-calc to reduce the formula down to this ...

	n = ( Z^2 * 0.25 ) / E^2


	...in laymen's terms...

	sample size = ( Zscore^2 * 0.25 ) / Margin of Error^2

	(Note the X^2 is MS Excel's notation for pow'ing things
	I was modelling things out in Excel first.)

	The problem with Cochran's Formula is it was designed
	for really large population sizes. If you start to get
	into really small ones (eg: 100 or less), it calculates
	a sample size larger then the population size. And, for
	large populations, it quickly ramps up the sample size
	to cap out quickly.

	So, his formula finishes with an exponential smoothing
	to re-proportion sample sizes for small populations to
	be less then the pop size, and helps increment the
	sample size for large pops up smoother then radically.

	So, the sample size calculated above gets ran through
	the following at the end...

	sample size = sample size / ( 1 + ( ( sample size - 1 ) / population ) )

	...or, in statistical terms...

	n = n / ( 1 + ( ( n - 1 ) / N ) )

	The interesting thing about this is the sample size
	doesn't increase much after population size hits
	200,000 or higher.

	EG: For confidence interval of 95%,
	and margin-of-error of 5%...

	population = 20,000
	sample size = 376 (--)

	population = 200,000
	sample size = 383 (+7)

	population = 2,000,000
	sample size = 384 (+1)

	At higher confidence levels, the sample
	size increases more, but not by much. But,
	altering the Margin of Error increases
	sample sizes dramatically.

	EG: MOE of 1% starts to get into 10,000+ samples
	for large population sets.

	To get around this, you could simply resample
	several times using a larger Margin of Error
	of 5% or such. Because central limit theory
	says that if you sample enough, you eventually
	get closer to the average, std dev,
	etc of the actual population (although you'd
	never be 100% accurate unless you actually did
	process the whole population.)

	EG: So, instead of doing 1 sample with MOE of 1%,
	you could do 3 samples at MOE 5%, then average
	the results. (In stat, there is Standard Error
	functions you can use to determine how close
	the sampling is coming to the population metrics,
	but if you want to be quick-n-dirty about it...)
	"""
	####################################################


	#------------------------------
	# Imports
	#------------------------------

	from samplesize_zscore import zscore



	#------------------------------
	# Globals / Constants
	#------------------------------
	"""
	I created two versions of the function

	* one uses laymen's terms
	* one uses statistical notation

	While it's easy to follow along when looking
	at the code below, I provided two versions
	so you can use whichever you like when
	you're coding on an IDE that has intellisense.
	When typing out the function call, the
	intellisense will show the function parameters.
	If you're comfortable with statistical notation,
	then use that function. If you like things
	spelled out, then use the laymen's function
	which will make it clear what parameter is what.

	The constant below lets you set which
	version of the function you want to use.
	"""

	# true = use statistical notation for var names
	# false = spell things out in laymen's terms for var names

	STATISTICAL_NOTATION = False




	#---------------------------
	# Cochran's Formula for Sample Size Estimation
	#---------------------------
	"""
	For normal (binomial) distribution sampling...
	* population needs to be 30+
	* we need to assume population follows a normal distrubtion
	"""


	if STATISTICAL_NOTATION:

	#---------------------------
	# use the one with statistical notation
	#---------------------------
	"""
	N = population size
	C = confidence level
	E = margin of error
	n = sample size
	P = proportion ( 1-P = Q, so P(1-P) = PQ)
	debug = print debug outputs for testing (defaulting to false)
	"""
	def samplesize( N, C, E, debug = False ):

	try:
	# do some pre-emptive error-checking
	assert ( C < 1.0 ), "C must be less than 100%"
	assert ( C > 0.0 ), "C must be greater than 0%"
	assert ( E < 1.0 ), "E must be less than 100%"
	assert ( E > 0.0 ), "E must be greater than 0%"
	assert ( N > 0.0 ), "N must be greater than 0"

	# need two-tail z-score
	Z = zscore( C, 2 )

	# sample size proportion
	# since we're using 50% default, we can
	# pre-calc P(1-P) ... IE: PQ
	# n = ( Z*2 P ( 1 - P ) ) / E**2
	n = ( Z*2 0.25 ) / E**2

	# corrective function that scales
	# sample size to pop size better
	# to smooth out sample sizing and
	# handle very small populatons
	n = n / ( 1 + ( ( n - 1 ) / N ) )

	# sample is a float / decimal
	# recast as whole number integer
	# so if we use it in range() it
	# won't throw an error
	n = int( n )

	if debug:
	print("Sample Size " + str( n ) + ", Population " + str( N ))

	return n

	except AssertionError as error:
	print( "Error: " + str ( error ) )

	else:

	#---------------------------
	# use the one that spells things out
	#---------------------------
	"""
	inputs... self-explanatory, except for...
	debug = print debug outputs for testing (defaulting to false)
	"""

	def samplesize( population, confidence, margin_of_error, debug = False ):

	try:
	# do some pre-emptive error-checking
	assert ( confidence < 1.0 ), "Confidence must be less than 100%"
	assert ( confidence > 0.0 ), "Confidence must be greater than 0%"
	assert ( margin_of_error < 1.0 ), "Margin of Error must be less than 100%"
	assert ( margin_of_error > 0.0 ), "Margin of Error must be greater than 0%"
	assert ( population > 0.0 ), "Population must be greater than 0"

	# need two-tail z-score
	z = zscore( confidence, 2 )

	# sample size proportion
	# since we're using 50% default, we can
	# pre-calc proportion ( 1 - proportion )
	# sample = ( zscore*2 proportion ( 1 - proportion ) ) / margin_of_error**2
	sample = ( z*2 0.25 ) / margin_of_error**2

	# corrective function that scales
	# sample size to pop size better
	# to smooth out sample sizing and
	# handle very small populatons
	sample = sample / ( 1 + ( ( sample - 1 ) / population ) )

	# sample is a float / decimal
	# recast as whole number integer
	# so if we use it in range() it
	# won't throw an error
	sample = int( sample )

	if debug:
	print("Sample Size " + str( sample ) + ", Population " + str( population ) )

	return sample

	except AssertionError as error:
	print( "Error: " + str ( error ) )




	######################################
	# setup up some debug / testing code
	# that will only run if this file
	# is active as "main"
	######################################

	if __name__ == "__main__":

	#---------------------------
	# test sample size function
	#---------------------------

	N = 20000
	C = 0.95
	E = 0.05

	print("-------------------------")
	samplesize( N, C, E, True )
	print("-------------------------")
	##########################################################
	#
	# Sample Size Ceiling Check (with Charts)
	#
	##########################################################
	"""

	Cochran Formula sample size caps out quickly as we near
	200,000 population. So, let's demonstrate by plowing
	through some pop sizes at various confidence levels
	and margins of errors.

	Code below uses following statistical notations...

	C = Confidence Level
	E = Margin of Error
	N = Population Size
	n = sample size

	"""
	##########################################################


	#------------------------------
	# Imports
	#------------------------------

	import matplotlib.pyplot as plt

	from samplesize import samplesize




	#------------------------------
	# Globals / Constants
	#------------------------------

	"""
	dashed border to make message titles stand-out
	and break apart different debug output runs
	"""

	BORDER_DASH = "-" * 50

	"""
	we're going to exponentially increase pop
	sizes to generate samples. We don't want our
	chart going crazy with a massive expanse of
	pop sizes, so instead we convert the pop sizes
	into string labels applied to each tick on
	the x axis ... we're doing 10 runs, but
	we're starting the plot ticks on 0 to
	have a 0,0 starting point, and then the
	pop sizes are on ticks 1 thru 10
	"""

	PLOT_TICKS = [0,1,2,3,4,5,6,7,8,9,10]

	"""
	pyplot formats can use string short cuts
	- = use a line chart
	b,r,g,y = blue, red, green, yellow colors
	o = use dots as markers
	so -bo = use a blue line & put a dot marker on it (same color).
	we're only doing 4 plot runs, so hard-coding
	the formats here to iterate through for a
	multi-line run chart later on.
	"""

	PLOT_FORMATS = ['-bo', '-ro', '-go', '-yo']


	"""
	we'll use some lists to keep track
	of what conf levels, margin of error,
	pop sizes & sample sizes we do over
	several runs, so we can chart them
	later on.

	population & sample size are
	going to be lists of lists,
	because we're gonna run
	10 population sizes to generate
	10 sample sizes per conf/moe
	combination run we do.
	"""

	marginoferr = []
	confidences = []
	populations = []
	samplesizes = []



	#---------------------------
	# Generate Sample Runs
	#---------------------------

	# function iterates through pop sizes
	# exponentially to gernate sample sizes

	def samplerun( C, E, debug = False ):

	if debug:
	print( BORDER_DASH )
	print( "conf " + str(C) + "\n" +
	"error " + str(E) )
	print( BORDER_DASH )

	# we'll rack up the pop & samples
	# in temp lists. they start with
	# 0 in each so we can have a 0,0
	# x,y starting point in our charts

	pops = [0]
	smps = [0]

	# exponentially ramp up pop size
	# and generate sample size for it

	for x in range(10):
	N = 200 * 10**x
	n = samplesize( N, C, E, debug )
	pops.append(N)
	smps.append(n)


	# figure out what slot we're on
	# in our global lists, and load
	# our respective values

	i = len(populations)

	populations.insert( i, pops )
	samplesizes.insert( i, smps )
	confidences.insert( i, C )
	marginoferr.insert( i, E )



	#---------------------------

	debug_output = True

	samplerun( 0.95, 0.05, debug_output ) # conf 95%, moe 5%
	samplerun( 0.99, 0.05, debug_output ) # conf 99%, moe 5%
	samplerun( 0.95, 0.01, debug_output ) # conf 95%, moe 1%
	samplerun( 0.99, 0.01, debug_output ) # conf 99%, moe 1%

	# debug ... see what's in our lists
	if 0:
	print(populations)
	print(samplesizes)
	print(confidences)
	print(marginoferr)




	#---------------------------
	# generate charts for our runs
	#---------------------------

	# func spits out single line chart
	# so we can see each individual run.
	# This lets us see the sample size
	# scale better for each separate run.

	# Note, the plots show up fine in Spyder's
	# plot windows, but in VSCode the population
	# labels get truncated off. So, you have
	# to click on each chart's "settings" when
	# they run, and adjust the "bottom" to
	# bring the labels up into view


	def singlechart( i ):

	# make some shortcut vars for the run we're doing
	C = confidences[i]
	E = marginoferr[i]
	N = populations[i]
	n = samplesizes[i]

	# chart title
	title = "conf " + str(C) + "\nerror " + str(E)

	# plot sample sizes for run using 1st line format style
	plt.plot( n, PLOT_FORMATS[0], label = title )

	# the population sizes are explonential
	# so we just want to show them as text
	# labels for the 10 pops we sampled
	plt.xticks( PLOT_TICKS, N, rotation = 90)

	# force 0,0 to be exact bottom-left corner w/o padding
	plt.xlim( xmin = 0.0 )
	plt.ylim( ymin = 0.0 )#,ymax=20000)

	plt.suptitle( title )
	plt.ylabel( 'sample size' )
	plt.xlabel( 'population size' )
	plt.grid( True, 'major', 'both' )
	# plt.legend()
	plt.show()

	#---------------------------

	# func spits out one chart with
	# 4 lines giving a comparison of
	# sample sizes for each conf &
	# error run

	def addline( plt, i ):
	# plt = chart we're working on
	# i = sample run we're working with
	C = confidences[i]
	E = marginoferr[i]
	n = samplesizes[i]
	f = PLOT_FORMATS[i]
	L = "conf " + str(C) + ", err " + str(E)
	plt.plot( n, f, label = L )


	def multichart():

	# plotting sample sizes for each run

	for i in range(4):
	addline( plt, i )

	# the population sizes are exponential
	# so we just want to show them as text
	# labels for the 10 we ran. Since we
	# ran the same pops each time, we can
	# just use the first pop list for this.
	plt.xticks( PLOT_TICKS, populations[0], rotation = 90)

	# force 0,0 to be exact bottom-left corner w/o padding
	# also max out y's ceiling so we can show the
	# difference between run chart lines better
	plt.xlim( xmin = 0.0 )
	plt.ylim( ymin = 0.0, ymax = 20000 )

	plt.suptitle( "Sample Size Ceiling Comparison" )
	plt.ylabel( 'sample size' )
	plt.xlabel( 'population size' )
	plt.grid( True, 'major', 'both' )

	# this legend style looks fine in Spyder's plot output
	# plt.legend( loc = 'upper left', bbox_to_anchor = ( 1.05, 1 ) )

	# this one embeds the legend in the chart for VSCode editor
	plt.legend()

	plt.show()


	#---------------------------

	# generate a single run chart per
	# conf + err run to see how the
	# sample sizes range for each

	for pop_run in range(len(populations)):
	singlechart(pop_run)

	# generate a multi-line run chart
	# to see how they compare to each
	# other (showing how margin of
	# error greatly impacts sample size)

	multichart()


	#---------------------------

	"""
	debug console output

	--------------------------------------------------
	conf 0.95
	error 0.05
	--------------------------------------------------
	Sample Size 131, Population 200
	Sample Size 322, Population 2000
	Sample Size 376, Population 20000
	Sample Size 383, Population 200000
	Sample Size 384, Population 2000000
	Sample Size 384, Population 20000000
	Sample Size 384, Population 200000000
	Sample Size 384, Population 2000000000
	Sample Size 384, Population 20000000000
	Sample Size 384, Population 200000000000
	--------------------------------------------------
	conf 0.99
	error 0.05
	--------------------------------------------------
	Sample Size 153, Population 200
	Sample Size 498, Population 2000
	Sample Size 642, Population 20000
	Sample Size 661, Population 200000
	Sample Size 663, Population 2000000
	Sample Size 663, Population 20000000
	Sample Size 663, Population 200000000
	Sample Size 663, Population 2000000000
	Sample Size 663, Population 20000000000
	Sample Size 663, Population 200000000000
	--------------------------------------------------
	conf 0.95
	error 0.01
	--------------------------------------------------
	Sample Size 195, Population 200
	Sample Size 1655, Population 2000
	Sample Size 6488, Population 20000
	Sample Size 9163, Population 200000
	Sample Size 9557, Population 2000000
	Sample Size 9599, Population 20000000
	Sample Size 9603, Population 200000000
	Sample Size 9603, Population 2000000000
	Sample Size 9603, Population 20000000000
	Sample Size 9603, Population 200000000000
	--------------------------------------------------
	conf 0.99
	error 0.01
	--------------------------------------------------
	Sample Size 197, Population 200
	Sample Size 1784, Population 2000
	Sample Size 9067, Population 20000
	Sample Size 15316, Population 200000
	Sample Size 16450, Population 2000000
	Sample Size 16573, Population 20000000
	Sample Size 16585, Population 200000000
	Sample Size 16587, Population 2000000000
	Sample Size 16587, Population 20000000000
	Sample Size 16587, Population 200000000000


	as you can see, the margin of error impacts
	sample size greatly.
	"""
	##########################################################
	#
	# Sample Size Metrics vs. Population Metrics Check
	#
	##########################################################
	"""

	Since Cochran Formula is pulling small sample sizes for
	potentially massive data sets, it's good to get an idea
	of how far off we might be in the stats of the sample
	vs. population (eg: std dev, average, etc.)

	So, let's...

	1) create a massive population data set
	2) generate a sample size for it
	3) pull simple random samples from population set
	4) compare pop & sample statistics

	Code below uses following statistical notations...

	C = Confidence Level
	E = Margin of Error
	N = Population Size
	n = sample size

	"""
	##########################################################


	#------------------------------
	# Imports
	#------------------------------

	from numpy import mean, std
	from numpy.random import randint
	from samplesize import samplesize


	#------------------------------
	# Globals / Constants
	#------------------------------

	BORDER = "-" * 50


	#---------------------------
	# test how far off numbers are
	#---------------------------

	print(BORDER)

	N = 2000000
	C = 0.95
	E = 0.05
	n = samplesize( N, C, E, True )

	pop = [] # population dataset
	smp = [] # sample dataset randomly pulled from pop

	#---------------------------

	# fill population list with
	# random integers (0 to 100)
	print("filling population dataset...")

	# numpy's randint let's us do it all in one
	# shot using the "size" paramter which we
	# just set return N random ints

	pop = randint( 0, 100, size = N )
	"""
	for i in range( N ):
	randomvalue = randint(0, 100) # random 0 to 100
	pop.insert( i, randomvalue )
	"""
	#---------------------------

	# pull random samples from population data
	print("pulling random samples...")

	for i in range( n ):
	randomsample = randint(0, N - 1) # random sample 0 to pop max
	smp.insert( i, pop[randomsample] )

	"""
	Note that we're doing sampling
	where we can potentially include
	the same sample multiple times.

	The larger the population, the
	less likely this will happen.

	If we want to really get elaborate,
	we can code the sample selection
	algorithm to go pick another random
	sample if we've already sampled
	the one it chooses.

	But, for now, we're just demonstrating
	some code, so ..
	"""

	#---------------------------


	# calculate std dev, mean & coefficient of variation
	# for pop and sample to compare how far off we are

	print("calculating metrics...")

	popMean = mean( pop )
	smpMean = mean( smp )
	popStdv = std( pop )
	smpStdv = std( smp )
	popCoef = popStdv / popMean
	smpCoef = smpStdv / smpMean

	print(BORDER)
	print("Pop std.dev = " + str( popStdv ) )
	print("Smp std.dev = " + str( smpStdv ) )
	print("Difference = " + str( abs(popStdv - smpStdv) ) )
	print(BORDER)
	print("Pop average = " + str( popMean ) )
	print("Smp average = " + str( smpMean ) )
	print("Difference = " + str( abs(popMean - smpMean) ) )
	print(BORDER)
	print("Pop coeff.v = " + str( popCoef ) )
	print("Smp coeff.v = " + str( smpCoef ) )
	print("Difference = " + str( abs(popCoef - smpCoef) ) )



	#---------------------------
	# resampling to try to increase accuracy
	#---------------------------
	"""
	Central Limit Theory says that if you sample
	a population enough times, then the metrics
	you create by merging your samples metrics
	(eg: averaging your sample averages) will come
	really close to the population metrics.. but
	never 100% close, because the only way to
	have 100% accuracy is to do the metrics
	on the entire population.

	So, we can do multiple sampling passes,
	then average their metrics together to
	see if we come closer.

	Eventually you hit a resampling where you
	might as well just increase the conf level
	and / or margin of error and run a single
	sample run instead of multiple runs,
	though.

	For very small populations, this point is
	hit very fast. EG: for populations of 200,
	if you did several resamples with 95% conf
	and 5% error, you'll pretty much have hit
	the sample size for a 99% conf and 1% error
	rate. So, might as well just ramp those up
	instead and do a single sample pass.

	But, when you start to get into pop sizes
	of, eg: 2000 ... a conf of 99% and error of
	1% gives a sample size of 1655. A conf of
	95% and error of 5% gives a sample size of
	322.

	1655 / 322 = about 5 sample runs of 322
	you could do before hitting 1655 samples.

	So, there's a decision.. should you resample
	using the same conf & error rate you already
	generated a sample size for? Or, should you
	just increase the conf level and error rate
	a bit and run another sample size number?

	You'll need to decide how you want to do
	it.

	But, ultimately, the more samples you do,
	the greater the accuracy you can generate.
	Whether that's through resampling via a
	small sample size multiple times, or ramping
	up the conf & error to generate a larger
	single sample size.. up to you.

	I'm sure there's some statistics to help
	decide which is better. But, for now, let's
	just do a multi-sample run to see how it
	looks.
	"""

	#---------------------------
	print(BORDER)
	#---------------------------

	# pull a 2nd round of random samples from population
	# using the sample size we already generated
	print("pulling random samples (2nd sampling)...")

	# purge old samples from samples list
	smp.clear()

	for i in range( n ):
	randomsample = randint(0, N - 1) # random sample 0 to pop max
	smp.insert( i, pop[randomsample] )

	# generte metrics for 2nd sampling
	smpMean2 = mean( smp )
	smpStdv2 = std( smp )
	smpCoef2 = smpStdv2 / smpMean2

	#---------------------------

	# pull a 3rd round of random samples from population
	# using the sample size we already generated
	print("pulling random samples (3rd sampling)...")

	# purge old samples from samples list
	smp.clear()

	for i in range( n ):
	randomsample = randint(0, N - 1) # random 0 to pop max
	smp.insert( i, pop[randomsample] )

	# generte metrics for 2nd sampling
	smpMean3 = mean( smp )
	smpStdv3 = std( smp )
	smpCoef3 = smpStdv2 / smpMean2

	#---------------------------

	avgSmpMean = ( smpMean + smpMean2 + smpMean3 ) / 3
	avgSmpStdv = ( smpStdv + smpStdv2 + smpStdv3 ) / 3
	avgSmpCoef = ( smpCoef + smpCoef2 + smpCoef3 ) / 3

	print(BORDER)
	print("Metrics compare using avg of 3 sample runs")
	print("IE: ran sampling 3 times, and took avg of ")
	print(" avg, std dev & cv to see if they get closer")
	print(" to pop metrics via more sampling.")
	print(BORDER)
	print("Pop std.dev = " + str( popStdv ) )
	print("Smp std.dev = " + str( avgSmpStdv ) )
	print("Difference = " + str( abs(popStdv - avgSmpStdv) ) )
	print(BORDER)
	print("Pop average = " + str( popMean ) )
	print("Smp average = " + str( avgSmpMean ) )
	print("Difference = " + str( abs(popMean - avgSmpMean) ) )
	print(BORDER)
	print("Pop coeff.v = " + str( popCoef ) )
	print("Smp coeff.v = " + str( avgSmpCoef ) )
	print("Difference = " + str( abs(popCoef - avgSmpCoef) ) )


	# calculate differences for
	# 1 sample run vs. 3 sample runs

	samplediff1 = [ \
	abs(popStdv - smpStdv), \
	abs(popMean - smpMean), \
	abs(popCoef - smpCoef) \
	]


	samplediff3 = [ \
	abs(popStdv - avgSmpStdv), \
	abs(popMean - avgSmpMean), \
	abs(popCoef - avgSmpCoef) \
	]


	print(BORDER)
	print("Compare differences of 1 sample run vs. 3 sample runs")
	print("(lower is better)")
	print(BORDER)
	print("Pop Std Dev - Sample Std Dev")
	print("1 sample run difference: " + str(samplediff1[0]))
	print("3 sample run difference: " + str(samplediff3[0]))
	print(BORDER)
	print("Pop Avg - Sample Avg")
	print("1 sample run difference: " + str(samplediff1[1]))
	print("3 sample run difference: " + str(samplediff3[1]))
	print(BORDER)
	print("Pop Coeff Var - Sample Coeff Var")
	print("1 sample run difference: " + str(samplediff1[2]))
	print("3 sample run difference: " + str(samplediff3[2]))
	print(BORDER)
	"""
	if you run the code several times, sometimes
	the multi-sampling does a better job, and sometimes
	it doesn't.
	"""
	##########################################################
	"""
	Cochran's Formula needs to convert Confidence Level
	Percents into two-tail z-scores. So, we need to use
	a converter to generate them.

	Farming this off to a seperate file for import to
	help streamline code in other files.
	"""
	##########################################################


	#------------------------------
	# Imports
	#------------------------------
	"""
	Python 3.8 stat pkg started including
	a normal distribution object to do
	stuff with, like zscores, but I was using
	Python 3.7.x, so using scipy instead
	"""

	#from statistics import NormalDist as nd

	from scipy.stats import norm as nd



	#------------------------------
	# Globals / Constants
	#------------------------------



	#------------------------------
	# Z-Score for Confidence Level
	#------------------------------
	"""
	scipy.stats.norm.ppf can convert confidence level to z-score

	https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html#scipy.stats.norm

	this stackoverflow helped figure it out...

	https://stackoverflow.com/questions/20864847/probability-to-z-score-and-vice-versa-in-python

	basically, scipy.stats.norm.ppf converts conf level % to
	left-tail z-score by default. So, 0.95 would get 1.64 z-score.

	Cochran's Formula needs two-sided z-score, IE: 0.95 to get
	1.96. So, we have to convert the confidence level to a
	two-side version first before outputting the ppf for it.
	"""


	# confidence = confidence level % as decimal (eg: 0.95 confidence)
	# tails = 1 for one-sided tail, 2 for two-sided tail
	# debug = true to print debug output for testing

	def zscore ( confidence, tails, debug = False ):

	try:
	assert ( 0 < tails < 3 ), "Tails has to be 1 or 2"
	assert ( confidence < 1 ), "Confidence must be greater then 0.0 (0%)"
	assert ( confidence > 0 ), "Confidence must be less then 1.0 (100%)"


	# ppf returns one-tail z by default
	# so convert confidence if needing two-tail
	if tails == 2:
	confidence = 1 - ( ( 1 - confidence ) / 2 )

	z = nd.ppf( confidence )

	if debug:
	print("zscore for " + str( confidence ) + " = " + str( z ))

	return z

	except AssertionError as error:
	print( "Error: " + str ( error ) )



	######################################
	# setup up some debug / testing code
	# that will only run if this file
	# is active as "main"
	######################################

	if __name__ == "__main__":

	# debug ... test zscore function
	if 1:
	print( "----------------------------" )
	print( "conf \t one \t two" )
	print( "----------------------------" )

	# for confidence level 90% to 99%
	# generate one and two-tail scores

	for i in range(10):
	C = 0.9 + 0.01 * i # confidence level
	Z1 = zscore( C, 1 ) # one-tail zscore
	Z2 = zscore( C, 2 ) # one-tail zscore
	C = round ( C, 2 ) # round everything to 2 decimal places
	Z1 = round ( Z1, 2 )
	Z2 = round ( Z2, 2 )
	print( str(C) + "\t" + str(Z1) + "\t" + str(Z2) )

	print( "----------------------------" )