count-unique-in-array-performance

Given a non-negative integers array a with different len(a) (scale) and different max(a) (amplitude), what is the most efficient way in Python to count the unique elements in a?

In this experiment, we fix the scale len(a) to be 20000, and focus specifically on the amplitude max(a) of the array.

Here is the code:

def get_runtime_stat(AMPLITUDE:int):
    def _test1():
        counts = np.zeros((AMPLITUDE + 1, ), dtype=np.int32)
        # np.int16 is already enough for sample size 20000 < 32767
        for ind in a: counts[ind] += 1
        counts = counts[counts != 0] # this is because some value in range may be missing
        return counts
    
    def _test2():
        uniq_vals, uniq_cnts = np.unique(a, return_counts=True)
        return uniq_cnts # uniq_vals are sorted

    def _test3():
        counts = np.bincount(a, minlength=AMPLITUDE+1)
        counts = counts[counts != 0]
        return counts

    TIMES = [[], [], []]
    for repeat in range(10):
        SCALE = 20000
        a = np.random.randint(0, AMPLITUDE, (SCALE,))

        tic = time.time(); ct1 = _test1(); tac = time.time() # or use tracemalloc.get_traced_memory() to get memory peak
        time1 = tac - tic
        tic = time.time(); ct2 = _test2(); tac = time.time()
        time2 = tac - tic
        tic = time.time(); ct3 = _test3(); tac = time.time()
        time3 = tac - tic
        assert np.all(ct1 == ct2) and np.all(ct1 == ct3)
        del ct1, ct2, ct3
        TIMES[0].append(time1)
        TIMES[1].append(time2)
        TIMES[2].append(time3)

    return TIMES

For AMPLITUDE in [5, 10, 50, 100, ..., 1e10], here is the results of time usage and memory usage:

So in my case, when AMPLITUDE is small (<1e5) I should use np.bincount, and otherwise np.unique.

MarkDana/count-unique-in-array-performance.md