Given a non-negative integers array a
with different len(a)
(scale) and different max(a)
(amplitude), what is the most efficient way in Python to count the unique elements in a
?
In this experiment, we fix the scale len(a)
to be 20000, and focus specifically on the amplitude max(a)
of the array.
Here is the code:
def get_runtime_stat(AMPLITUDE:int):
def _test1():
counts = np.zeros((AMPLITUDE + 1, ), dtype=np.int32)
# np.int16 is already enough for sample size 20000 < 32767
for ind in a: counts[ind] += 1
counts = counts[counts != 0] # this is because some value in range may be missing
return counts
def _test2():
uniq_vals, uniq_cnts = np.unique(a, return_counts=True)
return uniq_cnts # uniq_vals are sorted
def _test3():
counts = np.bincount(a, minlength=AMPLITUDE+1)
counts = counts[counts != 0]
return counts
TIMES = [[], [], []]
for repeat in range(10):
SCALE = 20000
a = np.random.randint(0, AMPLITUDE, (SCALE,))
tic = time.time(); ct1 = _test1(); tac = time.time() # or use tracemalloc.get_traced_memory() to get memory peak
time1 = tac - tic
tic = time.time(); ct2 = _test2(); tac = time.time()
time2 = tac - tic
tic = time.time(); ct3 = _test3(); tac = time.time()
time3 = tac - tic
assert np.all(ct1 == ct2) and np.all(ct1 == ct3)
del ct1, ct2, ct3
TIMES[0].append(time1)
TIMES[1].append(time2)
TIMES[2].append(time3)
return TIMES
For AMPLITUDE
in [5, 10, 50, 100, ..., 1e10]
, here is the results of time usage and memory usage:
So in my case, when AMPLITUDE
is small (<1e5) I should use np.bincount
, and otherwise np.unique
.