We'd like to better understand if and when using libcudf
's type_dispatcher
adds overhead when used in both host and device code.
In order to test this, we'd like to create a set of micro-benchmarks to profile the performance of operating on a set of n
gdf_column
objects.
To keep the benchmark simple, the work of each kernel will be simply applying some in-place transformation functor to every element of every column. For example:
template <template typename<> UnaryFunctor>
benchmark(cudf::table input){
// dispatch type
UnaryFunctor<T> f;
// For every element in every column, apply `f(element)`
}
There are 3 micro-benchmarks we'd like to implement:
-
Host Dispatching
- In this implementation, there will be
n
kernels forn
columns. The kernel will be a template whose type will be dispatched and invoked on the host for each column in the table.
- In this implementation, there will be
-
Device Dispatching
- In this implementation, there will be a single kernel for all
n
columns. Threads will iterate through the columns, dispatching on the type of each column in device code.
- In this implementation, there will be a single kernel for all
-
No Dispatching
- This will be a baseline implementation, where there will be a single kernel for all
n
columns. This kernel will be a variadic template to allow specifying the type of each column at compile time, eliminating the need for any type dispatching either in host or device code. - The purpose of this benchmark is to establish an upper bound for 2. as it performs the same operation, but without any type dispatching.
- This will be a baseline implementation, where there will be a single kernel for all
For each of the 3 benchmarks above, we'd like to test at least 2 different functors:
-
Bandwidth Bound
- This functor will do little work on each element, meaning the kernel will be bandwidth bound. Something like adding
1
to each element.
- This functor will do little work on each element, meaning the kernel will be bandwidth bound. Something like adding
-
Compute Bound
- This functor will do a large amount of compute for each element, meaning the kernel will be compute bound. I suggest something like do 50 fused-multiple adds or similar.
The desired output from this exploration will be graph(s) that show the variation in performance across the 3 described benchmarks, with the 2 different functors, and then varying the number of columns from 1
to n
(start with n==5
) and the size of the columns.
Ideally the microbenchmarks should be written and executed through Google Benchmark.