Description of type_dispatcher micro benchmark

We'd like to better understand if and when using libcudf's type_dispatcher adds overhead when used in both host and device code.

In order to test this, we'd like to create a set of micro-benchmarks to profile the performance of operating on a set of n gdf_column objects.

To keep the benchmark simple, the work of each kernel will be simply applying some in-place transformation functor to every element of every column. For example:

template <template typename<> UnaryFunctor>
benchmark(cudf::table input){

   // dispatch type
   UnaryFunctor<T> f;
   
   // For every element in every column, apply `f(element)`
}

There are 3 micro-benchmarks we'd like to implement:

Host Dispatching
- In this implementation, there will be n kernels for n columns. The kernel will be a template whose type will be dispatched and invoked on the host for each column in the table.
Device Dispatching
- In this implementation, there will be a single kernel for all n columns. Threads will iterate through the columns, dispatching on the type of each column in device code.
No Dispatching
- This will be a baseline implementation, where there will be a single kernel for all n columns. This kernel will be a variadic template to allow specifying the type of each column at compile time, eliminating the need for any type dispatching either in host or device code.
- The purpose of this benchmark is to establish an upper bound for 2. as it performs the same operation, but without any type dispatching.

For each of the 3 benchmarks above, we'd like to test at least 2 different functors:

Bandwidth Bound
- This functor will do little work on each element, meaning the kernel will be bandwidth bound. Something like adding 1 to each element.
Compute Bound
- This functor will do a large amount of compute for each element, meaning the kernel will be compute bound. I suggest something like do 50 fused-multiple adds or similar.

The desired output from this exploration will be graph(s) that show the variation in performance across the 3 described benchmarks, with the 2 different functors, and then varying the number of columns from 1 to n (start with n==5) and the size of the columns.

Ideally the microbenchmarks should be written and executed through Google Benchmark.

jrhemstad/dispatcher_benchmark.md