Skip to content

Instantly share code, notes, and snippets.

@jrhemstad
Last active July 3, 2019 15:56
Show Gist options
  • Save jrhemstad/829bdb4b5c7b84f188b091871fbf60fc to your computer and use it in GitHub Desktop.
Save jrhemstad/829bdb4b5c7b84f188b091871fbf60fc to your computer and use it in GitHub Desktop.
Description of type_dispatcher micro benchmark

We'd like to better understand if and when using libcudf's type_dispatcher adds overhead when used in both host and device code.

In order to test this, we'd like to create a set of micro-benchmarks to profile the performance of operating on a set of n gdf_column objects.

To keep the benchmark simple, the work of each kernel will be simply applying some in-place transformation functor to every element of every column. For example:

template <template typename<> UnaryFunctor>
benchmark(cudf::table input){

   // dispatch type
   UnaryFunctor<T> f;
   
   // For every element in every column, apply `f(element)`
}

There are 3 micro-benchmarks we'd like to implement:

  1. Host Dispatching

    • In this implementation, there will be n kernels for n columns. The kernel will be a template whose type will be dispatched and invoked on the host for each column in the table.
  2. Device Dispatching

    • In this implementation, there will be a single kernel for all n columns. Threads will iterate through the columns, dispatching on the type of each column in device code.
  3. No Dispatching

    • This will be a baseline implementation, where there will be a single kernel for all n columns. This kernel will be a variadic template to allow specifying the type of each column at compile time, eliminating the need for any type dispatching either in host or device code.
    • The purpose of this benchmark is to establish an upper bound for 2. as it performs the same operation, but without any type dispatching.

For each of the 3 benchmarks above, we'd like to test at least 2 different functors:

  1. Bandwidth Bound

    • This functor will do little work on each element, meaning the kernel will be bandwidth bound. Something like adding 1 to each element.
  2. Compute Bound

    • This functor will do a large amount of compute for each element, meaning the kernel will be compute bound. I suggest something like do 50 fused-multiple adds or similar.

The desired output from this exploration will be graph(s) that show the variation in performance across the 3 described benchmarks, with the 2 different functors, and then varying the number of columns from 1 to n (start with n==5) and the size of the columns.

Ideally the microbenchmarks should be written and executed through Google Benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment