felipeblazing · March 28, 2019 14:15
diff --git a/gistfile1.txt b/gistfile1.txt

 CUDF Memory Deficits Primitive Types only
 Endianness indicator
 8 Byte padding
 8 byte alignment (actually should be fine since cuda aligns)
 no support for list types
 64byte padding on valid bitmap (they probalby use 512bit instructions)
 null bitmap - An array with nulls must have a contiguous memory buffer, known as the null (or validity) bitmap, whose length is a multiple of 64 bytes (as discussed above) and large enough to have at least 1 bit for each array slot. (this may have been solved)




 Arrow GPU Functionality
 Cuda Buffer
 https://github.com/apache/arrow/blob/master/cpp/src/arrow/gpu/cuda_memory.cc
  CudaBuffer(uint8_t* data, int64_t size, const std::shared_ptr<CudaContext>& context,
             bool own_data = false, bool is_ipc = false);
  Here they are doing a few things we should emulate or we should consider this buffer being 
  one of the ways we can feed a cudf::column.
  
 is_ipc - this is something we need. Blazing has a wrapper class that basically contains this and other meta data so we
          know when to free things.
 own_data - could be potentially interesting but i can't imagine too many safe uses for this. If you don't own it and 
            someone else frees it then why pass it around this way?
 context - This is a big one. It tells you what device you are on, has soemthing thats shared that I am guessing is for p2p.
          It handles allocations. Unfortunately the allocator is not templated so we would need to add that so that we could
          use something like rmm. 
          
 Architecture decisions. Buffers are buffers and dictionaries, indices, values are all encoded as buffers. Instead of making a
 distinction between a buffer for valids and data like we do, they use a buffer type for all of them. Probably advisable.


          
 CudaContext
 https://github.com/apache/arrow/blob/master/cpp/src/arrow/gpu/cuda_context.cc
 We definitely need something like this to manage multiple gpus and be able to properly manage resources that get allocated 
 deallocated by processes using cudf. It handles things like copying buffers, allocating and freeing buffers both pinned host 
 and device side, generating ipc handles, synchronizing.

 It is not pluggable. Most of the things are hard coded. For example, we can't put in an allocator, tell it to copy on a stream, 
 or use a transport mechanism that isn't a few of the cuda apis. What if you want to use ucx for example without having to 
 write that logic in everywhere. It doesn't have support for streams and supports only synchronous apis that block the entire device.
 For example synchronize calls cuCtxSynchronize() so if a process has spawned in the same context a job that returns in 10ms and 
 and onther that takes 500, we will increase the latency of the first job greatly if we use cuCtxSynchronize() to complete. 

 CudaIpcMemHandle 
 https://github.com/apache/arrow/blob/master/cpp/src/arrow/gpu/cuda_memory.h
 This is a wrapper class for ipchandles. Good idea. We have a very narrow use case for ipc in blazing that we manage manually
 but it would be nice for these things to be constructed and destroyed gracefully by using shared ptrs and I am certain this
 could solve headaches when implementing code that leverages ipc.

 CudaBufferReader / CudaBufferWriter
 File interface for reading and writing to buffers. This could be kind of interesting when we make cudf-io its own things but
 in all honesty, we have the buffers usually in gpu memory, and I am not sure we would use these kinds of apis to access abuffer. 
 This does make it compatible with arrow ip apis. So one benefit is that in theory we could use this to plug into stuff they are
 doing easily. The reason this last point is less interesting is becuase so far we find that its faster to write parquet readers
 and csv readers from scratch by orders of magnitude so why would we want to sue this interface?

	CUDF Memory Deficits Primitive Types only
	Endianness indicator
	8 Byte padding
	8 byte alignment (actually should be fine since cuda aligns)
	no support for list types
	64byte padding on valid bitmap (they probalby use 512bit instructions)
	null bitmap - An array with nulls must have a contiguous memory buffer, known as the null (or validity) bitmap, whose length is a multiple of 64 bytes (as discussed above) and large enough to have at least 1 bit for each array slot. (this may have been solved)




	Arrow GPU Functionality
	Cuda Buffer
	https://github.com/apache/arrow/blob/master/cpp/src/arrow/gpu/cuda_memory.cc
	CudaBuffer(uint8_t* data, int64_t size, const std::shared_ptr<CudaContext>& context,
	bool own_data = false, bool is_ipc = false);
	Here they are doing a few things we should emulate or we should consider this buffer being
	one of the ways we can feed a cudf::column.

	is_ipc - this is something we need. Blazing has a wrapper class that basically contains this and other meta data so we
	know when to free things.
	own_data - could be potentially interesting but i can't imagine too many safe uses for this. If you don't own it and
	someone else frees it then why pass it around this way?
	context - This is a big one. It tells you what device you are on, has soemthing thats shared that I am guessing is for p2p.
	It handles allocations. Unfortunately the allocator is not templated so we would need to add that so that we could
	use something like rmm.

	Architecture decisions. Buffers are buffers and dictionaries, indices, values are all encoded as buffers. Instead of making a
	distinction between a buffer for valids and data like we do, they use a buffer type for all of them. Probably advisable.



	CudaContext
	https://github.com/apache/arrow/blob/master/cpp/src/arrow/gpu/cuda_context.cc
	We definitely need something like this to manage multiple gpus and be able to properly manage resources that get allocated
	deallocated by processes using cudf. It handles things like copying buffers, allocating and freeing buffers both pinned host
	and device side, generating ipc handles, synchronizing.

	It is not pluggable. Most of the things are hard coded. For example, we can't put in an allocator, tell it to copy on a stream,
	or use a transport mechanism that isn't a few of the cuda apis. What if you want to use ucx for example without having to
	write that logic in everywhere. It doesn't have support for streams and supports only synchronous apis that block the entire device.
	For example synchronize calls cuCtxSynchronize() so if a process has spawned in the same context a job that returns in 10ms and
	and onther that takes 500, we will increase the latency of the first job greatly if we use cuCtxSynchronize() to complete.

	CudaIpcMemHandle
	https://github.com/apache/arrow/blob/master/cpp/src/arrow/gpu/cuda_memory.h
	This is a wrapper class for ipchandles. Good idea. We have a very narrow use case for ipc in blazing that we manage manually
	but it would be nice for these things to be constructed and destroyed gracefully by using shared ptrs and I am certain this
	could solve headaches when implementing code that leverages ipc.

	CudaBufferReader / CudaBufferWriter
	File interface for reading and writing to buffers. This could be kind of interesting when we make cudf-io its own things but
	in all honesty, we have the buffers usually in gpu memory, and I am not sure we would use these kinds of apis to access abuffer.
	This does make it compatible with arrow ip apis. So one benefit is that in theory we could use this to plug into stuff they are
	doing easily. The reason this last point is less interesting is becuase so far we find that its faster to write parquet readers
	and csv readers from scratch by orders of magnitude so why would we want to sue this interface?