CUDA has a small amount of memory available for its threads called shared memory. As the name already suggests is that this memory is available to all threads within a block simultaneously. We want to use this property to make threads read memory from global memory to shared memory in a block, use the memory together, and afterwards write the result back into global memory to avoid multiple accesses to global memory. Nevertheless, there are some rules one need to respect for high performance.
During launch time we specify our grid dimensions. During execution the threads of a block are then executed in groups of 32 at a time. Warps are never made beyond block boundaries. Each of those fixed size groups is called a warp and they always have a consecutive threadIdx. Those threads all execute together in SIMD style as long as there is no branching. For example an if statement with different outcome for warps units will lead to serial execu