CPU is a latency-oriented device designed to solve complex problems quickly using very few powerful processors with sophisticated control logic. GPU is a throughput-oriented device with many weak processors designed to work on large datasets in parallel by dividing them into multiple chunks.
GPU can exploit data parallelism very well. This is the main reason behind the speed-up. However, the data transfer between RAM and VRAM is slow. So, if the overall problem has a small dataset, the time taken to transfer data will be more than the speed-up from parallelization.
CUDA C provides a convenient framework to write functions that can run directly on the GPU.
******A modern GPU has three main components: Streaming Processors (CUDA cores), Memory, and Control. CUDA cores are grouped into multiple Streaming Processors, and memory is divided into registers, shared memory, and global memory.
There are five types of memory in a CUDA device: ******Global Memory, Local Memory, Constant Memory, Shared Memory, and Registers.
When a kernel is launched, all threads in a block are simultaneously assigned to the same SM. Once a block is assigned to an SM, it is divided into 32-thread units called warps. There are usually more threads assigned to an SM than its cores. This is done so that GPUs can tolerate long-latency operations (like global memory accesses).
******SIMD execution on a warp suggests that for optimum results, all threads in a warp must follow the same execution path or control flow, i.e., there should not be any control divergence **of threads.
Several commands (predefined CUDA functions) can be used to determine the available resources for a GPU.