In a matter of just a few years, the programmable graphics processor unit has evolved into an absolute computing workhorse, providing raw performance of over 400 billion operations per second (400 times faster than CPU) and memory bandwidth of over 80GB per second (~15 times faster than CPU memory bandwidth). The main reason behind such an evolution is that the GPU is specialized for compute-intensive, highly parallel computation and therefore is designed such that more transistors are devoted to data processing rather than data caching and flow control.
More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations ג the same progrm is executed on many data elements in parallel ג with high arithmetic intensity ג the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control; and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.
With multiple cores driven by very high memory bandwidth, today's GPUs offer incredible resources for both graphics and non-graphics processing. In particular, GPUs have been used as high performance co-processors by executing computationally intensive parts of many classical high performance applications in fluid dynamics, bioinformatics, high energy physics and others. These applications have been shown to exhibit orders-of-magnitude speedups versus their performance on high-end multi-core CPUs.
In this talk I will describe the basics of GPU hardware and programming model, and will show some real-life examples based on NVIDIA CUDA model.