GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem we are trying to solve has also increased. From design and simulation of complex aerodynamics to the simulation of public response during a crisis. The computational power required is indeed phenomenal.

How about using conventional CPU s? 1. It is logical to suggest that we could use multiple CPU s to increase the calculation throughput. After all CPU s have been tried and tested since the dawn of computing. 2. Using n number of CPU s to meet our requirements does sounds like a legitimate solution. 3. CPU s have much better memory capabilities and it is more efficient at scheduling and managing the tasks performed by the computer. 4. They are also capable of very quick and efficient decision making But is that enough to qualify CPU s for high performance computing?

Meet the contender! 1. The GPU (Graphics processing unit) seems to be the solution for all our computationally intensive requirements. 2. The GPU will soon become a highly efficient PROCESSING FARM with multiple GPU s performing the computationally heavy functions and returning the processed data to the CPU. 3. CPU cores will still be required to act as managers and control the majority of the intensive work being carried out by the GPU. 4. The CPU becomes the brain of the system and the GPU becomes the sheer muscle power, leaving the CPU to do what it does best.

Is it REALLY possible? 1. Currently the GPU in a computer sits on a PCIe slot surrounded by a few GB of very fast DDR3 DDR 5 memory. 2. It does seem simple enough (and more efficient) to ditch the PCIe slot and put the complete hardware in a tightly coupled arrangement with the CPU. This tight CPU/GPU coupling is AMD s current plan for high performance supercomputing. 3. This technology, rightly called CPU assisted general computing on a GPU is a fused architecture used to allow the CPU and the GPU to collaborate by using a FUSED L3 cache. Additionally the CPU and the GPU will use the same shared off chip memory. 4. This approach increases the computational power of the GPU while taking advantage of the CPU s ability to handle complex tasks and data handling.

What makes a GPU so good? 1. GPU s are very good at handling large number of parallel processes. Especially where the same process has to be applied to large amount of data. 2. The long pipelines of GPU s favors the sequential streaming reads, where the number of operations to be done is far greater than the number of memory accesses required. 3. The GPU relies on the CPU s faster memory access to feed the data. 4. This implies that the GPU will have to only access the shared L3 cache, thus reducing the latency caused by GPU memory access.

Supercomputing The TITAN super computer uses the AMD OPTERON cores. And the nvidia TESLA series of GPU. 1.The TESLA is based on the new Kepler architecture which is the most recent update from the Fermi architecture. 2.The Kepler architecture is an improvement over Fermi in the sense that the parameters of efficiency, programmability and performance were improved. 3.The AMD OPTERON cores however use the AMD Bulldozer architecture. There are a lot of changes and enhancements over the Intel Xeon architecture that make the Opteron more desirable. 4.The Opteron has an Integrated Memory Controller that controls the CPU access to both the L3 and the main memory. This is as opposed to the Xeon that has 2 buses for Memory Memory and Memory Processor.

Right architecture for supercomputing A "module" has 213 million transistors in an area of 30.9 mm² (including the 2MB shared L2 cache). Each "module" has the following independent hardware resources: 2MB of L2 cache per "Module. Two dedicated integer clusters. Two symmetrical 128 bit FMAC floating point pipelines per module that can be unified into one large 256 bitwide unit if one of the integer cores dispatches AVX instruction. All "modules" present share the L3 cache as well as an Advanced Dual Channel Memory Sub System (IMC Integrated Memory Controller). Process technology 11 metal layer 32 nm SOI process. Cache and memory interface Up to 8MB of L3 shared among all Cores on the same silicon die, divided into four sub caches of 2MB each, capable of operating at 2.2 GHz.

Pictorial representation of GPU architecture. a

Advantages over the Intel Xeon. 1. The Intel Xeon is the core CPU used in the Tianhe 1A super computer. There are major differences in the way a GPU works, which give it an advantage over the Xeon architecture. 2. In any conventional CPU, including the Xeon. The main memory can be accessed by each individual CPU. The main memory itself is isolated. 3. The GPU architecture however, has a NON UNIFORM MEMORY ACCESS (NUMA). Here, instead of having a unified main memory each core has its own memory. The cores can access the memory of sister cells if needed. This transaction is transparent to the user. 4. Another critical advantage that the GPU cores have over conventional CPU is the use of the Switched Fabric rather than a shared bus. In a Xeon system, the competition for the shared bus causes the efficiency to drop.

Switched Fabric? Figure one shows the conventional Shared data bus. It is immediately obvious what problems are faced by this Architecture. Only one instruction can access the bus at a given time. In the world of super computing and Hiper applications this can be a serious bottleneck. For applications that are not very computationally intensive, the shared data bus is a practical and easy to implement solution. But for high performance, the other powerful albeit difficult solution is a Switched fabric. Here each node is connected to a Central fabric board. This way no node is dependent on any other node for the read/write operations.

Lots of theory, but are there any practical implementations? Let us consider a typical network analysis problem for supercomputing. The challenge is to keep up with the increased traffic of todays large networks. (all of them dealing with real time data) The network monitoring applications typically depend on : Standard x86 processors. Custom built ASIC. But is it enough? CPU s do not have the sheer compute power required to keep track of large networks. And as a result end up dropping packets. ASIC s can be designed to have sufficient power and memory for the job. But the custom architecture is difficult and expensive to program. So is their ability to work in parallel.

What happens when we replace the CPU with a GPU? This is where all the architectural changes of the GPU really shine through. GPU s have high memory bandwidth and easy programmability. The task of monitoring a network means that all data packets have to be read as they cross the network. Which means that the data parallelism is the key requirement. As the name implies GPU s were originally meant to render graphics on a computer. Their architecture, which consists of many cores running in parallel and working in tandem, is perfect for use as coprocessors in tasks that can be made inherently parallel. In the ranking of the top 500 supercomputers at www.top500.org. Out of the top 50 computers, 38 of them use nvidia GPU s to boost their performance.

How about at a more commercial level? We have considered the advantages of using a GPU for high performance applications. But what about at a consumer level. Do Intel, nvidia and AMD make hybrids between CPU s and GPU s? Let us consider the ultimate's of both the genres. For the GPU we shall consider the nvidia GTX 780 Ti, which is loosely based on the same architecture as the Titan supercomputer. The CPU s are represented by the Intel i7, 4 th generation processor with the Haswell architecture The cost of the GPU is around 650$. The CPU meanwhile costs 350$. Creating a machine that would integrate both the GPU and the CPU would cost around 3500$. This is phenomenal, considering that a suitable Hiper machine should not cost more than 1500$.

Couple of statistics. Processor Number i7 4771 a # of Cores 4 # of Threads 8 Vs. Clock Speed Max Turbo Frequency Cache Instruction Set Memory Specifications Max Memory Size (dependent on memory type) 3.5 GHz 3.9 GHz 8 MB 64 bit 32 GB Memory Types DDR3 1333/1600 Max Memory Bandwidth 25.6 GB/s values are the original figures. Without overclocking.

Problems faced by GPU s There are 3 fundamental problems when using GPU s. 1.Power Consumption This is the biggest concern when integrating GPU s with a CPU. GPU s are immense power sinks. Running so many cores has a disastrous effect on the power efficiency. An i7 4 th gen processor needs 84W of power. In contrast the GTX 780 Ti needs a MINIMUM of 250W and a recommended power supply of 600W. Naturally, the power hungry GPU also poses a huge temperature concern when running over prolonged periods of time. 2.Error detection and correction Mass produced GPU s are usually intended for gaming and it is pointless to engineer them such that they can detect and identify hardware problems. That task is usually performed by a more optimized CPU. However GPU s with this hardware are being developed for Hiper applications.

Problems (contd.) 3. The Major GPU manufacturers right now are nvidia and ATI. It is a monumental task for them to take over the market from established CPU manufacturers Intel and AMD. The current feature size of GPU s is nowhere as small as the Haswell, which is at a 22nm size. Unfortunately GPU manufacturers and designers are FABLESS industries which specialize in the design of their products. The actual fabrication is done by third party companies. Decreasing the feature size with this business model is unrealistic, because the smallest they can make is dictated by the manufacturing process of the fabricators. On the other hand, Intel and AMD are full fledged IDM s with their own Fabrication facilities and the capital to cover up an ambitious but doomed project.

QUESTIONS??