GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

Similar documents
CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

GPU for HPC. October 2010

AMD Opteron 4200 Series Processor

GPU Architecture. Alan Gray EPCC The University of Edinburgh

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Building NVLink for Developers

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Steve Scott, Tesla CTO SC 11 November 15, 2011

HPC Architectures. Types of resource currently in use

CME 213 S PRING Eric Darve

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Chapter 18 - Multicore Computers

Six-Core AMD Opteron Processor

Using Graphics Chips for General Purpose Computation

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Parallel Architectures

Current Trends in Computer Graphics Hardware

Scaling through more cores

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Parallelism and Concurrency. COS 326 David Walker Princeton University

Memory Systems IRAM. Principle of IRAM

CS 152, Spring 2011 Section 10

GPUs and Emerging Architectures

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

PC BUILDING PRESENTED BY

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

ATS-GPU Real Time Signal Processing Software

The Use of Cloud Computing Resources in an HPC Environment

Trends in HPC (hardware complexity and software challenges)

The Era of Heterogeneous Computing

Vector Engine Processor of SX-Aurora TSUBASA

CST STUDIO SUITE R Supported GPU Hardware

GPUs and GPGPUs. Greg Blanton John T. Lubia

Chapter 17 - Parallel Processing

The rcuda middleware and applications

Intel Enterprise Processors Technology

ECE 172 Digital Systems. Chapter 15 Turbo Boost Technology. Herbert G. Mayer, PSU Status 8/13/2018

Lecture 9: MIMD Architecture

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Distributed Systems. 01. Introduction. Paul Krzyzanowski. Rutgers University. Fall 2013

Experts in Application Acceleration Synective Labs AB

Trends in the Infrastructure of Computing

n N c CIni.o ewsrg.au

PacketShader: A GPU-Accelerated Software Router

Advances of parallel computing. Kirill Bogachev May 2016

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Robert Jamieson. Robs Techie PP Everything in this presentation is at your own risk!

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

egfx Breakaway Box for AMD and NVIDIA GPUs

Intel released new technology call P6P

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Agenda. What is Ryzen? History. Features. Zen Architecture. SenseMI Technology. Master Software. Benchmarks

Best Practices for Setting BIOS Parameters for Performance

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Fundamentals of Quantitative Design and Analysis

X-Stream II. Processing Method. Operating System. Hardware Performance. Elements of Processing Speed TECHNICAL BRIEF

Lecture 9: MIMD Architectures

Core 2 vs I-series. How Far Have We Really Come?

World s most advanced data center accelerator for PCIe-based servers

PC I/O. May 7, Howard Huang 1

Higher Level Programming Abstractions for FPGAs using OpenCL

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

General Purpose GPU Computing in Partial Wave Analysis

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Introducing Multi-core Computing / Hyperthreading

Intel Core i7 Processor

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

Introduction to GPU computing

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

INF5063: Programming heterogeneous multi-core processors Introduction

(software agnostic) Computational Considerations

WHY PARALLEL PROCESSING? (CE-401)

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

CyberServe Atom Servers

The Dell Precision T3620 tower as a Smart Client leveraging GPU hardware acceleration

Processor Architectures

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Improving Packet Processing Performance of a Memory- Bounded Application

소프트웨어기반고성능침입탐지시스템설계및구현

Parallelism. CS6787 Lecture 8 Fall 2017

Computer Architecture

VIA Apollo P4X400 Chipset. Enabling Total System Performance

The Art of Parallel Processing

Fundamental CUDA Optimization. NVIDIA Corporation

An Introduction to Graphical Processing Unit

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

POST-SIEVING ON GPUs

45-year CPU Evolution: 1 Law -2 Equations

Transcription:

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem we are trying to solve has also increased. From design and simulation of complex aerodynamics to the simulation of public response during a crisis. The computational power required is indeed phenomenal.

How about using conventional CPU s? 1. It is logical to suggest that we could use multiple CPU s to increase the calculation throughput. After all CPU s have been tried and tested since the dawn of computing. 2. Using n number of CPU s to meet our requirements does sounds like a legitimate solution. 3. CPU s have much better memory capabilities and it is more efficient at scheduling and managing the tasks performed by the computer. 4. They are also capable of very quick and efficient decision making But is that enough to qualify CPU s for high performance computing?

Meet the contender! 1. The GPU (Graphics processing unit) seems to be the solution for all our computationally intensive requirements. 2. The GPU will soon become a highly efficient PROCESSING FARM with multiple GPU s performing the computationally heavy functions and returning the processed data to the CPU. 3. CPU cores will still be required to act as managers and control the majority of the intensive work being carried out by the GPU. 4. The CPU becomes the brain of the system and the GPU becomes the sheer muscle power, leaving the CPU to do what it does best.

Is it REALLY possible? 1. Currently the GPU in a computer sits on a PCIe slot surrounded by a few GB of very fast DDR3 DDR 5 memory. 2. It does seem simple enough (and more efficient) to ditch the PCIe slot and put the complete hardware in a tightly coupled arrangement with the CPU. This tight CPU/GPU coupling is AMD s current plan for high performance supercomputing. 3. This technology, rightly called CPU assisted general computing on a GPU is a fused architecture used to allow the CPU and the GPU to collaborate by using a FUSED L3 cache. Additionally the CPU and the GPU will use the same shared off chip memory. 4. This approach increases the computational power of the GPU while taking advantage of the CPU s ability to handle complex tasks and data handling.

What makes a GPU so good? 1. GPU s are very good at handling large number of parallel processes. Especially where the same process has to be applied to large amount of data. 2. The long pipelines of GPU s favors the sequential streaming reads, where the number of operations to be done is far greater than the number of memory accesses required. 3. The GPU relies on the CPU s faster memory access to feed the data. 4. This implies that the GPU will have to only access the shared L3 cache, thus reducing the latency caused by GPU memory access.

Supercomputing The TITAN super computer uses the AMD OPTERON cores. And the nvidia TESLA series of GPU. 1.The TESLA is based on the new Kepler architecture which is the most recent update from the Fermi architecture. 2.The Kepler architecture is an improvement over Fermi in the sense that the parameters of efficiency, programmability and performance were improved. 3.The AMD OPTERON cores however use the AMD Bulldozer architecture. There are a lot of changes and enhancements over the Intel Xeon architecture that make the Opteron more desirable. 4.The Opteron has an Integrated Memory Controller that controls the CPU access to both the L3 and the main memory. This is as opposed to the Xeon that has 2 buses for Memory Memory and Memory Processor.

Right architecture for supercomputing A "module" has 213 million transistors in an area of 30.9 mm² (including the 2MB shared L2 cache). Each "module" has the following independent hardware resources: 2MB of L2 cache per "Module. Two dedicated integer clusters. Two symmetrical 128 bit FMAC floating point pipelines per module that can be unified into one large 256 bitwide unit if one of the integer cores dispatches AVX instruction. All "modules" present share the L3 cache as well as an Advanced Dual Channel Memory Sub System (IMC Integrated Memory Controller). Process technology 11 metal layer 32 nm SOI process. Cache and memory interface Up to 8MB of L3 shared among all Cores on the same silicon die, divided into four sub caches of 2MB each, capable of operating at 2.2 GHz.

Pictorial representation of GPU architecture. a

Advantages over the Intel Xeon. 1. The Intel Xeon is the core CPU used in the Tianhe 1A super computer. There are major differences in the way a GPU works, which give it an advantage over the Xeon architecture. 2. In any conventional CPU, including the Xeon. The main memory can be accessed by each individual CPU. The main memory itself is isolated. 3. The GPU architecture however, has a NON UNIFORM MEMORY ACCESS (NUMA). Here, instead of having a unified main memory each core has its own memory. The cores can access the memory of sister cells if needed. This transaction is transparent to the user. 4. Another critical advantage that the GPU cores have over conventional CPU is the use of the Switched Fabric rather than a shared bus. In a Xeon system, the competition for the shared bus causes the efficiency to drop.

Switched Fabric? Figure one shows the conventional Shared data bus. It is immediately obvious what problems are faced by this Architecture. Only one instruction can access the bus at a given time. In the world of super computing and Hiper applications this can be a serious bottleneck. For applications that are not very computationally intensive, the shared data bus is a practical and easy to implement solution. But for high performance, the other powerful albeit difficult solution is a Switched fabric. Here each node is connected to a Central fabric board. This way no node is dependent on any other node for the read/write operations.

Lots of theory, but are there any practical implementations? Let us consider a typical network analysis problem for supercomputing. The challenge is to keep up with the increased traffic of todays large networks. (all of them dealing with real time data) The network monitoring applications typically depend on : Standard x86 processors. Custom built ASIC. But is it enough? CPU s do not have the sheer compute power required to keep track of large networks. And as a result end up dropping packets. ASIC s can be designed to have sufficient power and memory for the job. But the custom architecture is difficult and expensive to program. So is their ability to work in parallel.

What happens when we replace the CPU with a GPU? This is where all the architectural changes of the GPU really shine through. GPU s have high memory bandwidth and easy programmability. The task of monitoring a network means that all data packets have to be read as they cross the network. Which means that the data parallelism is the key requirement. As the name implies GPU s were originally meant to render graphics on a computer. Their architecture, which consists of many cores running in parallel and working in tandem, is perfect for use as coprocessors in tasks that can be made inherently parallel. In the ranking of the top 500 supercomputers at www.top500.org. Out of the top 50 computers, 38 of them use nvidia GPU s to boost their performance.

How about at a more commercial level? We have considered the advantages of using a GPU for high performance applications. But what about at a consumer level. Do Intel, nvidia and AMD make hybrids between CPU s and GPU s? Let us consider the ultimate's of both the genres. For the GPU we shall consider the nvidia GTX 780 Ti, which is loosely based on the same architecture as the Titan supercomputer. The CPU s are represented by the Intel i7, 4 th generation processor with the Haswell architecture The cost of the GPU is around 650$. The CPU meanwhile costs 350$. Creating a machine that would integrate both the GPU and the CPU would cost around 3500$. This is phenomenal, considering that a suitable Hiper machine should not cost more than 1500$.

Couple of statistics. Processor Number i7 4771 a # of Cores 4 # of Threads 8 Vs. Clock Speed Max Turbo Frequency Cache Instruction Set Memory Specifications Max Memory Size (dependent on memory type) 3.5 GHz 3.9 GHz 8 MB 64 bit 32 GB Memory Types DDR3 1333/1600 Max Memory Bandwidth 25.6 GB/s values are the original figures. Without overclocking.

Problems faced by GPU s There are 3 fundamental problems when using GPU s. 1.Power Consumption This is the biggest concern when integrating GPU s with a CPU. GPU s are immense power sinks. Running so many cores has a disastrous effect on the power efficiency. An i7 4 th gen processor needs 84W of power. In contrast the GTX 780 Ti needs a MINIMUM of 250W and a recommended power supply of 600W. Naturally, the power hungry GPU also poses a huge temperature concern when running over prolonged periods of time. 2.Error detection and correction Mass produced GPU s are usually intended for gaming and it is pointless to engineer them such that they can detect and identify hardware problems. That task is usually performed by a more optimized CPU. However GPU s with this hardware are being developed for Hiper applications.

Problems (contd.) 3. The Major GPU manufacturers right now are nvidia and ATI. It is a monumental task for them to take over the market from established CPU manufacturers Intel and AMD. The current feature size of GPU s is nowhere as small as the Haswell, which is at a 22nm size. Unfortunately GPU manufacturers and designers are FABLESS industries which specialize in the design of their products. The actual fabrication is done by third party companies. Decreasing the feature size with this business model is unrealistic, because the smallest they can make is dictated by the manufacturing process of the fabricators. On the other hand, Intel and AMD are full fledged IDM s with their own Fabrication facilities and the capital to cover up an ambitious but doomed project.

QUESTIONS??