Page 1. Review. Big Iron. Replace Loops w/ Vector Instructions. 1 st Supercomputers. Roadblocks to parallelism. Vector-Vector add.

Similar documents
Advanced Topics in Computer Architecture

CMPE 655 Multiple Processor Systems. SIMD/Vector Machines. Daniel Terrance Stephen Charles Rajkumar Ramadoss

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

Vector Processors. Abhishek Kulkarni Girish Subramanian

Vector Architectures. Intensive Computation. Annalisa Massini 2017/2018

Chapter 4 Data-Level Parallelism

Hakam Zaidan Stephen Moore

Parallel Processing SIMD, Vector and GPU s

Static Compiler Optimization Techniques

COSC 6385 Computer Architecture. - Vector Processors

Asanovic/Devadas Spring Vector Computers. Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology

ILP Limit: Perfect/Infinite Hardware. Chapter 3: Limits of Instr Level Parallelism. ILP Limit: see Figure in book. Narrow Window Size

EE 4683/5683: COMPUTER ARCHITECTURE

CS 152 Computer Architecture and Engineering. Lecture 16: Vector Computers

CS 252 Graduate Computer Architecture. Lecture 7: Vector Computers

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

CS 152 Computer Architecture and Engineering. Lecture 17: Vector Computers

Parallel Processing SIMD, Vector and GPU s

Advanced Computer Architecture

CS 614 COMPUTER ARCHITECTURE II FALL 2005

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

UNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES

4.10 Historical Perspective and References

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Computer Architecture. Fall Dongkun Shin, SKKU

PIPELINE AND VECTOR PROCESSING

Unit 9 : Fundamentals of Parallel Processing

The Processor: Instruction-Level Parallelism

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

CS 426 Parallel Computing. Parallel Computing Platforms

Lecture 18: DRAM Technologies

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

BlueGene/L. Computer Science, University of Warwick. Source: IBM

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

Multiple Issue ILP Processors. Summary of discussions

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

CS 654 Computer Architecture Summary. Peter Kemper

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs) John Wawrzynek. EECS, University of California at Berkeley

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

VLIW/EPIC: Statically Scheduled ILP

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

How much energy can you save with a multicore computer for web applications?

What does Heterogeneity bring?

Lecture 14: Multithreading

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

November 7, 2014 Prediction

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

Processor (IV) - advanced ILP. Hwansoo Han

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2011 Professor Krste Asanović

Lecture 2 Parallel Programming Platforms

Overview. High Performance Computing - History of the Supercomputer. Modern Definitions (II)

Dynamic Issue & HW Speculation. Raising the IPC Ceiling

Page 1. Raising the IPC Ceiling. Dynamic Issue & HW Speculation. Fix OOO Completion Problem First. Reorder Buffer In Action

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

Pipelining and Vector Processing

CS 152, Spring 2011 Section 10

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

Fujitsu s Approach to Application Centric Petascale Computing

Parallel Architecture. Sathish Vadhiyar

45-year CPU Evolution: 1 Law -2 Equations

Digital Signal Processor Supercomputing

Parallel Computing: Parallel Architectures Jin, Hai

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Making a Case for a Green500 List

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

Exploitation of instruction level parallelism

Supercomputers. Alex Reid & James O'Donoghue

Lecture 13 - VLIW Machines and Statically Scheduled ILP

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

Computer Architecture

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

The Mont-Blanc approach towards Exascale

Computer Architecture

Advanced Computer Architecture

Getting CPI under 1: Outline

Cray XC Scalability and the Aries Network Tony Ford

COMPUTER ORGANIZATION AND DESI

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

The University of Texas at Austin

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Optimising for the p690 memory system

Introduction to Vector Processing

Parallel Computer Architecture - Basics -

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Computer Architecture Computer Architecture. Computer Architecture. What is Computer Architecture? Grading

Transcription:

Review Big Iron Today s topics: Vector Processors and Supercomputers VP s came first now exist as GPGPU s figure source: text Appendix F Supercomputers lots of microprocessors with a fancy interconnect a look at the top500 Datacenter cloud Computing lots of blades w/ fancy interconnect AND fancy storage systems (this is not DRAM!) Roadblocks to parallelism wide issue & deep pipelines» dynamic OOO issue huge # of instructions on the fly quadratic circuit complexity to keep track of everything forwarding, ROB size, # of registers power density kills you performance still limited by ILP in the program» VLIW compiler does most of the scheduling work still huge # of instructions on the fly power density is still a problem this will continue to be a common theme performance also limited by ILP Enhancing parallelism multi- threads, cores, sockets» main game today» might be easier to build than program 1 CS6810 2 CS6810 1 st Supercomputers Replace Loops w/ Vector Instructions Vector machines often attributed to Seymour Cray, but he says I m certainly not inventing vector processors. There are three kind that I know of existing today. They are represented by the Illiac-IV, th (CDC) Star processor, and the TI (ASC) processor. Those three were all pioneering processors.... One of the problems of being a pioneer is you always make mistakes and I never, never want to be a pioneer. It s always best to come second when you can look at the mistakes the pioneers made. talk at LLNL 1976 on the introduction of the CRAY-1 Alternative programming model two data types» scalar and vector not wildly dissimilar to map reduce (Google reinvention) map sub-problems to some set of resources reduce/combine sub-problem into final answer APL Iverson s 1969 book» +/( 1, 2, 3) = 6 Vector-Vector add conventional» 2 pointers to head of two vectors» offset with loop variable A[i] + B[i] for all I vector model» Vadd A, B /1 instruction does a lot of work» no loop or instruction decode overhead» hazard checking only required between vector instructions Issues each vector has to be contiguous machine has a native vector length» 64 was common pad if actual vector length is not in chunks of 64 scientific programmers embraced the vector model» but how do you write a web browser? 3 CS6810 4 CS6810 Page 1

2001 Vector Odyssey Basic Vector Architecture Vector machines out of fashion 2002 Japan s Earth Simulator announced» virtual planet predict environmental change impact on world climate» leads top500 list widespread US panic @ government level strategic leadership lost? oh woe is us or U.S. spurs supercomputer development including new vector machines from Cray Now wide-simd alive and well in GPGPU s short-simd alive and well in CPU s SIMD = short vector» same issues apply 2 parts scalar unit» similar to a normal CPU OOO: NEC SX/5 VLIW: Fujitsu VPP5000 vector unit» multiple FU s (both int & float) deeply pipelined for high clock frequencies particularly true for FPU s primary focus for the scientific comp folks 2 basic architecture types vector-register processors» early CDC machnes memory-memory vector processors (vector RISC)» everything since about 1980 CRAY 1, 2, XMP, YMP, C90, T90, SV1, X1 NEC SX/2-SX/8, Fujitsu VP200-VPP5000, Hitachi S820 and S8300 Convex C-1 through C-4 5 CS6810 6 CS6810 Top Level Vector-Register VMIPS Snippet of Real Machines 64 element Vregs 2 read ports 1 write port is it enough? 7 CS6810 8 CS6810 Page 2

VMIPS ISA Snippet 1 VMIPS ISA Snippet 2 9 CS6810 10 CS6810 DAXPY: MIPS vs. VMIPS Performance IC = 6 vs 600 Vector execution time f(vector length, structural hazards, data hazards)» initiation rate: # of operands consumed or produced per cycle» multi-lane architecture each vector lane can carry n values per cycle often 2 or more # vector lanes * lane width = initiation rate also dependent on pipeline fill and spill Convoys (made up term) set of independent vector instructions» similar to an EPIC VLIW bundle Chime time it takes to execute 1 convoy Start up time time it takes to load the vector registers and fill the pipe All contribute to execution time 11 CS6810 12 CS6810 Page 3

Vector Memory Systems Vector Length Control Lots of bandwidth required to feed lots of XU s very wide data bus banked memory» each bank indpendently addressed not interleaved multiple load and stores issued per cycle each bank serves a particular load or store assuming no bank conflict compiler tries hard to avoid conflict latency can be high for DRAM based memory but bandwidth can be quite good early CRAY machines used SSRAM s too expensive today» addressing? where are the bank select bits? Vec.reg.length!= operand.vec.size (OVS) MVL = vec.reg.length enter VLR» specifies the operand vec size for a vector instruction actual vector size often not known until run time may even change based on a call parameter APL rho(v) length (or structure if vector of vectors of )» controls XU s and Vector_Ld_Store Unit» VLR value <= vector.reg.length hence not known until run time statically known then compiler can issue LVLRI immediate compiler generates strip-mine code based on OVS» rem(ovs, MVL) = odd size piece do odd size piece first or last» OVS/MVL (truncating divide) = number of loops or unrolls for MVL sized chunks 13 CS6810 14 CS6810 Vector Stride Enhancing Vector Performance 1 What happens when vectors are in contiguous addresses consider MatMul» same problem in column (FORTRASH) or row (C) major order allocation stride = address distance between logically adjacent elements solution = stride operand to vector load store unit» on reads = gather» on writes = scatter scatter-gather load store unit» a key innovation for vector processors» LVWS instruction load vector w/stride» Note: banking conflicts may occur start up time increases CDC Star (APL like machine language) classic example Chaining originally developed in the CRAY-1» now fairly ubiquitous» same basic idea as forwarding consider» RAW hazard but if drive entire vector to forward destination» essentially chains XU s together problems? 15 CS6810 16 CS6810 Page 4

Enhancing Vector Performance 1 Enhancing Vector Performance 2 Chaining originally developed in the CRAY-1» now fairly ubiquitous» same basic idea as forwarding consider» RAW hazard but if drive entire vector to forward destination» essentially chains XU s together problems?» the usual = exceptions same barrier to performance solution chained or not modes debug in not do real run in chained mode Conditionals also a performance barrier Solution similar to the normal CPU case employ conditional instructions» predication in EPIC» but borrow mask idea from early SIMD machines like ILLIAC IV» execute the predicate on the vector create a mask vector of the same length» then do the real operation in masked fashion 17 CS6810 18 CS6810 Enhancing Vector Performance 3 Enhancing Vector Performance 4 Sparse matrix computation matrices too big to store directly hence use indirection Increase lanes expensive but throw more XU s and datapath at the problem» scatter-gather memory now takes a bit longer same idea however or use a bit vector to indicate valid entries» use as a mask new instructions (set up time increases though)» load and store vector indexed: LVI and SVI 19 CS6810 20 CS6810 Page 5

Vectorizing Compilers Take advantage of data-parallelism not as easy as it seems but lots of success after years of effort (D. Kuck UIUC, KAI Portland group) Cray X1 (2002) Lots of processor cores with streaming caches unified global memory fully connected interconnect SSP single streaming processor MSP multi-stream processor 21 CS6810 22 CS6810 Today s SuperComputers Cray Jaguar www.top500.org Nov 2009 Cray Jaguar XT5-HE (Oak Ridge NL, 1.7 Peta-Flops)» 6 core Opteron CPU s 2.6 GHz» Black Widow interconnect (YARC switches) IBM Roadrunner Blade Center QS22/L21 Cluster (LANL)» PowerXCell 8i 3.2 GHz» Opteron DC 1.8 GHz» Voltaire infiniband interconnect Kraken XT5 (Cray XT5-HE) Univ. Tenn (NICS) Jugene (IBM BlueGene/P) research center Juelich, GER Tianhe (Milky Way) China Defense Univ. SGI Pleiades» Nehalem quad core 3 GHz Bottom line high-end microprocessors, lots of memory, fancy interconnect lots of watts and dollars 23 CS6810 source: www.pnl.gov 24 CS6810 Page 6

Cray Jaguar IBM RoadRunner Comprising 45,000 quad-core Opteron CPU s (180K cores) 362 TB of main memory» 578 GB/sec main memory bandwidth» 284 GB/sec I/O bandwidth 10 PB of disk Fat tree interconnect moving to 3D Torus» radix 64 YARC swithes Cost $100M government contract» no way to know if they made money 7 MegaWatts of power» plus another 7 MW to run the chillers typical cooling energy = machine energy» @ current Tennessee industrial rates 6.45 cents/kwh $7,910,280 per annum power bill your tax payer dollars at work 25 CS6810 source: news.cnet.com 26 CS6810 IBM RoadRunner RoadRunner in the Large Comprising 13,824 cores» 6912 Opteron sockets 6480 for compute, 432 for managment» 12,960 PowerXCell 8i sockets (2/Opteron) 1 Power5 core and 8 SPE cores = 116,640 cores Packaging in Tri-blades (source Wikipedia) source: Wikipedia CU 60 BladeCenterH modules in a cabinent Other 103.6 TB Ram 216 x3755 I/O nodes 26 radix 288 ISR2012 Infiniband 4x DDR switches 296 racks 2.35 MW operating power 27 CS6810 28 CS6810 Page 7

Concluding Remarks For supercomputers what matters most? blade configuration rack configuration interconnect» on the blade» in the rack» between racks how memory is partitioned» remote vs. local access latencies and bandwidths memory capacity and organization the cores power and performance on big benchmarks If you choose one for your final HW there s a lot of advertizing copy try to dig past that 29 CS6810 Page 8