Revisiting Parallelism

Size: px

Start display at page:

Download "Revisiting Parallelism"

Kristina Miller
5 years ago
Views:

1 Revisiting Parallelism Sudhakar Yalamanchili, Georgia Institute of Technology Where Are We Headed? MIPS Multi-Threaded, Multi-Core Multi Threaded Era of Speculative, OOO 1000 Thread & Super Scalar Processor Level Era of Parallelism Instruction 1 Era of Pipelined Level Special 0.1 Architecture Parallelism Purpose HW Source: Shekhar Borkar, Intel Corp. ECE 4100/6100 (2)

2 Beyond ILP Performance is limited by the serial fraction parallelizable 1CPU 2CPUs 3CPUs 4CPUs Coarse grain parallelism in the post ILP era Thread, process and data parallelism Learn from the lessons of the parallel processing community Revisit the classifications and architectural techniques ECE 4100/6100 (3) Flynn s Model* Flynn s Classification Single instruction stream, single data stream (SISD) The conventional, word-sequential architecture including pipelined computers Data Level Parallelism (DLP) Single instruction stream, multiple data stream (SIMD) The multiple ALU-type architectures (e.g., array processor) Multiple instruction stream, single data stream (MISD) Not very common Thread Level Parallelism (TLP) Multiple instruction stream, multiple data stream (MIMD) The traditional multiprocessor system *M.J. Flynn, Very high speed computing systems, Proc. IEEE, vol. 54(12), pp , ECE 4100/6100 (4)

ILP Challenges As machine ILP capabilities increase, i.e., ILP width and depth, so do challenges OOO execution cores Key data structure sizes increase ROB, ILP window, etc.

3 ILP Challenges As machine ILP capabilities increase, i.e., ILP width and depth, so do challenges OOO execution cores Key data structure sizes increase ROB, ILP window, etc. Dependency tracking logic increases quadratically VLIW/EPIC Hardware interlocks, ports, recovery logic (speculation) increases quadratically Circuit complexity increases with number of inflight instructions Data Parallelism ECE 4100/6100 (5) Example: Itanium 2 Note the percentage of the die devoted to control And this is a statically scheduled processor! ECE 4100/6100 (6)

4 Data Parallel Alternatives Single Instruction Stream Multiple Data Stream Cores Co-processors exposed through the ISA Co-processors exposed as a distinct processor Vector Processing Over 5 decades of development ECE 4100/6100 (7) The SIMD Model Single instruction stream broadcast to all processors Processors execute in lock step on local data Efficient in use of silicon area - less resources devoted to control Distributed memory model vs. shared memory model Distributed memory Each processor has local memory Data routing network whose operation is under centralized control. Processor masking for data dependent operations Shared memory Access to memory modules through an alignment network Instruction classes: computation, routing, masking ECE 4100/6100 (8)

5 Two Issues Conditional Execution Data alignment ECE 4100/6100 (9) Vector Cores Sudhakar Yalamanchili, Georgia Institute of Technology

Classes of Vector Processors Vector machines register machines memory machines Memory to memory architectures have seen a resurgence on chip ECE

6 Classes of Vector Processors Vector machines register machines memory machines Memory to memory architectures have seen a resurgence on chip ECE 4100/6100 (11) VMIPS Load/Store architecture Multiported registers Deeply pipelined functional units Separate scalar registers ECE 4100/6100 (12)

7 Cray Family Architecture Stream oriented Recall data skewing and concurrent memory accesses! The first load/store ISA design Cray 1 (1976) ECE 4100/6100 (13) Features of Vector Processors Significantly less dependency checking logic Order of complexity of scalar comparisons with a significantly smaller number Vector data sets Hazard free operation on deep pipelines Conciseness of representation leads to low instruction issue rate Reduction in normal control hazards Vector operations vs. a sequence of scalar operations Concurrency in operation, memory access and address generation Often statically known ECE 4100/6100 (14)

8 Some Examples ECE 4100/6100 (15) Basic Performance Concepts Consider the vector operation Z = A*X + Y Execution time t ex = t startup + n*t cycle Metrics R infinity R half R v ECE 4100/6100 (16)

9 Optimizations for Vector Machines Chaining MULT.V V1, V2. V3 ADD.V V4, V1, V5 Fine grained forwarding of elements if a vector Need additional ports on a vector register Effectively creates a deeper pipeline Conditional operations and vector masks Scatter/gather operations Vector lanes Each lane is coupled to a portion of the vector register file Lanes are transparent to the code and are like caches in the family of machines concept ECE 4100/6100 (17) The IBM Cell Processor Sudhakar Yalamanchili, Georgia Institute of Technology

Cell Overview M I C P P U S P U S P U S P U S P U S P U MIB S P U S P U S P U B I C R R A C IBM/Toshiba/Sony joint project - 4-5 years, 400 designers 234 million transistors, 4+ Ghz 256 Gflops

) One 64-bit PowerPC processor 4+ Ghz, dual issue, two threads 512 kb of second-level cache Eight Synergistic Processor Elements Or Streaming Processor Elements Co-processors with dedicated 256kB of

10 Cell Overview M I C P P U S P U S P U S P U S P U S P U MIB S P U S P U S P U B I C R R A C IBM/Toshiba/Sony joint project years, 400 designers 234 million transistors, 4+ Ghz 256 Gflops (billions of floating pointer operations per second) 26 Gflops (double precision) Area 221 mm 2 Technology 90nm SOI ECE 4100/6100 (19) Cell Overview (cont.) One 64-bit PowerPC processor 4+ Ghz, dual issue, two threads 512 kb of second-level cache Eight Synergistic Processor Elements Or Streaming Processor Elements Co-processors with dedicated 256kB of memory (not cache) EIB data ring for internal communication Four 16 byte data rings, supporting multiple transfers 96B/cycle peak bandwidth Over 100 outstanding requests Dual Rambus XDR memory controllers (on chip) 25.6 GB/sec of memory bandwidth 76.8 GB/s chip-to-chip bandwidth (to off-chip GPU) ECE 4100/6100 (20)

Cell Features Security SPE dynamically reconfigurable as secure co-processor Networking SPEs might off-load networking overheads (TCP/IP) Virtualization Run multiple OSs at the same time Linux is

11 Cell Features Security SPE dynamically reconfigurable as secure co-processor Networking SPEs might off-load networking overheads (TCP/IP) Virtualization Run multiple OSs at the same time Linux is primary development OS for Cell Broadband SPE is RISC architecture with SIMD organization and Local Store 128+ concurrent transactions to memory per processor ECE 4100/6100 (21) PPE Block Diagram PPE handles operating system and control tasks 64-bit Power ArchitectureTM with VMX In-order, 2-way hardware Multi-threading Coherent Load/Store with 32KB I & D L1 and 512KB L2 ECE 4100/6100 (22)

12 PPE Pipeline ECE 4100/6100 (23) SPE Organization and Pipeline IBM Cell SPE Organization IBM Cell SPE pipeline diagram ECE 4100/6100 (24)

Cell Temperature Graph Power and heat are key constrains Cell is ~80 watts at 4+ Ghz Cell has 10 temperature sensors Source: IEEE ISSCC, 2005 ECE 4100/6100 (25) SPE User-mode architecture No

13 Cell Temperature Graph Power and heat are key constrains Cell is ~80 watts at 4+ Ghz Cell has 10 temperature sensors Source: IEEE ISSCC, 2005 ECE 4100/6100 (25) SPE User-mode architecture No translation/protection within SPU DMA is full Power Arch protect/x-late Direct programmer control DMA/DMA-list Branch hint VMX-like SIMD dataflow Broad set of operations Graphics SP-Float IEEE DP-Float (BlueGene-like) Unified register file 128 entry x 128 bit 256kB Local Store Combined I & D 16B/cycle L/S bandwidth 128B/cycle DMA bandwidth ECE 4100/6100 (26)

Cell I/O XDR is new high-speed memory from Rambus Dual XDRTM controller (25.6GB/s @ 3.2Gbps) Two configurable interfaces (76.8GB/s @6.

14 Cell I/O XDR is new high-speed memory from Rambus Dual XDRTM controller 3.2Gbps) Two configurable interfaces Flexible Bandwidth between interfaces Allows for multiple system configurations Pros: Fast - dual controllers give 25GB/sed Current AMD Opteron is only 6.4GB/s Small pin count Only need a few chips for high bandwidth Cons: Expensive ($ per bit) ECE 4100/6100 (27) Multiple system support Game console systems Workstations (CPBW) HDTV Home media servers Supercomputers ECE 4100/6100 (28)

15 Programming Cell 10 virtual processors 2 threads of PowerPC 8 co-processor SPEs Communicating with SPEs 256kB local storage is NOT a cache Must explicitly move data in and out of local store Use DMA engine (supports scatter/gather) ECE 4100/6100 (29) Programming Cell Multiple-ISA hand tuned programs Explicit SIMD coding SIMD alignment directives Shared memory, single program abstraction Automatic tuning for each ISA Automatic SIMDization Automatic parallelization Explicit parallelization Highest with local performance memories with help from programmers Highest Productivity with fully automatic compiler technology ECE 4100/6100 (30)

16 Execution Model SPE executables are embedded as readonly data in the PPE executable Use the memory flow controller (MFC) for DMA operations The shopping list view of memory accesses Source: IBM ECE 4100/6100 (31) Programming Model SPE Program /* spe_foo.c A C program to be compiled into an executable called "spe_foo" */ int main(unsigned long long speid, addr64 argp, addr64 envp) { int i; /* func_foo would be the real code */ i = func_foo(argp); return i; } PPE Program /* spe_runner.c A C program to be linked with spe_foo and run on the PPE. */ extern spe_program_handle_t spe_foo; int main() { int rc, status = 0; speid_t spe_id; spe_id = spe_create_thread(0, &spe_foo, 0, NULL, -1, 0); rc = spe_wait(spe_id, &status, 0); return status; } blocking call Source: IBM ECE 4100/6100 (32)

17 SPE Programming Dual Issue with issue constraints Predication and hints, no branch prediction hardware Alignment instructions Source: IBM ECE 4100/6100 (33) Programming Idioms: Pipeline ECE 4100/6100 (34)

18 Programming Idioms: Work Queue Model work queue Pull data off of a shared queue Self scheduled ECE 4100/6100 (35) SPMD & MIMD Accelerators Executing same (SPMD) or different (MPMD) programs ECE 4100/6100 (36)

19 Cell Processor Application Areas Digital content creation (games and movies) Game playing and game serving Distribution of (dynamic, media rich) content Imaging and image processing Image analysis (e.g. video surveillance) Next-generation physics-based visualization Video conferencing (3D) Streaming applications (codecs etc.) Physical simulation & science ECE 4100/6100 (37) Some References and Links ahle.html ct05.pdf Hofstee.pdf -cellprocessor_final.pdf ham-isscc05.pdf ECE 4100/6100 (38)

20 IRAM Cores Sudhakar Yalamanchili, Georgia Institute of Technology Data Parallelism and the Processor Memory Gap Moore s Law CPU µproc 60%/yr. Processor-Memory Performance Gap: (grows 50% / year) 1 DRAM DRAM 7%/yr. Time How can we close this gap? ECE 4100/6100 (40)

21 The Effects of the Processor- Memory Gap Tolerate gap with deeper cache memories increasing worst case performance System level impact: Alpha I & D cache access: 2 clocks L2 cache: 6 clocks L3 cache: 8 clocks Memory: 76 clocks DRAM component access: 18 clocks How much time is spent in the memory hierarchy? SpecInt92: 22% Specfp92: 32% Database: 77% Sparse matrix: 73% ECE 4100/6100 (41) Where do the Transistors Go? Processor % Area %Transistors (-cost) (-power) Alpha % 77% StrongArm SA110 61% 94% Pentium Pro 64% 88% Caches have no inherent value, they simply recover bandwidth? ECE 4100/6100 (42)

22 Impact of DRAM Capacity Increasing capacity creates a quandary Continual four fold increase in density increases minimum memory increment for a given width How do we match the memory bus width? Cost/bit issues for wider DRAM chips die size, testing, package costs Number of DRAM chips decrease decrease in concurrency ECE 4100/6100 (43) Merge Logic and DRAM! Bring the processors to memory Tremendous on-chip bandwidth for predictable application reference patterns Enough memory to hold complete programs and data feasible More applications are limited by memory speed Better memory latency for applications with irregular access patterns Synchronous DRAMs to integrate with the higher speed logic compatible ECE 4100/6100 (44)

23 Potential: IRAM for Lower Latency DRAM Latency Dominant delay = RC of the word lines Keep wire length short & block sizes small? ns for 64b-256b IRAM RAS/CAS? ECE 4100/6100 (45) Potential for IRAM Bandwidth Mbit modules(1gb), each 256b wide 20 ns RAS/CAS = 320 GBytes/sec If cross bar switch delivers 1/3 to 2/3 of BW of 20% of modules GBytes/sec FYI: AlphaServer 8400 = 1.2 GBytes/sec 75 MHz, 256-bit memory bus, 4 banks ECE 4100/6100 (46)

24 IRAM Applications PDAs, cameras, gameboys, cell phones, pagers Database systems? Database demand: 2X / 9 months 100 Greg s Law Database-Proc. Performance Gap: 10 Moore s Law µproc speed 2X / 18 months Processor-Memory Performance Gap: 1 DRAM speed 2X /120 months ECE 4100/6100 (47) Estimating IRAM Performance Direct application produces modest performance improvements Architectures were designed to overcome the memory bottleneck Architectures were not designed to use tremendous memory bandwidth Need to rethink the design! Tailor architecture to utilize the high bandwidth ECE 4100/6100 (48)

Emerging Embedded Applications and Characteristics Fastest growing application domain Video processing, speech recognition, 3D Graphics Set top boxes, game consoles, PDAs Data parallel Typically low

25 Emerging Embedded Applications and Characteristics Fastest growing application domain Video processing, speech recognition, 3D Graphics Set top boxes, game consoles, PDAs Data parallel Typically low temporal locality Size, weight and power constraints Highest speed not necessarily the best processor What about the role of ILP processors here? Real Time constraints Right data at the right time ECE 4100/6100 (49) SIMD/Vector Architectures VIRAM - Vector IRAM Logic is slow in DRAM process Put a vector unit in a DRAM and provide a port between a traditional processor and the vector IRAM instead of a whole processor in DRAM Source: Berkeley Vector IRAM ECE 4100/6100 (50)

26 ISA LD/SD vector ISA defined as a co-processor to the MIPS 64 ISA Vector register file with 32 entries Each can be configured as 64b, 32b, or 16b Integer or FP elements Two scalar register files Memory and exception handling, base addresses and stride information Scalar operands Flag registers Special Limited scope instructions to permute contents of vector registers Integer instructions for saturated arithmetic ECE 4100/6100 (51) MIMD Machines P + C P + C P + C P + C Dir Dir Dir Dir Memory Memory Memory Memory Interconnection Network Parallel processing has catalyzed the development of a several generations of parallel processing machines Unique features include the interconnection network, support for system wide synchronization, and programming languages/compilers ECE 4100/6100 (52)

27 Basic Models for Parallel Programs Shared Memory Coherency/consistency are driving concerns Programming model is simplified at the expense of system complexity Message Passing Typically implemented on distributed memory machines System complexity is simplified at the expense of increased effort by the programmer ECE 4100/6100 (53) Shared Memory Vs. Message Passing Shared memory Simplifies software development Increases complexity of hardware Power directories, coherency enforcement logic More recently transactional memory Message passing doesn t need centralized bus Simplifies hardware Scalable memory and interconnect bandwidth Increases complexity of software development Increases the burden on the developer ECE 4100/6100 (54)

28 Two Emerging Challenges Programming Models and Compilers? Source: Intel Corp. Source: IBM Interconnection Networks ECE 4100/6100 (55)

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer