BERKELEY PAR LAB. RAMP Gold Wrap. Krste Asanovic. RAMP Wrap Stanford, CA August 25, 2010

Similar documents
A Retrospective on Par Lab Architecture Research

Making Performance Understandable: Towards a Standard for Performance Counters on Manycore Architectures

Tessellation: Space-Time Partitioning in a Manycore Client OS

Integrating the Par Lab Stack Running Damascene on SEJITS/ROS/RAMP Gold

Hardware Acceleration for Tagging

Evaluation of RISC-V RTL with FPGA-Accelerated Simulation

ProtoFlex: FPGA Accelerated Full System MP Simulation

ProtoFlex: FPGA-Accelerated Hybrid Simulator

Strober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL

An FPGA Host-Multithreaded Functional Model for SPARC v8

RAMP Gold Hardware and Software Architecture. Zhangxi Tan, Yunsup Lee, Andrew Waterman, Rimas Avizienis, David Patterson, Krste Asanovic UC Berkeley

Getting to Work with OpenPiton. Princeton University. OpenPit

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Getting to Work with OpenPiton

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with Vector Accelerators"

Comparing Memory Systems for Chip Multiprocessors

CS250 VLSI Systems Design Lecture 9: Patterns for Processing Units and Communication Links

Heterogeneous Processing Systems. Heterogeneous Multiset of Homogeneous Arrays (Multi-multi-core)

VERIFICATION OF RISC-V PROCESSOR USING UVM TESTBENCH

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Lecture 14: Multithreading

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

! Readings! ! Room-level, on-chip! vs.!

ECE 8823: GPU Architectures. Objectives

Convergence of Parallel Architecture

Tessellation OS. Present and Future. Juan A. Colmenares, Sarah Bird, Andrew Wang, Krste Asanovid, John Kubiatowicz [Parallel Computing Laboratory]

ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs

Adaptable Intelligence The Next Computing Era

Embedded Systems: Projects

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore


CS 250 VLSI Design Lecture 11 Design Verification

Outline Marquette University

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

Embedded Systems: Projects

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems

Multiprocessors & Thread Level Parallelism

Quick Introduction to RDL. Third Time s The Charm: Designing & Building RDLC3. RAMP Architecture. RDF Target Model (1) RDF Target Model (2)

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

PACE: Power-Aware Computing Engines

FPGA-Accelerated Instrumentation

Online Course Evaluation. What we will do in the last week?

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

CS 2630 Computer Organization. What did we accomplish in 8 weeks? Brandon Myers University of Iowa

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

CS 2630 Computer Organization. What did we accomplish in 15 weeks? Brandon Myers University of Iowa

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

The Nios II Family of Configurable Soft-core Processors

Final Lecture. A few minutes to wrap up and add some perspective

ClearSpeed Visual Profiler

Course web site: teaching/courses/car. Piazza discussion forum:

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Wisconsin Computer Architecture. Nam Sung Kim

45-year CPU Evolution: 1 Law -2 Equations

Visualization of OpenCL Application Execution on CPU-GPU Systems

RAMP-White / FAST-MP

Kaisen Lin and Michael Conley

Survey results. CS 6354: Memory Hierarchy I. Variety in memory technologies. Processor/Memory Gap. SRAM approx. 4 6 transitors/bit optimized for speed

TDT 4260 lecture 3 spring semester 2015

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

RISC-V. Palmer Dabbelt, SiFive COPYRIGHT 2018 SIFIVE. ALL RIGHTS RESERVED.

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

GPU Fundamentals Jeff Larkin November 14, 2016

Multithreading: Exploiting Thread-Level Parallelism within a Processor

CS250 VLSI Systems Design Lecture 8: Introduction to Hardware Design Patterns

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Hwacha V4: Decoupled Data Parallel Custom Extension. Colin Schmidt, Albert Ou, Krste Asanović UC Berkeley

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

CS 6354: Memory Hierarchy I. 29 August 2016

CS252 Spring 2017 Graduate Computer Architecture. Lecture 18: Virtual Machines

Exploiting Dynamically Changing Parallelism with a Reconfigurable Array of Homogeneous Sub-cores (a.k.a. Field Programmable Core Array or FPCA)

Multithreaded Processors. Department of Electrical Engineering Stanford University

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

A Hardware Accelerator for Computing an Exact Dot Product. Jack Koenig, David Biancolin, Jonathan Bachrach, Krste Asanović

CS252 Spring 2017 Graduate Computer Architecture. Lecture 17: Virtual Memory and Caches

Placement de processus (MPI) sur architecture multi-cœur NUMA

Academic Course Description. EM2101 Computer Architecture

WHY PARALLEL PROCESSING? (CE-401)

Parallel Computing Platforms

High Performance Computing on GPUs using NVIDIA CUDA

Getting the Most out of Advanced ARM IP. ARM Technology Symposia November 2013

Yet Another Implementation of CoRAM Memory

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

The Role of Performance

Versal: AI Engine & Programming Environment

The HammerBlade: An ML-Optimized Supercomputer for ML and Graphs

Transcription:

RAMP Gold Wrap Krste Asanovic RAMP Wrap Stanford, CA August 25, 2010

RAMP Gold Team Graduate Students Zhangxi Tan Andrew Waterman Rimas Avizienis Yunsup Lee Henry Cook Sarah Bird Faculty Krste Asanovic David Patterson 2

Multicore Architecture Simulation Challenge Bigger, more complex target system Many cores, more components, cache coherence, Need to run OS scheduler, or multithreaded run time Sequential software simulator performance no longer scaling More cores, not faster cores Detailed software simulators don t parallelize well Cycle-by-cycle synchronization kills parallel performance Parallel code has non-deterministic performance Need multiple runs to get error bars Software more dynamic, adaptive, autotuned, JIT, Can t use sampling ~No legacy parallel software Need to write/port entire stack. 3

RAMP Blue, July 2007 1,008 modified MicroBlaze cores FPU (64-bit) RTL directly mapped to FPGA 90MHz Runs UPC version of NAS parallel benchmarks. Message-passing cluster No MMU Requires lots of hardware 21 BEE2 boards / 84 FPGAs Difficult to modify FPGA computer, not a simulator!

Dimensions in FAME (FPGA Architecture Model Execution) Direct: One target cycle executed in one FPGA host cycle Decoupled: One target cycle takes one or more FPGA cycles Full RTL: Complete RTL of target machine modeled Abstract RTL: Partial/simplified RTL, split functional/timing Host single-threaded: One target model per host pipeline Host multi-threaded: Multiple target models per host pipeline 5

Host Multithreading Target Model CPU 1 CPU 2 CPU 3 CPU 4 Multithreaded Emulation Engine (on FPGA) Single hardware pipeline with multiple copies of CPU state PC PC 1 PC PC 1 1 1 +1 2 I$ IR GPR GPR Multithreading emulation engine reduces FPGA resource use and improves emulator throughput Hides emulation latencies (e.g., communicating across FPGAs) 2 X Y D$

RAMP Gold, November 2009 Rapid accurate simulation of manycore architectural ideas using FPGAs Initial version models 64 cores of SPARC v8 with shared memory system on $750 board Hardware FPU, MMU, boots OS. Simics Software Simulator Cost Performance (MIPS) Simulations per day $2,000 0.1-1 1 RAMP Gold $2,000 + $750 50-100 100 7

RAMP Gold Target Machine 64 cores SPARC V8 CORE SPARC V8 CORE SPARC V8 CORE SPARC V8 CORE I$ D$ I$ D$ I$ D$ I$ D$ Shared L2$ / Interconnect DRAM 8

RAMP Gold Model Timing State Timing Model Pipeline Arch State Functional Model Pipeline RAMP emulation model for Parlab manycore SPARC v8 ISA Single-socket manycore target Split functional/timing model, both in hardware Functional model: Executes ISA Timing model: Capture pipeline timing detail (can be cycle accurate) Host multithreading of both functional and timing models Built for Virtex-5 systems (ML5059

Case Study: Manycore OS Resource Allocation Spatial resource allocation in a manycore system is hard Combinatorial explosion in number of apps and number of resources Idea: use predictive models of app performance to make it easier on OS HW partitioning for performance isolation (so models still work when apps run together) Problem: evaluating effectiveness of resulting scheduling decisions requires running hundreds of schedules for billions of cycles each Simulation-bound: 8.3 CPU-years for Simics! Read ISCA 10 FAME paper for details 10

Case Study: Manycore OS Resource Allocation 4 Normalized Runtime 3.5 3 2.5 2 1.5 1 0.5 worst sched. chosen sched. best sched. 0 Synthetic Only PARSEC Small 11

Case Study: Manycore OS Resource Allocation 4 Normalized Runtime 3.5 3 2.5 2 1.5 1 0.5 worst sched. chosen sched. best sched. 0 Synthetic Only PARSEC Small PARSEC Large The technique appears to perform very well for synthetic or reduced-input workloads, but is lackluster in reality! 12

RAMP Gold Distribution http://sites.google.com/site/rampgold/ BSD/GNU licenses Many (100?) downloads Used by Xilinx tools group as exemplar System Verilog design 13

RAMP Gold Lessons Architects will use it! Actually, don t want to use software simulators now Make everything a run-time parameter Avoid resynth + P&R Difficult to modify highly tuned FPGA designs Functional/Timing split should be at µarch. block level Really build a microfunctional model Standard ISA only gets you so far in research Research immediately changes ISA/ABI, so lose compatibility Target machine design vital part of architecture research Have to understand target to build a model of it! Need area/cycle-time/energy numbers from VLSI design FAME-7 simulators are very complex hardware designs Much more complicated than a processor 14

Richer set of target machines Midas Goals In-order scalar cores, possibly threaded, + various kinds of vector unit Various hardware-managed plus software-managed memory hierarchies Cross-chip and off-chip interconnect structures More modular design Trade some FPGA performance to make modifications easier Better/more timing models Especially interconnect and memory hierarchy Better I/O for target system Timed I/O to model real-time I/O accurately Paravirtual-style network/video/graphics/audio/haptic I/O devices Better scaling of simulation performance Weak scaling more FPGA pipelines allow bigger system to run faster Strong scaling more FPGA pipelines allow same system to run faster First target machine running apps on Par Lab stack in 2010 15

Midas ISA Choice RISC-V: A new home-grown RISC ISA V for Five, V for Vector, V for Variants Inspired by MIPS but much cleaner Supports 32-bit or 64-bit address space 32-bit instruction format plus optional variable length (16/32/>32) Designed to support easy extension Completely open (BSD) Are we crazy not to use a standard ISA? Standard ISA not very helpful, but standard ABI is Standard ABI effectively means running same OS Running same OS means modeling hardware platform too much work for university Also, only interesting standard ISAs are x86 & ARM (+GPU), both are too complex for university project to implement in full We re going to change ISA plus OS model anyway Midas modularity should make it easy to add other ISAs as well 16

Midas Target Core Design Space Scalar cores In-order (out-of-order possible later) Different issue widths, functional-unit latencies Decoupled memory system (non-blocking cache) Optional multithreading New translation/protection structures for OS support Performance counters Attached data-parallel accelerator options Traditional vector (a la Cray) SIMD (a la SSE/AVX) SIMT (a la NVIDIA GPU) Vector-threading (a la Scale/Maven) All with one or more lanes Should be possible to model mix of cores for asymmetric platforms (Same ISA, different µarchitecture) 17

Midas Uncore Design Space Distributed coherent caches Initially, using reverse-map tag directory Flexible placement, replication, migration, eviction policies Virtual local stores Software-managed with DMA engines Multistage cross-chip interconnect Rings/torus Partitioning/QoS hardware Cache capacity On-chip and off-chip bandwidth DRAM access schedulers Performance counters All fully parameterized for latency/bandwidth settings 18

Platforms of Interest 1W handheld 10W laptop/settop/tv/games/car 100W servers 19

Midas Software Gcc+binutils as efficiency-level compiler tools Already up and running for draft RISC-V spec Par Lab OS + software stack for the rest Par Lab applications Pattern-specific compilers build with SEJITS/Autotuning help get good app performance quickly on large set of new architectures 20

Not just simulators New DoE-funded Project Isis (with John Wawrzynek) on developing high-level hardware design capability Integrated approach spanning VLSI design and simulation Building ASIC designs for various scalar+vector cores Yes, we will fab chips Mapping RTL to FPGAs FPGA computers help software developers, helps debug RTL Why #1. For designing (not characterising) simple cores, power model abstractions don t work (e.g., 2x difference based on data values). 21

Acknowledgements Sponsors ParLab: Intel, Microsoft, UC Discovery, National Instruments, NEC, Nokia, NVIDIA, Samsung, Sun Microsystems Xilinx, IBM, SPARC International DARPA, NSF, DoE, GSRC, BWRC 22