RAMP-White / FAST-MP

Size: px

Start display at page:

Download "RAMP-White / FAST-MP"

Ralph Sparks
5 years ago
Views:

1 RAMP-White / FAST-MP Hari Angepat and Derek Chiou Electrical and Computer Engineering University of Texas at Austin Supported in part by DOE, NSF, SRC,Bluespec, Intel, Xilinx, IBM, and Freescale

2 RAMP-White Overview Use existing FPGA processor implementations to build scalable, flexible, coherent shared memory platforms that run standard operating systems Standard ISA/OS enables more complex applications such as software emulators (QEMU) when desired.

3 RAMP White Architecture Classic shared memory machine design IO/MEM IO/MEM

4 RAMP White Architecture Proc shim Proc shim NIU Ring Ring NIU Periph shim IO Device Peripheral bus IO Device IO Device DRAM Periph shim Peripheral bus DRAM RAMP-White

5 RAMP White Architecture Model CMP/SMP targets Coherent shared memory platform Single image OS RAMP scalability (1K cores) via spatial and temporal replication Proc shim Proc shim NIU Ring Ring NIU Periph shim IO Device Peripheral bus IO Device IO Device DRAM Periph shim Peripheral bus DRAM RAMP-White

6 RAMP White Architecture Ability to use commodity cores: SparcV8: Leon3 soft core PowerPC: PPC405 hard core Configurable coherence protocol, enginesc Proc shim Proc shim NIU Ring Ring NIU Periph shim IO Device Peripheral bus IO Device IO Device DRAM Periph shim Peripheral bus DRAM RAMP-White

7 RAMP White Architecture Configurable modules: NIC, network, coherence engine, intersection unit Modules connected by Connectors: Point to point FIFOs that can model target time if required Shim adapters Proc shim Proc shim NIU Ring Ring NIU Periph shim IO Device Peripheral bus IO Device IO Device DRAM Periph shim Peripheral bus DRAM RAMP-White

8 RAMP-White Status Working: Multi processor Leon Soft fp kernel and userspace as initramfs Standard pthread Splash benchmarks Still debugging: Multichip crossing with scalable interrupt components Integration with parametrizable FAST cache model See me during retreat if interested in Alpha release

9 Prototype (See at Demo) Hardware Sparc V8 32bit soft core processor (Leon3) 50 Mhz core clock, soft FP, 16KB Icache, Dcache bypassed GRLIB Components {serial, ethernet, ddr, jtag} Software Linux SMP for Leon3 Pthread based Splash2 benchmarks RAM disk rootfs with simple userspace apps Platform BEE2 control FPGA with JTAG based programming Ethernet for kernel loading/debugging RAMP-White

10 FAST-MP

11 FAST-MP: High Level Goal Multi resolution coherent shared memory target emulation Predict performance/power for wide range of micro architectures at accuracies ranging from cycle accurate to functional only Capable of running real ISAs aided by binary translation (x86, Sparc, PowerPC, etc), operating systems (unmodified Windows, Linux), compilers, applications (SQLServer, Apache, etc) Extensible/flexible (new instructions, different micro architectures)

12 Performance Modeling on RAMP-White RAMP White host predicts RAMP White target performance perfectly Predicting performance of arbitrary micro architectures requires additional support FAST (FPGA Accelerated Simulation Techniques) uses a timing model to predict performance of arbitrary micro architecture Special purpose structure designed to predict time Very small (complex model in a fraction of an FPGA) Uses same functional model for any micro architecture White as a scalable functional model for FAST MP

13 FAST (FPGA Accelerated Simulation) Speculative FM with checkpoint/rollback of FM when FM/TM paths diverge Ex) branch mispredict/resolve

14 FAST-MP Approach Multicore functional model executes as it wishes Functional instruction stream generated (per core) and sent to timing model Rollback when functional model execution differs from timing model Branch mispredictions, address speculation, etc. Possible for functional model to access memory in different order than target

15 FAST-MP Memory Reordering All memory references tagged with a version number FM passes a version number in trace to TM essentially a precondition on the validity of the given trace If TM version!= FM version Freeze timing models (to avoid corrupting TM) Rollback functional models to restore correct memory/architectural state Use TM directed order to re execute

16 White IO/MEM IO/MEM

17 White + Timing Model PowerPC/Sparc ISA with arbitrary timing model Timing Model Net model Timing Model IO/MEM IO/MEM

18 White + VM + Timing Model Sparc ISA with QEMU to emulate any ISA Requires trace/rollback: Hardware Software (QEMU) can also be hardware accelerated QEMU x86 VCPU SMP OS QEMU x86 VCPU X86 Timing Model Timing Model Net model X86 Timing Model Timing Model IO/MEM IO/MEM

19 Probability of Reordered Memory Ops Functionally driven speculation in a MP costly if timing ordered memory references conflict Preliminary study with on X86 applications studying atomic operations Use Pin dynamic instrumentation tool to monitor every atomic operation running a multi threaded app Analyze inter atomic distance for existing shared memory workloads (Splash2, Parsec)

20 Interprocessor Atomic Reuse Distance 35% 30% Percent Atomic Operations 25% 20% 15% 10% 5% FFT LU Ocean Radix BlackScholes BodyTrack FaceSim Ferret FluidAnimate FreqMine Swaptions 0% Interprocessor Reuse Distance (Cycles)

21 Task Size Scaling on Intel CMPs Speedup Normailzed to Serial Implementation Xeon5140 1Thread XeonX3230 1Thread Xeon5140 2Threads XeonX3230 2Threads Xeon5140 4Threads XeonX3230 4Threads Task Size (cycles)

22 FAST-MP Can Be Less Than Accurate! Nearly accurate Functional model backpressured by timing model Don t want to overflow buffers Each functional core roughly at correct instruction relative to other cores Do not rollback to reorder memory operations Still correct, just locks taken in different order Eliminate rollback overheads, probably quite accurate Model RAMP White on FAST MP to check accuracy Functional + cache Run with just cache simulators Etc.

23 QEMU on White-Leon3 QEMU with patches Some issues remaining with Dyngen for V8 ISA with Leon3 cross compiler For initial Linux Boot: X86 instructions: 1 QEMU uops: ~3.1 Sparc instructions: ~22.5 High overheads involved in address computation, segmentation checks, software tlb, etc Can modify/replace Leon3 to improve efficiency MicroOP based processor

24 Conclusions Initial RAMP White Alpha design functional FAST MP Provide various ISAs Cycle accuracy to purely functional Developing power models FAST MP will run on top of RAMP White as well as standard multicore system

25 Questions

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support ATLAS: A Chip-Multiprocessor with Transactional Memory Support Njuguna Njoroge, Jared Casper, Sewook Wee, Yuriy Teslyar, Daxia Ge, Christos Kozyrakis, and Kunle Olukotun Transactional Coherence and Consistency