Flexible Architecture Research Machine (FARM)

Similar documents
FARM: A Prototyping Environment for Tightly-Coupled, Heterogeneous Architectures

Towards Pervasive Parallelism

FARM: A Prototyping Environment for Tightly-Coupled, Heterogeneous Architectures

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support

EigenBench: A Simple Exploration Tool for Orthogonal TM Characteristics

Practical Near-Data Processing for In-Memory Analytics Frameworks

A HT3 Platform for Rapid Prototyping and High Performance Reconfigurable Computing

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Building blocks for custom HyperTransport solutions

Comparing Memory Systems for Chip Multiprocessors

Leveraging HyperTransport for a custom high-performance cluster network

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Industry Collaboration and Innovation

Industry Collaboration and Innovation

Implementing and Evaluating Nested Parallel Transactions in STM. Woongki Baek, Nathan Bronson, Christos Kozyrakis, Kunle Olukotun Stanford University

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Improving the Practicality of Transactional Memory

GPUfs: Integrating a file system with GPUs

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective

Tile Processor (TILEPro64)

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

Interfacing FPGAs with High Speed Memory Devices

Chí Cao Minh 28 May 2008

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

Parallel Architectures

COSC 6385 Computer Architecture - Memory Hierarchies (III)

Adapted from David Patterson s slides on graduate computer architecture

EE108B Lecture 17 I/O Buses and Interfacing to CPU. Christos Kozyrakis Stanford University

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

P51: High Performance Networking

Quad-core Press Briefing First Quarter Update

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

Re-architecting Virtualization in Heterogeneous Multicore Systems

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Review: Creating a Parallel Program. Programming for Performance

Lecture 41: Introduction to Reconfigurable Computing

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein. Copyright 2003 Philip A. Bernstein. Outline

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches

Copyright 2012, Elsevier Inc. All rights reserved.

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Application Performance on Dual Processor Cluster Nodes

The Cache-Coherence Problem

2008 International ANSYS Conference

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

XPU A Programmable FPGA Accelerator for Diverse Workloads

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Virtual Memory. Motivations for VM Address translation Accelerating translation with TLBs

Copyright 2012, Elsevier Inc. All rights reserved.

Advanced Parallel Programming I

Building and Using the ATLAS Transactional Memory System

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

LECTURE 5: MEMORY HIERARCHY DESIGN

Tradeoffs in Transactional Memory Virtualization

GPUfs: Integrating a file system with GPUs

Parallel Processing SIMD, Vector and GPU s cont.

Key Points. Rotational delay vs seek delay Disks are slow. Techniques for making disks faster. Flash and SSDs

On the Portability and Performance of Message-Passing Programs on Embedded Multicore Platforms

Six-Core AMD Opteron Processor

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.

Performance potential for simulating spin models on GPU

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

Pactron FPGA Accelerated Computing Solutions

IO System. CP-226: Computer Architecture. Lecture 25 (24 April 2013) CADSL

The Common Case Transactional Behavior of Multithreaded Programs

A Distributed Hash Table for Shared Memory

FlexNIC: Rethinking Network DMA

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Industry Collaboration and Innovation

The Nios II Family of Configurable Soft-core Processors

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Initial Performance Evaluation of the Cray SeaStar Interconnect

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Parallel Computing Platforms

WaveScalar. Winter 2006 CSE WaveScalar 1

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

Benchmarking the Memory Hierarchy of Modern GPUs

Altera Product Overview. Altera Product Overview

Transactional Memory

Convergence of Parallel Architecture

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Modern CPU Architectures

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

Transcription:

Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun

Motivation Why CPUs + FPGAs make sense Application acceleration Prototyping new functionality, low volume production FPGAs getting computationally denser Simulators/Research prototypes Software matters: experiment with new architectures Combine best of both worlds

Research Challenges When is it a good idea to use FPGAs + CPUs? Coarse-grained applications are great Video encoding, DSP, etc. But what about fine-grained communication Fine-grained in space? Fine-grained in time? How? Hardware vs software balance Mechanisms to reduce/hide overheads

The Stanford FARM High performance, yet flexible Commodity CPUs, memory, I/O for fast system with rich SW support FPGAs to prototype new accelerators FARM in a nutshell A research machine personalize computing (threads, vectors, reconfigurable, ) personalize memory (shared mem, transactions, streams, ) personalize I/O (off-loading engines, coherent I/O, ) An industrial strength cluster State of the art CPUs, GPUs, memory, and I/O Infiniband or PCIe interconnect Scalable to 10s or 100s of nodes

FARM Node Memory Memory Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 Memory Memory

FARM Node Memory Memory Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3 FPGA SRAM I O Memory Memory

FARM Node Memory Memory Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 GPU/Stream FPGA I O SRAM Memory Memory

FARM System View Memory Memory (scalable) Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 Core 0 Core 1 FPGA I O Infiniband Or PCIe Interconnect Core 2 Core 3 SRAM Memory Memory

Procyon System Initial platform for FARM From A&D Technology, Inc. Full system board AMD Opteron Socket F Two DDR2 DIMMs USB/eSATA/VGA/GigE Sun OpenSolaris OS Extra CPU board AMD Opteron Socket F FPGA Board Altera Stratix II FPGA All connected via HT backplane Also provides PCIe and PCI

Procyon System Communication Diagram Altera Stratix II FPGA (132k Logic Gates) 1.8GHz Core 0 64K L1 1.8GHz Core 3 64K L1 1.8GHz Core 0 64K L1 1.8GHz Core 3 64K L1 MMR IF User Application 512KB L2 Cache 2MB L3 Shared Cache Hyper Transport 512KB L2 Cache AMD Barcelona 32 Gbps 32 Gbps ~60ns 512KB L2 Cache 2MB L3 Shared Cache Hyper Transport 512KB L2 Cache Transfer Engine 6.4 Gbps chtcore Hyper Transport (PHY, LINK) 6.4 Gbps ~380ns Data Stream IF Data Cache IF Configurable Coherent Cache Components to manage system communication Numbers from the A&D Procyon

Overhead on Procyon Issues to resolve FPGA Communication latencies: Also non-uniform access times from different cores Frequency discrepancy: 1.8 GHz CPUs vs 100 MHz FPGA FPGA round trip from closer Opteron: ~1400 instructions FPGA round trip from farther Opteron: ~1700 instructions Synchronization

A Simple Analytical Model Goals High level model for predicting accelerator speedup Intuition into when accelerating makes sense Hardware requirements Application requirements

A Simple Analytical Model Speedup G(T on T off ) G(T on at off ) t ovhd t ovlp T off a T on G t ovlp t ovhd Time to execute the offloaded work on the processor Acceleration factor for the offloaded work (doubled rate would have a=0.5) Time to execute remaining work (i.e. unaccelerated work) on the processor Percentage of offloaded work done between each communication with the accelerator Time the processor is doing work in parallel with communication and/or work done on the accelerator Communication overhead

A Simple Analytical Model: Synchronization

Model Verification Microbenchmark Essentially a loop which offloads work to the FPGA Use no-ops to simulate unaccelerated work on the processor Each instance of communication transfers 64 bytes of data Used to measure speedup for varying system/application choices

Seepdup A Simple Analytical Model: Results 2.5 theoretical speedup limit (limited by offloaded work) 2 1.5 Full Synch (Modeled) 1 0.5 Half Synch (Modeled) ASynch (Modeled) Full Synch (Measured) Half Synch (Measured) Asynch (Measured) 0 0.01 0.1 1 10 100 breakeven point for breakeven point for half synch Granularity model (Normalized by full Roundtrip synch Latency) model

Initial Application: Transactional Memory Accelerate STM without changing the processor Use FPGA in FARM to detect conflicts between transactions Significantly improve expensive read barriers in STM systems Can use FPGA to atomically perform transaction commit Provides strong isolation from non-transactional access Not used in current rendition of FARM Good application for varying granularity of communication FPGA communication on all shared memory accesses: potential worst case (lots of communication)

Committer FPGA Hardware Overview Cache RSM HT Interface HT Core HT

FPGA Utilization CPU Frequency 1.8 GHz HyperTransport Frequency HT400 FPGA Device Stratix II EP2S130 Logic Utilization 62% Total Registers 43K Combinational LUTs 51% Dedicated Logic Registers 41% Pin Usage 33% Block Memory 10% (depends on cache) PLLs 4/12 (33%) Logic Frequency 100 MHz

CPU FPGA Communication Driver Modify system registers to create DRAM address space mapped to FPGA Unlimited size (40 bit addresses) User application maps addresses to virtual space using mmap No kernel changes necessary

CPU FPGA Commands Uncached stores Half-synchronous communication Writes strictly ordered Write combining buffers Asynchronous until buffer overflow Command offset: configure addresses to maximize merging DMA Fully asynchronous Write to cached memory and pull from FPGA

FPGA CPU Communication FPGA writes to coherent memory Need a static physical address (e.g. pinned page cache) or coherent TLB on FPGA Asynchronous but expensive, usually involves stealing a cache line from CPUs CPU reads memory mapped registers on the FPGA Synchronous, but efficient

Communication in TM CPU FPGA Use write-combining buffer DMA not needed, yet. FPGA CPU Violation notification uses coherent writes Free incremental validation Final validation uses MMR

Tolerating FPGA-CPU Latency Challenge: Unbounded latency leads to unknown ordering of commands from various processors Solution: Decouple timeline of CPU command firing from FPGA reception Embed a global time stamp in commands to FPGA Software or hardware increments time stamp when necessary Divides time into epochs Currently using atomic increment looking into Lamport clocks FPGA uses time stamp to reason about ordering

Global and Local Epochs Epoch N-1 Epoch N Epoch N+1 A A B B C C Global Epochs Local Epochs Global Epochs Finer grain but requires global state Know A < B,C but nothing about B and C Local Epochs Cheaper, but coarser grain (non-overlapping epochs) Know C < B, but nothing about A and B or A and C

Example: Use in TM Read Barrier Send command with global timestamp and read reference to FPGA FPGA maintains per-txn bloom filter Commit Send commands with global timestamp and each written reference to FPGA FPGA notifies CPU of already known violations Maintains a bloom filter for this epoch Violates new reads with same epoch

TM Time Stamp illustration Read x CPU 0 CPU 1 FPGA Start Commit Lock x Violate x

Synchronization Fence Occasionally you need to synchronize E.g. TM validation before commit Decoupling FPGA/CPU makes this expensive should be rare Send fence command to FPGA FPGA notifies CPU when done Initially used coherent write too expensive Improved: CPU reads MMR

Results Single thread execution breakdown for STAMP apps

Results Speedup over sequential execution for STAMP apps

Classic Lessons Bandwidth CPU vs Simulator In-order single-cycle CPUs do not look like modern processors (Opteron) Off chip is hard CPUs optimized for caches not off-chip communication Wish list Truly asynchronous fire and forget method of writing to the FPGA Accelerator write directly into the cache

Possible Directions Possibility of building a much bigger system (~28 cores) Security Memory watchdog, encryption, etc. Traditional hardware accelerators Scheduling, cryptography, video encoding, etc. Communication Accelerator Partially-coherent cluster with FPGA connecting coherence domains

Questions