Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Similar documents
XPU A Programmable FPGA Accelerator for Diverse Workloads

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

A Hybrid Approach to Cache Management in Heterogeneous CPU-FPGA Platforms

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

Benchmarking the Memory Hierarchy of Modern GPUs

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Flexible Architecture Research Machine (FARM)

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

CellSs Making it easier to program the Cell Broadband Engine processor

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Embedded Systems: Projects

Lecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections )

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Lecture 17: Virtual Memory, Large Caches. Today: virtual memory, shared/pvt caches, NUCA caches

Survey results. CS 6354: Memory Hierarchy I. Variety in memory technologies. Processor/Memory Gap. SRAM approx. 4 6 transitors/bit optimized for speed

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Lecture 15: Virtual Memory and Large Caches. Today: TLB design and large cache design basics (Sections )

CS 6354: Memory Hierarchy I. 29 August 2016

Coherence Protocol for Transparent Management of Scratchpad Memories in Shared Memory Manycore Architectures

SimXMD Simulation-based HW/SW Co-debugging for field-programmable Systems-on-Chip

Exploring OpenCL Memory Throughput on the Zynq

ECE 30 Introduction to Computer Engineering

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

Exploring Automatically Generated Platforms in High Performance FPGAs

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

Yet Another Implementation of CoRAM Memory

Design methodology for multi processor systems design on regular platforms

A Comparison of Capacity Management Schemes for Shared CMP Caches

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

Research on the Implementation of MPI on Multicore Architectures

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

SimXMD Co-Debugging Software and Hardware in FPGA Embedded Systems

Structure of Computer Systems

SimXMD: Simulation-based HW/SW Co-Debugging for FPGA Embedded Systems

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks

Handout 4 Memory Hierarchy

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Advanced CUDA Optimization 1. Introduction

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Virtual Memory. Motivations for VM Address translation Accelerating translation with TLBs

Staged Memory Scheduling

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Fundamental CUDA Optimization. NVIDIA Corporation

A Study of Data Partitioning on OpenCL-based FPGAs. Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST)

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

virtual memory Page 1 CSE 361S Disk Disk

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

«Real Time Embedded systems» Multi Masters Systems

Eliminating Dark Bandwidth

Advanced Memory Organizations

Computer Architecture Memory hierarchies and caches

PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Dynamic Flow Regulation for IP Integration on Network-on-Chip

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Re-architecting Virtualization in Heterogeneous Multicore Systems

Cache Performance (H&P 5.3; 5.5; 5.6)

Lecture 2. Memory locality optimizations Address space organization

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Deterministic Memory Abstraction and Supporting Multicore System Architecture

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Virtual Memory Oct. 29, 2002

Virtual Memory Nov 9, 2009"

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

Outline Marquette University

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

From Application to Technology OpenCL Application Processors Chung-Ho Chen

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Transcription:

Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1

Outline ointroduction to FPGA-based acceleration oheterogeneous CPU-FPGA architecture oa full system simulator for exploration of various CPU- FPGA architectures oa static and dynamic combined cache optimization strategy for cache coherent CPU-FPGA system oconclusion 2

FPGA based Acceleration Field Programmable Gate Array ofpga accelerator Tens of times faster than CPU One to two orders of magnitude more power efficient than CPU and GPU Post-fabrication reconfigurable ofpga usage in data center Microsoft, Baidu, Tencent 30% cost reduction 3

CPU FPGA System Architecture otraditional DMA based design: Manual efforts for DMA control Coherency handled by user Larger communication overhead through DRAM Less hardware cost ocache coherent design: Automatic coherence guarantee Simple programming model Smaller communication overhead through LLC Good for scattered access and large data space Intel Harp and IBM Power 8 4

Gem5 HDL: Heterogeneous System Simulator opowerful support of gem5-hdl Cycle-Accurate Co-Simulation of hardware accelerators and CPUs Based on gem5 and Verilator Accelerators described by Verilog or C/C++ FPGA model: Functionality, timing, IO signals Shared coherent caches between CPUs and accelerators on FPGA Runtime Control on accelerators Flexible Architecture for on-chip interconnection work, NoC or BUS Cooperation between separate processes based on Inter-Process Communication Data Control 5

Runtime Control Interface 6

Memory Management odata Coherence Support both virtual address and physical address Protocol - MOESI Accelerator Coherent Port FPGA-> L2 bus probe L1 of CPU Shared TLB CPU allocates a space of virtual address for FPGA 7

Exploration of Memory Hierarchy o Comparison of three different memory hierarchies Hierarchy (A) Hierarchy (B) Hierarchy (C) Shared Private 8

Experimental Setup o Parameters for Memory Hierarchy CPU FPGA Frequency Architecture Execution Frequency 2.67GHz X86 Out-of-Order 150MHz FPGA SPM L1 Cache L2 Cache Main Memory 32 KB-private, 4 cycles 32 KB-private, 8-way associate, 4 cycles 128KB-shared, 8-way associate, 12 cycles 512 MB, DDR3-1600 o Characteristics of PolyBench Benchmarks Benchmark 2mm floyd nussinov Jacobi-2d fdtd-2d heat-3d # Data 20.4 16 16 64 59.7 211 (KB) 9

Experimental Results = FPGA - L1 - L2 = FPGA - SPM - L2 = FPGA - SPM - MEM Data generated by CPU L1 Cache vs. SPM Frequent communication between CPU and FPGA If the amount of data > the size of SPM the performance of FPGA with L1 > the performance of FPGA with SPM SPM costs less area and energy than L1 Cache L1 Cache 10

Experimental Results Normalized Execution Time 1.2 1 0.8 0.6 0.4 0.2 0 jacobi 2d (64kB) fdtd 2d (59.7kB) heat 3d (311kB) Benchmarks 2mm (20.4kB) floyd (16kB) Data are pre-stored in main memory nussinov (16kB) hierarchy (A) hierarchy (B) hierarchy (C) = FPGA - L1 - L2 = FPGA - SPM - L2 = FPGA - SPM - MEM Cache-coherent vs. DMA If the amount of data > the size of SPM Besides loading/unloading SPM, there are still many FPGA accesses to cache or main memory; Cache helps to reduce the average latency of accesses Hierarchy (A) or (B) If the amount of data <= the size of SPM Except loading/unloading SPM, there is no FPGA access to cache or main memory; Using DMA can avoid the overhead of caches. Hierarchy (C) 11

FPGA L1 Cache Optimization Strategy ostatic analysis of accelerated functions and dynamic cache bypassing control oeach kernel own one partition of the cache to alleviate the contention odynamic control handles cache bypassing and cache partitioning Applications and accelerated kernels Vivado_HLS and GCC compiler Static LLVM analysis pass Kernel HDL module and software executable K values Whole system on Gem5 HDL with runtime control module implemented 12

Static Analysis and Dynamic Control Reuse Distance Analysis Partition and Bypassing Control Control bypassing and update the partition table with partition releasing bypassing Access cache o A set of K values obtained from static analysis to describe the cache utilization information of each kernel o Cache partition is assigned according to K value FPGA cache Replacement control for partitioning L2 cache Cache miss 13

Experiment Results The improvement on cache hit rate is %10~%20 for the tested benchmarks compared with LRU The total performance is improved by 1.5-6.5%. Resource overhead Partition table: 4 associativity and 16KB cache only need 165B additional resource, only 1% PolyBench: floyd_warshall + covariance + gesummv Average improvement: 18.25% Latency overhead No additional delay overhead for partition and bypassing Negligible compared with large miss latency 14

Conclusion o Develop a cycle-accurate full system simulator, gem5-hdl for heterogeneous CPU-accelerator system Flexible run-time control Support various memory hierarchy Enable design space exploration opropose a static and dynamic combined cache optimization strategy for optimizing the cache performance in cache coherent CPU-FPGA system Static analysis of data reuse in each kernel Dynamically assign partition and control cache bypassing Significantly improve the hit rate Negligible resource and latency overhead 15