Parallel graph traversal for FPGA

Similar documents
Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Mobile Robot Path Planning Software and Hardware Implementations

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation

FPGP: Graph Processing Framework on FPGA

A Reconfigurable Platform for Frequent Pattern Mining

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

FPGA Provides Speedy Data Compression for Hyperspectral Imagery

Scalable GPU Graph Traversal!

Efficient Self-Reconfigurable Implementations Using On-Chip Memory

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

Sparse Matrix-Vector Multiplication FPGA Implementation

Optimizing Memory Performance for FPGA Implementation of PageRank

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

INTRODUCTION TO FPGA ARCHITECTURE

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding

Introduction of the Research Based on FPGA at NICS

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Large-scale Multi-flow Regular Expression Matching on FPGA*

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Graph Partitioning for Scalable Distributed Graph Computations

Scanline-based rendering of 2D vector graphics

Copyright 2012, Elsevier Inc. All rights reserved.

Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures

Signal Processing Algorithms into Fixed Point FPGA Hardware Dennis Silage ECE Temple University

Developing a Data Driven System for Computational Neuroscience

Compute Node Design for DAQ and Trigger Subsystem in Giessen. Justus Liebig University in Giessen

Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs

FPGA Implementation and Validation of the Asynchronous Array of simple Processors

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

ISSN: [Bilani* et al.,7(2): February, 2018] Impact Factor: 5.164

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

XPU A Programmable FPGA Accelerator for Diverse Workloads

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements

System Verification of Hardware Optimization Based on Edge Detection

Adapted from David Patterson s slides on graduate computer architecture

LogiCORE IP Serial RapidIO Gen2 v1.2

DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL. Shruti Hathwalia* 1, Meenakshi Yadav 2

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

LECTURE 5: MEMORY HIERARCHY DESIGN

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment

Program-Driven Fine-Grained Power Management for the Reconfigurable Mesh

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Simulation & Synthesis of FPGA Based & Resource Efficient Matrix Coprocessor Architecture

Copyright 2012, Elsevier Inc. All rights reserved.

Line-rate packet processing in hardware: the evolution towards 400 Gbit/s

Reconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization

A Configurable Architecture for Sparse LU Decomposition on Matrices with Arbitrary Patterns

An Efficient Implementation of LZW Compression in the FPGA

GPU Sparse Graph Traversal. Duane Merrill

5. ReAl Systems on Silicon

Breadth First Search on Cost efficient Multi GPU Systems

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Lecture 13: March 25

EE414 Embedded Systems Ch 5. Memory Part 2/2

Towards Performance and Scalability Analysis of Distributed Memory Programs on Large-Scale Clusters

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

Overview of ROCCC 2.0

Vertex Shader Design I

A SCALABLE COMPUTING AND MEMORY ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye

Customized Architecture for Complex Routing Analysis: Case Study for the Convey Hybrid-Core Computer

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA

Ultra-Fast NoC Emulation on a Single FPGA

With Fixed Point or Floating Point Processors!!

of Soft Core Processor Clock Synchronization DDR Controller and SDRAM by Using RISC Architecture

RiceNIC. Prototyping Network Interfaces. Jeffrey Shafer Scott Rixner

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH

Improving Energy Efficiency of Block-Matching Motion Estimation Using Dynamic Partial Reconfiguration

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

Cache Performance (H&P 5.3; 5.5; 5.6)

Top-k Keyword Search Over Graphs Based On Backward Search

Metropolitan Road Traffic Simulation on FPGAs

Field Program mable Gate Arrays

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

AES Core Specification. Author: Homer Hsing

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Accelerating Double Precision Sparse Matrix Vector Multiplication on FPGAs

DESIGN AND IMPLEMENTATION OF VLSI SYSTOLIC ARRAY MULTIPLIER FOR DSP APPLICATIONS

Introduction to reconfigurable systems

Design and Implementation of High Performance DDR3 SDRAM controller

Near Optimal Repair Rate Built-in Redundancy Analysis with Very Small Hardware Overhead

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Transcription:

LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha, China a) nishice@gmail.com Abstract: This paper presents a multi-channel memory based architecture for parallel processing of large-scale graph traversal for fieldprogrammable gate array (FPGA). By designing a multi-channel memory subsystem with two DRAM modules and two SRAM chips and developing an optimized pipelining structure for the processing elements, we achieve superior performance to that of a state-of-the-art highly optimized BFS implementations using the same type of FPGA. Keywords: FPGA, multi-channel memory, parallel graph traversal Classification: Electron devices, circuits, and systems References [1] J. W. Babb, M. Frank and A. Agarwal: SPIE Photonics East (1996) 225. [2] A. Dandalis, A. Mei and V. Prasanna: PDP (1999) 652. [3] O. Mencer, Z. Huang and L. Huelsbergen: FPL (2002) 915. [4] Q. Wang, W. Jiang, Y. Xia and V. Prasanna: ICFPT (2010) 70. [5] B. Betkaoui, Y. Wang, D. B. Thomas and W. Luk: ASAP (2012) 8. [6] D. Chakrabarti, Y. Zhan and C. Faloutsos: ICDM (2004) 442. [7] C. M. Richard and B. Kyle: Cray User s Group (2010). 1 Introduction Graph is an important representation of large data sets. Graph traversal algorithms, such as breadth-first search (BFS), have data-driven computations dictated by the property of the graph, requirs fine-grained random memory accesses with poor spatial and temporal locality and leads to execution times dominated by memory subsystems. A field-programmable gate array (FPGA) is a reconfigurable integrated circuit. Combining the flexibility of software and the high performance of customized hardware design, FPGA can offer superior performance for many specific applications. Many studies have attempted to implement the graph traversal algorithm on FPGAs. The early methods either build a circuit that resembles the graph or use low-latency on-chip memory resources to store the entire graph [1, 2, 3], but failed to adapt the real world graphs, which are too large to fit into on-chip random-access memory (RAM) of FPGAs. Several recent publications [4, 5] described strategies of graph traversal on FPGAs 1

using off-chip DRAM memories to adapt the traversal of large-scale graph instances, and Betkaoui s work [5] is the first FPGA-based BFS implementation that can compete with other high performance multi-core systems. In this paper, we present a multi-channel memory based architecture for single FPGA chip that aims to achieve high performance for large-scale graph exploration. We use two dynamic random-access memory (DRAM) modules to increase the throughput of the memory subsystem and two static randomaccess memory (SRAM) chips to improve the random write/read performance on small data set. Moreover, we develop an optimized pipelining structure for the processing elements (PEs), which supports generating memory access requests at all pipeline stages, thus making it possible to saturate the memory system using fewer PEs with higher core frequency. The experimental results show that our implementation is capable of obtaining superior performance to that of a highly optimized BFS implementations using the same type of FPGA. 2 Graph representation and BFS algorithm Sparse matrices are a general way to represent the graphs. For a graph G(V, E), the sets V and E are respectively comprised of n vertices and m directed edges. A directed edge in the graph from v i to v j corresponds to a nonzero element at location (v i, v j ) in a sparse matrix. The compressed sparse row (CSR) representation is the most general storage format for storing the sparse matrix. This format is comprised of two arrays: column indexes array col and row pointers array ptr. The col array contain the corresponding column index for each nonzero element in the sparse matrix, arranged in a raster order starting with the upper-left and continuing from left to right then top to bottom. The row pointers array ptr stores the locations within col where each row starts and terminated with the number of nonzero elements. Given a graph G(V, E), the BFS algorithm is to traverse the vertices of G in breadth-first order starting at source vertex v s to build a BFS tree. Each newly-discovered vertex v i is marked by its level in the BFS tree, which equals to its distance from v s. The source vertex is at level 0, and its neighbors are at level 1, and so on. The parallel BFS algorithm is usually implemented in a level-synchronous way, in which each BFS level is processed in parallel while the sequential ordering of levels is preserved. 3 Multi-channel memory based architecture for BFS We design a multi-channel memory based architecture that adapts the largescale graphs and fully utilizes the parallelism on FPGA for high-performance BFS implementation. Our design is implemented on a self-designed reconfigurable algorithm accelerator which resides on a host server. The architecture of our system is shown in Fig. 1. The basic organization of the accelerator is a FPGA chip, a multi-channel memory subsystem and a PCI-E interface to the host server. The memory subsystem of the proposed system consists of four independent memory channels, where two DRAM channels connect to 2

two external DRAM modules (DRAM 0 /DRAM 1 ) and two SRAM channels connect to two external SRAM chips (SRAM 0 /SRAM 1 ). The accelerator receives the graph in CSR format and root vertex, executes the BFS algorithm and sends the data structure containing the BFS tree to the host. The BFS algorithm is executed on the PE array, which run in parallel in a multithreaded fashion. Each PE in our architecture has local memory, which is mainly used as data buffers to temporarily hold the data to and from external memory. Fig. 1. Multi-channel memory based architecture for BFS For our multi-channel memory based system, a key decision is to choose the memory type for the data structures of BFS implementation. Our main selection criterions are the sizes and access patterns of the data set. There are mainly three data sets, the graph in CSR format, a BFS level array and a visitation status bitmap. The graphs from real world are usually hundreds of Megabytes (MBs) in size, so we choose DRAM memory to store the graph. The BFS level array L is used to store the structure of BFS tree, which is usually tens of MBs is size, so it is also stored in DRAM memory. The visitation status of all vertices are stored in a bitmap, which is usually several MBs in size and requires frequent random memory accesses, so we use SRAM memory to store the bitmap data. As the DRAM access takes varying amounts of time before it returns data from the off-chip memory, on-chip RAM resources are mainly used as data buffers to store data from off-chip DRAM memories, including the offset addresses for accessing the vertex neighbors and the vertex levels that are ready to be updated. The general BFS framework for our FPGA-based BFS implementation mainly consists of three steps: Initialization of the data structures. The host processor initializes the BFS level array L, the visitation status bitmap array B and the BFS level indicator curlevel. Then, the set of vertices V of graph G(V, E) isevenly partitioned into two subsets V 0 =0, 1,..., V 2 and V 1 = V V 2 +1,..., 2 1. Correspondingly, G, L and B are respectively partitioned into G 0 /G 1,L 0 /L 1 and B 0 /B 1. After that, G 0 /L 0 and G 1 /L 1 are respectively sent to DRAM 0 3

and DRAM 1 ; B 0 and B 1 are respectively sent to SRAM 0 and SRAM 1. Concurrent computation on PE array. After the initialization of the memory subsystem, asynchronous execution of the BFS kernel takes place on each PE for a given BFS level. Each PE explores a vertex subset of V. OnceaPE has explored all the vertices in its subset, it sends a termination flag signal to the PE array controller, indicating whether there had been any vertices marked for the next BFS level. Synchronization on PE array. Each PE waits for all the other PEs to finish their assigned vertices. The termination is reached when there are no marked vertices for next BFS level. At that time the PE array controller issues a termination signal to the host processor to indicate that the execution is completed. After that, the host processor receives the computation result from DRAM 0 and DRAM 1. 4 Processing element design The execution time of the BFS algorithm is dominated by memory access latency. Therefore, the PE is likely to be idle for most of the time while waiting for data from the memory subsystem. To take the advantage of the capability of the multi-channel memory subsystem, our PE design is based on a pipelined architecture to access the off-chip memory subsystem in parallel. Multiple concurrent memory requests can be issued from different pipeline stages of each PE, leading to superior memory access performance. For efficiently taking advantage of the high bandwidth and the parallelism of the multi-channel memory subsystem, multiple concurrent memory requests are issued from different pipeline stages of each PE and kept in flight at any given time. Fig. 2 presents an overview of the PE design for the BFS kernel. The PE design consists of four pipeline stages, which are indicated using dashed lines. Fig. 2. Processing element architecture The algorithm process for each PE is described as follows: Read vertices to be explored (S1). For each vertex v i of the PE s vertex subset, this processing stage reads the BFS level L [v i ]fromdrammemory in batches into the local buffer Buf 1, using multiple non-blocking memory requests. This process is repeated until all vertices of the PE s vertex subset have been processed. 4

Row index gathering (S2). This processing stage reads L [v i ]frombuf 1. If L [v i ] is equal to the current BFS level, it means that the vertex belongs to this level, and hence its neighbors will be explored in the current iteration; otherwise, the PE check the BFS level of the next vertex v i+1. For each vertex v i to be explored, the row index pair ptr[v i ]andptr[v i + 1] are retrieved from DRAM memory into local buffer Buf 2. This process is repeated until Buf 1 is empty and processing stage S1 is completed. Neighbor vertex gathering (S3). This processing stage reads each row index pair ptr [v i ] andptr [v i +1]fromBuf 2. The neighbors of vertex v i, which are col [ptr [v i ]:ptr [v i + 1]], are retrieved from DRAM memory in batches into local buffer Buf 3. This process is repeated until Buf 2 is empty and processing stage S2 is completed. Status look-up and update (S4). This processing stage reads each gathered neighbors v i from Buf 3 and checked its visitation status by reading status bitmap from SRAM memory. Unvisited vertices will have their distance value updated to the current BFS level plus one. Status bitmap is also updated accordingly. This process is repeated until Buf 3 is empty and processing stage S3 is completed. 5 Implementation and performance evaluation We evaluated the proposed architecture on a self-designed FPGA-based reconfigurable algorithm accelerator, which is composed of one Xilinx Virtex5 XC5VLX330 FPGA, two Kingston KVR 667D2S5/2G 2 GB DRAM modules, two Samsung K7N323601M-QC20 4 MB SRAM chips (there are three SRAM chips on our board, among which one chip is used to store the configuration bits of the FPGA) and a PCI-E interface (implemented by a small-scale Virtex5 XC5VLX50T FPGA) to the host computer. Table I presents the configuration details of the platforms used in our experiments and in the previous work. Table I. Specification of platforms used in our performance comparison Platform Frequency Core num. Memory size DRAM bandwidth ASAP12-1F [5] 75 MHz 512 16 GB 80 GB/s Ours 160 MHz 32 4GB 12.8 GB/s The graph we used in this paper is the scale-free graphs, representing the common features observed in many large-scale real-world graphs. The scale-free graphs are generated using the Kronecker generator [6] of Graph500 benchmark [7]. As in previous related work, we measure the BFS performance as the number traversed edges per second (TEPS), which is computed by dividing the actual edge number of the BFS tree over the execution time. We implemented our design with 8-PE, 16-PE and 32-PE configurations. Our BFS hardware design is expressed in RTL using Verilog HDL, and was 5

compiled using Xilinx ISE v13.1. The resource consumption and timing performance are reported in Table II, which includes the two DRAM controllers. As shown in Table II, all the implementations can reach a clock frequency of over 160 MHz and it does not drop visibly with the array size growth from 8-PE to 32-PE, also showing the good scalability of our design. Table II. Device utilization summary PE number Slice LUTs Block RAMs Maximum frequency (MHz) 8 24% 47% 212 16 31% 59% 196 32 47% 86% 172 Set the vertex number to 16 million with an average arity from 8 to 32, we compare the BFS performance on our platform to that of ASAP12-1F, as shown in Table. III. ASAP12-1F, the single FPGA-based implementation of [5], is a state-of-the-art highly optimized BFS implementation on FPGA for large scale graphs, which is capable of obtaining superior performance to that of multi-core CPU platforms. Using a single Virtex5 LX330 FPGA with 32 PEs, our design outperforms ASAP12-1F using the same type of FPGA with 128 PEs by a factor of 1.25x to 1.32x. Table III. Performance comparison on reference platforms Arity ASAP12-1F (GTEPS) Ours (GTEPS) Speedup 8 0.39 0.49 1.26 16 0.53 0.66 1.25 32 0.72 0.95 1.32 6 Conclusion In this article, a multi-channel memory based architecture for efficient parallel processing of large-scale graph traversal on FPGA is proposed. By using two DRAM modules and two SRAM chips to improve the performance of the memory subsystem and developing an optimized pipelining structure for the PEs, the implementation is able to provide high bandwidth utilization for graph traversal algorithms. The experimental results show that our implementation is capable of obtaining superior performance to that of a state-of-the-art highly optimized BFS implementations using the same type of FPGA. Acknowledgments This research is supported by National Science Foundation of China under Grants No.61125201 and the Doctoral Fund of Ministry of Education of China under Grants No.60911062. 6