arxiv: v1 [cs.dc] 21 Aug 2017

Size: px

Start display at page:

Download "arxiv: v1 [cs.dc] 21 Aug 2017"

Carmel Barton
6 years ago
Views:

1 GraphR: Accelerating Graph Processing Using ReRAM arxiv:78.648v [cs.dc] Aug 7 Linghao Song Duke University linghao.song@duke.edu ABSTRACT Youwei Zhuo University of Southern California youweizh@usc.edu Hai Li Duke University hai.li@duke.edu Graph processing recently received intensive interests in light of a wide range of needs to understand relationships. The graph applications are memory bound with a large number of random accesses. The memory bandwidth bottleneck is incurred by a large number of random memory accesses during neighborhood traversal. In conventional architecture, the two characteristics incur a significant amount of data movements and energy consumption, thus proposed were hardware solutions for graph processing acceleration. The current graph processing accelerators rely on memory access optimizations or place computation logics close to memory. Distinct from all existing approaches, we leverage an emerging memory technology to directly accelerate graph processing. This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle of near-data processing and explores the opportunity of performing massive parallel operations with low hardware and energy cost. The analog computation is suitable for graph processing because: ) The algorithms are iterative and could tolerate the imprecise values by nature; ) Both probability calculation, such as PageRank and Collaborative Filtering, and the algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. GRAPHR is a novel accelerator architecture consisting of two components: memory ReRAM and graph engine (GE). The core graph computations are performed in sparse matrix format in GEs (ReRAM crossbars), which perform efficient matrix-vector multiplications. The vector/matrix-based graph computation is not new, but ReRAM offers the unique opportunity to realize the massive parallelism with unprecedented energy efficiency and low hardware cost. Because of this, the gain of performing parallel operations overshadows the wastes due to sparsity in matrix view within a small subgraph. The experiment results show that GRAPHR achieves a 6.x (up to.67x) speedup and an.8x energy saving on geometric mean compared to a CPU baseline system. The performance of GRAPHR is in the same order of GPUs without considering the data transfer time between CPU and GPU, an overhead GRAPHR does not incur. More importantly, GRAPHR consumes.6x to.68x less energy for computation only (not even counting CPU/GPU data transfer). Yiran Chen Duke University yiran.chen@duke.edu Xuehai Qian University of Southern California xuehai.qian@usc.edu. INTRODUCTION With the explosion of data collected from massive sources, graph processing received intensive interests due to the increasing needs to understand relationships: In cyber security, the graph analytics are needed to detect intrusion on the network [5]. In social media, people rely on the graph analytics to figure out the relationships and influences between people []. And significant real-world applications of graph processing include PageRank citation ranking [4], natural language processing [8, 7, 7], system biology [, 5], recommendation systems [, 48, 54] and machine learning [, 9, 5], etc., which are all fast-growing aspects in computer science and engineering research. There are several ways to perform graph processing. The distributed systems [,,, 59] leverage the ample computing resources to process large graphs. However, they inherently suffer from synchronization and fault tolerance overhead [6, 45, 6] and load imbalance [55]. Moreover, finding efficient graph cuts that minimizes communication between nodes and ensure balanced load is a difficult problem [9]. Alternatively, disk-based single-machine graph processing systems [7, 47, 5, 6] (out-of-core systems) can largely eliminate all the challenges of distributed frameworks. The key principle of such systems is to keep only a small portion of active graph data in memory and spill the remainder to disks. The third approach is the in-memory graph processing. Compared to disk-based systems, accessing data in main memory is orders of magnitude faster. The potential of inmemory data processing has been exemplified in a number of successful projects, including RAMCloud [4], Pregel [6], GraphLab [4], Oracle TimesTen [8], and SAP HANA [] Graph processing applications are memory bound with poor locality. The memory bandwidth bottleneck is incurred by a large number of random memory accesses during neighborhood traversal. The poor locality is caused by the fact that the neighbor vertices are not stored contiguously. In addition, the fact that graph operations only use a small portion of a cache block leads to memory bandwidth waste. Therefore, in conventional architecture, the graph processing incurs significant amount of data movements and energy consumption. To overcome these challenges, several hardware accelerators [4,, 4] are proposed to execute graph processing more efficiently with specialized architecture. In the following, we

2 retrospect the graph computation and the current solutions to motivate our approach. A graph can be naturally represented as an adjacency matrix and all graph algorithms can be eventually boiled down to some form of matrix-vector multiplications. On the other side, the current graph processing systems do not use the familiar matrix-based operations in high-performance computing (HPC) [4]. The reason is that the graph is sparse, and the matrix-based computation on the corresponding sparse matrix would be deemed to be inefficient, there are many multiplications with zero. For the same reason, the graph data are not stored in matrix form either, instead certain compressed sparse matrix representations are used (see more details in Section ). In this context, current graph processing systems perform two high-level actions: ) bringing data for computation from memory based on certain compressed representation; ) performing the computations on the loaded data. Due to the sparsity, the data accesses in ) are random and irregular. In essence, ) performs simple computations that are part of the matrix-vector multiplications but only on non-zero operands. As a result, each computing core experiences alternative long random memory access latency and short computations. This leads to the well-known challenges in graph processing and other issues such as memory bandwidth waste []. The current graph processing accelerators mainly optimize the memory accesses. Specifically, Graphicionado [] reduces memory access latency and improves throughput by replacing random accesses to conventional memory hierarchy with sequential accesses to scratchpad memory optimization and pipelining. Ozdal et al. [4] improves the performance and energy efficiency by latency tolerance and hardware supports for dependence tracking and consistency. TESSERACT [4] applies the principle of near-data processing by placing compute logics (e.g. in-order cores) close to memory to claim the high internal bandwidth of Hybrid Memory Cube (HMC) [44]. However, all architectures do little change on compute unit, the simple computations are performed one at a time. To perform matrix-vector multiplications, two approaches exist that reflect two ends of the spectrum: ) the densematrix-based methods incur regular memory accesses and perform computations with every elements in matrix/vector; ) the sparse-matrix-based methods used in graph processing incur random memory accesses but only perform computations on non-zero operands. In this paper, we adopt an approach that can be considered as the mid-point between these two ends that could potentially achieve better performance and energy efficiency. Specifically, we propose to perform sparse matrix-vector multiplications on data blocks of compressed representation. This idea has obvious pros and cons. The benefit is two-fold. First, the computation and data movement ratio is increased. It means that the cost of bringing data and format conversion could be naturally hidden by the larger amount of computations on a block of data. Second, inside this data block (i.e., sub-matrix/sub-graph), computations could be performed in parallel in SIMD manner. In contrast, the discrepancy of the fast computation and the long random access latency in the current systems based on sparse matrix makes it challenging to optimize performance. The downside is that certain hardware and energy will be wasted in performing useless multiplications with zero. This approach could in principle be applied to the current GPUs or accelerators, but with the same amount of compute resources (e.g. SM in GPUs), it is unclear whether the gain would outweigh the inefficiency caused by the sparsity. Clearly, a key factor is the cost of compute logic, with a low-cost mechanism to implement matrix-vector multiplications, the proposed approach is likely to be beneficial. In this paper, we demonstrate that the non-volatile memory, metal-oxide resistive random access memory (ReRAM) [57] could serve as the essential hardware building block to perform matrix-vector multiplications in graph processing. Recent works [, 49, 5] demonstrate the promising applications of efficiently in-situ matrix-vector multiplication of ReRAM on neural network computing. The analog computation is suitable for graph processing because: ) The iterative algorithms could tolerate the imprecise values by nature; ) Both probability calculation, such as PageRank and Collaborative Filtering, and the algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. Due to the low-cost and energy efficiency, the dense-matrix-based computation in ReRAM would not incur significant hardware waste due to sparsity. In fact, such waste could only occur inside the ReRAM crossbar with moderate size (e.g., 8 8). As a result, the architecture will mostly enjoy the benefits of more parallelism in computation and higher ratio between computation and data movements, which reduces the pressure on memory system. Moreover, performing computation in ReRAM naturally enables near data processing with reduced data movements: the data do not need to go through the memory hierarchy like in the conventional architecture or some accelerators. Applying ReRAM in graph processing poses a few challenges: ) Data representation. In graph processing, the matrix-vector multiplications cannot be directly performed on the graph data with compressed sparse matrix representations; ) Algorithm mapping. Unlike the regular matrix computations in neural network that could almost directly match the computing capability of ReRAM, it is less intuitive to map various graph algorithms to ReRAM with hardware constrains; ) Data Movements. While ReRAM is close to memory, to ensure high throughput and efficient execution, the architecture needs to carefully orchestrate data movements and support pipelining of different operations. This paper proposes, GRAPHR, a novel ReRAM-based accelerator for graph processing. It consists of two key components: memory ReRAM and graph engine (GE), which are both based on ReRAM but with different functionality. The memory ReRAM stores the graph data in compressed sparse representation. GEs (ReRAM crossbars) perform the efficient matrix-vector multiplications on the sparse matrix representation. We propose a set of techniques to map various graph algorithms to ReRAM crossbar. The peripheral components handle the auxiliary operations depending on the algorithms and control logics to support data movements. We propose a novel streaming-apply model particularly suitable for ReRAM to coordinate data block (chunks of vertices/edges) movements. In the evaluation, we compare GRAPHR with a software

3 framework [6] on a high-end CPU-based platform and also GPU implementations. The experiment results show that GRAPHR achieves a 6.x (up to.67x) speedup and an.8x energy saving on geometric mean compared to the CPU baseline. The performance of GRAPHR is in the same order of GPUs without considering the data transfer time between CPU and GPU, an overhead GRAPHR does not incur. More importantly, GRAPHR consumes.6x to.68x less energy for computation only (not even counting CPU/GPU data transfer). This paper is organized as follows. Section introduces the background of graph processing, ReRAM and current hardware graph processing accelerators. Section describes the whole GRAPHR architecture. Section 4 presents the graph engine for various graph algorithms. Section 5 presents the evaluation methodology and experiment results. Section 6 concludes the paper.. BACKGROUND AND MOTIVATION. Graph Processing (a) Vertex and edge accesses in graph processing Vertices: Edges: (random access) (local sequential access) (global random access) (b) Access pattern in vertexcentric program Figure : Graph Processing in Vertex-Centric Program Graph algorithms traverse vertices and edges to discover interesting graph properties based on relationship. A graph could be naturally considered as a adjacency matrix, where the rows correspond to the vertices and the matrix elements represent the edges. The graph algorithms can be always mapped to matrix operations, e.g., matrix-vector multiplications. However, in reality, the graph is sparse, which means that there would be many zeros in the adjacency matrix. This property incurs the waste of both storage and compute resources. Therefore, the current graph processing systems use the format that is suitable for sparse graph data. Based on such data structures, the graph processing can be essentially considered as implementing the matrix operations on the sparse matrix representation. In this case, the common techniques (e.g. vector processing) in HPC to optimize dense matrix operations do not apply. Instead, the individual (and simple) operations in the whole matrix computation are performed by the compute units (e.g., a core in CPU or an SM in GPU) after data is fetched. In the following, we elaborate the challenge of random accesses in various graph processing approaches. More details on sparse graph data representation will be discussed in Section.4. To provide an easy programming interface, the vertexcentric program featuring Think Like a Vertex (TLAV) [6] was proposed as a natural and intuitive way for human brain to think of the graph problems. Figure (a) shows an example, an algorithm could first access and process the red vertex in the center with all its neighbors through the red edges. Then it can move to one of the neighbors, the blue vertex on the top, accessing the vertex and the neighbors through the blue edges. After that, the algorithm can access another vertex (the green one on the left), which is one of the neighbors of the blue vertex. In graph processing, the vertex accesses lead to the random accesses because the neighbor vertices are not stored continuously. For edges, they incur local sequential access because the edges related to a vertex are stored continuously but global random accesses because the algorithm needs to access the edges of different vertices. The concepts are shown in Figure (b). The random accesses lead to long memory latency and, more importantly, the bandwidth waste, because only a small portion of data are accessed in a memory block. In a conventional hierarchical memory system, this leads to the significant data movements. Clearly, reducing random accesses is critical to improve the performance of graph processing, this is particularly crucial for the disk-based single machine graph processing systems (a.k.a out-of-core systems [7, 47, 5, 6]), because the random disk I/O operations are much more detrimental to the performance. In this context, the memory is considered small and fast (therefore can afford random accesses) while disk is considered large and slow (therefore should be only accessed sequentially). The edge-centric model in X-Stream [47] is a notable solution for reducing random accesses. Specifically, the edges of a graph partition are stored and accessed sequentially in disk and the related vertices are stored and accessed randomly in memory. Such setting is feasible because typically the vertex data are much smaller than edge data. During process, X-Stream generates Updates in scatter phase, which incurs sequential writes, and then, applies these Updates to vertices in gather phase, which incurs sequential reads. The concepts are shown in Figure (a). A notable drawback of X-stream is that the amount of updates may be as large as that of edges, GridGraph [6] proposed optimizations to further reduce the storage cost and data movements due to updates. The solution is based on the dual sliding windows (shown in Figure (b)), which partitions edges into blocks and vertices into chunks. On visiting the edge blocks, the source vertex chunk (orange) is accessed and updates are directly applied to the destination vertex chunk (blue). This requires no temporary update storage as in X-Stream. Edge blocks can be accessed in source oriented order (shown in Figure (b)) or destination oriented order. Note that the dual sliding window mechanism is based on edge-centric model.. ReRAM Basics The resistive random access memory (ReRAM) [57] is an emerging non-volatile memory with appealing properties of high density, fast read access and low leakage power. ReRAM has been considered as a promising candidate for future memory architecture. A primary application of ReRAM is to be used as an alternate for main memory [9,, 6]. Figure (a) demonstrates the metal-insulator-metal (MIM) structure of an ReRAM cell. It has a top electrode, a bottom electrode and a metal-oxide layer sandwiched between electrodes. By applying an external voltage across it, an ReRAM cell can be switched between a high resistance state (HRS or OFF-state)

4 . Edge-centric scatter. Edge-centric gather (sequential read) Edges: (sequential read) Updates: Vertices(source): Updates: (random access) (sequential write) Vertices (destination): (a) Edge-centric processing (random access) Figure : (a) Edge-Centric Processing and (b) Dual Sliding Windows (b) Dual sliding windows and a low resistance state (LRS or On-state), which are used to represent the logical "" and "", respectively, as shown in Figure (b). Top Electrode Metal Oxide Current HRS( ) RESET LRS( ) Bottom Electrode b j = f ( a i i w ij ) Voltage i (a) (b) (c) SET Figure : Basics of ReRAM The ReRAM features the capability to perform in-situ matrix-vector multiplication [4, 5] as shown in Figure (c), which utilizes the property of bitline current summation in ReRAM crossbars to enable computing with high performance and low energy cost. While conventional CMOS based system showed success on neural network acceleration [5,, 5], recent works [9,, 49, 5] demonstrated that ReRAM-based architectures offer significant performance and energy benefits for the computation and memory intensive neural network computing.. Graph Processing Accelerators Due to the wide applications of graph processing and its challenges, several hardware accelerators were recently proposed. Ahn et al. [4] proposes TESSERACT, the first PIMbased graph processing architecture. It defines a generic communication interface to map graph processing to HMC. At any time, each core can Put a remote memory access and get interrupted to receive and execute the Put calls from other cores. Mustafa et al. [4] introduces an accelerator for asynchronous graph processing, which features efficient hardware scheduling and dependence tracking. To use the system, programmers have to understand its architecture and modify existing code. Graphicionado [] is a customized graph accelerator designed for high performance and energy efficiency based on off-chip DRAM and on-chip edram. It modifies the graph data structure and data path to optimize graph access patterns and designs specialized memory subsystem for higher bandwidth. These accelerators all optimize the memory accesses, reducing the latency or better tolerating the random access latency..4 Graph Representation As discussed in Section, supporting the limited dense matrix-vector multiplications on data blocks could increase the ratio between computation and data movement and reduce the pressure on memory system. With its matrix-vector multiplication capability, ReRAM could naturally perform the low-cost parallel dense operations on the sparse sub-matrices (sub-graphs), enjoying the benefits without increasing hardware and energy cost. The key insight of GRAPHR is to still store the majority of the graph data in the compressed sparse matrix representation and process the subgraphs in uncompressed sparse matrix representation. In the following, we review several commonly used sparse representations. (row,val) colptr (col,val) rowptr (row,col,val) 8 (,) (,) (,,) 7 (,4) (,8) (,,8) (,) (,,7) (,7) (,7) 4 (,) 4 (,,) 4 (,8) 4 6 (,4) 4 6 (,,4) (,) 5 (,) 5 (,,) (a) (b) CSC (c) CSR (d) COO Figure 4: (a) Sparse Matrix and Its Compressed Representations in: (b) Compressed Sparse Column (CSC), (c) Compressed Sparse Row (CSR), (d) Coordinate List (COO) The three major compressed sparse representations are compressed sparse column (CSC), compressed sparse row (CSR) and coordinate list (COO). They are illustrated in Figure 4. In the CSC representation, non-zeros are stored in column major order as (row index, value) pairs in a list, the number of entries in the list is the number of non-zeros. Another list of column starting pointers indicate the starting index of a row in the (row,val) list. For example, in Figure 4 (a), 4 in the colptr indicates that the 4-th entry in (row,val) list, i.e., (,8) is the starting of column. The number of entries in colptr is the number of columns +. Compressed sparse row (CSR) is similar to CSC, with row and column alternated. For coordinate list (COO), each entry is a tuple of (row index, column index, value) of the nonzeros B- B- B- B- (a) (b) (c) Figure 5: (a) A Directed Graph and Its Representations in (b) Adjacency Matrix and (c) Coordinate List 4

5 In GRAPHR, we assume coordinate list (COO) representation to store a graph. Given a graph in Figure 5 (a), its (sparse) adjacency matrix representation and COO representation (partitioned into four 4 4 subgraphs) are shown in Figure 5(b) and (c), respectively. In this example, the coordinate list saves 6% storing space, compared with the adjacency matrix representation. For real-world graphs with high sparsity, the saving is even high: the coordinate list can only take.% of space for WikiVote [] compared to an adjacency matrix.. GRAPHR ARCHITECTURE. Overall GraphR Architecture DRV S/H CTRL Memory ReRAM Memory ReRAM IO Interface GE GE GE GE GE S/H DRV DRV S/H Memory ReRAM Memory ReRAM S/H DRV RegI ADC S/A salu RegO CTRL S/H S/A DRV GE ADC salu RegI RegO Controller Sample & hold Shift & add unit Driver Graph Engine Analog to digital Simple ALU Input register Output register Figure 6: Architecture of GRAPHR: a ReRAM-Based Graph Processing Accelerator Figure 6 shows an overview of the proposed GRAPHR accelerator architecture. Essentially, GRAPHR is a ReRAMbased memory module that performs efficient near data parallel graph processing. It contains two key components: memory ReRAM and graph engine (GE), both are based on ReRAM but with different functionalities. The memory ReRAM stores the graph data in compressed sparse representation. GEs perform the efficient matrix-vector multiplications on the sparse matrix representation. In a GE, a number of ReRAM crossbars with Drivers (DRV) and Sample and Hold (S/H) components are placed in mesh, which are connected with Analog to Digital Converter (ADC), Shift and Add units (S/A) and simple algorithmic and logic units (salu). The input and output register (RegI/RegO) are used to cache data flow. The functionality and design details of components will be discussed in Section.5. The whole GRAPHR architecture contains a number of GEs that process a sub-graph in parallel on sparse matrix. The small-sized ReRAM crossbars ensure that the wasteful computations are not significant, as they only happen inside the crossbars. As shown in Section.4, each entry in a coordinate list should be a three-element tuple of {Source ID, Destination ID, Edge Weight}. To simplify the example, we use an unweighted graph, so a twoelement tuple of {Source ID, Destination ID} is sufficient to represent one edge. The I/O interface is attached to the ReRAM components so that the graph data and instructions could be loaded into ReRAM memory and controller, respectively. In GRAPHR, the controller is the software/hardware interface that could execute simple instructions to: ) coordinate the graph data movements between memory ReRAM and GEs based on the model in the next section; ) assign the graph edges in coordinate list representation in memory ReRAM to the ReRAM crossbars in GEs (using method in Section.); ) perform convergence check (see Section 4.).. Streaming-Apply Execution Model Src Memory ReRAM B- B- B- B- B- B- B- B- B- B- B- B- B- B- B- B- Dst GE GE GE GE Figure 7: Streaming-Apply Execution Model We propose a streaming-apply execution model so that the GRAPHR architecture can process a complete large graph. Figure 7 shows the model, which is inspired by the dual sliding windows shown in Figure (b). Initially, all data loaded into memory ReRAM through I/O interface. We use subgraph to refer to the graph data that are processed together in all GEs. In Figure 7, we have 6 subgraphs Bi-j, i, j {,,, }. Before processing, these subgraphs are stored in the original sparse representation in memory ReRAM and marked in orange. The vertex data are always stored in a vector and the edges are stored in the coordinate list. There are two vectors for the vertices: src holds the old vertex values for the vertex read and dst holds the new vertex values for the vertex write (update). After all sub-graphs are processed in one iteration, the dst is copied to src so that the next iteration could read the new vertex values. The actual processing of a subgraph happens in GEs, we use block to refer to the region in a sub-graph that is processed by a particular GE. The streaming-apply model concerns two aspects: ) intersubgraph processing, which specifies the order of subgraphs are processed; and ) intra-subgraph processing, which specifies the order of blocks are processed in GEs. Inter-subgraph processing. The subgraphs can be processed in either row-major or column-major with different trade-offs. For row-major order (e.g., B- B- B- B- B-...), a region of dst is updated after each subgraph is processed by GEs. This operation copies data from RegO to dst. It has to be done after each subgraph because the next subgraph will produce the vertex updates to another region and overwrite the original updates in RegO. On the other side, the source vertex data only need to load once to RegI for all subgraphs in the same row. For columnmajor order (e.g., B- B- B- B- B-...), a region of dst is updated after each column of sub- 5

6 graphs (e.g., {B-, B-, B-, B-}) is processed. This is because all subgraphs in the same column will update the same set of vertices (corresponding to the same region in dst). The updates generated in later subgraphs in the same column could simply overwrite the updates generated earlier. However, the vertex values need to be read from src to RegI for each subgraph. Overall, the column-major order incurs less ReRAM writes but more ReRAM reads than row-major order. Considering that the write cost is more than read cost for ReRAM, we use column-major order in GRAPHR. Intra-subgraph processing. In Figure 7, we mark the subgraph being processed by GEs in blue. To process each subgraph, the data in memory ReRAM are first converted from coordinate list to the sparse matrix representation by the controller and then loaded into GEs. For vertices, the Src Chunk (i.e., the region of vertex values in src related to the subgraph) is loaded into RegI. The updates (i.e., Dst Chunk) are stored in RegO. We see that the data movements here are only from memory ReRAM to GEs. Because the edges are not updated, there is no data movements from ReRAM crossbars back to memory ReRAM. With multiple GEs available, different blocks within a subgraph could be assigned to different GEs. Here, the key requirement is that the vertex updates in Dst Chunk should not conflict. Therefore, we can only assign each GE to process a column of blocks. When different GEs are processing different columns, they read from the shared Src Chunk and write to different regions of Dst Chunk without conflict. This intra-subgraph execution model is called Row Parallelized and Column Oriented (RPCO) processing as shown in Figure 7. Assume we have G GEs, in RPCO, they could process G columns of blocks simultaneously. Note that the blocks in the same column can be processed in any order but only sequentially, because otherwise different blocks could generate conflicting updates. In another word, we cannot allow any more parallelism. We assume a simple top to bottom order in each column.. Graph Preprocessing block B- B- GE GE CB CB CB CB Busy GE Idle GE subgraph (a) CB CB CB CB CB CB CB CB B- (c) B- B Sequential 4 Access (b) Figure 8: GraphR Processing Order Based on the streaming-apply execution model, the sparse graph data in coordinate list representation need to be preprocessed to facilitate the loading of edge data into GEs. While the preprocessing takes time, it is only done once for different algorithms, because the execution order is always the same, the difference is how to map the algorithms to GEs, which will be discussed in Section 4. The preprocessing is both determined by the execution model and a few architectural parameters of GRAPHR. We assume the users of GRAPHR is informed with these parameters so that they could perform proper preprocessing. This is nothing different from the traditional architecture-dependent optimizations that are widely used (e.g., tiling optimization based on cache size [4]). Let us assume that we have G GEs, each GE contains C ReRAM Crossbars (CBs), the size of each CB is S. According to Figure 7, all GEs together can process a subgraph with G C S vertices. For a graph with V vertices, the whole graph should be partitioned into (V /(G C S)) subgraphs. Figure 8 shows the example with V = 6,G =,C =,S =, the graph with 6 vertices are first divided into 4 subgraphs (B-, B-, B- and B-). The subgraphs are processed in column-major order. Inside each subgraph, data in a set of rows are loaded and processed in sequence by all GEs. The number of rows processed together is determined by S, the size of CBs. Note that the CBs inside a GE should be logically organized into a vector, because they will be all devoted to processing the same rows. Also, all GEs are also organized into a vector manner to cover all columns in the subgraph. Based on this setting, the blocks inside the subgraph should be organized in a rowmajor manner. In the example in Figure 8 (a), the process of B- starts with the two GEs (GE and GE ) processing the first few rows. This involves reading the coordinate list in the same order, which is marked with numbers in Figure 8 (b). We call the portion inside a block processed by a CB as sub-block. It is important to ensure that the the coordinate list entries for a sub-block are presented consecutively, so that when the mapping of this sub-block from coordinate list to sparse matrix representation in CB will only involve sequential accesses. This is the reason why the preprocessing is related to the architectural parameter S: we need to make sure all the coordinate list entries for sub-block are ordered before the entries for sub-block, which means at most S entries for each row. If a sub-block has any edge, at least one cell in the corresponding CB is non-zero, this CB is busy, otherwise, it is idle. Due to the sparsity of the graph, the sub-blocks without edges spread all over the graph. In Figure 8 (a), the row being processed by the GEs in B- illustrates such situation, where the second CB in GE is idle because there is no edge in this sub-block. Obviously, this causes resource under-utilization, but it is unavoidable using sparse matrix format. The gain of paying this inefficiency cost is to be able to process many blocks in parallel with low cost. As mentioned in Section, ReRAM offers a good justification to explore this trade-off. Note that the empty sub-block does not pose extra complexity, because in the preprocessed coordinate list, the entries for this sub-block are just skipped. When each coordinate list entry is sequentially loaded, the pair of indexes indicate the 6

7 CB it should fall into, CBs are also loaded sequentially so the skipped CB(s) could be also inferred. It is also possible that the whole GE is idle, as shown in the row being processed in B-. Occasionally, if all blocks in a row are empty, the row could be skipped and save the cycle, this happens in the second row in B-. Based on the discussion, the requirement of preprocessing is already clear. The sub-block order in the output of preprocessed coordinate list is indicated in Figure 8 (b). The sub-block without a number is skipped. Figure 8 (c) further shows a segment of the preprocessed coordinate list of B-, we can see that when converting the data in the compressed representation to sparse matrix in CB/GE, the coordinate list entries are read sequentially..4 ReRAM Crossbar Size Trade-offs (a) 8-8 Mapping (b) 4-4 Mapping (c) - Mapping Figure 9: Processing the Adjacency Matrix with (a) 8-by- 8, (b) 4-by-4 and (c) -by- ReRAM Crossbars In general, the size of ReRAM crossbars captures the tradeoff between parallelism/speedup/cost and utilization. Figure 9 (a) shows an 8 8 adjacency matrix for Figure 5 (a), where the shadowed squares represent an edge between two vertices and a blank square means there is no edge between the two vertices. In this case, although the whole graph could be processed in parallel, 6.9% of cells are wasted, and the utilization is 9.% If we consider reducing the ReRAM crossbar size to 4 4 and, as shown in Figure 9 (b) and (c), then the utilization is improved. Specifically, the 4 4 subgraph in the upper-right corner of Figure 9 (b) does not need to be mapped to ReRAM crossbar and processed, the utilization becomes 5.%. Similarly, the utilization in Figure 9 (c) is further increased to 78.%. Let ReRAM crossbar size as S, number of edges in the graph as N E, number of ReRAM crossbars in a partition as N B. We could define a metric cell utilization rate as µ = N E /(S S N B )..5 GraphR Modules This section discussed the structure of the each component in GRAPHR. Memory ReRAM It does not perform actual graph computation but is used to store edges and vertices. Typically, the number of vertices is smaller than the number of edges. Edges are stored as coordinate list while vertices are stored as a vector. Driver (DRV) It is used to ) load new edge data to ReRAM crossbars for processing; and ) input data into ReRAM crossbars for matrix-vector multiplication. We use a weighted spike coding scheme. For a N-bit input value, we need N time slots for potential spikes. Inside the driver, N levels of reference voltage, ranging from V / N to V / are generated. When writing or programing edge data to ReRAM crossbars of a GE, the driver supplies a programing voltage. Sample and Hold (S/H) It is used to sample analog values and hold them before converting to a digital form. Analog to Digital Converter (ADC) It converts analog values to digital format. Because ADCs have relatively higher area and power consumption, ADCs are not connected to every bitlines of ReRAM crossbars in a GE but shared between those bitlines. If the GE cycle is 64ns, we can have one ADC working at.gsps to convert all data from eight 8-bitline crossbars within one GE. salu It is a simple customized algorithmic and logic unit. salu performs operations that cannot be efficiently performed by ReRAM crossbars, such as comparison. Data Format It is not practical for ReRAM cells to support a high resolution. Recent work [5] reported 5-bit resolution on ReRAM programing. To alleviate the pressure of diver, we conservatively assume the 4-bit ReRAM cell. To support higher computing resolution, e.g., 6 bit, the Shift and Add (S/A) unit is employed. A 6-bit fixed point number M can be represented as M = [M,M,M,M ], where each segment M i is a 4-bit number. We can shift and add results from four 4-bit-resolution ReRAM crossbars,i.e. D + D 8 + D 4 + D to get a 6-bit result. When salu and S/A are bypassed, a graph engine could be simply considered as a memory ReRAM mat. A similar scheme of reusing ReRAM crossbars for computing and storing is employed in []. 4. PROCESSING IN GRAPH ENGINES With the overall GRAPHR architecture and the streamingapply execution model, the graph data can flow into the GEs to be processed. This section discusses the mapping of different algorithms to GEs. At high level, the algorithm mapping requires both edge-related and vertex-related operations. The edge-related operations are performed by ReRAM crossbars, and the essential problem is how to map different graph algorithms to the matrix-vector operations. We proposes two processing patterns, parallel and sequential processing, with the illustrations with concrete algorithms. The key issue in vertex-related operations is to perform the operations that could not be performed by ReRAM crossbars in the salu. Different operations should be configured in this component depending on the algorithms. 4. Edge-Related Operations 4.. Parallel Processing The graph algorithms with Parallel processing pattern can be naturally mapped to ReRAM with the best parallelism. PageRank is a representative example for this pattern. PageRank [4] is an algorithm to calculate the website popularity (the probability of staying on a website). It does the following iterative computation: PR t+ = rm PR t +( r) e. PR t is the PageRank at iteration t, M is a probability transfer matrix, r is the probability of random surfing and e is a vector of probabilities of staying in each page. We consider a small sub-graph (i.e., sub-block in Section ) that could be processed by a 5 4 CB (the additional row is to perform addition). It contains at most 6 edges represented by the intersection of 4 rows and 4 columns in the sparse matrix 7

8 Src / ) ) / ) ) / ) ) ) ) ) ) ) ) ) ) ) ) / ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) Dst Src Dst - Dst 4 (c) i(4 i: i: i: 5 j:7 j:6 j:m j:m j:7 j:5 j: j:4 (c) i i i i 4 Src (RegI) ) ) ) M 5 9 M min min min min 7 6 M M M () t= Dst - ) ) ) M M 6 4 min min min min M () t= ) ) ) M M M M min min min min () t= ) ) ) M M M min min min min (RegO) Dst 4 () t=4 i i i i Src Dst (a) Src Dst i: /4 / / / i: /4 / i: /4 / i: /4 j j j j (b) / / Src / / / / Dst (b) /4 /4 /4 /4 9/6 /6 5/6 /6 (b) Figure : Graphs to Illustrate (a) PageRank and (b) SSSP (shown in Figure (a)). Thus, the block is related to 8 vertices (i.e., i i, j j). The following are the parameters for the 4 4 block PageRank shown in Figure (b): M = [,/,,;/,,,/;/,,,/;/,/,,], e = [/4,/4,/4,/4] T, r = 4/5. We define M = rm and e = ( r) e, so that PR t+ = M PRt + e. In CB, as shown in Figure (b, b), the values are already scaled with r. Figure (b) shows the parallel processing pattern of PageRank algorithm in a CB (ReRAM Crossbar). M T is mapped to the CB (blue squares). e T is mapped to the last row (green squares). For the 4 4 block, a 5 4 CB is needed (in the view of Figure 8, S = 4). The initial PageRank value (at time slot t = ) PR = [/4,/4,/4,/4] is connected to the input to the CB at every wordline in parallel. The additional row is used to implement the addition of e. We input a fixed value to the last wordline. With this mapping, the current summation on bitline is the PageRank value at time slot t =. For example, in the first bitline: 9/6 = /4 + /4 /5 + /4 + /. Such output is stored first in RegO in GE and eventually updates the value of the destination vertex. The updated values will be used in the next iteration and flow in the ReRAM crossbar at the wordlines. In summary, the algorithms with this pattern can fully leverage the parallelism of computations in ReRAM s because the computation in the next iteration only depends on the local property (neighbors) of the previous iteration. Therefore, all vertices can be processed in parallel and only one logical cycle (time slot) is needed. 4.. Serial Processing Different from PageRank, another type of iterative algorithms such as Single Source Shortest Path (SSSP) [6] and Breadth First Search (BFS) [6] requires the Serial processing pattern because the computation in the next iteration depends on the intermediate results during the processing of sub-graph in CB. Thus, each vertex can only be processed sequentially. Moreover, this type of algorithms need to maintain whether a vertex is active. This could be implemented by adding a boolean value for each vertex. We explain the serial processing pattern and the related technical issues using SSSP. In SSSP, each vertex v is given a distance label dist(v) that maintains the length of the shortest known path to vertex v from the source. The distance label is initialized to at the source and at all other nodes. Then the SSSP algorithm iteratively applies the relaxation operator, which is defined as follows: if (u,v) is an edge and dist(u)+w(u,v) < dist(v), the value of dist(v) is updated to dist(u) + w(u,v). An active vertex is relaxed by applying this operator to all the edges connected to it. Each relaxation may lower the distance label of some vertex, and when no further lowering can be performed anywhere in the graph, the resulting vertex distance labels indicate the shortest distances from the source to the each vertex, regardless of the order in which the relaxations were performed. Breadth-first numbering of a graph is a special case of SSSP where all edge labels are. Next, we explain the mapping of SSSP algorithm to a CB 8

9 using a small 8-vertex sub-graph corresponding to a 4 4 block in sparse matrix, as shown in Figure (c). It can be represented by adjacency matrix: W = [M,,5,M;M,M,,; M,M,M,M;M,M,,M] where M indicates no edge connected two vertices and M is set to a reserved maximum value for a memory cell in a CB. The values are stored in the CB shown in Figure (c). Given a vertex u and dist(u), the row in the adjacency matrix W for u indicates w(u,v). Therefore, we could perform the relaxation operator for u in parallel. However, we cannot perform the relaxation for all vertices in parallel, because of the data dependence on intermediate results. Specifically, it is possible that the relaxation operator applied to u can update dist(v), which is read by the relaxation operator of v. For this reason, each vertex should be processed in Serial manner. The relaxation operator of u involves reading: ) dist(u): it is computed iteratively from the source, for the source vertex, the initial value is zero; ) The vector of the dist(v) before the relaxation operator: it is a vector indicating the the distance between source and all other vertices and is also computed iteratively from the source. In our example, for the source vertices (i,i,i,i), the initial value is [4,,,]; ) The vector of the w(u,v): it is the distance from u to the destination vertices in the subgraph, and can be obtained by reading a row in adjacency matrix W. Figure (c) shows the process to perform SSSP in a 5 4 CB. The last row (green squares) is set to a fixed value, which is used to add dist(u) (the input to the last wordline) to each w(u, v) in the relaxation operator. The initial value for dist(u) for the destination vertices ( j, j, j, j) are [7,6,M,M]. The vector of w(u,v) is obtained by activating the wordline associated to vertex v. In the time slot (t = ), wordline (for source vertex i) is activated (the red square with input value ) and a value 4 (this is the current value in dist(v) for source i) is input to the last wordline (the green box line). The vector of the dist(v) for the destination vertices (( j, j, j, j)) is set as the value of the output at each bitline, which is [7,6,M,M]. With this mapping, the current summation in bitline in Figure (c-) is [ M + 4, + 4, 5 + 4, M + 4 ] = [M,5,9,M]. It is the dist(u) + w(u,v) computed in parallel, where u is the source vertex. Then the distance of source to each vertex v is compared with the initial dist(v) ([7,6,M,M]) by an array of comparators, and in the final output of bitline, we get [7,5,9,M], which is the updated dist(v) vector after time slot t =. The parallel comparisons are performed by vertex-related operations in Section 4.. Different algorithms may require different functions on vertices. After time slot t =, we move to the next vertex in Figure (c-), where we i) activate the second wordline; ii) change the input to the last wordline to (the distance label for source vertex i); and iii) set the intermediate dist(v) to be the final output of bitline in time slot t =, which is [7,5,9,M]. Similar as Figure (c-), the the current summation in bitline is [M,M,6,4] and it is compared with [7,5,9,M], yielding the final output of bitline for time slot t = as [7,5,6,4]. We indicate the updated distance label using orange squares. The operations in time slot t = and t = 4 can be performed in the similar manner. Initially, before processing the block, the active indicator for each destination vertex is set to be FALSE. After the serial processing in CB, the active indicators for all vertices which have been updated (marked in orange in Figure (c)) are set to be TRUE. This indicates that they are active for next iteration. In our example, j, j, j are marked active. Referring to Figure, this means that the distance labels for these vertices have been updated. Because we are processing the block in CB, the active indicator for destination vertex may be accessed for multiple times, but if it is set be TRUE at least one time, this vertex is active in next iteration. Globally, after all active source vertices and corresponding edges are processed in an iteration, source vertex properties (values and active indicators) that hold the old values are updated by the properties of the same vertices in destination. The new source vertex properties are used in the next iteration. 4. Vertex-related Operations reg(old) reg(new) salu (a) reg old) reg(new) salu 6 4 Figure : Vertex-related Processing: salu Is Configured to Perform (a) add in PageRank and (b) min in SSSP Vertex-related processing handles computations that cannot be performed by ReRAM CB. Specifically, it performs the computation on the destination vertices, which are elementwise operations. This specific operation can be set by configuring salu, different algorithms need different functions. In PageRank, salu is configured to perform add operation to add PageRank values. In Figure (a), the current PageRank values for the same destination vertices ([,7,,], which are stored in RegO)) need to be added with [,,4,5], the new values from the GE/CB. To perform this computation, salu updates values stored in register to [+,7+,+4,+5] = [4,9,6,8]. In SSSP, the old distance vector for the destination vertices is [7,5,6,4], the new distance vector generated during processing is [,,9,4], the salu is configured to perform min operation, which will update the distance vector for the destination vertices as [,,6,4] as shown in Figure (b). 4. Convergence The convergence of the computation is determined by the algorithm. After each iteration, vertices in the memory ReRAM of GRAPHR are checked to determine the convergence. For example, in PageRank, we can accumulate the PageRank values of all vertices and compare it with that of previous iteration, the algorithm is converged if the difference is less than a threshold. For Breadth First Search (BFS)/SSSP, we can check if there are active vertices to determine the convergence. 5. EVALUATION In this section, we evaluate GRAPHR architecture on performance and energy saving. (b) 9

10 Dataset # Vertices #Edges Description WikiVote(WV) [] 7.K K Wikipedia who-votes-on-whom network Slashdot(SD) [] 8K 948K Slashdot social network from February 9 Amazon(AZ) [] 6K.M Amazon product co-purchasing network from March WebGoogle(WG) [].88M 5.M Web graph from Google LiveJournal(LJ) [] 4.8M 69M LiveJournal online social network Orkut(OK) [46].M 6M Orkut social network Netflix(NF) [7] 48K users, 7.8K movies 99M Netflix movie ratings Table : Graph Datasets 5. Graph Datasets and Applications Table shows the datasets used in our evaluation. We use seven real-world graphs. For WikiVote(WV), Slashdot(SD), Amazon(AZ), WebGoogle(WG), LiveJournal(LJ), Orkut(OK) and Netflix(NF). We run pagerank(pr), breadth first search(bfs), single source shortest path (SSSP) and sparse matrix-vector multiplication (SpMV) on the first six datasets. On Netflix(NF), we run collaborative filtering(cf), and the feature length used is. 5. Experiment Setup In our experiments, we compare GRAPHR with a CPU baseline platform and a GPU platform. PR, BFS, SSSP and SpMV running on the CPU platform are based on the software framework GridGraph [6], while collaborative filtering is based on GraphChi [7]. PR, BFS, SSSP and SpMV running on GPU platform are based on Gunrock [56], while CuMF_SGD [58] is the GPU framework for CF. It is important to note that although we use the a disk-based out-of-core system, we are not comparing disk-based system with in-memory ReRAM-based system. It is because the CPU/GPU-based machine we used has 8GB memory, large enough to fit the data of all graphs in memory. Essentially, they are also doing in-memory graph processing. We also exclude the disk I/O time from the execution time of the CPU/GPU-based platform. CPU Intel Xeon E5-6 V, 8 cores,.4 GHz 8 ( + )KB L Cache 8 56KB L Cache MB L Cache Memory 8 GB Storage TB CPUs, a total number of threads. Table : Specifications of the CPU Platform Specifications of the CPU and GPU platforms are shown in Table and Table. The CPU energy consumption is estimated by Intel Product Specifications [] while NVIDIA System Management Interface (nvidia-smi) is used to estimate energy consumption by GPU. The execution times for CPU/GPU platform are measured in the computing frameworks. To evaluate GRAPHR, for the ReRAM part, we use NVSim [8] to estimate time and energy consumption. The HRS/LRS Graphic Card NVIDIA Tesla K4c Architecture Kepler # CUDA Cores 88 Base Clock 745 MHz Graphic Memory GB GDDR5 Memory Bandwidth 88 GB/s CUDA Version 7.5 Table : Specifications of the GPU Platform resistance are 5M/5K Ω, read voltage (Vr) and write voltage (Vw) are.7v and V respectively, and current of LRS/HRS are 4 ua and ua respectively. The read/write latency and read/write energy cost used are 9.ns / 5.88ns,.8 pj /.9nJ respectively from data reported in [4]. The programming of a bipolar ReRAM cell is to change (from High to Low) or inverse. For multi-level cell, the programming is to change the resistance to a middle state between High and Low, and the middle state is determined by the programming voltage pulse length. Actually, the difference between High and Low is the worse case. Note that [6, 5] describe a ReRAM programming prototype, which includes: ) writing circuitry; ) ReRAM cell/array; and ) conversion circuitry. They demonstrated the possibility of % accuracy for multi-level cell. However, in a real production system, only writing circuitry" and ReRAM cell/array" are needed, there is no need to consider the conversion, as we just need to acquiesce" a writing precision. Therefore, this energy cost estimation for 4-bit cell programming is reasonable and more conservative than two recent ReRAM-based accelerators [, 49]. For on-chip registers, we use CACTI 6.5 [] at nm to model. For Analog/Digital converters, we use data from [8]. The system performance is modeled by code instrumentation. The ReRAM crossbar size S, number of ReRAM crossbars per graph engine C and number of graph engines is 8,, 64 respectively. 5. Performance Results Figure compares the performance of GRAPHR and CPU platform. The CPU implementation is used as the baseline and execution times of applications of GRAPHR are normalized to it. Compared to CPU platform, the geometric mean of speedups with GRAPHR architecture on all 5 executions is 6.x. Among all applications on the datasets, the highest speedup achieved by GRAPHR is.67x, and it happens on sparse matrix-vector multiplication on WikiVote dataset. The low-

GraphR: Accelerating Graph Processing Using ReRAM

GraphR: Accelerating Graph Processing Using ReRAM : Accelerating Graph Processing Using ReRAM Linghao Song *, Youwei Zhuo #, Xuehai Qian #, Hai Li * and Yiran Chen * * Duke University, # University of Southern California * {linghao.song, hai.li, yiran.chen}@duke.edu,