arxiv: v1 [cs.dc] 21 Aug 2017

Size: px
Start display at page:

Download "arxiv: v1 [cs.dc] 21 Aug 2017"

Transcription

1 GraphR: Accelerating Graph Processing Using ReRAM arxiv:78.648v [cs.dc] Aug 7 Linghao Song Duke University linghao.song@duke.edu ABSTRACT Youwei Zhuo University of Southern California youweizh@usc.edu Hai Li Duke University hai.li@duke.edu Graph processing recently received intensive interests in light of a wide range of needs to understand relationships. The graph applications are memory bound with a large number of random accesses. The memory bandwidth bottleneck is incurred by a large number of random memory accesses during neighborhood traversal. In conventional architecture, the two characteristics incur a significant amount of data movements and energy consumption, thus proposed were hardware solutions for graph processing acceleration. The current graph processing accelerators rely on memory access optimizations or place computation logics close to memory. Distinct from all existing approaches, we leverage an emerging memory technology to directly accelerate graph processing. This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle of near-data processing and explores the opportunity of performing massive parallel operations with low hardware and energy cost. The analog computation is suitable for graph processing because: ) The algorithms are iterative and could tolerate the imprecise values by nature; ) Both probability calculation, such as PageRank and Collaborative Filtering, and the algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. GRAPHR is a novel accelerator architecture consisting of two components: memory ReRAM and graph engine (GE). The core graph computations are performed in sparse matrix format in GEs (ReRAM crossbars), which perform efficient matrix-vector multiplications. The vector/matrix-based graph computation is not new, but ReRAM offers the unique opportunity to realize the massive parallelism with unprecedented energy efficiency and low hardware cost. Because of this, the gain of performing parallel operations overshadows the wastes due to sparsity in matrix view within a small subgraph. The experiment results show that GRAPHR achieves a 6.x (up to.67x) speedup and an.8x energy saving on geometric mean compared to a CPU baseline system. The performance of GRAPHR is in the same order of GPUs without considering the data transfer time between CPU and GPU, an overhead GRAPHR does not incur. More importantly, GRAPHR consumes.6x to.68x less energy for computation only (not even counting CPU/GPU data transfer). Yiran Chen Duke University yiran.chen@duke.edu Xuehai Qian University of Southern California xuehai.qian@usc.edu. INTRODUCTION With the explosion of data collected from massive sources, graph processing received intensive interests due to the increasing needs to understand relationships: In cyber security, the graph analytics are needed to detect intrusion on the network [5]. In social media, people rely on the graph analytics to figure out the relationships and influences between people []. And significant real-world applications of graph processing include PageRank citation ranking [4], natural language processing [8, 7, 7], system biology [, 5], recommendation systems [, 48, 54] and machine learning [, 9, 5], etc., which are all fast-growing aspects in computer science and engineering research. There are several ways to perform graph processing. The distributed systems [,,, 59] leverage the ample computing resources to process large graphs. However, they inherently suffer from synchronization and fault tolerance overhead [6, 45, 6] and load imbalance [55]. Moreover, finding efficient graph cuts that minimizes communication between nodes and ensure balanced load is a difficult problem [9]. Alternatively, disk-based single-machine graph processing systems [7, 47, 5, 6] (out-of-core systems) can largely eliminate all the challenges of distributed frameworks. The key principle of such systems is to keep only a small portion of active graph data in memory and spill the remainder to disks. The third approach is the in-memory graph processing. Compared to disk-based systems, accessing data in main memory is orders of magnitude faster. The potential of inmemory data processing has been exemplified in a number of successful projects, including RAMCloud [4], Pregel [6], GraphLab [4], Oracle TimesTen [8], and SAP HANA [] Graph processing applications are memory bound with poor locality. The memory bandwidth bottleneck is incurred by a large number of random memory accesses during neighborhood traversal. The poor locality is caused by the fact that the neighbor vertices are not stored contiguously. In addition, the fact that graph operations only use a small portion of a cache block leads to memory bandwidth waste. Therefore, in conventional architecture, the graph processing incurs significant amount of data movements and energy consumption. To overcome these challenges, several hardware accelerators [4,, 4] are proposed to execute graph processing more efficiently with specialized architecture. In the following, we

2 retrospect the graph computation and the current solutions to motivate our approach. A graph can be naturally represented as an adjacency matrix and all graph algorithms can be eventually boiled down to some form of matrix-vector multiplications. On the other side, the current graph processing systems do not use the familiar matrix-based operations in high-performance computing (HPC) [4]. The reason is that the graph is sparse, and the matrix-based computation on the corresponding sparse matrix would be deemed to be inefficient, there are many multiplications with zero. For the same reason, the graph data are not stored in matrix form either, instead certain compressed sparse matrix representations are used (see more details in Section ). In this context, current graph processing systems perform two high-level actions: ) bringing data for computation from memory based on certain compressed representation; ) performing the computations on the loaded data. Due to the sparsity, the data accesses in ) are random and irregular. In essence, ) performs simple computations that are part of the matrix-vector multiplications but only on non-zero operands. As a result, each computing core experiences alternative long random memory access latency and short computations. This leads to the well-known challenges in graph processing and other issues such as memory bandwidth waste []. The current graph processing accelerators mainly optimize the memory accesses. Specifically, Graphicionado [] reduces memory access latency and improves throughput by replacing random accesses to conventional memory hierarchy with sequential accesses to scratchpad memory optimization and pipelining. Ozdal et al. [4] improves the performance and energy efficiency by latency tolerance and hardware supports for dependence tracking and consistency. TESSERACT [4] applies the principle of near-data processing by placing compute logics (e.g. in-order cores) close to memory to claim the high internal bandwidth of Hybrid Memory Cube (HMC) [44]. However, all architectures do little change on compute unit, the simple computations are performed one at a time. To perform matrix-vector multiplications, two approaches exist that reflect two ends of the spectrum: ) the densematrix-based methods incur regular memory accesses and perform computations with every elements in matrix/vector; ) the sparse-matrix-based methods used in graph processing incur random memory accesses but only perform computations on non-zero operands. In this paper, we adopt an approach that can be considered as the mid-point between these two ends that could potentially achieve better performance and energy efficiency. Specifically, we propose to perform sparse matrix-vector multiplications on data blocks of compressed representation. This idea has obvious pros and cons. The benefit is two-fold. First, the computation and data movement ratio is increased. It means that the cost of bringing data and format conversion could be naturally hidden by the larger amount of computations on a block of data. Second, inside this data block (i.e., sub-matrix/sub-graph), computations could be performed in parallel in SIMD manner. In contrast, the discrepancy of the fast computation and the long random access latency in the current systems based on sparse matrix makes it challenging to optimize performance. The downside is that certain hardware and energy will be wasted in performing useless multiplications with zero. This approach could in principle be applied to the current GPUs or accelerators, but with the same amount of compute resources (e.g. SM in GPUs), it is unclear whether the gain would outweigh the inefficiency caused by the sparsity. Clearly, a key factor is the cost of compute logic, with a low-cost mechanism to implement matrix-vector multiplications, the proposed approach is likely to be beneficial. In this paper, we demonstrate that the non-volatile memory, metal-oxide resistive random access memory (ReRAM) [57] could serve as the essential hardware building block to perform matrix-vector multiplications in graph processing. Recent works [, 49, 5] demonstrate the promising applications of efficiently in-situ matrix-vector multiplication of ReRAM on neural network computing. The analog computation is suitable for graph processing because: ) The iterative algorithms could tolerate the imprecise values by nature; ) Both probability calculation, such as PageRank and Collaborative Filtering, and the algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. Due to the low-cost and energy efficiency, the dense-matrix-based computation in ReRAM would not incur significant hardware waste due to sparsity. In fact, such waste could only occur inside the ReRAM crossbar with moderate size (e.g., 8 8). As a result, the architecture will mostly enjoy the benefits of more parallelism in computation and higher ratio between computation and data movements, which reduces the pressure on memory system. Moreover, performing computation in ReRAM naturally enables near data processing with reduced data movements: the data do not need to go through the memory hierarchy like in the conventional architecture or some accelerators. Applying ReRAM in graph processing poses a few challenges: ) Data representation. In graph processing, the matrix-vector multiplications cannot be directly performed on the graph data with compressed sparse matrix representations; ) Algorithm mapping. Unlike the regular matrix computations in neural network that could almost directly match the computing capability of ReRAM, it is less intuitive to map various graph algorithms to ReRAM with hardware constrains; ) Data Movements. While ReRAM is close to memory, to ensure high throughput and efficient execution, the architecture needs to carefully orchestrate data movements and support pipelining of different operations. This paper proposes, GRAPHR, a novel ReRAM-based accelerator for graph processing. It consists of two key components: memory ReRAM and graph engine (GE), which are both based on ReRAM but with different functionality. The memory ReRAM stores the graph data in compressed sparse representation. GEs (ReRAM crossbars) perform the efficient matrix-vector multiplications on the sparse matrix representation. We propose a set of techniques to map various graph algorithms to ReRAM crossbar. The peripheral components handle the auxiliary operations depending on the algorithms and control logics to support data movements. We propose a novel streaming-apply model particularly suitable for ReRAM to coordinate data block (chunks of vertices/edges) movements. In the evaluation, we compare GRAPHR with a software

3 framework [6] on a high-end CPU-based platform and also GPU implementations. The experiment results show that GRAPHR achieves a 6.x (up to.67x) speedup and an.8x energy saving on geometric mean compared to the CPU baseline. The performance of GRAPHR is in the same order of GPUs without considering the data transfer time between CPU and GPU, an overhead GRAPHR does not incur. More importantly, GRAPHR consumes.6x to.68x less energy for computation only (not even counting CPU/GPU data transfer). This paper is organized as follows. Section introduces the background of graph processing, ReRAM and current hardware graph processing accelerators. Section describes the whole GRAPHR architecture. Section 4 presents the graph engine for various graph algorithms. Section 5 presents the evaluation methodology and experiment results. Section 6 concludes the paper.. BACKGROUND AND MOTIVATION. Graph Processing (a) Vertex and edge accesses in graph processing Vertices: Edges: (random access) (local sequential access) (global random access) (b) Access pattern in vertexcentric program Figure : Graph Processing in Vertex-Centric Program Graph algorithms traverse vertices and edges to discover interesting graph properties based on relationship. A graph could be naturally considered as a adjacency matrix, where the rows correspond to the vertices and the matrix elements represent the edges. The graph algorithms can be always mapped to matrix operations, e.g., matrix-vector multiplications. However, in reality, the graph is sparse, which means that there would be many zeros in the adjacency matrix. This property incurs the waste of both storage and compute resources. Therefore, the current graph processing systems use the format that is suitable for sparse graph data. Based on such data structures, the graph processing can be essentially considered as implementing the matrix operations on the sparse matrix representation. In this case, the common techniques (e.g. vector processing) in HPC to optimize dense matrix operations do not apply. Instead, the individual (and simple) operations in the whole matrix computation are performed by the compute units (e.g., a core in CPU or an SM in GPU) after data is fetched. In the following, we elaborate the challenge of random accesses in various graph processing approaches. More details on sparse graph data representation will be discussed in Section.4. To provide an easy programming interface, the vertexcentric program featuring Think Like a Vertex (TLAV) [6] was proposed as a natural and intuitive way for human brain to think of the graph problems. Figure (a) shows an example, an algorithm could first access and process the red vertex in the center with all its neighbors through the red edges. Then it can move to one of the neighbors, the blue vertex on the top, accessing the vertex and the neighbors through the blue edges. After that, the algorithm can access another vertex (the green one on the left), which is one of the neighbors of the blue vertex. In graph processing, the vertex accesses lead to the random accesses because the neighbor vertices are not stored continuously. For edges, they incur local sequential access because the edges related to a vertex are stored continuously but global random accesses because the algorithm needs to access the edges of different vertices. The concepts are shown in Figure (b). The random accesses lead to long memory latency and, more importantly, the bandwidth waste, because only a small portion of data are accessed in a memory block. In a conventional hierarchical memory system, this leads to the significant data movements. Clearly, reducing random accesses is critical to improve the performance of graph processing, this is particularly crucial for the disk-based single machine graph processing systems (a.k.a out-of-core systems [7, 47, 5, 6]), because the random disk I/O operations are much more detrimental to the performance. In this context, the memory is considered small and fast (therefore can afford random accesses) while disk is considered large and slow (therefore should be only accessed sequentially). The edge-centric model in X-Stream [47] is a notable solution for reducing random accesses. Specifically, the edges of a graph partition are stored and accessed sequentially in disk and the related vertices are stored and accessed randomly in memory. Such setting is feasible because typically the vertex data are much smaller than edge data. During process, X-Stream generates Updates in scatter phase, which incurs sequential writes, and then, applies these Updates to vertices in gather phase, which incurs sequential reads. The concepts are shown in Figure (a). A notable drawback of X-stream is that the amount of updates may be as large as that of edges, GridGraph [6] proposed optimizations to further reduce the storage cost and data movements due to updates. The solution is based on the dual sliding windows (shown in Figure (b)), which partitions edges into blocks and vertices into chunks. On visiting the edge blocks, the source vertex chunk (orange) is accessed and updates are directly applied to the destination vertex chunk (blue). This requires no temporary update storage as in X-Stream. Edge blocks can be accessed in source oriented order (shown in Figure (b)) or destination oriented order. Note that the dual sliding window mechanism is based on edge-centric model.. ReRAM Basics The resistive random access memory (ReRAM) [57] is an emerging non-volatile memory with appealing properties of high density, fast read access and low leakage power. ReRAM has been considered as a promising candidate for future memory architecture. A primary application of ReRAM is to be used as an alternate for main memory [9,, 6]. Figure (a) demonstrates the metal-insulator-metal (MIM) structure of an ReRAM cell. It has a top electrode, a bottom electrode and a metal-oxide layer sandwiched between electrodes. By applying an external voltage across it, an ReRAM cell can be switched between a high resistance state (HRS or OFF-state)

4 . Edge-centric scatter. Edge-centric gather (sequential read) Edges: (sequential read) Updates: Vertices(source): Updates: (random access) (sequential write) Vertices (destination): (a) Edge-centric processing (random access) Figure : (a) Edge-Centric Processing and (b) Dual Sliding Windows (b) Dual sliding windows and a low resistance state (LRS or On-state), which are used to represent the logical "" and "", respectively, as shown in Figure (b). Top Electrode Metal Oxide Current HRS( ) RESET LRS( ) Bottom Electrode b j = f ( a i i w ij ) Voltage i (a) (b) (c) SET Figure : Basics of ReRAM The ReRAM features the capability to perform in-situ matrix-vector multiplication [4, 5] as shown in Figure (c), which utilizes the property of bitline current summation in ReRAM crossbars to enable computing with high performance and low energy cost. While conventional CMOS based system showed success on neural network acceleration [5,, 5], recent works [9,, 49, 5] demonstrated that ReRAM-based architectures offer significant performance and energy benefits for the computation and memory intensive neural network computing.. Graph Processing Accelerators Due to the wide applications of graph processing and its challenges, several hardware accelerators were recently proposed. Ahn et al. [4] proposes TESSERACT, the first PIMbased graph processing architecture. It defines a generic communication interface to map graph processing to HMC. At any time, each core can Put a remote memory access and get interrupted to receive and execute the Put calls from other cores. Mustafa et al. [4] introduces an accelerator for asynchronous graph processing, which features efficient hardware scheduling and dependence tracking. To use the system, programmers have to understand its architecture and modify existing code. Graphicionado [] is a customized graph accelerator designed for high performance and energy efficiency based on off-chip DRAM and on-chip edram. It modifies the graph data structure and data path to optimize graph access patterns and designs specialized memory subsystem for higher bandwidth. These accelerators all optimize the memory accesses, reducing the latency or better tolerating the random access latency..4 Graph Representation As discussed in Section, supporting the limited dense matrix-vector multiplications on data blocks could increase the ratio between computation and data movement and reduce the pressure on memory system. With its matrix-vector multiplication capability, ReRAM could naturally perform the low-cost parallel dense operations on the sparse sub-matrices (sub-graphs), enjoying the benefits without increasing hardware and energy cost. The key insight of GRAPHR is to still store the majority of the graph data in the compressed sparse matrix representation and process the subgraphs in uncompressed sparse matrix representation. In the following, we review several commonly used sparse representations. (row,val) colptr (col,val) rowptr (row,col,val) 8 (,) (,) (,,) 7 (,4) (,8) (,,8) (,) (,,7) (,7) (,7) 4 (,) 4 (,,) 4 (,8) 4 6 (,4) 4 6 (,,4) (,) 5 (,) 5 (,,) (a) (b) CSC (c) CSR (d) COO Figure 4: (a) Sparse Matrix and Its Compressed Representations in: (b) Compressed Sparse Column (CSC), (c) Compressed Sparse Row (CSR), (d) Coordinate List (COO) The three major compressed sparse representations are compressed sparse column (CSC), compressed sparse row (CSR) and coordinate list (COO). They are illustrated in Figure 4. In the CSC representation, non-zeros are stored in column major order as (row index, value) pairs in a list, the number of entries in the list is the number of non-zeros. Another list of column starting pointers indicate the starting index of a row in the (row,val) list. For example, in Figure 4 (a), 4 in the colptr indicates that the 4-th entry in (row,val) list, i.e., (,8) is the starting of column. The number of entries in colptr is the number of columns +. Compressed sparse row (CSR) is similar to CSC, with row and column alternated. For coordinate list (COO), each entry is a tuple of (row index, column index, value) of the nonzeros B- B- B- B- (a) (b) (c) Figure 5: (a) A Directed Graph and Its Representations in (b) Adjacency Matrix and (c) Coordinate List 4

5 In GRAPHR, we assume coordinate list (COO) representation to store a graph. Given a graph in Figure 5 (a), its (sparse) adjacency matrix representation and COO representation (partitioned into four 4 4 subgraphs) are shown in Figure 5(b) and (c), respectively. In this example, the coordinate list saves 6% storing space, compared with the adjacency matrix representation. For real-world graphs with high sparsity, the saving is even high: the coordinate list can only take.% of space for WikiVote [] compared to an adjacency matrix.. GRAPHR ARCHITECTURE. Overall GraphR Architecture DRV S/H CTRL Memory ReRAM Memory ReRAM IO Interface GE GE GE GE GE S/H DRV DRV S/H Memory ReRAM Memory ReRAM S/H DRV RegI ADC S/A salu RegO CTRL S/H S/A DRV GE ADC salu RegI RegO Controller Sample & hold Shift & add unit Driver Graph Engine Analog to digital Simple ALU Input register Output register Figure 6: Architecture of GRAPHR: a ReRAM-Based Graph Processing Accelerator Figure 6 shows an overview of the proposed GRAPHR accelerator architecture. Essentially, GRAPHR is a ReRAMbased memory module that performs efficient near data parallel graph processing. It contains two key components: memory ReRAM and graph engine (GE), both are based on ReRAM but with different functionalities. The memory ReRAM stores the graph data in compressed sparse representation. GEs perform the efficient matrix-vector multiplications on the sparse matrix representation. In a GE, a number of ReRAM crossbars with Drivers (DRV) and Sample and Hold (S/H) components are placed in mesh, which are connected with Analog to Digital Converter (ADC), Shift and Add units (S/A) and simple algorithmic and logic units (salu). The input and output register (RegI/RegO) are used to cache data flow. The functionality and design details of components will be discussed in Section.5. The whole GRAPHR architecture contains a number of GEs that process a sub-graph in parallel on sparse matrix. The small-sized ReRAM crossbars ensure that the wasteful computations are not significant, as they only happen inside the crossbars. As shown in Section.4, each entry in a coordinate list should be a three-element tuple of {Source ID, Destination ID, Edge Weight}. To simplify the example, we use an unweighted graph, so a twoelement tuple of {Source ID, Destination ID} is sufficient to represent one edge. The I/O interface is attached to the ReRAM components so that the graph data and instructions could be loaded into ReRAM memory and controller, respectively. In GRAPHR, the controller is the software/hardware interface that could execute simple instructions to: ) coordinate the graph data movements between memory ReRAM and GEs based on the model in the next section; ) assign the graph edges in coordinate list representation in memory ReRAM to the ReRAM crossbars in GEs (using method in Section.); ) perform convergence check (see Section 4.).. Streaming-Apply Execution Model Src Memory ReRAM B- B- B- B- B- B- B- B- B- B- B- B- B- B- B- B- Dst GE GE GE GE Figure 7: Streaming-Apply Execution Model We propose a streaming-apply execution model so that the GRAPHR architecture can process a complete large graph. Figure 7 shows the model, which is inspired by the dual sliding windows shown in Figure (b). Initially, all data loaded into memory ReRAM through I/O interface. We use subgraph to refer to the graph data that are processed together in all GEs. In Figure 7, we have 6 subgraphs Bi-j, i, j {,,, }. Before processing, these subgraphs are stored in the original sparse representation in memory ReRAM and marked in orange. The vertex data are always stored in a vector and the edges are stored in the coordinate list. There are two vectors for the vertices: src holds the old vertex values for the vertex read and dst holds the new vertex values for the vertex write (update). After all sub-graphs are processed in one iteration, the dst is copied to src so that the next iteration could read the new vertex values. The actual processing of a subgraph happens in GEs, we use block to refer to the region in a sub-graph that is processed by a particular GE. The streaming-apply model concerns two aspects: ) intersubgraph processing, which specifies the order of subgraphs are processed; and ) intra-subgraph processing, which specifies the order of blocks are processed in GEs. Inter-subgraph processing. The subgraphs can be processed in either row-major or column-major with different trade-offs. For row-major order (e.g., B- B- B- B- B-...), a region of dst is updated after each subgraph is processed by GEs. This operation copies data from RegO to dst. It has to be done after each subgraph because the next subgraph will produce the vertex updates to another region and overwrite the original updates in RegO. On the other side, the source vertex data only need to load once to RegI for all subgraphs in the same row. For columnmajor order (e.g., B- B- B- B- B-...), a region of dst is updated after each column of sub- 5

6 graphs (e.g., {B-, B-, B-, B-}) is processed. This is because all subgraphs in the same column will update the same set of vertices (corresponding to the same region in dst). The updates generated in later subgraphs in the same column could simply overwrite the updates generated earlier. However, the vertex values need to be read from src to RegI for each subgraph. Overall, the column-major order incurs less ReRAM writes but more ReRAM reads than row-major order. Considering that the write cost is more than read cost for ReRAM, we use column-major order in GRAPHR. Intra-subgraph processing. In Figure 7, we mark the subgraph being processed by GEs in blue. To process each subgraph, the data in memory ReRAM are first converted from coordinate list to the sparse matrix representation by the controller and then loaded into GEs. For vertices, the Src Chunk (i.e., the region of vertex values in src related to the subgraph) is loaded into RegI. The updates (i.e., Dst Chunk) are stored in RegO. We see that the data movements here are only from memory ReRAM to GEs. Because the edges are not updated, there is no data movements from ReRAM crossbars back to memory ReRAM. With multiple GEs available, different blocks within a subgraph could be assigned to different GEs. Here, the key requirement is that the vertex updates in Dst Chunk should not conflict. Therefore, we can only assign each GE to process a column of blocks. When different GEs are processing different columns, they read from the shared Src Chunk and write to different regions of Dst Chunk without conflict. This intra-subgraph execution model is called Row Parallelized and Column Oriented (RPCO) processing as shown in Figure 7. Assume we have G GEs, in RPCO, they could process G columns of blocks simultaneously. Note that the blocks in the same column can be processed in any order but only sequentially, because otherwise different blocks could generate conflicting updates. In another word, we cannot allow any more parallelism. We assume a simple top to bottom order in each column.. Graph Preprocessing block B- B- GE GE CB CB CB CB Busy GE Idle GE subgraph (a) CB CB CB CB CB CB CB CB B- (c) B- B Sequential 4 Access (b) Figure 8: GraphR Processing Order Based on the streaming-apply execution model, the sparse graph data in coordinate list representation need to be preprocessed to facilitate the loading of edge data into GEs. While the preprocessing takes time, it is only done once for different algorithms, because the execution order is always the same, the difference is how to map the algorithms to GEs, which will be discussed in Section 4. The preprocessing is both determined by the execution model and a few architectural parameters of GRAPHR. We assume the users of GRAPHR is informed with these parameters so that they could perform proper preprocessing. This is nothing different from the traditional architecture-dependent optimizations that are widely used (e.g., tiling optimization based on cache size [4]). Let us assume that we have G GEs, each GE contains C ReRAM Crossbars (CBs), the size of each CB is S. According to Figure 7, all GEs together can process a subgraph with G C S vertices. For a graph with V vertices, the whole graph should be partitioned into (V /(G C S)) subgraphs. Figure 8 shows the example with V = 6,G =,C =,S =, the graph with 6 vertices are first divided into 4 subgraphs (B-, B-, B- and B-). The subgraphs are processed in column-major order. Inside each subgraph, data in a set of rows are loaded and processed in sequence by all GEs. The number of rows processed together is determined by S, the size of CBs. Note that the CBs inside a GE should be logically organized into a vector, because they will be all devoted to processing the same rows. Also, all GEs are also organized into a vector manner to cover all columns in the subgraph. Based on this setting, the blocks inside the subgraph should be organized in a rowmajor manner. In the example in Figure 8 (a), the process of B- starts with the two GEs (GE and GE ) processing the first few rows. This involves reading the coordinate list in the same order, which is marked with numbers in Figure 8 (b). We call the portion inside a block processed by a CB as sub-block. It is important to ensure that the the coordinate list entries for a sub-block are presented consecutively, so that when the mapping of this sub-block from coordinate list to sparse matrix representation in CB will only involve sequential accesses. This is the reason why the preprocessing is related to the architectural parameter S: we need to make sure all the coordinate list entries for sub-block are ordered before the entries for sub-block, which means at most S entries for each row. If a sub-block has any edge, at least one cell in the corresponding CB is non-zero, this CB is busy, otherwise, it is idle. Due to the sparsity of the graph, the sub-blocks without edges spread all over the graph. In Figure 8 (a), the row being processed by the GEs in B- illustrates such situation, where the second CB in GE is idle because there is no edge in this sub-block. Obviously, this causes resource under-utilization, but it is unavoidable using sparse matrix format. The gain of paying this inefficiency cost is to be able to process many blocks in parallel with low cost. As mentioned in Section, ReRAM offers a good justification to explore this trade-off. Note that the empty sub-block does not pose extra complexity, because in the preprocessed coordinate list, the entries for this sub-block are just skipped. When each coordinate list entry is sequentially loaded, the pair of indexes indicate the 6

7 CB it should fall into, CBs are also loaded sequentially so the skipped CB(s) could be also inferred. It is also possible that the whole GE is idle, as shown in the row being processed in B-. Occasionally, if all blocks in a row are empty, the row could be skipped and save the cycle, this happens in the second row in B-. Based on the discussion, the requirement of preprocessing is already clear. The sub-block order in the output of preprocessed coordinate list is indicated in Figure 8 (b). The sub-block without a number is skipped. Figure 8 (c) further shows a segment of the preprocessed coordinate list of B-, we can see that when converting the data in the compressed representation to sparse matrix in CB/GE, the coordinate list entries are read sequentially..4 ReRAM Crossbar Size Trade-offs (a) 8-8 Mapping (b) 4-4 Mapping (c) - Mapping Figure 9: Processing the Adjacency Matrix with (a) 8-by- 8, (b) 4-by-4 and (c) -by- ReRAM Crossbars In general, the size of ReRAM crossbars captures the tradeoff between parallelism/speedup/cost and utilization. Figure 9 (a) shows an 8 8 adjacency matrix for Figure 5 (a), where the shadowed squares represent an edge between two vertices and a blank square means there is no edge between the two vertices. In this case, although the whole graph could be processed in parallel, 6.9% of cells are wasted, and the utilization is 9.% If we consider reducing the ReRAM crossbar size to 4 4 and, as shown in Figure 9 (b) and (c), then the utilization is improved. Specifically, the 4 4 subgraph in the upper-right corner of Figure 9 (b) does not need to be mapped to ReRAM crossbar and processed, the utilization becomes 5.%. Similarly, the utilization in Figure 9 (c) is further increased to 78.%. Let ReRAM crossbar size as S, number of edges in the graph as N E, number of ReRAM crossbars in a partition as N B. We could define a metric cell utilization rate as µ = N E /(S S N B )..5 GraphR Modules This section discussed the structure of the each component in GRAPHR. Memory ReRAM It does not perform actual graph computation but is used to store edges and vertices. Typically, the number of vertices is smaller than the number of edges. Edges are stored as coordinate list while vertices are stored as a vector. Driver (DRV) It is used to ) load new edge data to ReRAM crossbars for processing; and ) input data into ReRAM crossbars for matrix-vector multiplication. We use a weighted spike coding scheme. For a N-bit input value, we need N time slots for potential spikes. Inside the driver, N levels of reference voltage, ranging from V / N to V / are generated. When writing or programing edge data to ReRAM crossbars of a GE, the driver supplies a programing voltage. Sample and Hold (S/H) It is used to sample analog values and hold them before converting to a digital form. Analog to Digital Converter (ADC) It converts analog values to digital format. Because ADCs have relatively higher area and power consumption, ADCs are not connected to every bitlines of ReRAM crossbars in a GE but shared between those bitlines. If the GE cycle is 64ns, we can have one ADC working at.gsps to convert all data from eight 8-bitline crossbars within one GE. salu It is a simple customized algorithmic and logic unit. salu performs operations that cannot be efficiently performed by ReRAM crossbars, such as comparison. Data Format It is not practical for ReRAM cells to support a high resolution. Recent work [5] reported 5-bit resolution on ReRAM programing. To alleviate the pressure of diver, we conservatively assume the 4-bit ReRAM cell. To support higher computing resolution, e.g., 6 bit, the Shift and Add (S/A) unit is employed. A 6-bit fixed point number M can be represented as M = [M,M,M,M ], where each segment M i is a 4-bit number. We can shift and add results from four 4-bit-resolution ReRAM crossbars,i.e. D + D 8 + D 4 + D to get a 6-bit result. When salu and S/A are bypassed, a graph engine could be simply considered as a memory ReRAM mat. A similar scheme of reusing ReRAM crossbars for computing and storing is employed in []. 4. PROCESSING IN GRAPH ENGINES With the overall GRAPHR architecture and the streamingapply execution model, the graph data can flow into the GEs to be processed. This section discusses the mapping of different algorithms to GEs. At high level, the algorithm mapping requires both edge-related and vertex-related operations. The edge-related operations are performed by ReRAM crossbars, and the essential problem is how to map different graph algorithms to the matrix-vector operations. We proposes two processing patterns, parallel and sequential processing, with the illustrations with concrete algorithms. The key issue in vertex-related operations is to perform the operations that could not be performed by ReRAM crossbars in the salu. Different operations should be configured in this component depending on the algorithms. 4. Edge-Related Operations 4.. Parallel Processing The graph algorithms with Parallel processing pattern can be naturally mapped to ReRAM with the best parallelism. PageRank is a representative example for this pattern. PageRank [4] is an algorithm to calculate the website popularity (the probability of staying on a website). It does the following iterative computation: PR t+ = rm PR t +( r) e. PR t is the PageRank at iteration t, M is a probability transfer matrix, r is the probability of random surfing and e is a vector of probabilities of staying in each page. We consider a small sub-graph (i.e., sub-block in Section ) that could be processed by a 5 4 CB (the additional row is to perform addition). It contains at most 6 edges represented by the intersection of 4 rows and 4 columns in the sparse matrix 7

8 Src / ) ) / ) ) / ) ) ) ) ) ) ) ) ) ) ) ) / ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) Dst Src Dst - Dst 4 (c) i(4 i: i: i: 5 j:7 j:6 j:m j:m j:7 j:5 j: j:4 (c) i i i i 4 Src (RegI) ) ) ) M 5 9 M min min min min 7 6 M M M () t= Dst - ) ) ) M M 6 4 min min min min M () t= ) ) ) M M M M min min min min () t= ) ) ) M M M min min min min (RegO) Dst 4 () t=4 i i i i Src Dst (a) Src Dst i: /4 / / / i: /4 / i: /4 / i: /4 j j j j (b) / / Src / / / / Dst (b) /4 /4 /4 /4 9/6 /6 5/6 /6 (b) Figure : Graphs to Illustrate (a) PageRank and (b) SSSP (shown in Figure (a)). Thus, the block is related to 8 vertices (i.e., i i, j j). The following are the parameters for the 4 4 block PageRank shown in Figure (b): M = [,/,,;/,,,/;/,,,/;/,/,,], e = [/4,/4,/4,/4] T, r = 4/5. We define M = rm and e = ( r) e, so that PR t+ = M PRt + e. In CB, as shown in Figure (b, b), the values are already scaled with r. Figure (b) shows the parallel processing pattern of PageRank algorithm in a CB (ReRAM Crossbar). M T is mapped to the CB (blue squares). e T is mapped to the last row (green squares). For the 4 4 block, a 5 4 CB is needed (in the view of Figure 8, S = 4). The initial PageRank value (at time slot t = ) PR = [/4,/4,/4,/4] is connected to the input to the CB at every wordline in parallel. The additional row is used to implement the addition of e. We input a fixed value to the last wordline. With this mapping, the current summation on bitline is the PageRank value at time slot t =. For example, in the first bitline: 9/6 = /4 + /4 /5 + /4 + /. Such output is stored first in RegO in GE and eventually updates the value of the destination vertex. The updated values will be used in the next iteration and flow in the ReRAM crossbar at the wordlines. In summary, the algorithms with this pattern can fully leverage the parallelism of computations in ReRAM s because the computation in the next iteration only depends on the local property (neighbors) of the previous iteration. Therefore, all vertices can be processed in parallel and only one logical cycle (time slot) is needed. 4.. Serial Processing Different from PageRank, another type of iterative algorithms such as Single Source Shortest Path (SSSP) [6] and Breadth First Search (BFS) [6] requires the Serial processing pattern because the computation in the next iteration depends on the intermediate results during the processing of sub-graph in CB. Thus, each vertex can only be processed sequentially. Moreover, this type of algorithms need to maintain whether a vertex is active. This could be implemented by adding a boolean value for each vertex. We explain the serial processing pattern and the related technical issues using SSSP. In SSSP, each vertex v is given a distance label dist(v) that maintains the length of the shortest known path to vertex v from the source. The distance label is initialized to at the source and at all other nodes. Then the SSSP algorithm iteratively applies the relaxation operator, which is defined as follows: if (u,v) is an edge and dist(u)+w(u,v) < dist(v), the value of dist(v) is updated to dist(u) + w(u,v). An active vertex is relaxed by applying this operator to all the edges connected to it. Each relaxation may lower the distance label of some vertex, and when no further lowering can be performed anywhere in the graph, the resulting vertex distance labels indicate the shortest distances from the source to the each vertex, regardless of the order in which the relaxations were performed. Breadth-first numbering of a graph is a special case of SSSP where all edge labels are. Next, we explain the mapping of SSSP algorithm to a CB 8

9 using a small 8-vertex sub-graph corresponding to a 4 4 block in sparse matrix, as shown in Figure (c). It can be represented by adjacency matrix: W = [M,,5,M;M,M,,; M,M,M,M;M,M,,M] where M indicates no edge connected two vertices and M is set to a reserved maximum value for a memory cell in a CB. The values are stored in the CB shown in Figure (c). Given a vertex u and dist(u), the row in the adjacency matrix W for u indicates w(u,v). Therefore, we could perform the relaxation operator for u in parallel. However, we cannot perform the relaxation for all vertices in parallel, because of the data dependence on intermediate results. Specifically, it is possible that the relaxation operator applied to u can update dist(v), which is read by the relaxation operator of v. For this reason, each vertex should be processed in Serial manner. The relaxation operator of u involves reading: ) dist(u): it is computed iteratively from the source, for the source vertex, the initial value is zero; ) The vector of the dist(v) before the relaxation operator: it is a vector indicating the the distance between source and all other vertices and is also computed iteratively from the source. In our example, for the source vertices (i,i,i,i), the initial value is [4,,,]; ) The vector of the w(u,v): it is the distance from u to the destination vertices in the subgraph, and can be obtained by reading a row in adjacency matrix W. Figure (c) shows the process to perform SSSP in a 5 4 CB. The last row (green squares) is set to a fixed value, which is used to add dist(u) (the input to the last wordline) to each w(u, v) in the relaxation operator. The initial value for dist(u) for the destination vertices ( j, j, j, j) are [7,6,M,M]. The vector of w(u,v) is obtained by activating the wordline associated to vertex v. In the time slot (t = ), wordline (for source vertex i) is activated (the red square with input value ) and a value 4 (this is the current value in dist(v) for source i) is input to the last wordline (the green box line). The vector of the dist(v) for the destination vertices (( j, j, j, j)) is set as the value of the output at each bitline, which is [7,6,M,M]. With this mapping, the current summation in bitline in Figure (c-) is [ M + 4, + 4, 5 + 4, M + 4 ] = [M,5,9,M]. It is the dist(u) + w(u,v) computed in parallel, where u is the source vertex. Then the distance of source to each vertex v is compared with the initial dist(v) ([7,6,M,M]) by an array of comparators, and in the final output of bitline, we get [7,5,9,M], which is the updated dist(v) vector after time slot t =. The parallel comparisons are performed by vertex-related operations in Section 4.. Different algorithms may require different functions on vertices. After time slot t =, we move to the next vertex in Figure (c-), where we i) activate the second wordline; ii) change the input to the last wordline to (the distance label for source vertex i); and iii) set the intermediate dist(v) to be the final output of bitline in time slot t =, which is [7,5,9,M]. Similar as Figure (c-), the the current summation in bitline is [M,M,6,4] and it is compared with [7,5,9,M], yielding the final output of bitline for time slot t = as [7,5,6,4]. We indicate the updated distance label using orange squares. The operations in time slot t = and t = 4 can be performed in the similar manner. Initially, before processing the block, the active indicator for each destination vertex is set to be FALSE. After the serial processing in CB, the active indicators for all vertices which have been updated (marked in orange in Figure (c)) are set to be TRUE. This indicates that they are active for next iteration. In our example, j, j, j are marked active. Referring to Figure, this means that the distance labels for these vertices have been updated. Because we are processing the block in CB, the active indicator for destination vertex may be accessed for multiple times, but if it is set be TRUE at least one time, this vertex is active in next iteration. Globally, after all active source vertices and corresponding edges are processed in an iteration, source vertex properties (values and active indicators) that hold the old values are updated by the properties of the same vertices in destination. The new source vertex properties are used in the next iteration. 4. Vertex-related Operations reg(old) reg(new) salu (a) reg old) reg(new) salu 6 4 Figure : Vertex-related Processing: salu Is Configured to Perform (a) add in PageRank and (b) min in SSSP Vertex-related processing handles computations that cannot be performed by ReRAM CB. Specifically, it performs the computation on the destination vertices, which are elementwise operations. This specific operation can be set by configuring salu, different algorithms need different functions. In PageRank, salu is configured to perform add operation to add PageRank values. In Figure (a), the current PageRank values for the same destination vertices ([,7,,], which are stored in RegO)) need to be added with [,,4,5], the new values from the GE/CB. To perform this computation, salu updates values stored in register to [+,7+,+4,+5] = [4,9,6,8]. In SSSP, the old distance vector for the destination vertices is [7,5,6,4], the new distance vector generated during processing is [,,9,4], the salu is configured to perform min operation, which will update the distance vector for the destination vertices as [,,6,4] as shown in Figure (b). 4. Convergence The convergence of the computation is determined by the algorithm. After each iteration, vertices in the memory ReRAM of GRAPHR are checked to determine the convergence. For example, in PageRank, we can accumulate the PageRank values of all vertices and compare it with that of previous iteration, the algorithm is converged if the difference is less than a threshold. For Breadth First Search (BFS)/SSSP, we can check if there are active vertices to determine the convergence. 5. EVALUATION In this section, we evaluate GRAPHR architecture on performance and energy saving. (b) 9

10 Dataset # Vertices #Edges Description WikiVote(WV) [] 7.K K Wikipedia who-votes-on-whom network Slashdot(SD) [] 8K 948K Slashdot social network from February 9 Amazon(AZ) [] 6K.M Amazon product co-purchasing network from March WebGoogle(WG) [].88M 5.M Web graph from Google LiveJournal(LJ) [] 4.8M 69M LiveJournal online social network Orkut(OK) [46].M 6M Orkut social network Netflix(NF) [7] 48K users, 7.8K movies 99M Netflix movie ratings Table : Graph Datasets 5. Graph Datasets and Applications Table shows the datasets used in our evaluation. We use seven real-world graphs. For WikiVote(WV), Slashdot(SD), Amazon(AZ), WebGoogle(WG), LiveJournal(LJ), Orkut(OK) and Netflix(NF). We run pagerank(pr), breadth first search(bfs), single source shortest path (SSSP) and sparse matrix-vector multiplication (SpMV) on the first six datasets. On Netflix(NF), we run collaborative filtering(cf), and the feature length used is. 5. Experiment Setup In our experiments, we compare GRAPHR with a CPU baseline platform and a GPU platform. PR, BFS, SSSP and SpMV running on the CPU platform are based on the software framework GridGraph [6], while collaborative filtering is based on GraphChi [7]. PR, BFS, SSSP and SpMV running on GPU platform are based on Gunrock [56], while CuMF_SGD [58] is the GPU framework for CF. It is important to note that although we use the a disk-based out-of-core system, we are not comparing disk-based system with in-memory ReRAM-based system. It is because the CPU/GPU-based machine we used has 8GB memory, large enough to fit the data of all graphs in memory. Essentially, they are also doing in-memory graph processing. We also exclude the disk I/O time from the execution time of the CPU/GPU-based platform. CPU Intel Xeon E5-6 V, 8 cores,.4 GHz 8 ( + )KB L Cache 8 56KB L Cache MB L Cache Memory 8 GB Storage TB CPUs, a total number of threads. Table : Specifications of the CPU Platform Specifications of the CPU and GPU platforms are shown in Table and Table. The CPU energy consumption is estimated by Intel Product Specifications [] while NVIDIA System Management Interface (nvidia-smi) is used to estimate energy consumption by GPU. The execution times for CPU/GPU platform are measured in the computing frameworks. To evaluate GRAPHR, for the ReRAM part, we use NVSim [8] to estimate time and energy consumption. The HRS/LRS Graphic Card NVIDIA Tesla K4c Architecture Kepler # CUDA Cores 88 Base Clock 745 MHz Graphic Memory GB GDDR5 Memory Bandwidth 88 GB/s CUDA Version 7.5 Table : Specifications of the GPU Platform resistance are 5M/5K Ω, read voltage (Vr) and write voltage (Vw) are.7v and V respectively, and current of LRS/HRS are 4 ua and ua respectively. The read/write latency and read/write energy cost used are 9.ns / 5.88ns,.8 pj /.9nJ respectively from data reported in [4]. The programming of a bipolar ReRAM cell is to change (from High to Low) or inverse. For multi-level cell, the programming is to change the resistance to a middle state between High and Low, and the middle state is determined by the programming voltage pulse length. Actually, the difference between High and Low is the worse case. Note that [6, 5] describe a ReRAM programming prototype, which includes: ) writing circuitry; ) ReRAM cell/array; and ) conversion circuitry. They demonstrated the possibility of % accuracy for multi-level cell. However, in a real production system, only writing circuitry" and ReRAM cell/array" are needed, there is no need to consider the conversion, as we just need to acquiesce" a writing precision. Therefore, this energy cost estimation for 4-bit cell programming is reasonable and more conservative than two recent ReRAM-based accelerators [, 49]. For on-chip registers, we use CACTI 6.5 [] at nm to model. For Analog/Digital converters, we use data from [8]. The system performance is modeled by code instrumentation. The ReRAM crossbar size S, number of ReRAM crossbars per graph engine C and number of graph engines is 8,, 64 respectively. 5. Performance Results Figure compares the performance of GRAPHR and CPU platform. The CPU implementation is used as the baseline and execution times of applications of GRAPHR are normalized to it. Compared to CPU platform, the geometric mean of speedups with GRAPHR architecture on all 5 executions is 6.x. Among all applications on the datasets, the highest speedup achieved by GRAPHR is.67x, and it happens on sparse matrix-vector multiplication on WikiVote dataset. The low-

GraphR: Accelerating Graph Processing Using ReRAM

GraphR: Accelerating Graph Processing Using ReRAM : Accelerating Graph Processing Using ReRAM Linghao Song *, Youwei Zhuo #, Xuehai Qian #, Hai Li * and Yiran Chen * * Duke University, # University of Southern California * {linghao.song, hai.li, yiran.chen}@duke.edu,

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University GridGraph: Large-Scale Graph Processing on a Single Machine Using -Level Hierarchical Partitioning Xiaowei ZHU Tsinghua University Widely-Used Graph Processing Shared memory Single-node & in-memory Ligra,

More information

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns March 12, 2018 Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns Wen Wen Lei Zhao, Youtao Zhang, Jun Yang Executive Summary Problems: performance and reliability of write operations

More information

Communication for PIM-based Graph Processing with Efficient Data Partition. Mingxing Zhang, Youwei Zhuo (equal contribution),

Communication for PIM-based Graph Processing with Efficient Data Partition. Mingxing Zhang, Youwei Zhuo (equal contribution), GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition Mingxing Zhang, Youwei Zhuo (equal contribution), Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis,

More information

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems

More information

GRAPHS have been widely used to represent relationships

GRAPHS have been widely used to represent relationships IEEE TRANSACTIONS ON COMPUTER, SPECIAL SECTION ON EMERGING NON-VOLATILE MEMORY TECHNOLOGIES HyVE: Hybrid Vertex-Edge Memory Hierarchy for Energy-Efficient Graph Processing Guohao Dai,Tianhao Huang,Yu Wang,

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu Mohsen Imani University of California San Diego Winter 2016 Technology Trend for IoT http://www.flashmemorysummit.com/english/collaterals/proceedi ngs/2014/20140807_304c_hill.pdf 2 Motivation IoT significantly

More information

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong

More information

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Yanhao Chen 1, Ari B. Hayes 1, Chi Zhang 2, Timothy Salmon 1, Eddy Z. Zhang 1 1. Rutgers University 2. University of Pittsburgh Sparse

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Mosaic: Processing a Trillion-Edge Graph on a Single Machine Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys

More information

GraphSAR: A Sparsity-Aware Processing-in-Memory Architecture for Large-scale Graph Processing on ReRAMs

GraphSAR: A Sparsity-Aware Processing-in-Memory Architecture for Large-scale Graph Processing on ReRAMs GraphSAR: A Sparsity-Aware Processing-in-Memory Architecture for Large-scale Graph Processing on ReRAMs Guohao Dai*, Tianhao Huang**, Yu Wang*, Huazhong Yang*, John Wawrzynek*** *Dept. of EE, BNRist, Tsinghua

More information

Unleashing the Power of Embedded DRAM

Unleashing the Power of Embedded DRAM Copyright 2005 Design And Reuse S.A. All rights reserved. Unleashing the Power of Embedded DRAM by Peter Gillingham, MOSAID Technologies Incorporated Ottawa, Canada Abstract Embedded DRAM technology offers

More information

CuSha: Vertex-Centric Graph Processing on GPUs

CuSha: Vertex-Centric Graph Processing on GPUs CuSha: Vertex-Centric Graph Processing on GPUs Farzad Khorasani, Keval Vora, Rajiv Gupta, Laxmi N. Bhuyan HPDC Vancouver, Canada June, Motivation Graph processing Real world graphs are large & sparse Power

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,

More information

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin Coordinating More Than 3 Million CUDA Threads for Social Network Analysis Adam McLaughlin Applications of interest Computational biology Social network analysis Urban planning Epidemiology Hardware verification

More information

GraFBoost: Using accelerated flash storage for external graph analytics

GraFBoost: Using accelerated flash storage for external graph analytics GraFBoost: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind MIT CSAIL Funded by: 1 Large Graphs are Found Everywhere in Nature

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot Sort vs. Hash Join Revisited for Near-Memory Execution Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot 1 Near-Memory Processing (NMP) Emerging technology Stacked memory: A logic die w/ a stack

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,

More information

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang* Norm Rubin Albert Sidelnik Sudhakar Yalamanchili* *Georgia Institute of Technology NVIDIA Research Email: {jin.wang,sudha}@gatech.edu,

More information

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural

More information

Parallel Combinatorial BLAS and Applications in Graph Computations

Parallel Combinatorial BLAS and Applications in Graph Computations Parallel Combinatorial BLAS and Applications in Graph Computations Aydın Buluç John R. Gilbert University of California, Santa Barbara SIAM ANNUAL MEETING 2009 July 8, 2009 1 Primitives for Graph Computations

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition

GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition Mingxing Zhang Youwei Zhuo Chao Wang Mingyu Gao Yongwei Wu Kang Chen Christos Kozyrakis Xuehai Qian Tsinghua

More information

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)

More information

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING Grzegorz Malewicz, Matthew Austern, Aart Bik, James Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski (Google, Inc.) SIGMOD 2010 Presented by : Xiu

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Fast Iterative Graph Computation: A Path Centric Approach

Fast Iterative Graph Computation: A Path Centric Approach Fast Iterative Graph Computation: A Path Centric Approach Pingpeng Yuan*, Wenya Zhang, Changfeng Xie, Hai Jin* Services Computing Technology and System Lab. Cluster and Grid Computing Lab. School of Computer

More information

NUMA-aware Graph-structured Analytics

NUMA-aware Graph-structured Analytics NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU

Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU Ke Ma 1, and Yao Song 2 1 Department of Computer Sciences 2 Department of Electrical and Computer Engineering University of Wisconsin-Madison

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Emerging NVM Memory Technologies

Emerging NVM Memory Technologies Emerging NVM Memory Technologies Yuan Xie Associate Professor The Pennsylvania State University Department of Computer Science & Engineering www.cse.psu.edu/~yuanxie yuanxie@cse.psu.edu Position Statement

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA

More information

Optimizing Sparse Matrix Vector Multiplication on Emerging Multicores

Optimizing Sparse Matrix Vector Multiplication on Emerging Multicores Optimizing Sparse Matrix Vector Multiplication on Emerging Multicores Orhan Kislal, Wei Ding, Mahmut Kandemir The Pennsylvania State University University Park, Pennsylvania, USA omk03, wzd09, kandemir@cse.psu.edu

More information

GPU Sparse Graph Traversal. Duane Merrill

GPU Sparse Graph Traversal. Duane Merrill GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor

More information

Milind Kulkarni Research Statement

Milind Kulkarni Research Statement Milind Kulkarni Research Statement With the increasing ubiquity of multicore processors, interest in parallel programming is again on the upswing. Over the past three decades, languages and compilers researchers

More information

Exploring the Hidden Dimension in Graph Processing

Exploring the Hidden Dimension in Graph Processing Exploring the Hidden Dimension in Graph Processing Mingxing Zhang, Yongwei Wu, Kang Chen, *Xuehai Qian, Xue Li, and Weimin Zheng Tsinghua University *University of Shouthern California Graph is Ubiquitous

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1

More information

Graph Data Management

Graph Data Management Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan

More information

Data Structures and Algorithms for Counting Problems on Graphs using GPU

Data Structures and Algorithms for Counting Problems on Graphs using GPU International Journal of Networking and Computing www.ijnc.org ISSN 85-839 (print) ISSN 85-847 (online) Volume 3, Number, pages 64 88, July 3 Data Structures and Algorithms for Counting Problems on Graphs

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography 1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

AS we are now in the big data era, the data volume

AS we are now in the big data era, the data volume IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. X, MONTH YEAR 1 GraphH: A Processing-in-Memory Architecture for Large-scale Graph Processing Guohao Dai, Student

More information

Lecture 22 : Distributed Systems for ML

Lecture 22 : Distributed Systems for ML 10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.

More information

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 1 Rensselaer Polytechnic Institute, 2 Sandia National

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

GPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) Computer Architecture, Spring 18

GPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) Computer Architecture, Spring 18 ABSTRACT GPU/CPU Heterogeneous Work Partitioning Riya Savla (rds), Ridhi Surana (rsurana), Jordan Tick (jrtick) 15-740 Computer Architecture, Spring 18 With the death of Moore's Law, programmers can no

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

Squeezing out All the Value of Loaded Data: An Out-of-core Graph Processing System with Reduced Disk I/O

Squeezing out All the Value of Loaded Data: An Out-of-core Graph Processing System with Reduced Disk I/O Squeezing out All the Value of Loaded Data: An Out-of-core Graph Processing System with Reduced Disk I/O Zhiyuan Ai, Mingxing Zhang, and Yongwei Wu, Department of Computer Science and Technology, Tsinghua

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

Heterogeneous Memory Subsystem for Natural Graph Analytics

Heterogeneous Memory Subsystem for Natural Graph Analytics Heterogeneous Memory Subsystem for Natural Graph Analytics Abraham Addisie, Hiwot Kassa, Opeoluwa Matthews, and Valeria Bertacco Computer Science and Engineering, University of Michigan Email: {abrahad,

More information

GPU-Based Acceleration for CT Image Reconstruction

GPU-Based Acceleration for CT Image Reconstruction GPU-Based Acceleration for CT Image Reconstruction Xiaodong Yu Advisor: Wu-chun Feng Collaborators: Guohua Cao, Hao Gong Outline Introduction and Motivation Background Knowledge Challenges and Proposed

More information

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture

More information

An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches. Adam McLaughlin, Jason Riedy, and David A. Bader

An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches. Adam McLaughlin, Jason Riedy, and David A. Bader An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches Adam McLaughlin, Jason Riedy, and David A. Bader Problem Data is unstructured, heterogeneous, and vast Serious opportunities for

More information

Views of Memory. Real machines have limited amounts of memory. Programmer doesn t want to be bothered. 640KB? A few GB? (This laptop = 2GB)

Views of Memory. Real machines have limited amounts of memory. Programmer doesn t want to be bothered. 640KB? A few GB? (This laptop = 2GB) CS6290 Memory Views of Memory Real machines have limited amounts of memory 640KB? A few GB? (This laptop = 2GB) Programmer doesn t want to be bothered Do you think, oh, this computer only has 128MB so

More information

GPU Sparse Graph Traversal

GPU Sparse Graph Traversal GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex

More information

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,

More information

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011 Data-Intensive Information Processing Applications! Session #5 Graph Algorithms Jordan Boyd-Graber University of Maryland Thursday, March 3, 2011 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

Summary of Computer Architecture

Summary of Computer Architecture Summary of Computer Architecture Summary CHAP 1: INTRODUCTION Structure Top Level Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output

More information

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

Link Analysis in the Cloud

Link Analysis in the Cloud Cloud Computing Link Analysis in the Cloud Dell Zhang Birkbeck, University of London 2017/18 Graph Problems & Representations What is a Graph? G = (V,E), where V represents the set of vertices (nodes)

More information

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

Lecture 6: Input Compaction and Further Studies

Lecture 6: Input Compaction and Further Studies PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 6: Input Compaction and Further Studies 1 Objective To learn the key techniques for compacting input data for reduced consumption of

More information

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUs Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUS MOTIVATION Matrix operations

More information

Efficient graph computation on hybrid CPU and GPU systems

Efficient graph computation on hybrid CPU and GPU systems J Supercomput (2015) 71:1563 1586 DOI 10.1007/s11227-015-1378-z Efficient graph computation on hybrid CPU and GPU systems Tao Zhang Jingjie Zhang Wei Shu Min-You Wu Xiaoyao Liang Published online: 21 January

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018 S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific

More information

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem

More information

Continuum Computer Architecture

Continuum Computer Architecture Plenary Presentation to the Workshop on Frontiers of Extreme Computing: Continuum Computer Architecture Thomas Sterling California Institute of Technology and Louisiana State University October 25, 2005

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks

PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks George M. Slota 1,2 Kamesh Madduri 2 Sivasankaran Rajamanickam 1 1 Sandia National Laboratories, 2 The Pennsylvania

More information