Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES
|
|
- Horatio Hart
- 5 years ago
- Views:
Transcription
1 Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES
2 Background and Motivation Graph applications are memory latency bound Caches & Prefetching are existing solution for memory latency However, irregular access patterns hinder their usefulness Key insight: accesses seem irregular at individual load/store level, but have predictable structure when we consider the high-level algorithm e.g. Breadth-first search (BFS) 2
3 Background and Motivation Existing SW/HW prefetching is insufficient Speedup Both L2 L DCU L DCU IP Both L All L + L2 (a) Hardware prefetchers Speedup HW SW SW+HW (b) Hardware vs software Figure 3: Hardware and software prefetching on Graph 500 search with scale 2, edge factor 0. 3
4 Background and Motivation - BFS Unvisited A Visited C B Will be visited after visiting C So we can prefetch after visiting B... 4
5 Prefetching with Algorithmic Knowledge Design a hardware prefetcher that relies on access patterns specific to algorithms Target BFS, but can support a wider range of algorithms/access patterns Specific to Compressed Sparse Row (CSR) format Prefetcher snoop reads/writes from L cache Achieve an average of 2.3x speedup 5
6 Review: Compressed Sparse Row (CSR) Sparse representation, with a vertex list indirecting to an edge list Authors add a visited list and work list specifically for BFS (a) Graph 7 4 Work List Vertex List Edge List (b) CSR breadth-first search Visited Figure : A compressed sparse row format graph and breadth-first search on it. False True True True True True False True 6
7 Poor Locality of Accesses in Graphs Stall Rate (%) Edge factor 5 Edge factor 0 Edge factor Scale (a) Stall rate Miss Rate (%) Edge factor 5 Edge factor 0 Edge factor Scale (b) L miss rate (c) Source of misses 7
8 Overview of Approach Prefetch all relevant data of o-distance away from the current worklist entry: visited[edgelist[vertexlist[worklist[n+o]]]] Prefetcher snoops the core-to-l mem. accesses to determine which data to prefetch Observation Load from worklist[n] Prefetch vid = worklist[n] Prefetch from vertexlist[vid] Prefetch vid = edgelist[eid] Vertex-O set Mode Action Prefetch worklist[n+o] Prefetch vertexlist[vid] Prefetch edgelist[vertexlist[vid]] to edgelist[vertexlist[vid+]] (2 lines max) Prefetch visited[vid] 8
9 System Architecture Main Memory L2 Cache Work List Vertex List Edge List Visited List Dcache (a) System overview To / From L2 Cache Snoops Prefetch Reqs Prefetched Data DTLB Core EWMA Calculator Address Generator Request Queue Prefetcher Config Figure 5: A graph prefetcher, configured with in 9
10 Prefetcher Microarchitecture Snoops & Prefetched Data From L Cache Address Filter Address Bounds Registers Work List Start Vertex List Start Edge List Start Visited List Start Work List End Vertex List End Edge List End Visited List End Programmer must specify these bounds To DTLB & L Cache Prefetch Request Queue EWMA Unit Prefetch Address Generator Work List Time EWMA Data Time EWMA Ratio Register (b) Prefetcher microarchitecture detail 0
11 Determining Prefetch Distance Easy Case: Time to process a vertex (work_list_time) is less than time to pre-fetch the next vertex (data_time) o work list time = data time work_list_time and data_time vary wildly => use exponentially weighted moving averages (EWMA) Use a safe bound because EWMA often underestimates data_time: o =+ k data time work list time
12 Determining Prefetch Distance Problem: work_list_time > data_time Pre-fetched data is not used timely, might get kicked out of cache before it is used Happens with high-degree vertices Solution: Large vertex mode Base prefetch on how far along we have processed the high-degree vertex»possible because we know the range of the edge indices Prefetch within edgelist for larger vertex Fetch need vertex in worklist when almost done with current vertex s edges 2
13 Extensions Technique can be extended to other algorithms: Parallel BFS Sequentially scanning vertex and edge data (e.g. PageRank) 3
14 Methodology gem5 simulator Set of algorithms from Graph500 and the Boost Graph Library: BFS-like traversal: Connected components, BFS, betweenness-centrality, ST connectivity Sequential access: PageRank, sequential coloring 4
15 Evaluation - BFS-like traversal Speedup s6e0 s9e5s9e0 Stride Stride-Indirect CC s9e5 s2e0s6e0 (a) Graph 500 Graph Search s9e5 s9e0s9e5 s2e0 5
16 Evaluation - BFS-like traversal Stride Stride-Indirect Graph 3 BFS BC ST 2.5 Speedup 2.5 web road web (b) BGL road web road 6
17 Significantly improved L hit-rate L Cache Read Hit Rate s6e0 s9e5 s9e0 s9e5 s2e0 No Prefetching CC Search BFS BC ST s6e0 s9e5 s9e0 s9e5 s2e0 webroad Graph Prefetching webroad webroad Figure 7: Hit rates in the L cache with and without prefetching. 7
18 Prefetching has low overheads Extra Memory Accesses (%) s6e0 s9e5 s9e CC s9e5 s2e0 Search web road BFS BC ST Figure 8: Percentage of additional memory accesses as a result of using our prefetcher. s6e0 s9e5 s9e0 s9e5 s2e0 web road L Utilisation Rate CC Search BFS BC ST Figure 9: Rates of prefetched cache lines that are used before leaving the L cache. 8
19 Prefetching Analysis Most of the benefit comes from prefetching visited & edge lists -> as expected Speedup Proportion s6e0 s9e5 s9e0 Visited Edge Vertex Work CC Search BFS BC ST s9e5 s2e0 s6e0 s9e5 s9e0 s9e5 s2e0 webroad webroad webroad Figure 0: The proportion of speedup from prefetching each data structure within the breadth first search. 9
20 Prefetching works for other traversal types Example: parallel BFS Speedup Graph PF No PF 2 4 Number of Cores Figure : Speedup relative to core with a parallel implementation of Graph500 search with scale 2, edge factor 0 using OpenMP. 20
21 Prefetching works for other traversal types Stride Graph PR SC Speedup web road web road Figure 2: Speedup for di erent types of prefetching when running PageRank and Sequential Colouring. 2
22 Conclusion Prefetching with knowledge of the graph traversal order significantly improves its performance Works for different traversal types (BFS, sequential scan,...) 22
Graph Prefetching Using Data Structure Knowledge
Graph Prefetching Using Data Structure Knowledge Sam Ainsworth University of Cambridge sam.ainsworth@cl.cam.ac.uk Timothy M. Jones University of Cambridge timothy.jones@cl.cam.ac.uk ABSTRACT Searches on
More informationPerformance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu
More informationIntroduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem
Introduction Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: Increase computation power Make the best use of available bandwidth We study the bandwidth
More informationEfficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling
Cornell University Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling Tao Chen and G. Edward Suh Computer Systems Laboratory Cornell University Accelerator-Rich
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationDECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations
DECstation 5 Miss Rates Cache Performance Measures % 3 5 5 5 KB KB KB 8 KB 6 KB 3 KB KB 8 KB Cache size Direct-mapped cache with 3-byte blocks Percentage of instruction references is 75% Instr. Cache Data
More informationLecture 13: March 25
CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging
More informationHigh-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock
High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)
More informationAccelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh
Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary
More informationCACHE-GUIDED SCHEDULING
CACHE-GUIDED SCHEDULING EXPLOITING CACHES TO MAXIMIZE LOCALITY IN GRAPH PROCESSING Anurag Mukkara, Nathan Beckmann, Daniel Sanchez 1 st AGP Toronto, Ontario 24 June 2017 Graph processing is memory-bound
More informationEECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table
Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.
More informationBloom Filtering Cache Misses for Accurate Data Speculation and Prefetching
Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, Konrad Lai peir@cise.ufl.edu Computer & Information Science and Engineering
More informationCGAcc: A Compressed Sparse Row Representation-Based BFS Graph Traversal Accelerator on Hybrid Memory Cube
electronics Article CGAcc: A Compressed Sparse Row Representation-Based BFS Graph Traversal Accelerator on Hybrid Memory Cube Cheng Qian 1, Bruce Childers 2, Libo Huang 1, *, Hui Guo 1 and Zhiying Wang
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationTechnical Report. Prefetching for complex memory access patterns. Sam Ainsworth. Number 923. July Computer Laboratory
Technical Report UCAM-CL-TR-923 ISSN 1476-2986 Number 923 Computer Laboratory Prefetching for complex memory access patterns Sam Ainsworth July 2018 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom
More informationCS222: Cache Performance Improvement
CS222: Cache Performance Improvement Dr. A. Sahu Dept of Comp. Sc. & Engg. Indian Institute of Technology Guwahati Outline Eleven Advanced Cache Performance Optimization Prev: Reducing hit time & Increasing
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationCache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time
Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +
More informationOptimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation
Optimizing Cache Performance for Graph Analytics Yunming Zhang 6.886 Presentation Goals How to optimize in-memory graph applications How to go about performance engineering just about anything In-memory
More informationComputer Architecture Spring 2016
omputer Architecture Spring 2016 Lecture 09: Prefetching Shuai Wang Department of omputer Science and Technology Nanjing University Prefetching(1/3) Fetch block ahead of demand Target compulsory, capacity,
More information/ : Computer Architecture and Design Fall Final Exam December 4, Name: ID #:
16.482 / 16.561: Computer Architecture and Design Fall 2014 Final Exam December 4, 2014 Name: ID #: For this exam, you may use a calculator and two 8.5 x 11 double-sided page of notes. All other electronic
More informationECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti
ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real
More informationA Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures
A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationCallisto-RTS: Fine-Grain Parallel Loops
Callisto-RTS: Fine-Grain Parallel Loops Tim Harris, Oracle Labs Stefan Kaestle, ETH Zurich 7 July 2015 The following is intended to provide some insight into a line of research in Oracle Labs. It is intended
More informationCache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance
6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,
More informationEITF20: Computer Architecture Part 5.1.1: Virtual Memory
EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache
More informationA Scheme of Predictor Based Stream Buffers. Bill Hodges, Guoqiang Pan, Lixin Su
A Scheme of Predictor Based Stream Buffers Bill Hodges, Guoqiang Pan, Lixin Su Outline Background and motivation Project hypothesis Our scheme of predictor-based stream buffer Predictors Predictor table
More informationAnnouncements. ! Previous lecture. Caches. Inf3 Computer Architecture
Announcements! Previous lecture Caches Inf3 Computer Architecture - 2016-2017 1 Recap: Memory Hierarchy Issues! Block size: smallest unit that is managed at each level E.g., 64B for cache lines, 4KB for
More informationLRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.
LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E
More informationProcessors. Young W. Lim. May 12, 2016
Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version
More informationStatic, multiple-issue (superscaler) pipelines
Static, multiple-issue (superscaler) pipelines Start more than one instruction in the same cycle Instruction Register file EX + MEM + WB PC Instruction Register file EX + MEM + WB 79 A static two-issue
More informationGraph Partitioning for Scalable Distributed Graph Computations
Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More information1 Introduction to Parallel Computing
1 Introduction to Parallel Computing 1.1 Goals of Parallel vs Distributed Computing Distributed computing, commonly represented by distributed services such as the world wide web, is one form of computational
More informationUsing Intel Streaming SIMD Extensions for 3D Geometry Processing
Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationWhat SMT can do for You. John Hague, IBM Consultant Oct 06
What SMT can do for ou John Hague, IBM Consultant Oct 06 100.000 European Centre for Medium Range Weather Forecasting (ECMWF): Growth in HPC performance 10.000 teraflops sustained 1.000 0.100 0.010 VPP700
More informationTanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia
How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More informationECE/CS 757: Homework 1
ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)
More informationLimitations of Scalar Pipelines
Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline
More informationh Coherence Controllers
High-Throughput h Coherence Controllers Anthony-Trung Nguyen Microprocessor Research Labs Intel Corporation 9/30/03 Motivations Coherence Controller (CC) throughput is bottleneck of scalable systems. CCs
More informationRethinking Prefetching in GPGPUs: Exploiting Unique Opportunities
Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities Ahmad Lashgar Electrical and Computer Engineering University of Victoria Victoria, BC, Canada Email: lashgar@uvic.ca Amirali Baniasadi
More informationImproving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches
Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging 6.823, L8--1 Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Highly-Associative
More information1/19/2009. Data Locality. Exploiting Locality: Caches
Spring 2009 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Data Locality Temporal: if data item needed now, it is likely to be needed again in near future Spatial: if data item needed now, nearby
More informationPick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality
Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality Repeated References, to a set of locations: Temporal Locality Take advantage of behavior
More informationMaximizing Face Detection Performance
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount
More informationPredict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch
branch taken Revisiting Branch Hazard Solutions Stall Predict Not Taken Predict Taken Branch Delay Slot Branch I+1 I+2 I+3 Predict Not Taken branch not taken Branch I+1 IF (bubble) (bubble) (bubble) (bubble)
More informationChapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.
Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!
More informationSuggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 4.5-4.7!! (Over the next 3-4 lectures)! Lecture 17" Short Pipelining Review! 3! Processor components! Multicore processors and programming! Recap: Pipelining
More informationInf 2B: Graphs, BFS, DFS
Inf 2B: Graphs, BFS, DFS Kyriakos Kalorkoti School of Informatics University of Edinburgh Directed and Undirected Graphs I A graph is a mathematical structure consisting of a set of vertices and a set
More informationCache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time
Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +
More informationLecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University
Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:
More informationTutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
More informationI A graph is a mathematical structure consisting of a set of. I Formally: G =(V, E), where V is a set and E V V.
Directed and Undirected Graphs Inf 2B: Graphs, BFS, DFS Kyriakos Kalorkoti School of Informatics University of Edinburgh I A graph is a mathematical structure consisting of a set of vertices and a set
More informationLecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995
Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Fall 1995 Review: Who Cares About the Memory Hierarchy? Processor Only Thus Far in Course:
More informationSEMANTIC LOCALITY & CONTEXT BASED PREFETCHING. Leeor Peled 1, Shie Mannor 1, Uri Weiser 1, Yoav Etsion 1,2
SEMANTIC LOCALITY & CONTEXT BASED PREFETCHING Leeor Peled 1, Shie Mannor 1, Uri Weiser 1, Yoav Etsion 1,2 1 Electrical Engineering and 2 Computer Science Technion Israel Institute of Technology ICRI-CI
More informationLecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining
Lecture 21 Software Pipelining & Prefetching I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining [ALSU 10.5, 11.11.4] Phillip B. Gibbons 15-745: Software
More informationCS 206 Introduction to Computer Science II
CS 206 Introduction to Computer Science II 04 / 06 / 2018 Instructor: Michael Eckmann Today s Topics Questions? Comments? Graphs Definition Terminology two ways to represent edges in implementation traversals
More informationLecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: Who Cares About the Memory Hierarchy? Processor Only Thus
More informationCuSha: Vertex-Centric Graph Processing on GPUs
CuSha: Vertex-Centric Graph Processing on GPUs Farzad Khorasani, Keval Vora, Rajiv Gupta, Laxmi N. Bhuyan HPDC Vancouver, Canada June, Motivation Graph processing Real world graphs are large & sparse Power
More informationWedge A New Frontier for Pull-based Graph Processing. Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018
Wedge A New Frontier for Pull-based Graph Processing Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018 Graph Processing Problems modelled as objects (vertices) and connections between
More informationScalable Algorithmic Techniques Decompositions & Mapping. Alexandre David
Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability
More informationUnderstanding The Effects of Wrong-path Memory References on Processor Performance
Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend
More informationExploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling
Appears in the Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 218 Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More informationEECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont
Lecture 14 Advanced Caches DEC Alpha Fall 2018 Instruction Cache BIU Jon Beaumont www.eecs.umich.edu/courses/eecs470/ Data Cache Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
More informationPipelining to Superscalar
Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel
More informationc. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?
Brown University School of Engineering ENGN 164 Design of Computing Systems Professor Sherief Reda Homework 07. 140 points. Due Date: Monday May 12th in B&H 349 1. [30 points] Consider the non-pipelined
More informationEfficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling
Efficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling Tao Chen and G. Edward Suh Cornell University Ithaca, NY 14850, USA {tc466, gs272}@cornell.edu Abstract This
More informationTechniques for Efficient Processing in Runahead Execution Engines
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu
More informationComputer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM
Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic
More informationPipelining: Overview. CPSC 252 Computer Organization Ellen Walker, Hiram College
Pipelining: Overview CPSC 252 Computer Organization Ellen Walker, Hiram College Pipelining the Wash Divide into 4 steps: Wash, Dry, Fold, Put Away Perform the steps in parallel Wash 1 Wash 2, Dry 1 Wash
More informationChasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems
Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair *, Johnathan Alsop^, Sarita V. Adve + * University of Wisconsin-Madison ^ AMD Research + University
More informationProfiling: Understand Your Application
Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel
More informationImproving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs
Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Misses Classifying Misses: 3 Cs! Compulsory The first access to a block is
More informationCS5460: Operating Systems Lecture 14: Memory Management (Chapter 8)
CS5460: Operating Systems Lecture 14: Memory Management (Chapter 8) Important from last time We re trying to build efficient virtual address spaces Why?? Virtual / physical translation is done by HW and
More informationChasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems
Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu
More informationcsci 3411: Operating Systems
csci 3411: Operating Systems Memory Management II Gabriel Parmer Slides adapted from Silberschatz and West Each Process has its Own Little World Virtual Address Space Picture from The
More information3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?
CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:
More informationLecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"
Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =
More information15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationA Cache Hierarchy in a Computer System
A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the
More informationReducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses
Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Miss Penalty: Read Priority over Write on Miss Write buffers may offer RAW
More informationECE 587/687 Final Exam Solution
ECE 587/687 Final Exam Solution Time allowed: 80 minutes Total Points: 60 Points Scored: Name: Problem No. 1 (15 points) Consider a computer system with a two level cache hierarchy, consisting of split
More informationBest-Offset Hardware Prefetching
Best-Offset Hardware Prefetching Pierre Michaud March 2016 2 BOP: yet another data prefetcher Contribution: offset prefetcher with new mechanism for setting the prefetch offset dynamically - Improvement
More informationInstr. execution impl. view
Pipelining Sangyeun Cho Computer Science Department Instr. execution impl. view Single (long) cycle implementation Multi-cycle implementation Pipelined implementation Processing an instruction Fetch instruction
More informationMemory Consistency. Challenges. Program order Memory access order
Memory Consistency Memory Consistency Memory Consistency Reads and writes of the shared memory face consistency problem Need to achieve controlled consistency in memory events Shared memory behavior determined
More informationCaches and Prefetching
Caches and Prefetching www.csa.iisc.ac.in Acknowledgements Several of the slides in the deck are from Luis Ceze (Washington), Nima Horanmand (Stony Brook), Mark Hill, David Wood, Karu Sankaralingam (Wisconsin),
More informationA Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,
More informationMemory Hierarchy Basics
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationHigh Performance SMIPS Processor. Jonathan Eastep May 8, 2005
High Performance SMIPS Processor Jonathan Eastep May 8, 2005 Objectives: Build a baseline implementation: Single-issue, in-order, 6-stage pipeline Full bypassing ICache: blocking, direct mapped, 16KByte,
More informationLecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue 1 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction
More informationA POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL
A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN *, INDRANI PAUL, JOSEPH GREATHOUSE, SRILATHA MANNE, AND SUDHKAHAR YALAMANCHILI * * GEORGIA INSTITUTE OF TECHNOLOGY AMD RESEARCH
More information