DIAMOND RINGS ACKNOWLEDGED EVENT PROPAGATION IN MANY-CORE PROCESSORS
|
|
- Tamsyn Taylor
- 5 years ago
- Views:
Transcription
1 th August DIAMOND RINGS ACKNOWLEDGED EVENT PROPAGATION IN MANY-CORE PROCESSORS Stefan Nürnberger, Randolf Rotta, Gabor Drescher, Daniel Danner, Jörg Nolte
2 ACKNOWLEDGED EVENT PROPAGATION What does it do? Make events observable in a networked system Make sure events are globally observable Enforce ordering of events What is it good for? Memory Consistency Coherence Protocols Atomic Operations How to implement it? Just use broadcast with acknowledgement... Motivation
3 EXAMPLE: READ FOR OWNERSHIP x Memory $Dir $ C $ $ $ $ C C C... Cn Motivation
4 EXAMPLE: READ FOR OWNERSHIP x Memory $Dir read (x) $ C $ $ $ $ C C C... Cn Motivation
5 EXAMPLE: READ FOR OWNERSHIP x Memory $Dir read (x) $ x C $ $ $ $ C C C... Cn Motivation
6 EXAMPLE: READ FOR OWNERSHIP x Memory $Dir read (x) $ x C $ $ $ $ C C C... Cn Motivation
7 EXAMPLE: READ FOR OWNERSHIP x Memory $Dir read (x) $ x C $ x $ $ $ C C C... Cn Motivation
8 EXAMPLE: READ FOR OWNERSHIP x Memory $Dir read (x) $ x $ x $ x $ $ C C C C... Cn Motivation
9 EXAMPLE: READ FOR OWNERSHIP x Memory $Dir read (x) $ x $ x $ x $ x $ C C C C... Cn Motivation
10 rfo (x) EXAMPLE: READ FOR OWNERSHIP x Memory $Dir $ x $ x $ x $ x $ C C C C... Cn Motivation
11 rfo (x) EXAMPLE: READ FOR OWNERSHIP x Memory $Dir invalidate (x) $ x $ x $ x $ x $ C C C C... Cn Motivation
12 rfo (x) EXAMPLE: READ FOR OWNERSHIP x Memory $Dir invalidate (x) $ x $ x $ x $ x $ C C C C... Cn Motivation
13 rfo (x) EXAMPLE: READ FOR OWNERSHIP x Memory $Dir $ x $ x $ x $ x $ x C C C C... Cn Motivation
14 OUTLINE. & of Broadcast. The Diamond Ring Topology. Evaluation & of Broadcast
15 THROUGHPUT & LATENCY time from sending out message to reception of acknowledgement determined by longest path (#hops + processing at each node) lower is better number of messages processed within fixed time span determined by node with maximum overhead (i.e. bottleneck) requires pipelining of messages (latency hiding) higher is better & of Broadcast
16 ACKNOWLEDGED BROADCAST USING BALANCED TREES & of Broadcast
17 ACKNOWLEDGED BROADCAST USING BALANCED TREES & of Broadcast
18 ACKNOWLEDGED BROADCAST USING BALANCED TREES & of Broadcast
19 ACKNOWLEDGED BROADCAST USING BALANCED TREES & of Broadcast
20 ACKNOWLEDGED BROADCAST USING BALANCED TREES & of Broadcast
21 ACKNOWLEDGED BROADCAST USING BALANCED TREES & of Broadcast
22 ACKNOWLEDGED BROADCAST USING BALANCED TREES & of Broadcast
23 ACKNOWLEDGED BROADCAST USING BALANCED TREES & of Broadcast
24 ACKNOWLEDGED BROADCAST USING BALANCED TREES & of Broadcast
25 ACKNOWLEDGED BROADCAST USING BALANCED TREES & of Broadcast
26 ACKNOWLEDGED BROADCAST USING SKEWED TREES & of Broadcast
27 ACKNOWLEDGED BROADCAST USING SKEWED TREES & of Broadcast
28 ACKNOWLEDGED BROADCAST USING SKEWED TREES & of Broadcast
29 ACKNOWLEDGED BROADCAST USING SKEWED TREES & of Broadcast
30 ACKNOWLEDGED BROADCAST USING SKEWED TREES & of Broadcast
31 ACKNOWLEDGED BROADCAST USING SKEWED TREES & of Broadcast
32 ACKNOWLEDGED BROADCAST USING SKEWED TREES & of Broadcast
33 ACKNOWLEDGED BROADCAST USING SKEWED TREES & of Broadcast
34 ACKNOWLEDGED BROADCAST USING SKEWED TREES & of Broadcast
35 ACKNOWLEDGED BROADCAST USING RINGS & of Broadcast
36 ACKNOWLEDGED BROADCAST USING RINGS & of Broadcast
37 ACKNOWLEDGED BROADCAST USING RINGS & of Broadcast
38 ACKNOWLEDGED BROADCAST USING RINGS & of Broadcast
39 ACKNOWLEDGED BROADCAST USING RINGS & of Broadcast
40 ACKNOWLEDGED BROADCAST USING RINGS & of Broadcast
41 ACKNOWLEDGED BROADCAST USING RINGS & of Broadcast
42 ACKNOWLEDGED BROADCAST USING RINGS & of Broadcast
43 ACKNOWLEDGED BROADCAST USING RINGS & of Broadcast
44 ACKNOWLEDGED BROADCAST USING RINGS & of Broadcast
45 ACKNOWLEDGED BROADCAST USING RINGS & of Broadcast
46 ACKNOWLEDGED BROADCAST USING RINGS & of Broadcast
47 FORWARD PROCESS ACK Message Forwarding as Acknowledgement possible in ring structures halve number of sent messages (network contention) may increase latency (processing time at node) Ring Structure. Receive Message. Process Message. Forward Message (Ack) Tree Structure. Receive Message. Forward Message (except leaves). Process Message. Receive Ack (except leaves). Forward Ack Not an issue if only message reception needs acknowledgement. & of Broadcast
48 OUTLINE. & of Broadcast. The Diamond Ring Topology. Evaluation The Diamond Ring Topology
49 THE DIAMOND RING TOPOLOGY Combine Ring and Balanced Tree Logarithmic path length for low latency Forwarding is acknowledgement Parallel message propagation Computable topology Diamond Ring: Directed Graph D l k k Arity of tree nodes l Levels of tree scattering Based on a balanced tree B l k Mirrored at the leaves Closed to ring at the root D l k = (k+)kl (k+) k D l+ k = D l k +kl +k l+ The Diamond Ring Topology
50 THE DIAMOND RING TOPOLOGY Combine Ring and Balanced Tree Logarithmic path length for low latency Forwarding is acknowledgement Parallel message propagation Computable topology Diamond Ring: Directed Graph D l k k Arity of tree nodes l Levels of tree scattering Based on a balanced tree B l k Mirrored at the leaves Closed to ring at the root D l k = (k+)kl (k+) k D l+ k = D l k +kl +k l+ The Diamond Ring Topology
51 THE PERFECT DIAMOND RING D - diamond ring with nodes The Diamond Ring Topology
52 THE PERFECT DIAMOND RING D - diamond ring with nodes root scatter center gather root The Diamond Ring Topology
53 THE PERFECT DIAMOND RING D - diamond ring with nodes + (no bottleneck version) The Diamond Ring Topology
54 THE PERFECT DIAMOND RING D - diamond ring with nodes + (no bottleneck version) root scatter center gather The Diamond Ring Topology
55 SOME MORE EXAMPLES D - diamond ring with nodes The Diamond Ring Topology
56 SOME MORE EXAMPLES D - diamond ring with nodes The Diamond Ring Topology
57 SOME MORE EXAMPLES D - diamond ring with nodes The Diamond Ring Topology
58 ACKNOWLEDGED BROADCAST USING DIAMOND RINGS The Diamond Ring Topology
59 ACKNOWLEDGED BROADCAST USING DIAMOND RINGS The Diamond Ring Topology
60 ACKNOWLEDGED BROADCAST USING DIAMOND RINGS The Diamond Ring Topology
61 ACKNOWLEDGED BROADCAST USING DIAMOND RINGS The Diamond Ring Topology
62 ACKNOWLEDGED BROADCAST USING DIAMOND RINGS The Diamond Ring Topology
63 ACKNOWLEDGED BROADCAST USING DIAMOND RINGS The Diamond Ring Topology
64 ACKNOWLEDGED BROADCAST USING DIAMOND RINGS The Diamond Ring Topology
65 ACKNOWLEDGED BROADCAST USING DIAMOND RINGS The Diamond Ring Topology
66 ACKNOWLEDGED BROADCAST USING DIAMOND RINGS The Diamond Ring Topology
67 DEALING WITH ODD NODE COUNTS D - diamond ring with nodes (- nodes) The Diamond Ring Topology
68 DEALING WITH ODD NODE COUNTS D - diamond ring with nodes (- nodes) root scatter center gather root The Diamond Ring Topology
69 DEALING WITH ODD NODE COUNTS () D - diamond ring with nodes (+ nodes) The Diamond Ring Topology
70 DEALING WITH ODD NODE COUNTS () D - diamond ring with nodes (+ nodes) root scatter center gather root The Diamond Ring Topology
71 COMPARISON TO BALANCED TREES and is reduced due to shorter longest path is increased since nodes have less communication partners Contention on the network is reduced due to less messages sent Balanced Tree Diamond Ring Ring Longest Path log k (n) log k (n) n Max. Overhead (k + ) k Messages sent (n ) k k+ n n The Diamond Ring Topology
72 COMPARISON TO BALANCED TREES and is reduced due to shorter longest path is increased since nodes have less communication partners Contention on the network is reduced due to less messages sent Balanced Tree Diamond Ring Ring Longest Path log k (n) log k (n) n Max. Overhead (k + ) k + Messages sent (n ) k k+ n+ n The Diamond Ring Topology
73 OUTLINE. & of Broadcast. The Diamond Ring Topology. Evaluation Evaluation
74 EVALUATION OF DIAMOND RINGS Hypothesis Acknowledged broadcasts using diamond rings should have.... lower latency,. higher throughput... than balanced trees. Benchmark Setup Custom active message framework Messages in shared memory Topologies: Balanced Tree (BT), Diamond Ring (DR), Sequenced Diamond Ring (SDR) Three different evaluation platforms Evaluation
75 EVALUATION PLATFORMS EZ-Chip Tilera TILE-Gx (in-order) Low- Mesh Network (UDN) Intel Xeon E v Sockets, Cores, (out-of-order) Slotted Rings, QPI between Sockets Intel Xeon Phi P Cores, (in-order) Slotted Ring Network Evaluation
76 EZ-CHIP TILERA TILE-GX median latency [µs] arity= arity= arity= number of cores BT DR SDR median events per µs..... arity= arity= arity= number of pipelined broadcasts BT DR SDR Evaluation
77 INTEL XEON V median latency [µs]..... arity= arity= arity= number of hardware threads BT DR SDR median events per µs.... arity= arity= arity= number of pipelined broadcasts BT DR SDR Evaluation
78 INTEL XEON PHI P median latency [µs] arity= arity= arity= number of hardware threads BT DR SDR median events per µs arity= arity= arity= number of pipelined broadcasts BT DR SDR Evaluation
79 RESULTS OVERVIEW median latency [µs] TILE Gx ( nodes) Xeon E v ( nodes) XeonPhi P ( nodes) BT DR max median throughput [broadcasts per µs] SDR Evaluation
80 RESULTS OVERVIEW median latency [µs].... TILE Gx ( nodes) Xeon E v ( nodes) XeonPhi P ( nodes) max median throughput [broadcasts per µs].... BT DR SDR Evaluation
81 SUMMARY Acknowledged Event Propagation is very important in consistency management. and require a trade-off. Diamond Rings offer a better trade-off than balanced trees. are acknowledged broadcast s best friend. Thank you for your attention! Questions? This work was supported by the German Research Foundation (DFG) under grant no. NO /- and SCHR /- The End
Diamond Rings: Acknowledged Event Propagation in Many-Core Processors
Diamond Rings: Acknowledged Event Propagation in Many-Core Processors Stefan Nürnberger, Randolf Rotta, Gabor Drescher, Daniel Danner, and Jörg Nolte Brandenburg University of Technology, Cottbus-Senftenberg,
More informationDesigning Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters
Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi
More informationNetwork-on-chip (NOC) Topologies
Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance
More informationSymmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment
Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment Xin-Wei Shih, Tzu-Hsuan Hsu, Hsu-Chieh Lee, Yao-Wen Chang, Kai-Yuan Chao 2013.01.24 1 Outline 2 Clock Network Synthesis Clock network
More informationInterconnection Networks
Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationOptimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor
Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Intel K. K. E-mail: hirokazu.kobayashi@intel.com Yoshifumi Nakamura RIKEN AICS E-mail: nakamura@riken.jp Shinji Takeda
More informationMemory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System
Center for Information ervices and High Performance Computing (ZIH) Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor ystem Parallel Architectures and Compiler Technologies
More informationTopologies. Maurizio Palesi. Maurizio Palesi 1
Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and
More informationMeet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors
Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors Sandro Bartolini* Department of Information Engineering, University of Siena, Italy bartolini@dii.unisi.it
More informationInterconnection Networks
Lecture 18: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of these slides were created by Michael Papamichael This lecture is partially
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationDown selecting suitable manycore technologies for the ELT AO RTC. David Barr, Alastair Basden, Nigel Dipper and Noah Schwartz
Down selecting suitable manycore technologies for the ELT AO RTC David Barr, Alastair Basden, Nigel Dipper and Noah Schwartz GFLOPS RTC for AO workshop 27/01/2016 AO RTC Complexity 1.E+05 1.E+04 E-ELT
More informationEvaluating On-Node GPU Interconnects for Deep Learning Workloads
Evaluating On-Node GPU Interconnects for Deep Learning Workloads NATHAN TALLENT, NITIN GAWANDE, CHARLES SIEGEL ABHINAV VISHNU, ADOLFY HOISIE Pacific Northwest National Lab PMBS 217 (@ SC) November 13,
More informationInterconnection Networks: Topology. Prof. Natalie Enright Jerger
Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design
More informationMulticore Hardware and Parallelism
Multicore Hardware and Parallelism Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3
More informationA Combined Semi-Pipelined Query Processing Architecture For Distributed Full-Text Retrieval
A Combined Semi-Pipelined Query Processing Architecture For Distributed Full-Text Retrieval Simon Jonassen and Svein Erik Bratsberg Department of Computer and Information Science Norwegian University of
More informationViper: Communication-Layer Determinism and Scaling in Low-Latency Stream Processing
Viper: Communication-Layer Determinism and Scaling in Low-Latency Stream Processing Ivan Walulya, Yiannis Nikolakopoulos, Vincenzo Gulisano Marina Papatriantafilou and Philippas Tsigas Auto-DaSP 2017 Chalmers
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationSIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto
SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES University of Toronto Interaction of Coherence and Network 2 Cache coherence protocol drives network-on-chip traffic Scalable coherence protocols
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationFuture of Interconnect Fabric A Contrarian View. Shekhar Borkar June 13, 2010 Intel Corp. 1
Future of Interconnect Fabric A ontrarian View Shekhar Borkar June 13, 2010 Intel orp. 1 Outline Evolution of interconnect fabric On die network challenges Some simple contrarian proposals Evaluation and
More informationSpecial Course on Computer Architecture
Special Course on Computer Architecture #9 Simulation of Multi-Processors Hiroki Matsutani and Hideharu Amano Outline: Simulation of Multi-Processors Background [10min] Recent multi-core and many-core
More informationNon-uniform memory access (NUMA)
Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access
More informationTopologies. Maurizio Palesi. Maurizio Palesi 1
Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and
More informationLecture: Interconnection Networks
Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core
More informationNoC Simulation in Heterogeneous Architectures for PGAS Programming Model
NoC Simulation in Heterogeneous Architectures for PGAS Programming Model Sascha Roloff, Andreas Weichslgartner, Frank Hannig, Jürgen Teich University of Erlangen-Nuremberg, Germany Jan Heißwolf Karlsruhe
More informationOverview. Processor organizations Types of parallel machines. Real machines
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, DAS Programming methods, languages, and environments
More informationLecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)
Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew
More informationEARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA
EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility
More informationCS/COE1541: Intro. to Computer Architecture
CS/COE1541: Intro. to Computer Architecture Multiprocessors Sangyeun Cho Computer Science Department Tilera TILE64 IBM BlueGene/L nvidia GPGPU Intel Core 2 Duo 2 Why multiprocessors? For improved latency
More informationAchieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation
Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationMPI Performance Analysis and Optimization on Tile64/Maestro
MPI Performance Analysis and Optimization on Tile64/Maestro Mikyung Kang, Eunhui Park, Minkyoung Cho, Jinwoo Suh, Dong-In Kang, and Stephen P. Crago USC/ISI-East July 19~23, 2009 Overview Background MPI
More informationPerformance study example ( 5.3) Performance study example
erformance study example ( 5.3) Coherence misses: - True sharing misses - Write to a shared block - ead an invalid block - False sharing misses - ead an unmodified word in an invalidated block CI for commercial
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationParallelism in Hardware
Parallelism in Hardware Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3 Moore s Law
More informationTDT Appendix E Interconnection Networks
TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationLecture 10: Cache Coherence. Parallel Computer Architecture and Programming CMU / 清华 大学, Summer 2017
Lecture 10: Cache Coherence Parallel Computer Architecture and Programming CMU / 清华 大学, Summer 2017 Course schedule (where we are) Week 1: How parallel hardware works: types of parallel execution in modern
More informationOutline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work
Using Non-blocking Operations in HPC to Reduce Execution Times David Buettner, Julian Kunkel, Thomas Ludwig Euro PVM/MPI September 8th, 2009 Outline 1 Motivation 2 Theory of a non-blocking benchmark 3
More informationNeural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research
More informationLecture 25: Multiprocessors
Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed
More informationVorlesung Kommunikationsnetze Research Topics: QoS in VANETs
Vorlesung Kommunikationsnetze Research Topics: QoS in VANETs Prof. Dr. H. P. Großmann mit B. Wiegel sowie A. Schmeiser und M. Rabel Sommersemester 2009 Institut für Organisation und Management von Informationssystemen
More informationSwizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems
1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna
More informationBuilding blocks for high performance DWH Computing
Building blocks for high performance DWH Computing Wolfgang Höfer, Nuremberg, 18 st November 2010 Copyright 2010 Fujitsu Technology Solutions Current trends (1) Intel/AMD CPU performance is growing fast
More informationFault-adaptive routing
Fault-adaptive routing Presenter: Zaheer Ahmed Supervisor: Adan Kohler Reviewers: Prof. Dr. M. Radetzki Prof. Dr. H.-J. Wunderlich Date: 30-June-2008 7/2/2009 Agenda Motivation Fundamentals of Routing
More informationThe Impact of Optics on HPC System Interconnects
The Impact of Optics on HPC System Interconnects Mike Parker and Steve Scott Hot Interconnects 2009 Manhattan, NYC Will cost-effective optics fundamentally change the landscape of networking? Yes. Changes
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationCapability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL
SABELA RAMOS, TORSTEN HOEFLER Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL spcl.inf.ethz.ch Microarchitectures are becoming more and more complex CPU L1 CPU L1 CPU L1 CPU
More informationDesigning Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen
Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen Outline Architecture of Web-based Data Center Three-Stage framework to benefit
More informationExploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters
Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth
More informationEN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors
EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationDell PowerEdge 11 th Generation Servers: R810, R910, and M910 Memory Guidance
Dell PowerEdge 11 th Generation Servers: R810, R910, and M910 Memory Guidance A Dell Technical White Paper Dell Product Group Armando Acosta and James Pledge THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES
More informationMesh Networks
Institute of Computer Science Department of Distributed Systems Prof. Dr.-Ing. P. Tran-Gia Decentralized Bandwidth Management in IEEE 802.16 Mesh Networks www3.informatik.uni-wuerzburg.de Motivation IEEE
More informationTile Processor (TILEPro64)
Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture X: Parallel Databases Topics Motivation and Goals Architectures Data placement Query processing Load balancing
More informationCOSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors
COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.
More informationWORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS
WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES ON BIG AND SMALL SERVER PLATFORMS Shuang Chen*, Shay Galon**, Christina Delimitrou*, Srilatha Manne**, and José Martínez* *Cornell University **Cavium
More informationNovel Hardware Architecture for Fast Address Lookups
Novel Hardware Architecture for Fast Address Lookups Pronita Mehrotra Paul D. Franzon Department of Electrical and Computer Engineering North Carolina State University {pmehrot,paulf}@eos.ncsu.edu This
More informationThe Tofu Interconnect 2
The Tofu Interconnect 2 Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Shun Ando, Masahiro Maeda, Takahide Yoshikawa, Koji Hosoe, and Toshiyuki Shimizu Fujitsu Limited Introduction Tofu interconnect
More informationEXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab
EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT Konstantinos Alexopoulos ECE NTUA CSLab MOTIVATION HPC, Multi-node & Heterogeneous Systems Communication with low latency
More informationUsing Time Division Multiplexing to support Real-time Networking on Ethernet
Using Time Division Multiplexing to support Real-time Networking on Ethernet Hariprasad Sampathkumar 25 th January 2005 Master s Thesis Defense Committee Dr. Douglas Niehaus, Chair Dr. Jeremiah James,
More informationReducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet
Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Pilar González-Férez and Angelos Bilas 31 th International Conference on Massive Storage Systems
More informationEITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor
EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration I/O MultiProcessor Summary 2 Virtual memory benifits Using physical memory efficiently
More informationInterconnection Networks: Flow Control. Prof. Natalie Enright Jerger
Interconnection Networks: Flow Control Prof. Natalie Enright Jerger Switching/Flow Control Overview Topology: determines connectivity of network Routing: determines paths through network Flow Control:
More informationChapter 9 Multiprocessors
ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University
More informationUNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568
UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568 Part 6 Input/Output Israel Koren ECE568/Koren Part.6. CPU performance keeps increasing 26 72-core Xeon
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationInterconnection Networks
Interconnection Networks Interconnection Networks Introduction How to connect individual devices together into a group of communicating devices? Device: r r r Component within a computer Single computer
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More
More informationDatabase Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:
Database Workload + Low throughput (0.8 IPC on an 8-wide superscalar. 1/4 of SPEC) + Naturally threaded (and widely used) application - Already high cache miss rates on a single-threaded machine (destructive
More informationEECS 570 Final Exam - SOLUTIONS Winter 2015
EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32
More informationMULTIPROCESSOR OS. Overview. COMP9242 Advanced Operating Systems S2/2013 Week 11: Multiprocessor OS. Multiprocessor OS
Overview COMP9242 Advanced Operating Systems S2/2013 Week 11: Multiprocessor OS Multiprocessor OS Scalability Multiprocessor Hardware Contemporary systems Experimental and Future systems OS design for
More informationTelematics. 5th Tutorial - LLC vs. MAC, HDLC, Flow Control, E2E-Arguments
19531 - Telematics 5th Tutorial - LLC vs. MAC, HDLC, Flow Control, E2E-Arguments Bastian Blywis Department of Mathematics and Computer Science Institute of Computer Science 18. November, 2010 Institute
More informationChapter 18 - Multicore Computers
Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca
More informationRuntime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism
1 Runtime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism Takeshi Nanri (Kyushu Univ. and JST CREST, Japan) 16 Aug, 2016 4th Annual MVAPICH Users Group Meeting 2 Background
More informationxsim The Extreme-Scale Simulator
www.bsc.es xsim The Extreme-Scale Simulator Janko Strassburg Severo Ochoa Seminar @ BSC, 28 Feb 2014 Motivation Future exascale systems are predicted to have hundreds of thousands of nodes, thousands of
More informationChapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Part 2 Homogeneous & Heterogeneous Multicore Architectures Intel XEON 22nm
More informationUNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568
UNIVERSITY OF MASSACHUSETTS Dept of Electrical & Computer Engineering Computer Architecture ECE 568 art 5 Input/Output Israel Koren ECE568/Koren art5 CU performance keeps increasing 26 72-core Xeon hi
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)
1 MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) Chapter 5 Appendix F Appendix I OUTLINE Introduction (5.1) Multiprocessor Architecture Challenges in Parallel Processing Centralized Shared Memory
More informationNETWORK PROBLEM SET Solution
NETWORK PROBLEM SET Solution Problem 1 Consider a packet-switched network of N nodes connected by the following topologies: 1. For a packet-switched network of N nodes, the number of hops is one less than
More informationHW and SW Architectures for Over-The-Air Dynamic Reconfiguration by Software Download
Information Technology Center Europe Telecommunications Laboratory HW and SW Architectures for Over-The-Air Dynamic Reconfiguration by Software Download a proof of concept by lab experimentation Christophe
More informationPerformance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware
Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware 2010 VMware Inc. All rights reserved About the Speaker Hemant Gaidhani Senior Technical
More informationUnderstanding The Performance of DPDK as a Computer Architect
Understanding The Performance of DPDK as a Computer Architect XIAOBAN WU *, PEILONG LI *, YAN LUO *, LIANG- MIN (LARRY) WANG +, MARC PEPIN +, AND JOHN MORGAN + * UNIVERSITY OF MASSACHUSETTS LOWELL + INTEL
More informationInterconnection networks
Interconnection networks When more than one processor needs to access a memory structure, interconnection networks are needed to route data from processors to memories (concurrent access to a shared memory
More informationSynchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom
ISCA 2018 Session 8B: Interconnection Networks Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom Aniruddh Ramrakhyani Georgia Tech (aniruddh@gatech.edu) Tushar
More informationEECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects
1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 12: On-Chip Interconnects Instructor: Ron Dreslinski Winter 216 1 1 Announcements Upcoming lecture schedule Today: On-chip
More informationSOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS
SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too
More informationIntel Architecture for HPC
Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter
More informationHybrid MPI - A Case Study on the Xeon Phi Platform
Hybrid MPI - A Case Study on the Xeon Phi Platform Udayanga Wickramasinghe Center for Research on Extreme Scale Technologies (CREST) Indiana University Greg Bronevetsky Lawrence Livermore National Laboratory
More informationLow-Power Interconnection Networks
Low-Power Interconnection Networks Li-Shiuan Peh Associate Professor EECS, CSAIL & MTL MIT 1 Moore s Law: Double the number of transistors on chip every 2 years 1970: Clock speed: 108kHz No. transistors:
More informationPolicy-Sealed Data: A New Abstraction for Building Trusted Cloud Services
Max Planck Institute for Software Systems Policy-Sealed Data: A New Abstraction for Building Trusted Cloud Services 1, Rodrigo Rodrigues 2, Krishna P. Gummadi 1, Stefan Saroiu 3 MPI-SWS 1, CITI / Universidade
More informationCOMMUNICATION AND I/O ARCHITECTURES FOR HIGHLY INTEGRATED MPSoC PLATFORMS OUTLINE
COMMUNICATION AND I/O ARCHITECTURES FOR HIGHLY INTEGRATED MPSoC PLATFORMS Martino Ruggiero Luca Benini University of Bologna Simone Medardoni Davide Bertozzi University of Ferrara In cooperation with STMicroelectronics
More informationMaximizing System x and ThinkServer Performance with a Balanced Memory Configuration
Front cover Maximizing System x and ThinkServer Performance with a Balanced Configuration Last Update: October 2017 Introduces three balanced memory guidelines for Intel Xeon s Compares the performance
More information