NUMA Profiling for Dynamic Dataflow Applications
|
|
- Moris May
- 6 years ago
- Views:
Transcription
1 1 / 28 NUMA Profiling for Dynamic Dataflow Applications Manuel Selva Lionel Morel Kevin Marquet CITI - INRIA SOCRATE Université de Lyon September 29th, 2015
2 Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling CMP are everywhere Intel Nehalem - 4 cores 2009 Kalray MPPA cores On the headlines David P.: The Trouble with Multicore Herb S.: Welcome to the Jungle Samsung Exynos - 2 x 4 cores Ed L.: The Problem with Threads Timothy R.: Mind the Gap... David P.: The Hail Mary of Programming 2 / 28
3 3 / 28 But Programming them is Hard...
4 4 / 28 Dataflow Text. Text.Y Text. Y Mot.Y Parser Text.U Mot.U Merger Display Text.V Mot.V Actors exchanging data only through FIFO channels Different forms of parallelism Task Pipeline Data
5 5 / 28 Dataflow Applications Examples Medical image processing [Albers2012] Software Defined Radio [Dardaillon2014] Video Decoding [Lucarz09]
6 6 / 28 The setting
7 6 / 28 The question Do DF applications scale? If not, why?
8 3 Does it scale? Speedup vs single-core HEVC decoding Different inputs 200 frames 33 Actors Number of cores 7 / 28
9 8 / 28 What are the reasons for that? Are the applications well written? blame the app designer. Are the runtimes well implemented? blame the runtime designer. Is the model of computation really the good one? Programmer tricked into some idiosyncracies? blame the language designer.
10 9 / 28 Problem Statement How to identify and understand performance bottlenecks in dataflow programs? Contribution: CPU/memory profiling to analyse (and fix) bottlenecks on dataflow programs
11 10 / 28 Preliminary: Which Software? RVC-Cal - [Yviquel13] Dynamic Dataflow Dedicated to video codec applications Many applications available (hevc, h264, gzip, zigbee) Active community
12 11 / 28 Preliminary: Dataflow Execution Model A B C D
13 11 / 28 Preliminary: Dataflow Execution Model A B D C Compiler A; B; C; D;
14 11 / 28 Preliminary: Dataflow Execution Model B A D C Compiler A; B; C; D; Mapper Core 1 Core 2 A; C; D; B; RAM 1 thread per core - actors scheduled within thread
15 12 / 28 Preliminary: Which Architecture? Core Core 1... Core 6 Core 7... Core 12 Uncore QPI QPI Memory Bank 1 Memory Bank 2 Commodity HW NUMA PMU linux-supported
16 Goal: Identify and Understand Performance Bottlenecks in Dataflow Programs Core Core 1... Core 6 Core 7... Core 12 Correlate hw profiling to the DF graph Uncore QPI QPI Memory Bank 1 Memory Bank 2 13 / 28
17 14 / 28 CPU Profiling
18 Goal: Identify and Understand Cores Imbalance Core Core 1... Core 6 Core 7... Core 12 Exec time actors Uncore QPI QPI Memory Bank 1 Memory Bank 2 15 / 28
19 Cores Balance Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11 Core 12 Work distribution by core (%) Number of cores HEVC Input: Kimono 200 frames Single actor: Inter pred. 16 / 28
20 Cores Balance Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Work distribution by core (%) Core 7 Core 8 Core 9 Core 10 Core 11 Core Number of cores HEVC Input: Kimono 200 frames Single actor: Inter pred. 16 / 28
21 Diagnosis? The application is not parallel enough: Split the Interframe Prediction actor! [Jerbi14] Split other actors as well... Parallelize the sequential code inside actors? 17 / 28
22 Total Work Time is Increasing! Total Work Time (cycles) % HEVC Input: Kimono 200 frames Number of cores Total Work Time = Sum of cpu time for all cores used Question: where does this overhead come from? 18 / 28
23 Memory Profiling
24 Goal: Identify and Understand Memory Usage Core Core 1... Core 6 Core 7... Core 12 Mem. traffic FIFOs Uncore QPI QPI Memory Bank 1 Memory Bank 2 19 / 28
25 20 / 28 NUMA - Performance Monitoring Unit Core PMU PMU PMU PMU Core 1 Core 6 Core 7 Core Uncore PMU QPI QPI PMU Memory Bank 1 Memory Bank 2
26 20 / 28 NUMA - Performance Monitoring Unit Core PMU PMU PMU PMU Core 1 Core 6 Core 7 Core Uncore PMU QPI QPI PMU Memory Bank 1 Memory Bank 2 Hardware profiling mechanisms Hard to program
27 21 / 28 A library for NUMA Profiling Write assembler Run in supervisor Linux Perf PAPI numap Intel PCM perf _event_open() system call Linux Kernel Kernel module /dev/cpu/msr PMU Hardware Memory bandwidth profiling Memory access sampling
28 22 / 28 Using numap for memory bandwidth usage Core Uncore PMU PMU PMU PMU Core 1 Core 6... PMU QPI QPI Core 7 Core PMU Memory Bank 1 Memory Bank 2
29 22 / 28 Using numap for memory bandwidth usage Core Uncore PMU PMU PMU PMU Core 1 Core 6... PMU QPI QPI Core 7 Core PMU DF applications saturate memory bandwidth? Memory Bank 1 Memory Bank 2
30 Average Bandwidth (GB/s) Main Memory Bandwidth Usage 25 Read max bandwidth Write max bandwidth 10 5 Read Write HEVC Input: Kimono 200 frames Number of cores DF applications saturate memory bandwidth? NO! 23 / 28
31 24 / 28 Do you pay for too many distant accesses? Core Uncore PMU PMU PMU PMU Core 1 Core 6... PMU QPI QPI Core 7 Core PMU Memory Bank 1 Memory Bank 2
32 24 / 28 Do you pay for too many distant accesses? Core Uncore Core 1... Memory Bank 1 Core 6 QPI Core 7 QPI PMU PMU PMU PMU Core PMU Memory Bank 2 Associate mem accesses to actors and FIFOs
33 25 / 28 Communication Cost Core 1 Core 6 Core 7 Core % of accesses QPI QPI Average Memory Memory Bank 1 Latency (cycles) Intel X5650 Westemere HEVC Input: Kimono 200 frames Memory Bank Number of cores LFB RemoteCache LocalRAM RemoteRAM
34 % of accesses Communication Cost Average Memory Latency (cycles) HEVC Input: Kimono 200 frames Number of cores LFB RemoteCache LocalRAM RemoteRAM A small part of the accesses are responsible for a large share of the latency. 25 / 28
35 26 / 28 Where to Optimize? High latency High latency The profiler gives us: High latency data exchanges at the dataflow level We plan on using this for: Feeding this information to the mapping heuristics
36 27 / 28 Conclusion Proposition Main goal: Improve scalability of DF programs How: Understand performance bottlenecks in DF programs Approach: connect HW-level performance monitoring to DF runtime Contributions numap: memory profiling for NUMA architectures Connection to the RVC-Cal runtime Memory profiling of video decoders
37 28 / 28 Perspectives Short-term Continue analysis of memory sampling results Build more intelligent (re-)mapping decisions Mid- and Long-term Compare ressource usage of DF-written decoders with traditionnal thread-based implementations (eg ffmpeg) Integrate DF notions (ie data-dependencies) into OS kernel Adapt runtime strategies to many-core architectures Run and adapt multiple DF applications simultaneously
38 Bibliography I A H R Albers and P H N de With. Task complexity analysis and qos management for mapping dynamic video-processing tasks on a multi-core platform. Journal of Real-Time Image Processing, 7(3): , Mickaël Dardaillon, Kevin Marquet, Tanguy Risset, Jérôme Martin, and Henri-Pierre Charles. A compilation flow for parametric dataflow: Programming model, scheduling, and application to heterogeneous mpsoc. In Proceedings of the 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES 14, pages 8:1 8:10, New York, NY, USA, ACM.
39 Bibliography II Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. Traffic management: A holistic approach to memory placement on numa systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 13, pages , New York, NY, USA, ACM. Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP 13, pages 33 48, New York, NY, USA, ACM.
40 Bibliography III Khaled Jerbi, Daniele Renzi, Damien de Saint-Jorre, Hervé Yviquel, Mickaël Raulet, Claudio Alberti, and Marco Mattavelli. Development and optimization of high level dataflow programs: the HEVC decoder design case. In 48th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, United States, November I. Amer, C. Lucarz, G. Roquier, M. Mattavelli, M. Raulet, J.-F. Nezan, and O. Deforges. Reconfigurable video coding on multicore. Signal Processing Magazine, IEEE, 26(6): , november 2009.
41 Bibliography IV Daniel Molka, Daniel Hackenberg, Robert Schone, and Matthias S. Muller. Memory performance and cache coherency effects on an intel nehalem multiprocessor system. In Proceedings of the th International Conference on Parallel Architectures and Compilation Techniques, PACT 09, pages , Washington, DC, USA, IEEE Computer Society. Herve Yviquel, Antoine Lorence, Khaled Jerbi, Gildas Cocherel, Alexandre Sanchez, and Mickael Raulet. Orcc: Multimedia development made easy. In Proceedings of the 21st ACM International Conference on Multimedia, MM 13, pages ACM, 2013.
42 3 / 6 Communication Overhead On NUMA Core Core 1... Core 6 Core 7... Core 12 Uncore QPI QPI Memory Bank 1 Memory Bank 2
43 3 / 6 Communication Overhead On NUMA Core Core 1... Core 6 Core 7... Core 12 Uncore QPI QPI Memory Bank 1 Remote vs local latency +30% [Molka2009, David2013] Memory Bank 2
44 3 / 6 Communication Overhead On NUMA Core Core 1... Core 6 Core 7... Core 12 Uncore QPI QPI Memory Bank 1 Cache coherency protocol QPI overhead lat. * 4 [Molka2009] Memory Bank 2
45 3 / 6 Communication Overhead On NUMA Core Core 1... Core 6 Core 7... Core 12 Uncore QPI QPI Memory Bank 1 Memory controlers and QPI links contention lat. * 5 [Dashti2013] Memory Bank 2
46 4 / 6 Why build a dataflow profiler? Why not use a regular profiler alone? Because they are generally too low-level: Distance to programmer s thinking is too long May know about threads, but not actors Will not be aware of data dependencies between actors
47 5 / 6 Preliminary: Dataflow Actors Internals Application graph A B C int[512] fifo_ab; int[512] fifo_bc; void action1() { int in = pop(fifo_ab); int out = in * ; push(fifo_bc, tmp); } Work time(b) = cpu time (a) a actions void action2(){... } C code generated for actor B
48 6 / 6 Sample correlation Application graph A B C fifo stack int[512] fifo_ab; int[512] fifo_bc; void action1() { int in = pop(fifo_ab); int out = in * ; push(fifo_bc, tmp); } C code generated for actor B
49 6 / 6 Sample correlation fifo stack Application graph A B C int[512] fifo_ab; int[512] fifo_bc; void action1() { int in = pop(fifo_ab); int out = in * ; push(fifo_bc, tmp); } C code generated for actor B PMU Sample PC = = 0xEFC234A latency = 50 cycles
50 6 / 6 Sample correlation fifo stack Application graph A B C int[512] fifo_ab; int[512] fifo_bc; void action1() { int in = pop(fifo_ab); int out = in * ; push(fifo_bc, tmp); } C code generated for actor B Correlation PMU Sample PC = = 0xEFC234A latency = 50 cycles Dataflow Sample B:action1 fifo_ab latency = 50 cycles
Orcc: multimedia development made easy
Orcc: multimedia development made easy Hervé Yviquel, Antoine Lorence, Khaled Jerbi, Gildas Cocherel, Alexandre Sanchez, Mickaël Raulet To cite this version: Hervé Yviquel, Antoine Lorence, Khaled Jerbi,
More informationPerformance Monitoring of Throughput Constrained Dataflow Programs Executed On Shared-Memory Multi-core Architectures
Thèse Pour obtenir le grade de Docteur Présentée devant L institut national des sciences appliquées de Lyon Par Manuel Selva Performance Monitoring of Throughput Constrained Dataflow Programs Executed
More informationCompilation of Parametric Dataflow Applications for Software-Defined-Radio-Dedicated MPSoCs DREAM seminar
Compilation of Parametric Dataflow Applications for Software-Defined-Radio-Dedicated MPSoCs DREAM seminar Mickaël Dardaillon Research Intern with NOKIA Technologies January 27th, 2015 2 / 33 What we know
More informationMemory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System
Center for Information ervices and High Performance Computing (ZIH) Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor ystem Parallel Architectures and Compiler Technologies
More informationChallenges of Memory Management on Modern NUMA Systems
DOI:1.1145/2814328 Article development led by queue.acm.org Optimizing NUMA systems applications with Carrefour. BY FABIEN GAUD, BAPTISTE LEPERS, JUSTIN FUNSTON, MOHAMMAD DASHTI, ALEXANDRA FEDOROVA, VIVIEN
More informationSCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH
Faculty of Computer Science Institute of Systems Architecture, Operating Systems Group SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH LAYER CAKE Application Runtime OS Kernel ISA Physical RAM 2 COMMODITY
More informationDetecting Memory-Boundedness with Hardware Performance Counters
Center for Information Services and High Performance Computing (ZIH) Detecting ory-boundedness with Hardware Performance Counters ICPE, Apr 24th 2017 (daniel.molka@tu-dresden.de) Robert Schöne (robert.schoene@tu-dresden.de)
More informationNUMA-aware scheduling for both memory- and compute-bound tasks
NUMA-aware scheduling for both memory- and compute-bound tasks Mickael Reinman Mickael Reinman Spring 2015 Bachelorsthesis, 15 hp Bachelor Programme in Computing Science, 180 hp Page 2 (16) Abstract There
More informationnumap: A Portable Library For Low Level Memory Profiling
numap: A Portable Library For Low Level Memory Profiling Manuel Selva, Lionel Morel, Kevin Marquet To cite this version: Manuel Selva, Lionel Morel, Kevin Marquet. numap: A Portable Library For Low Level
More informationSOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS
SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power
More informationA Methodology for Profiling and Partitioning Stream Programs on Many-core Architectures
Procedia Computer Science Volume 51, 2015, Pages 2962 2966 ICCS 2015 International Conference On Computational Science A Methodology for Profiling and Partitioning Stream Programs on Many-core Architectures
More informationvscope: A Fine-Grained Approach to Schedule vcpus in NUMA Systems
2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems
More informationImproving Virtual Machine Scheduling in NUMA Multicore Systems
Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationProceedings Chapter. Reference. Efficient scheduling policies for dynamic data flow programs executed on multi-core. MICHALSKA, Malgorzata, et al.
Proceedings Chapter Efficient scheduling policies for dynamic data flow programs executed on multi-core MICHALSKA, Malgorzata, et al. Abstract An important challenge of dataflow program implementations
More informationCSE 120 Principles of Operating Systems
CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationTDT 4260 lecture 3 spring semester 2015
1 TDT 4260 lecture 3 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU http://research.idi.ntnu.no/multicore 2 Lecture overview Repetition Chap.1: Performance,
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationTowards efficient execution of RVC-CAL dataflow programs on multicore platforms
Noname manuscript No. (will be inserted by the editor) Towards efficient execution of RVC-CAL dataflow programs on multicore platforms Ilkka Hautala Jani Boutellier Teemu Nyländen Olli Silvén Received:
More informationModern systems: multicore issues
Modern systems: multicore issues By Paul Grubbs Portions of this talk were taken from Deniz Altinbuken s talk on Disco in 2009: http://www.cs.cornell.edu/courses/cs6410/2009fa/lectures/09-multiprocessors.ppt
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationLecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationDynamic Performance Tuning for Speculative Threads
Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept. of Computer Science and Engineering Dept. of
More informationToday. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture
More information2 TEST: A Tracer for Extracting Speculative Threads
EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath
More informationHardware-Efficient Parallelized Optimization with COMSOL Multiphysics and MATLAB
Hardware-Efficient Parallelized Optimization with COMSOL Multiphysics and MATLAB Frommelt Thomas* and Gutser Raphael SGL Carbon GmbH *Corresponding author: Werner-von-Siemens Straße 18, 86405 Meitingen,
More informationComputer Systems Research in the Post-Dennard Scaling Era. Emilio G. Cota Candidacy Exam April 30, 2013
Computer Systems Research in the Post-Dennard Scaling Era Emilio G. Cota Candidacy Exam April 30, 2013 Intel 4004, 1971 1 core, no cache 23K 10um transistors Intel Nehalem EX, 2009 8c, 24MB cache 2.3B
More informationMPEG RVC AVC Baseline Encoder Based on a Novel Iterative Methodology
MPEG RVC AVC Baseline Encoder Based on a Novel Iterative Methodology Hussein Aman-Allah, Ehab Hanna, Karim Maarouf, Ihab Amer Laboratory of Microelectronic Systems (GR-LSM), EPFL CH-1015 Lausanne, Switzerland
More informationIntroduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
More informationJoe Butler, Principal Engineer, Director Cloud Services Lab. Nov , OpenStack Summit Paris.
Telemetry the foundation of intelligent cloud orchestration. Joe Butler, Principal Engineer, Director Cloud Services Lab. Nov 3 2014, OpenStack Summit Paris. http://sched.co/1xj2lm9 Datacenter Trends and
More informationDistributed caching for cloud computing
Distributed caching for cloud computing Maxime Lorrillere, Julien Sopena, Sébastien Monnet et Pierre Sens February 11, 2013 Maxime Lorrillere (LIP6/UPMC/CNRS) February 11, 2013 1 / 16 Introduction Context
More informationKalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013
Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013 The End of Dennard MOSFET Scaling Theory 2013 Kalray SA All Rights Reserved MPSoC
More informationPosition Paper: OpenMP scheduling on ARM big.little architecture
Position Paper: OpenMP scheduling on ARM big.little architecture Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, and Michel Robert LIRMM
More informationPERFORMANCE BENCHMARKING OF RVC BASED MULTIMEDIA SPECIFICATIONS
PERFORMANCE BENCHMARKING OF RVC BASED MULTIMEDIA SPECIFICATIONS Junaid Jameel Ahmad 1 Shujun Li 2 Marco Mattavelli 1 1 École Polytechnique Fédérale de Lausanne (EPFL), Switzerland 2 University of Surrey,
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationPlacement de processus (MPI) sur architecture multi-cœur NUMA
Placement de processus (MPI) sur architecture multi-cœur NUMA Emmanuel Jeannot, Guillaume Mercier LaBRI/INRIA Bordeaux Sud-Ouest/ENSEIRB Runtime Team Lyon, journées groupe de calcul, november 2010 Emmanuel.Jeannot@inria.fr
More informationNon-uniform memory access (NUMA)
Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access
More informationSoftware and Tools for HPE s The Machine Project
Labs Software and Tools for HPE s The Machine Project Scalable Tools Workshop Aug/1 - Aug/4, 2016 Lake Tahoe Milind Chabbi Traditional Computing Paradigm CPU DRAM CPU DRAM CPU-centric computing 2 CPU-Centric
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationPotentials and Limitations for Energy Efficiency Auto-Tuning
Center for Information Services and High Performance Computing (ZIH) Potentials and Limitations for Energy Efficiency Auto-Tuning Parco Symposium Application Autotuning for HPC (Architectures) Robert Schöne
More informationibench: Quantifying Interference in Datacenter Applications
ibench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos Kozyrakis Stanford University IISWC September 23 th 2013 Executive Summary Problem: Increasing utilization
More informationSCHEDULING ALGORITHMS FOR MULTICORE SYSTEMS BASED ON APPLICATION CHARACTERISTICS
SCHEDULING ALGORITHMS FOR MULTICORE SYSTEMS BASED ON APPLICATION CHARACTERISTICS 1 JUNG KYU PARK, 2* JAEHO KIM, 3 HEUNG SEOK JEON 1 Department of Digital Media Design and Applications, Seoul Women s University,
More informationNumaMMA: NUMA MeMory Analyzer
François Trahay, Manuel Selva, Lionel Morel, Kevin Marquet To cite this version: François Trahay, Manuel Selva, Lionel Morel, Kevin Marquet. NumaMMA: NUMA MeMory Analyzer. ICPP - 7th International Conference
More informationEECS 750: Advanced Operating Systems. 01/29 /2014 Heechul Yun
EECS 750: Advanced Operating Systems 01/29 /2014 Heechul Yun 1 Administrative Next summary assignment Resource Containers: A New Facility for Resource Management in Server Systems, OSDI 99 due by 11:59
More informationCOSC 6385 Computer Architecture. Virtualizing Compute Resources
COSC 6385 Computer Architecture Virtualizing Compute Resources Spring 2010 References [1] J. L. Hennessy, D. A. Patterson Computer Architecture A Quantitative Approach Chapter 5.4 [2] G. Neiger, A. Santoni,
More informationCOSC 6385 Computer Architecture. Virtualizing Compute Resources
COSC 6385 Computer Architecture Virtualizing Compute Resources Fall 2009 References [1] J. L. Hennessy, D. A. Patterson Computer Architecture A Quantitative Approach Chapter 5.4 [2] G. Neiger, A. Santoni,
More informationCS377P Programming for Performance Multicore Performance Multithreading
CS377P Programming for Performance Multicore Performance Multithreading Sreepathi Pai UTCS October 14, 2015 Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX
More informationMarco Danelutto. May 2011, Pisa
Marco Danelutto Dept. of Computer Science, University of Pisa, Italy May 2011, Pisa Contents 1 2 3 4 5 6 7 Parallel computing The problem Solve a problem using n w processing resources Obtaining a (close
More informationJust-in-time adaptive decoder engine: a universal video decoder based on MPEG RVC
Just-in-time adaptive decoder engine: a universal video decoder based on MPEG RVC Jérôme Gorin, Hervé Yviquel, Françoise Prêteux, Mickaël Raulet To cite this version: Jérôme Gorin, Hervé Yviquel, Françoise
More informationA Scalable Multiprocessor for Real-time Signal Processing
A Scalable Multiprocessor for Real-time Signal Processing Daniel Scherrer, Hans Eberle Institute for Computer Systems, Swiss Federal Institute of Technology CH-8092 Zurich, Switzerland {scherrer, eberle}@inf.ethz.ch
More informationOptimising Multicore JVMs. Khaled Alnowaiser
Optimising Multicore JVMs Khaled Alnowaiser Outline JVM structure and overhead analysis Multithreaded JVM services JVM on multicore An observational study Potential JVM optimisations Basic JVM Services
More informationProfiling and Debugging OpenCL Applications with ARM Development Tools. October 2014
Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationLect. 2: Types of Parallelism
Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric
More informationThe benefits and costs of writing a POSIX kernel in a high-level language
1 / 38 The benefits and costs of writing a POSIX kernel in a high-level language Cody Cutler, M. Frans Kaashoek, Robert T. Morris MIT CSAIL Should we use high-level languages to build OS kernels? 2 / 38
More informationCross-Layer Memory Management for Managed Language Applications
Cross-Layer Memory Management for Managed Language Applications Michael R. Jantz University of Tennessee mrjantz@utk.edu Forrest J. Robinson Prasad A. Kulkarni University of Kansas {fjrobinson,kulkarni}@ku.edu
More informationComputer Architecture and OS. EECS678 Lecture 2
Computer Architecture and OS EECS678 Lecture 2 1 Recap What is an OS? An intermediary between users and hardware A program that is always running A resource manager Manage resources efficiently and fairly
More informationVirtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])
EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationHigh Performance Managed Languages. Martin Thompson
High Performance Managed Languages Martin Thompson - @mjpt777 Really, what s your preferred platform for building HFT applications? Why would you build low-latency applications on a GC ed platform? Some
More informationKaisen Lin and Michael Conley
Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC
More informationLegUp: Accelerating Memcached on Cloud FPGAs
0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are
More informationCompilation of Parametric Dataflow Applications for Software-Defined-Radio-Dedicated MPSoCs
Compilation of Parametric Dataflow Applications for Software-Defined-Radio-Dedicated MPSoCs PhD work of Mickael Dardaillon Mickaël Dardaillon, Kevin Marquet (Citi), Tanguy Risset (Citi), Jérôme Martin
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationJackson Marusarz Intel Corporation
Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits
More informationHardware Performance Monitoring Unit Working Group Outbrief
Hardware Performance Monitoring Unit Working Group Outbrief CScADS Performance Tools for Extreme Scale Computing August 2011 hpctoolkit.org Topics From HW-centric measurements to application understanding
More informationDynamic inter-core scheduling in Barrelfish
Dynamic inter-core scheduling in Barrelfish. avoiding contention with malleable domains Georgios Varisteas, Mats Brorsson, Karl-Filip Faxén November 25, 2011 Outline Introduction Scheduling & Programming
More informationMeet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors
Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors Sandro Bartolini* Department of Information Engineering, University of Siena, Italy bartolini@dii.unisi.it
More informationEfficient scheduling policies for dynamic dataflow programs executed on multi-core
Efficient scheduling policies for dynamic dataflow programs executed on multi-core Ma lgorzata Michalska 1, Nicolas Zufferey 2, Jani Boutellier 3, Endri Bezati 1, and Marco Mattavelli 1 1 EPFL STI-SCI-MM,
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationAn Analysis of Shared Library Performance on NUMA Architectures
An Analysis of Shared Library Performance on NUMA Architectures Hemant Saxena hemant.saxena@uwaterloo.ca Neeraj Kumar neeraj.kumar@uwaterloo.ca Abstract Most modern multicore systems these days are Non
More informationPREESM: A Dataflow-Based Rapid Prototyping Framework for Simplifying Multicore DSP Programming
PREESM: A Dataflow-Based Rapid Prototyping Framework for Simplifying Multicore DSP Programming Maxime Pelcat, Karol Desnos, Julien Heulot Clément Guy, Jean-François Nezan, Slaheddine Aridhi EDERC 2014
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationDecoupled Software Pipelining in LLVM
Decoupled Software Pipelining in LLVM 15-745 Final Project Fuyao Zhao, Mark Hahnenberg fuyaoz@cs.cmu.edu, mhahnenb@andrew.cmu.edu 1 Introduction 1.1 Problem Decoupled software pipelining [5] presents an
More informationClearSpeed Visual Profiler
ClearSpeed Visual Profiler Copyright 2007 ClearSpeed Technology plc. All rights reserved. 12 November 2007 www.clearspeed.com 1 Profiling Application Code Why use a profiler? Program analysis tools are
More informationArrakis: The Operating System is the Control Plane
Arrakis: The Operating System is the Control Plane Simon Peter, Jialin Li, Irene Zhang, Dan Ports, Doug Woos, Arvind Krishnamurthy, Tom Anderson University of Washington Timothy Roscoe ETH Zurich Building
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationClassification-Based Optimization of Dynamic Dataflow Programs
Classification-Based Optimization of Dynamic Dataflow Programs Hervé Yviquel, Emmanuel Casseau, Matthieu Wipliez, Jérôme Gorin, Mickaël Raulet To cite this version: Hervé Yviquel, Emmanuel Casseau, Matthieu
More informationConvergence of Parallel Architecture
Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty
More informationImpact of Cache Coherence Protocols on the Processing of Network Traffic
Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationCode transformations for energy efficiency; a decoupled accessexecute
Code transformations for energy efficiency; a decoupled accessexecute approach Work performed at Uppsala University Konstantinos Koukos November 2016 OVERALL GOAL The big goal Better exploit the potential
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster
More informationShoal: smart allocation and replication of memory for parallel programs
Shoal: smart allocation and replication of memory for parallel programs Stefan Kaestle, Reto Achermann, Timothy Roscoe, Tim Harris $ ETH Zurich $ Oracle Labs Cambridge, UK Problem Congestion on interconnect
More informationThe SARC Architecture
The SARC Architecture Polo Regionale di Como of the Politecnico di Milano Advanced Computer Architecture Arlinda Imeri arlinda.imeri@mail.polimi.it 19-Jun-12 Advanced Computer Architecture - The SARC Architecture
More informationThe Multikernel A new OS architecture for scalable multicore systems
Systems Group Department of Computer Science ETH Zurich SOSP, 12th October 2009 The Multikernel A new OS architecture for scalable multicore systems Andrew Baumann 1 Paul Barham 2 Pierre-Evariste Dagand
More informationChapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues
Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin
More informationAutomated Design Flow for Coarse-Grained Reconfigurable Platforms: an RVC-CAL Multi-Standard Decoder Use-Case
XIV International Conference on Embedded Computer and Systems: Architectures, MOdeling and Simulation SAMOS XIV - 2014 July 14 th - Samos Island (Greece) Carlo Sau, Luigi Raffo DIEE Università degli Studi
More informationCross-Layer Memory Management to Reduce DRAM Power Consumption
Cross-Layer Memory Management to Reduce DRAM Power Consumption Michael Jantz Assistant Professor University of Tennessee, Knoxville 1 Introduction Assistant Professor at UT since August 2014 Before UT
More informationCPU Architecture Overview. Varun Sampath CIS 565 Spring 2012
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations
More informationChapter 4: Multithreaded Programming
Chapter 4: Multithreaded Programming Silberschatz, Galvin and Gagne 2013 Chapter 4: Multithreaded Programming Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading
More informationPortable Power/Performance Benchmarking and Analysis with WattProf
Portable Power/Performance Benchmarking and Analysis with WattProf Amir Farzad, Boyana Norris University of Oregon Mohammad Rashti RNET Technologies, Inc. Motivation Energy efficiency is becoming increasingly
More informationReconOS: Multithreaded Programming and Execution Models for Reconfigurable Hardware
ReconOS: Multithreaded Programming and Execution Models for Reconfigurable Hardware Enno Lübbers and Marco Platzner Computer Engineering Group University of Paderborn {enno.luebbers, platzner}@upb.de Outline
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More information