ClearSpeed Visual Profiler
|
|
- Marianna Cannon
- 6 years ago
- Views:
Transcription
1 ClearSpeed Visual Profiler Copyright 2007 ClearSpeed Technology plc. All rights reserved. 12 November
2 Profiling Application Code Why use a profiler? Program analysis tools are extremely important for understanding program behavior. Improve application performance. Software writers need tools to analyze their programs and identify critical pieces of code. Compiler writers often use such tools to find out how well their instruction scheduling or branch prediction algorithm is performing. Computer architects need such tools to evaluate how well programs will perform on new architectures. 2
3 Heterogeneous Systems What does this mean for software? Adding an accelerator to an existing host system changes the system architecture and the view of the hardware seem by applications. Software performance needs to be re-evaluated and balanced to allow best usage of all in this new system. When code is optimized in one particular area of such a system it changes the performance of the system as a whole. The tools used in the various parts of a heterogeneous system do not communicate with each other. 3
4 Heterogeneous Systems What does this mean for profiling? Existing tools which focus on optimizing single CPU application code are no longer enough for optimization. Application developers require a system view of the operation of code alongside the specific CPU optimization tools. Things become a lot more complicated for the software developer, multiple threads and multiple different CPU architectures. Different CPU architectures operate at different speeds (e.g., CSX600 vs. x86) 4
5 The ClearSpeed Visual Profiler Visual Profiler GUI Java for cross-platform portability. Common file format allows trace data from multiple sources. Used throughout the ClearSpeed software stack for performance analysis. Designed to provide a common visual profiling environment for all areas of a ClearSpeed-enabled system. Can display any timeline-based data presented in its simple text based file format. Has capability for extension using a plug-in mechanism allowing for more specific data visualization. 5
6 System Level Heterogeneous Profiling HOST CODE PROFILING Visually inspect host code executing. Supports multiple threads and processes. Time specific code sections. See overlap of host threads executing. Platform and processor agnostic trace collection. Host Host Host CPU(s) Host CPU(s) CPU(s) CPU(s) Advance Accelerator Board Advance Accelerator Board HOST/BOARD INTERACTION View host/board interactions. See overlap of host and board code. Get performance information for data transfer operations. Trace cluster node/board interaction. See overlap of host compute and board compute. CSX600 PIPELINE View detailed instruction issue information. See overlap of executing instructions. Optimize code at the instruction level View instruction level performance bottlenecks. Get accurate instruction timing. CSX600 SYSTEM View system level trace. Visually inspect the overlap of compute and I/O. See cache utilization. View branch trace of code executing. Find and analyse performance bottlenecks. Get accurate event timing. 6
7 ClearSpeed Hardware Support For Profiling Real-time trace port allows non-intrusive capture of CSX600 hardware events. Each event tagged with program counter to allow relation to original source code. Single cycle instruction to generate data on trace port allows for user code instrumentation. Debugging capability of CSX600 allows for tracing to be done around specific code points. Trace data streamed to external memory on the board allowing for large amounts of data to be collected. Post processed, allowing reconstruction of code execution. 7
8 CSX System Profiling Profiling the CSX hardware Real-time trace data collection mechanism on the CSX600 allows for an accurate view of processor activity. Information across the chip can be collected and relayed to the user after execution. Provides information about the mono and poly processors and about the on-chip DMA unit. All events relate back to source code to allow software developers to take advantage of hardware trace information. Debugger is used to configure and capture the collection of trace data. 8
9 CSX System Profiling How does it work? A user stops the processor in the debugger prior to a section of code that they want to profile. The relevant options are enabled using the debugger commands for the trace port. Specific sets of events can be enabled for tracing. Execution is continued until after the interesting section of code is complete. Trace output file is written after execution in the specified format from the debugger. The user can then view the results in the ClearSpeed Visual Profiler GUI. 9
10 CSX System Profiling Demo HOST CODE PROFILING Visually inspect host code executing Supports multiple threads and processes Time specific code sections See overlap of host threads executing Platform and processor agnostic trace collection Host Host Host CPU(s) Host CPU(s) CPU(s) CPU(s) Advance Accelerator Board Advance Accelerator Board HOST/BOARD INTERACTION View host/board interactions See overlap of host and board code. Get performance information for data transfer operations. Trace cluster node/board interaction See overlap of host compute and board compute CSX600 PIPELINE View detailed instruction issue information. See overlap of executing instructions. Optimize code at the instruction level View instruction level performance bottlenecks. Get accurate instruction timing CSX600 SYSTEM View system level trace Visually inspect the overlap of compute and I/O See cache utilization View branch trace of code executing Find and analyse performance bottlenecks Get accurate event timing 10
11 CSX Profiling Profiling using the cycle accurate simulator For optimizing low level assembler code it is useful to see how the various stages of the CSX600 pipeline are operating. The CSX600 cycle-accurate simulator has a mode which allows the collection of pipeline trace data. This allows the ClearSpeed Visual Profiler to reconstruct a very accurate view of the execution flow of instructions through the processor. Has been developed for improving the compiler scheduler but was deemed useful to all developers. Potential to schedule code at the instruction level for higher performance. 11
12 CSX Profiling How does it work? The cycle accurate simulator is started with the option to enable the pipeline tracing. The relevant options are enabled using the debugger commands for the simulator pipeline trace. Execution is continued until after the interesting section of code is complete. The simulator outputs a trace to which the debugger adds symbolic source code information before writing a trace file. The user can then view the results in the ClearSpeed Visual Profiler GUI. 12
13 CSX Profiling Demo HOST CODE PROFILING Visually inspect host code executing Supports multiple threads and processes Time specific code sections See overlap of host threads executing Platform and processor agnostic trace collection Host Host Host CPU(s) Host CPU(s) CPU(s) CPU(s) Advance Accelerator Board Advance Accelerator Board HOST/BOARD INTERACTION View host/board interactions See overlap of host and board code. Get performance information for data transfer operations. Trace cluster node/board interaction See overlap of host compute and board compute CSX600 PIPELINE View detailed instruction issue information. See overlap of executing instructions. Optimize code at the instruction level View instruction level performance bottlenecks. Get accurate instruction timing CSX600 SYSTEM View system level trace Visually inspect the overlap of compute and I/O See cache utilization View branch trace of code executing Find and analyse performance bottlenecks Get accurate event timing 13
14 CSAPI Profiling Board activity as seen from the host The CSAPI library is the low-level interface between the host processor and the ClearSpeed boards. At this point there is useful information that can be captured for the application developer. Data such as bus bandwidth figures and CSX processor activity can be collected. Allows a user to see from the system level how well the CSX processors are being used. Data collection scales as more boards are added, will work with as many boards as are in the system. Data from cluster nodes containing ClearSpeed hardware can be captured and then merged into a single file. 14
15 CSAPI Profiling How does it work? Instrumented CSAPI library provided with software installation. A user can switch at runtime the library used to drive the hardware to be this instrumented library. This is done by setting an environment variable and then running the application unchanged. A file compatible with the ClearSpeed Visual Profiler will be produced for post-process viewing. GUI displays information about parameters and return values for every single CSAPI call made during the program run. Additional information such as host threads and metrics such as bandwidth for reading and writing memory are also displayed. 15
16 CSAPI Profiling Demo HOST CODE PROFILING Visually inspect host code executing Supports multiple threads and processes Time specific code sections See overlap of host threads executing Platform and processor agnostic trace collection Host Host Host CPU(s) Host CPU(s) CPU(s) CPU(s) Advance Accelerator Board Advance Accelerator Board HOST/BOARD INTERACTION View host/board interactions See overlap of host and board code. Get performance information for data transfer operations. Trace cluster node/board interaction See overlap of host compute and board compute CSX600 PIPELINE View detailed instruction issue information. See overlap of executing instructions. Optimize code at the instruction level View instruction level performance bottlenecks. Get accurate instruction timing CSX600 SYSTEM View system level trace Visually inspect the overlap of compute and I/O See cache utilization View branch trace of code executing Find and analyse performance bottlenecks Get accurate event timing 16
17 Generic Host Profiling Profiling host code The mechanism used to instrument the CSAPI library is implemented in a very generic manner. Relies on a simple cross-platform tracing library that can be used in any x86 code. Allows a user to instrument any piece of source code they have and generate timing information and additional metrics that can be viewed with the ClearSpeed Visual Profiler. Initially developed to fill in the Unknown sections in the CSAPI tracing. Can be used to optimize stand-alone x86 code. Very simple graphing mechanism to allow user-specific metrics. 17
18 Generic Host Profiling How does it work? A user instruments their code with the API for the visual profiler tracing library. Additional metrics for the GUI can be added using a callback mechanism so as not to interfere with timing. Application generates a file that can be used with the visual profiler containing all of the events traced within the application. This file can then be loaded and viewed using the Visual Profiler GUI. Additional metrics added can be viewed using the host profiling plug-in for the GUI. 18
19 Generic Host Profiling Demo HOST CODE PROFILING Visually inspect host code executing Supports multiple threads and processes Time specific code sections See overlap of host threads executing Platform and processor agnostic trace collection Host Host Host CPU(s) Host CPU(s) CPU(s) CPU(s) Advance Accelerator Board Advance Accelerator Board HOST/BOARD INTERACTION View host/board interactions See overlap of host and board code. Get performance information for data transfer operations. Trace cluster node/board interaction See overlap of host compute and board compute CSX600 PIPELINE View detailed instruction issue information. See overlap of executing instructions. Optimize code at the instruction level View instruction level performance bottlenecks. Get accurate instruction timing CSX600 SYSTEM View system level trace Visually inspect the overlap of compute and I/O See cache utilization View branch trace of code executing Find and analyse performance bottlenecks Get accurate event timing 19
20 Conclusion What have we seen? Software developers now require additional tools in order to develop applications for heterogeneous systems. The ClearSpeed Visual Profiler attempts to tackle these issues by introducing a high level view of system activity. Ability to profile both the CSX600 hardware and simulator allows detailed tuning of application code. ClearSpeed engineering development tools are being productized and made available to all developers. The features demonstrated are enhancements to the current tools and are available in the 3.00 Beta release. Copyright 2006 ClearSpeed Technology plc. All rights reserved. <slide set title> 12 November
21 Conclusion Questions? Copyright 2006 ClearSpeed Technology plc. All rights reserved. <slide set title> 12 November
Visual Profiler. User Guide
Visual Profiler User Guide Version 3.0 Document No. 06-RM-1136 Revision: 4.B February 2008 Visual Profiler User Guide Table of contents Table of contents 1 Introduction................................................
More informationCOMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor.
COMP 635: Seminar on Heterogeneous Processors Lecture 7: ClearSpeed CSX600 Processor www.cs.rice.edu/~vsarkar/comp635 Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu October
More informationENVISION. ACCELERATE.
ENVISION. ACCELERATE. ARRIVE. ClearSpeed Programming Model: Optimizing Performance 1 Overview Compute considerations Memory considerations Latency hiding Miscellaneous Profiling Inline assembly Optimal
More information2 TEST: A Tracer for Extracting Speculative Threads
EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath
More informationIntel VTune Performance Analyzer 9.1 for Windows* In-Depth
Intel VTune Performance Analyzer 9.1 for Windows* In-Depth Contents Deliver Faster Code...................................... 3 Optimize Multicore Performance...3 Highlights...............................................
More informationArchitectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.
Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationCUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin
CUDA Development Using NVIDIA Nsight, Eclipse Edition David Goodwin NVIDIA Nsight Eclipse Edition CUDA Integrated Development Environment Project Management Edit Build Debug Profile SC'12 2 Powered By
More informationCSX600 Runtime Software. User Guide
CSX600 Runtime Software User Guide Version 3.0 Document No. 06-UG-1345 Revision: 3.D January 2008 Table of contents Table of contents 1 Introduction................................................ 7 2
More informationCOSC 6385 Computer Architecture - Memory Hierarchy Design (III)
COSC 6385 Computer Architecture - Memory Hierarchy Design (III) Fall 2006 Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses
More informationFinal Lecture. A few minutes to wrap up and add some perspective
Final Lecture A few minutes to wrap up and add some perspective 1 2 Instant replay The quarter was split into roughly three parts and a coda. The 1st part covered instruction set architectures the connection
More informationCLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTURE
CSX PROCESSOR ARCHITECTURE CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTURE Abstract This paper describes the architecture of the CSX family of processors based on ClearSpeed s multi-threaded array processor;
More informationPerformance analysis basics
Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis
More informationPerformance Analysis of Parallel Scientific Applications In Eclipse
Performance Analysis of Parallel Scientific Applications In Eclipse EclipseCon 2015 Wyatt Spear, University of Oregon wspear@cs.uoregon.edu Supercomputing Big systems solving big problems Performance gains
More informationProfiling and Debugging OpenCL Applications with ARM Development Tools. October 2014
Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline
More informationProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors
ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George Chrysos Digital Equipment Corporation 1 Motivation
More informationOptimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs
Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs Niu Feng Technical Specialist, ARM Tech Symposia 2016 Agenda Introduction Challenges: Optimizing cache coherent subsystem
More informationHSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!
Advanced Topics on Heterogeneous System Architectures HSA foundation! Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationBasic Computer Architecture
Basic Computer Architecture CSCE 496/896: Embedded Systems Witawas Srisa-an Review of Computer Architecture Credit: Most of the slides are made by Prof. Wayne Wolf who is the author of the textbook. I
More informationOperating Systems (2INC0) 2018/19. Introduction (01) Dr. Tanir Ozcelebi. Courtesy of Prof. Dr. Johan Lukkien. System Architecture and Networking Group
Operating Systems (2INC0) 20/19 Introduction (01) Dr. Courtesy of Prof. Dr. Johan Lukkien System Architecture and Networking Group Course Overview Introduction to operating systems Processes, threads and
More informationPerformance Analysis with Hybrid Simulation
6 th November, 2008 Performance Analysis with Hybrid Simulation PN111 Matthew Liong System and Application Engineer, NMG owners. Freescale Semiconductor, Inc. 2008. r2 Overview Hybrid Modeling Overview
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More informationIntroduction to Parallel Performance Engineering
Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:
More informationProfiling of Data-Parallel Processors
Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel Kruck 1 / 41 Outline 1 Motivation 2 Background - GPUs 3 Profiler NVIDIA Tools Lynx 4 Optimizations 5 Conclusion
More informationUsing Intel VTune Amplifier XE and Inspector XE in.net environment
Using Intel VTune Amplifier XE and Inspector XE in.net environment Levent Akyil Technical Computing, Analyzers and Runtime Software and Services group 1 Refresher - Intel VTune Amplifier XE Intel Inspector
More informationIBM High Performance Computing Toolkit
IBM High Performance Computing Toolkit Pidad D'Souza (pidsouza@in.ibm.com) IBM, India Software Labs Top 500 : Application areas (November 2011) Systems Performance Source : http://www.top500.org/charts/list/34/apparea
More informationSoftware Overview Release Rev: 3.0
Software Overview Release Rev: 3.0 1 Overview of ClearSpeed software The ClearSpeed Advance accelerators are provided with a package of runtime software. A software development kit (SDK) is also available
More informationHSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!
Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationKaisen Lin and Michael Conley
Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC
More informationLabVIEW Programming for a Multicore Environment. Stefan Kreuzer Applications Engineer National Instruments
LabVIEW Programming for a Multicore Environment Stefan Kreuzer Applications Engineer National Instruments Agenda Overview of LabVIEW Multithreading Parallel Programming Techniques Real-Time Considerations
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationParallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1
Pipelining COMP375 Computer Architecture and dorganization Parallelism The most common method of making computers faster is to increase parallelism. There are many levels of parallelism Macro Multiple
More informationComputer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email
More informationPerformance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino
Performance analysis tools: Intel VTuneTM Amplifier and Advisor Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimisation After having considered the MPI layer,
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationPrecise Continuous Non-Intrusive Measurement-Based Execution Time Estimation. Boris Dreyer, Christian Hochberger, Simon Wegener, Alexander Weiss
Precise Continuous Non-Intrusive Measurement-Based Execution Time Estimation Boris Dreyer, Christian Hochberger, Simon Wegener, Alexander Weiss This work was funded within the project CONIRAS by the German
More informationPERFORMANCE OPTIMIZATIONS FOR AUTOMOTIVE SOFTWARE
April 4-7, 2016 Silicon Valley PERFORMANCE OPTIMIZATIONS FOR AUTOMOTIVE SOFTWARE Pradeep Chandrahasshenoy, Automotive Solutions Architect, NVIDIA Stefan Schoenefeld, ProViz DevTech, NVIDIA 4 th April 2016
More informationARM Processors for Embedded Applications
ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationPerformance Optimization: Simulation and Real Measurement
Performance Optimization: Simulation and Real Measurement KDE Developer Conference, Introduction Agenda Performance Analysis Profiling Tools: Examples & Demo KCachegrind: Visualizing Results What s to
More informationCERN IT Technical Forum
Evaluating program correctness and performance with new software tools from Intel Andrzej Nowak, CERN openlab March 18 th 2011 CERN IT Technical Forum > An introduction to the new generation of software
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationCOSC 6385 Computer Architecture. - Memory Hierarchies (II)
COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available
More informationCenter for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop
Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion
More informationCritically Missing Pieces on Accelerators: A Performance Tools Perspective
Critically Missing Pieces on Accelerators: A Performance Tools Perspective, Karthik Murthy, Mike Fagan, and John Mellor-Crummey Rice University SC 2013 Denver, CO November 20, 2013 What Is Missing in GPUs?
More informationECE332, Week 2, Lecture 3. September 5, 2007
ECE332, Week 2, Lecture 3 September 5, 2007 1 Topics Introduction to embedded system Design metrics Definitions of general-purpose, single-purpose, and application-specific processors Introduction to Nios
More informationECE332, Week 2, Lecture 3
ECE332, Week 2, Lecture 3 September 5, 2007 1 Topics Introduction to embedded system Design metrics Definitions of general-purpose, single-purpose, and application-specific processors Introduction to Nios
More informationBlueGene/L (No. 4 in the Latest Top500 List)
BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside
More informationCache Performance Analysis with Callgrind and KCachegrind
Cache Performance Analysis with Callgrind and KCachegrind Parallel Performance Analysis Course, 31 October, 2010 King Abdullah University of Science and Technology, Saudi Arabia Josef Weidendorfer Computer
More informationHiTune. Dataflow-Based Performance Analysis for Big Data Cloud
HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241
More informationThe Microarchitecture Level
The Microarchitecture Level Chapter 4 The Data Path (1) The data path of the example microarchitecture used in this chapter. The Data Path (2) Useful combinations of ALU signals and the function performed.
More informationLecture #10 Context Switching & Performance Optimization
SPRING 2015 Integrated Technical Education Cluster At AlAmeeria E-626-A Real-Time Embedded Systems (RTES) Lecture #10 Context Switching & Performance Optimization Instructor: Dr. Ahmad El-Banna Agenda
More informationDNWSH - Version: 2.3..NET Performance and Debugging Workshop
DNWSH - Version: 2.3.NET Performance and Debugging Workshop .NET Performance and Debugging Workshop DNWSH - Version: 2.3 8 days Course Description: The.NET Performance and Debugging Workshop is a practical
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationHigh Performance Computing Course Notes Course Administration
High Performance Computing Course Notes 2009-2010 2010 Course Administration Contacts details Dr. Ligang He Home page: http://www.dcs.warwick.ac.uk/~liganghe Email: liganghe@dcs.warwick.ac.uk Office hours:
More informationMulti-core Programming Evolution
Multi-core Programming Evolution Based on slides from Intel Software ollege and Multi-ore Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts, Evolution
More informationEmbedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.
Embedded processors Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.fi Comparing processors Evaluating processors Taxonomy of processors
More informationThe PAPI Cross-Platform Interface to Hardware Performance Counters
The PAPI Cross-Platform Interface to Hardware Performance Counters Kevin London, Shirley Moore, Philip Mucci, and Keith Seymour University of Tennessee-Knoxville {london, shirley, mucci, seymour}@cs.utk.edu
More informationTop500 Supercomputer list
Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationIntel VTune Amplifier XE
Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance
More informationThe Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006
The Use Of Virtual Platforms In MP-SoC Design Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 1 MPSoC Is MP SoC design happening? Why? Consumer Electronics Complexity Cost of ASIC Increased SW Content
More informationReducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses
Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Miss Penalty: Read Priority over Write on Miss Write buffers may offer RAW
More informationSimplifying the Development and Debug of 8572-Based SMP Embedded Systems. Wind River Workbench Development Tools
Simplifying the Development and Debug of 8572-Based SMP Embedded Systems Wind River Workbench Development Tools Agenda Introducing multicore systems Debugging challenges of multicore systems Development
More informationGraphics Performance Analyzer for Android
Graphics Performance Analyzer for Android 1 What you will learn from this slide deck Detailed optimization workflow of Graphics Performance Analyzer Android* System Analysis Only Please see subsequent
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationA Software Development Toolset for Multi-Core Processors. Yuichi Nakamura System IP Core Research Labs. NEC Corp.
A Software Development Toolset for Multi-Core Processors Yuichi Nakamura System IP Core Research Labs. NEC Corp. Motivations Embedded Systems: Performance enhancement by multi-core systems CPU0 CPU1 Multi-core
More informationPatterns for! Parallel Programming II!
Lecture 4! Patterns for! Parallel Programming II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Task Decomposition Also known as functional
More informationCaching Basics. Memory Hierarchies
Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby
More informationMaking Performance Understandable: Towards a Standard for Performance Counters on Manycore Architectures
Parallel Hardware Parallel Applications IT industry (Silicon Valley) Parallel Software Users Making Performance Understandable: Towards a Standard for Performance Counters on Manycore Architectures Sarah
More informationMemory Subsystem Profiling with the Sun Studio Performance Analyzer
Memory Subsystem Profiling with the Sun Studio Performance Analyzer CScADS, July 20, 2009 Marty Itzkowitz, Analyzer Project Lead Sun Microsystems Inc. marty.itzkowitz@sun.com Outline Memory performance
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationEECS 452 Lecture 9 TLP Thread-Level Parallelism
EECS 452 Lecture 9 TLP Thread-Level Parallelism Instructor: Gokhan Memik EECS Dept., Northwestern University The lecture is adapted from slides by Iris Bahar (Brown), James Hoe (CMU), and John Shen (CMU
More informationGPUs and GPGPUs. Greg Blanton John T. Lubia
GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware
More informationHardware Design I Chap. 10 Design of microprocessor
Hardware Design I Chap. 0 Design of microprocessor E-mail: shimada@is.naist.jp Outline What is microprocessor? Microprocessor from sequential machine viewpoint Microprocessor and Neumann computer Memory
More informationDistributed Debugging API for ORBs and Services. Request for Proposal, test/ Dale Parson, Distinguished Member of Technical Staff
Lucent CORBA Seminar 1999 Distributed Debugging API for ORBs and Services Request for Proposal, test/99-08-02 September 28, 1999 Dale Parson, Distinguished Member of Technical Staff Bell Labs, Microelectronics
More informationDeveloping, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge
Developing, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge Ryan Hulguin Applications Engineer ryan.hulguin@arm.com Agenda Introduction Overview of Allinea Products
More informationRad-Hard Microcontroller For Space Applications
The most important thing we build is trust ADVANCED ELECTRONIC SOLUTIONS AVIATION SERVICES COMMUNICATIONS AND CONNECTIVITY MISSION SYSTEMS Rad-Hard Microcontroller For Space Applications Fredrik Johansson
More informationOverview. Technology Details. D/AVE NX Preliminary Product Brief
Overview D/AVE NX is the latest and most powerful addition to the D/AVE family of rendering cores. It is the first IP to bring full OpenGL ES 2.0/3.1 rendering to the FPGA and SoC world. Targeted for graphics
More informationPipelining, Branch Prediction, Trends
Pipelining, Branch Prediction, Trends 10.1-10.4 Topics 10.1 Quantitative Analyses of Program Execution 10.2 From CISC to RISC 10.3 Pipelining the Datapath Branch Prediction, Delay Slots 10.4 Overlapping
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationTools and Methodology for Ensuring HPC Programs Correctness and Performance. Beau Paisley
Tools and Methodology for Ensuring HPC Programs Correctness and Performance Beau Paisley bpaisley@allinea.com About Allinea Over 15 years of business focused on parallel programming development tools Strong
More informationVisualizing the out-of-order CPU model. Ryota Shioya Nagoya University
Visualizing the out-of-order CPU model Ryota Shioya Nagoya University Introduction This presentation introduces the visualization of the out-of-order CPU model in gem5 2 Introduction Let's suppose you
More informationOperating Systems CMPSCI 377 Spring Mark Corner University of Massachusetts Amherst
Operating Systems CMPSCI 377 Spring 2017 Mark Corner University of Massachusetts Amherst Last Class: Intro to OS An operating system is the interface between the user and the architecture. User-level Applications
More informationKNL tools. Dr. Fabio Baruffa
KNL tools Dr. Fabio Baruffa fabio.baruffa@lrz.de 2 Which tool do I use? A roadmap to optimization We will focus on tools developed by Intel, available to users of the LRZ systems. Again, we will skip the
More informationAn Overview of the BLITZ System
An Overview of the BLITZ System Harry H. Porter III Department of Computer Science Portland State University Introduction The BLITZ System is a collection of software designed to support a university-level
More informationCourse web site: teaching/courses/car. Piazza discussion forum:
Announcements Course web site: http://www.inf.ed.ac.uk/ teaching/courses/car Lecture slides Tutorial problems Courseworks Piazza discussion forum: http://piazza.com/ed.ac.uk/spring2018/car Tutorials start
More informationProfiling: Understand Your Application
Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel
More informationWS_CCESSH-OUT-v1.00.doc Page 1 of 8
Course Name: Course Code: Course Description: System Development with CrossCore Embedded Studio (CCES) and the ADI SHARC Processor WS_CCESSH This is a practical and interactive course that is designed
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationArquitecturas y Modelos de. Multicore
Arquitecturas y Modelos de rogramacion para Multicore 17 Septiembre 2008 Castellón Eduard Ayguadé Alex Ramírez Opening statements * Some visionaries already predicted multicores 30 years ago And they have
More informationVAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW
VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW 8th VI-HPS Tuning Workshop at RWTH Aachen September, 2011 Tobias Hilbrich and Joachim Protze Slides by: Andreas Knüpfer, Jens Doleschal, ZIH, Technische Universität
More informationAccelerated Library Framework for Hybrid-x86
Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit
More informationMEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming
MEMORY HIERARCHY DESIGN B649 Parallel Architectures and Programming Basic Optimizations Average memory access time = Hit time + Miss rate Miss penalty Larger block size to reduce miss rate Larger caches
More informationIntel Parallel Studio 2011
THE ULTIMATE ALL-IN-ONE PERFORMANCE TOOLKIT Studio 2011 Product Brief Studio 2011 Accelerate Development of Reliable, High-Performance Serial and Threaded Applications for Multicore Studio 2011 is a comprehensive
More informationBaback Elmieh, Software Lead James Ritts, Profiler Lead Qualcomm Incorporated Advanced Content Group
Introduction ti to Adreno Tools Baback Elmieh, Software Lead James Ritts, Profiler Lead Qualcomm Incorporated Advanced Content Group Qualcomm HW Accelerated 3D: Adreno Moving content-quality forward requires
More informationIBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics
IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that
More informationMapReduce on the Cell Broadband Engine Architecture. Marc de Kruijf
MapReduce on the Cell Broadband Engine Architecture Marc de Kruijf Overview Motivation MapReduce Cell BE Architecture Design Performance Analysis Implementation Status Future Work What is MapReduce? A
More information