Various optimization and performance tips for processors
|
|
- Darren Dickerson
- 6 years ago
- Views:
Transcription
1 Various optimization and performance tips for processors Kazushige Goto Texas Advanced Computing Center 2006/12/7 Kazushige Goto (TACC) 1
2 Contents Introducing myself Merit/demerit of Optimization How to avoid reducing performance (not improving performance ) Explanation of each tips GotoBLAS Tutorial (Monday 10:30 to 11:30) 2006/12/7 Kazushige Goto (TACC) 2
3 Four Years ago I was a patent examiner at JPO & got a chance to study abroad I had to find research groups, but No one except for UT responded my request I developed DGEMM for P4 and used at Buffalo Univ. at NY. 2006/12/7 Kazushige Goto (TACC) 3
4 One regret thing Wrong Naming I never thought it would become one of the major BLAS Now over 4,500 users have been registered I don t know number of downloads No one can pronounce my name 2006/12/7 Kazushige Goto (TACC) 4
5 My questions about R Do you need high precision floating point operation? 64bit? Or do you need 128bit? Or is integer operation enough? 80bit FP BLAS? 2006/12/7 Kazushige Goto (TACC) 5
6 Standard optimization 1. Compiler optimization is good enough if data are on L1 cache 2. Your job is to manage data and move them to L1 cache 3. Please don t expect too much Not good, not bad performance L1 cache is very small 2006/12/7 Kazushige Goto (TACC) 6
7 Advanced optimization 1. Bandwidth aware programming 2. Separate important functions we call it Kernel 3. Write function in assembler 4. Compiler s code is not enough Much better performance L2 cache is very large 2006/12/7 Kazushige Goto (TACC) 7
8 Who determine performance? Everyone asked me how to improve performance Actually performance is a kind of demerit point system 1. Start from 100% 2. Someone prevents working 3. Then reducing performance We have to get rid of all problem 2006/12/7 Kazushige Goto (TACC) 8
9 Can I improve performance? Depends on your bottleneck I/O : no hope Main memory : up to 2x Cache memory : up to 6x Instruction scheduling : up to 10x Especially integer operation can be improved up to 100x! Of course, it s very^2 difficult 2006/12/7 Kazushige Goto (TACC) 9
10 Side effect of Optimization Come to nothing if algorithm is changed Performance comparison Optimization can control the order of superiority People will misunderstand which is better Need fair optimization and comparison Good algorithm + non-optimized coding Bad algorithm + optimized coding 2006/12/7 Kazushige Goto (TACC) 10
11 Bunch of mines Operating System Memory Instruction Scheduling Floating point exception Synchronization cost on SMP 2006/12/7 Kazushige Goto (TACC) 11
12 Biggest bottle neck : Human Algorithm is always 1 st priority! Human should understand Computer is far from perfect Computer loves simple work Computer hates any exceptions and interrupts Optimization is the last resort 2006/12/7 Kazushige Goto (TACC) 12
13 Operating System - Process Scheduling - Generally process can t use 100% of CPU cycles Interrupt handling Process scheduling In case of many active processes Timer frequency problem Linux default is 1000Hz You may change to 100Hz 2006/12/7 Kazushige Goto (TACC) 13
14 Operating System - Memory management - Very important for performance Have you ever seen performance variations? Slow Fast Slow It cause due to physically noncontiguous memory mapping Average performance is nonsense User can t control it 2006/12/7 Kazushige Goto (TACC) 14
15 Memory Mapping Page Page Virtual Memory (Contiguous) Physical Memory (Non-contiguous) 2006/12/7 Kazushige Goto (TACC) 15
16 Performance variations Perormance Variations (PPC970) Performance Performance (HugeTLB) L2 Conflicts MFlops Iterations # of conflicts 2006/12/7 Kazushige Goto (TACC) 16
17 Operating System - Frequency Throttle - Recent CPU can control frequency to reduce power consumption Very slow at the beginning of benchmark You can check proc file system /sys/devices/system/cpu/cpu? /cpufreq/scaling_min_freq 2006/12/7 Kazushige Goto (TACC) 17
18 Throttle Performance CPU Freuency Throttling Normal Throttle MFlops Matrix Order 2006/12/7 Kazushige Goto (TACC) 18
19 Memory Issue Page fault (Low amount of memory) TLB miss Narrow bandwidth Cache miss Large latency Cache bank conflict Unaligned trap 2006/12/7 Kazushige Goto (TACC) 19
20 Memory Latency Memory Latency on Opteron HugeTLB MMAP Cycles Vector Size (kb) 2006/12/7 Kazushige Goto (TACC) 20
21 Memory Bandwidth Memory Bandwidth on Opteron HugeTLB MMAP Doubles/cycle Matrix Order 2006/12/7 Kazushige Goto (TACC) 21
22 Instruction cache Simple sin benchmark First access costs too much More than 3 times call is required to get good performance Iteration Itanium Pentium /12/7 Kazushige Goto (TACC) 22
23 Instruction Scheduling (Skip) Decoding bottleneck Scheduling rules Complex dependencies Integer divide and remainder Each architecture has each characteristic --- deep world 2006/12/7 Kazushige Goto (TACC) 23
24 Floating point exception Subnormal Overflow Underflow +Infinity, -Infinity NaN (Not a Number) Dividing by zero 2006/12/7 Kazushige Goto (TACC) 24
25 Strange initialization by great user Inf *0 is actually NaN, not Zero Some users call SCAL (One of BLAS functions) with alpha = Zero to initialize matrix BLAS (only my BLAS?) doesn t take into account about special case of IEEE /12/7 Kazushige Goto (TACC) 25
26 Floating point exception cost Architecture SubNormal Infinity, Nan, Overflow, Underflow Pentium Core Opteron 41 1 Itaniuim POWER5 9 1 Relative value (normal is 1) 2006/12/7 Kazushige Goto (TACC) 26
27 Calculation Order Association law s problem (A + B) + C!= A + (B + C) Optimization needs changing order of calculation Order depends on architecture Really difficult to get correct (same) result between architectures 2006/12/7 Kazushige Goto (TACC) 27
28 Function Call Overhead Spill operation (save/restore register values) Big hidden bottleneck Try out static inline function if function size is too small doesn t contain other function calls Use -fno-inline if you use profile option 2006/12/7 Kazushige Goto (TACC) 28
29 System Call Overhead Different from normal function call System call mmap/munmap, shared memory Write to/read from file Signaling malloc is not system call Output to stderr is unbuffered! 2006/12/7 Kazushige Goto (TACC) 29
30 Example DDOT (double precision dot) I don t explain how to optimize it Please understand Calculation order in ddot function Result may vary Unrolling type SSE or SIMD operation Aligned/unaligned issue 2006/12/7 Kazushige Goto (TACC) 30
31 DDOT on R Testing was failed with my BLAS Actually my BLAS was sanity The problem was 1. R has original ddot function 2. It uses x87 FP stack (80bit precision) 3. My BLAS uses SSE2 (64bit precision) 4. Results are fairly different 2006/12/7 Kazushige Goto (TACC) 31
32 Reason Intermediate result was close to ZERO 80bit FP can hold small value 64bit FP can t do that BLAS can t avoid it BLAS changes calculation order to get better performance 2006/12/7 Kazushige Goto (TACC) 32
33 DDOT data on R X[ 0] = e+00 Y[ 0] = e+00 X[ 1] = e-01 Y[ 1] = e-01 X[ 2] = e-01 Y[ 2] = e-01 X[ 3] = e-01 Y[ 3] = e+00 X[ 4] = e-01 Y[ 4] = e+00 X[ 5] = e-01 Y[ 5] = e+00 X[ 6] = e-01 Y[ 6] = e+00 X[ 7] = e-01 Y[ 7] = e+00 X[ 8] = e-01 Y[ 8] = e+00 X[ 9] = e-01 Y[ 9] = e+00 Totally 10! = patterns for add operations 2006/12/7 Kazushige Goto (TACC) 33
34 How results vary Precision Min Max 32bit e e-07 64bit 80bit 64bit with sort(*) e e e e e-15 (*) Sort in absolute ascending order and add 2006/12/7 Kazushige Goto (TACC) 34
35 Why is calculation order different? We have to hide instruction latency Itanium2 : 8 times unrolling POWER5 : 16 times unrolling Pentium4 with SSE2 : 8 times unrolling Pentium4 with x87 : 4 times unrolling SPARC : 4 times unrolling Calculation order is completely different 2006/12/7 Kazushige Goto (TACC) 35
36 Alignment / Unalignment Alignment It s related to address of data Offset address should be multiply for data size 16bit 0x082 : aligned, 0x83 : unaligned 32bit 0x084 : aligned, 0x86 : unaligned 64bit 0x088 : aligned, 0x8a : unaligned 128bit 0x090 : aligned, 0x98 : unaligned 2006/12/7 Kazushige Goto (TACC) 36
37 Aligned data Some architecture needs 128bit alignment to move data effectively Intel SSE/SSE2 Intel IA64 IBM VMX (Altivec) User s argument of X and Y are not always aligned 2006/12/7 Kazushige Goto (TACC) 37
38 Four Scenarios 1. X : aligned Y : aligned 2. X : unaligned Y : aligned 3. X : aligned Y : unaligned 4. X : unaligned Y : unaligned 1 and 2 may be same result. 1 and 3 may be different result even data are exactly same!! 2006/12/7 Kazushige Goto (TACC) 38
39 Synchronization Cost Always reduces efficiency Two ways for synchronization By Kernel Other threads/process can use CPU Bad response Busy wait (different from spin loop) Other threads/process can t use CPU Pretty good response 2006/12/7 Kazushige Goto (TACC) 39
40 Threaded Operation It s important to divide jobs equally Accessing queue will take long time Many threads try to access same queue at once Waking up/suspending threads cost If we can get rid of above costs, what s happen? 2006/12/7 Kazushige Goto (TACC) 40
41 Pthread overhead (Level 2) MFlops Thread Overhead (DGEMV on Itanium2) Single MultiThreaded Busy Wait Matrix Order 2006/12/7 Kazushige Goto (TACC) 41
42 Imagine how data move! Modified data CPU 0 CPU 1 Cache prevent writing back data Memory All data has to go through main memory! 2006/12/7 Kazushige Goto (TACC) 42
43 Pthread Overhead (Level 3) Thread Overhead (DGEMM on Itanium2) Single Multi Threaded Busy Wait MFlops Matrix Order 2006/12/7 Kazushige Goto (TACC) 43
44 80bit FP BLAS 128bit FP is really good, but slow due to software emulation 80bit FP is less precise, but more precise than 64bit FP No penalty except for load/store operation GCC can handle it by long double 2006/12/7 Kazushige Goto (TACC) 44
45 The problem 80bit FP is not compatible with 128bit FP Intel x86 / x86_64 Bad Performance Intel IA64 Good performance (92% of peak) I don t know how useful it is Is anyone interested in? 2006/12/7 Kazushige Goto (TACC) 45
46 QGEMM on Itanium2 QGEMM Performance on Itanium2 DGEMM QGEMM MFlops Matrix Order 2006/12/7 Kazushige Goto (TACC) 46
47 QGEMM on Opteron QGEMM performance on Opteron DGEMM QGEMM MFlops Matrix Order 2006/12/7 Kazushige Goto (TACC) 47
48 Conclusion The performance of your application comes from many reasons Operating System Your algorithm Function overhead Data alignment Data types etc 2006/12/7 Kazushige Goto (TACC) 48
49 Please do not Easy optimization that compiler can do Unrolling loop Easy blocking Simple hand optimize Stick cache size; bandwidth is more important Excessive threaded operation 2006/12/7 Kazushige Goto (TACC) 49
50 Please do Improving your algorithm Please be aware of Limited bandwidth Avoiding to use subnormal value Separating important function Dividing job equally on thread operation 2006/12/7 Kazushige Goto (TACC) 50
51 Any questions? Then please join Tutorial on Monday! 2006/12/7 Kazushige Goto (TACC) 51
Martin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationAutomatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee.
Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee Outline Pre-intro: BLAS Motivation What is ATLAS Present release How ATLAS works
More informationComputer Systems A Programmer s Perspective 1 (Beta Draft)
Computer Systems A Programmer s Perspective 1 (Beta Draft) Randal E. Bryant David R. O Hallaron August 1, 2001 1 Copyright c 2001, R. E. Bryant, D. R. O Hallaron. All rights reserved. 2 Contents Preface
More informationBLAS. Christoph Ortner Stef Salvini
BLAS Christoph Ortner Stef Salvini The BLASics Basic Linear Algebra Subroutines Building blocks for more complex computations Very widely used Level means number of operations Level 1: vector-vector operations
More information( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture
( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline
More informationDouble-precision General Matrix Multiply (DGEMM)
Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationThings to know about Numeric Computation
Things to know about Numeric Computation Classes of Numbers Countable Sets of Numbers: N: Natural Numbers {1, 2, 3, 4...}. Z: Integers (contains N) {..., -3, -2, -1, 0, 1, 2, 3,...} Q: Rational Numbers
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationChapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture
An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationVector and Parallel Processors. Amdahl's Law
Vector and Parallel Processors. Vector processors are processors which have special hardware for performing operations on vectors: generally, this takes the form of a deep pipeline specialized for this
More informationQuestions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process
Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation Why are threads useful? How does one use POSIX pthreads? Michael Swift 1 2 What s in a process? Organizing a Process A process
More informationAlgorithms and Computation in Signal Processing
Algorithms and Computation in Signal Processing special topic course 18-799B spring 2005 22 nd lecture Mar. 31, 2005 Instructor: Markus Pueschel Guest instructor: Franz Franchetti TA: Srinivas Chellappa
More informationPipelining, Branch Prediction, Trends
Pipelining, Branch Prediction, Trends 10.1-10.4 Topics 10.1 Quantitative Analyses of Program Execution 10.2 From CISC to RISC 10.3 Pipelining the Datapath Branch Prediction, Delay Slots 10.4 Overlapping
More informationLecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture
Lecture Topics ECE 486/586 Computer Architecture Lecture # 5 Spring 2015 Portland State University Quantitative Principles of Computer Design Fallacies and Pitfalls Instruction Set Principles Introduction
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationIntroduction to the MMAGIX Multithreading Supercomputer
Introduction to the MMAGIX Multithreading Supercomputer A supercomputer is defined as a computer that can run at over a billion instructions per second (BIPS) sustained while executing over a billion floating
More informationCS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines
CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per
More informationThreads. Computer Systems. 5/12/2009 cse threads Perkins, DW Johnson and University of Washington 1
Threads CSE 410, Spring 2009 Computer Systems http://www.cs.washington.edu/410 5/12/2009 cse410-20-threads 2006-09 Perkins, DW Johnson and University of Washington 1 Reading and References Reading» Read
More informationPerformance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu
More informationBindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core
Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable
More informationThese slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.
11 1 This Set 11 1 These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information. Text covers multiple-issue machines in Chapter 4, but
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationCOMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy
COMPUTER ARCHITECTURE Virtualization and Memory Hierarchy 2 Contents Virtual memory. Policies and strategies. Page tables. Virtual machines. Requirements of virtual machines and ISA support. Virtual machines:
More informationOptimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology
Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationCS 134: Operating Systems
CS 134: Operating Systems More Memory Management CS 134: Operating Systems More Memory Management 1 / 27 2 / 27 Overview Overview Overview Segmentation Recap Segmentation Recap Segmentation Recap Segmentation
More informationCompiling for Performance on hp OpenVMS I64. Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005
Compiling for Performance on hp OpenVMS I64 Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005 Compilers discussed C, Fortran, [COBOL, Pascal, BASIC] Share GEM optimizer
More informationIBM PSSC Montpellier Customer Center. Blue Gene/P ASIC IBM Corporation
Blue Gene/P ASIC Memory Overview/Considerations No virtual Paging only the physical memory (2-4 GBytes/node) In C, C++, and Fortran, the malloc routine returns a NULL pointer when users request more memory
More informationThe Art and Science of Memory Allocation
Logical Diagram The Art and Science of Memory Allocation Don Porter CSE 506 Binary Formats RCU Memory Management Memory Allocators CPU Scheduler User System Calls Kernel Today s Lecture File System Networking
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationOperating Systems. Process scheduling. Thomas Ropars.
1 Operating Systems Process scheduling Thomas Ropars thomas.ropars@univ-grenoble-alpes.fr 2018 References The content of these lectures is inspired by: The lecture notes of Renaud Lachaize. The lecture
More informationLecture 3: Intro to parallel machines and models
Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class
More informationControl Hazards. Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationChapter 6 Solutions S-3
6 Solutions Chapter 6 Solutions S-3 6.1 There is no single right answer for this question. The purpose is to get students to think about parallelism present in their daily lives. The answer should have
More informationCSE 120 PRACTICE FINAL EXAM, WINTER 2013
CSE 120 PRACTICE FINAL EXAM, WINTER 2013 For each question, select the best choice. In the space provided below each question, justify your choice by providing a succinct (one sentence) explanation. 1.
More informationChapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs
Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationChapter 8: Main Memory
Chapter 8: Main Memory Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and 64-bit Architectures Example:
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationThreads. Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011
Threads Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011 Threads Effectiveness of parallel computing depends on the performance of the primitives used to express
More informationAddress spaces and memory management
Address spaces and memory management Review of processes Process = one or more threads in an address space Thread = stream of executing instructions Address space = memory space used by threads Address
More informationNumber Representations
Number Representations times XVII LIX CLXX -XVII D(CCL)LL DCCC LLLL X-X X-VII = DCCC CC III = MIII X-VII = VIIIII-VII = III 1/25/02 Memory Organization Viewed as a large, single-dimension array, with an
More informationMULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT
MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs
More informationROB: head/tail. exercise: result of processing rest? 2. rename map (for next rename) log. phys. free list: X11, X3. PC log. reg prev.
Exam Review 2 1 ROB: head/tail PC log. reg prev. phys. store? except? ready? A R3 X3 no none yes old tail B R1 X1 no none yes tail C R1 X6 no none yes D R4 X4 no none yes E --- --- yes none yes F --- ---
More informationOutline. Low-Level Optimizations in the PowerPC/Linux Kernels. PowerPC Architecture. PowerPC Architecture
Low-Level Optimizations in the PowerPC/Linux Kernels Dr. Paul Mackerras Senior Technical Staff Member IBM Linux Technology Center OzLabs Canberra, Australia paulus@samba.org paulus@au1.ibm.com Introduction
More informationChapter 13 Reduced Instruction Set Computers
Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining
More informationSWAR: MMX, SSE, SSE 2 Multiplatform Programming
SWAR: MMX, SSE, SSE 2 Multiplatform Programming Relatore: dott. Matteo Roffilli roffilli@csr.unibo.it 1 What s SWAR? SWAR = SIMD Within A Register SIMD = Single Instruction Multiple Data MMX,SSE,SSE2,Power3DNow
More informationA Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD
A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD KELEFOURAS, Vasileios , KRITIKAKOU, Angeliki and GOUTIS, Costas Available
More informationThe Role of Performance
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware
More informationMain Points of the Computer Organization and System Software Module
Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a
More informationChapter 8: Memory-Management Strategies
Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationOptimising for the p690 memory system
Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationCS-537: Midterm Exam (Fall 2013) Professor McFlub
CS-537: Midterm Exam (Fall 2013) Professor McFlub Please Read All Questions Carefully! There are fourteen (14) total numbered pages. Please put your NAME (mandatory) on THIS page, and this page only. Name:
More informationLow Level Optimization by Data Alignment. Presented by: Mark Hauschild
Low Level Optimization by Data Alignment Presented by: Mark Hauschild Motivation We have discussed how to gain performance Application already done, send it off to grid Switch gears this class Low-level
More informationEITF20: Computer Architecture Part2.1.1: Instruction Set Architecture
EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationCS370 Operating Systems
CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 L20 Virtual Memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Questions from last time Page
More informationCHAPTER 8 - MEMORY MANAGEMENT STRATEGIES
CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES OBJECTIVES Detailed description of various ways of organizing memory hardware Various memory-management techniques, including paging and segmentation To provide
More informationVirtual Memory. ICS332 Operating Systems
Virtual Memory ICS332 Operating Systems Virtual Memory Allow a process to execute while not completely in memory Part of the address space is kept on disk So far, we have assumed that the full address
More informationCHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed.
CHAPTER 8: MEMORY MANAGEMENT By I-Chen Lin Textbook: Operating System Concepts 9th Ed. Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the
More informationModule 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1
Module 2 Embedded Processors and Memory Version 2 EE IIT, Kharagpur 1 Lesson 8 General Purpose Processors - I Version 2 EE IIT, Kharagpur 2 In this lesson the student will learn the following Architecture
More informationChapter 8: Main Memory. Operating System Concepts 9 th Edition
Chapter 8: Main Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel
More informationApple LLVM GPU Compiler: Embedded Dragons. Charu Chandrasekaran, Apple Marcello Maggioni, Apple
Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1 Agenda How Apple uses LLVM to build a GPU Compiler Factors that affect GPU performance The Apple GPU compiler
More informationIntel 64 and IA-32 Architectures Software Developer s Manual
Intel 64 and IA-32 Architectures Software Developer s Manual Volume 1: Basic Architecture NOTE: The Intel 64 and IA-32 Architectures Software Developer's Manual consists of five volumes: Basic Architecture,
More informationPast: Making physical memory pretty
Past: Making physical memory pretty Physical memory: no protection limited size almost forces contiguous allocation sharing visible to program easy to share data gcc gcc emacs Virtual memory each program
More informationCS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 23
CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 205 Lecture 23 LAST TIME: VIRTUAL MEMORY! Began to focus on how to virtualize memory! Instead of directly addressing physical memory, introduce a level of
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationLast time: forwarding/stalls. CS 6354: Branch Prediction (con t) / Multiple Issue. Why bimodal: loops. Last time: scheduling to avoid stalls
CS 6354: Branch Prediction (con t) / Multiple Issue 14 September 2016 Last time: scheduling to avoid stalls 1 Last time: forwarding/stalls add $a0, $a2, $a3 ; zero or more instructions sub $t0, $a0, $a1
More informationCS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches
CS 61C: Great Ideas in Computer Architecture Direct Mapped Caches Instructor: Justin Hsia 7/05/2012 Summer 2012 Lecture #11 1 Review of Last Lecture Floating point (single and double precision) approximates
More informationAdvance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts
Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationChapter 8: Main Memory
Chapter 8: Main Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel
More informationMemory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts
Memory management Last modified: 26.04.2016 1 Contents Background Logical and physical address spaces; address binding Overlaying, swapping Contiguous Memory Allocation Segmentation Paging Structure of
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationENCM 501 Winter 2018 Assignment 2 for the Week of January 22 (with corrections)
page 1 of 5 ENCM 501 Winter 2018 Assignment 2 for the Week of January 22 (with corrections) Steve Norman Department of Electrical & Computer Engineering University of Calgary January 2018 Assignment instructions
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationOptimisation p.1/22. Optimisation
Performance Tuning Optimisation p.1/22 Optimisation Optimisation p.2/22 Constant Elimination do i=1,n a(i) = 2*b*c(i) enddo What is wrong with this loop? Compilers can move simple instances of constant
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationChapter 3 Memory Management: Virtual Memory
Memory Management Where we re going Chapter 3 Memory Management: Virtual Memory Understanding Operating Systems, Fourth Edition Disadvantages of early schemes: Required storing entire program in memory
More informationMemory Management. Disclaimer: some slides are adopted from book authors slides with permission 1
Memory Management Disclaimer: some slides are adopted from book authors slides with permission 1 CPU management Roadmap Process, thread, synchronization, scheduling Memory management Virtual memory Disk
More informationJohn Wawrzynek & Nick Weaver
CS 61C: Great Ideas in Computer Architecture Lecture 23: Virtual Memory John Wawrzynek & Nick Weaver http://inst.eecs.berkeley.edu/~cs61c From Previous Lecture: Operating Systems Input / output (I/O) Memory
More informationEfficient Software Based Fault Isolation. Software Extensibility
Efficient Software Based Fault Isolation Robert Wahbe, Steven Lucco Thomas E. Anderson, Susan L. Graham Software Extensibility Operating Systems Kernel modules Device drivers Unix vnodes Application Software
More informationENCM 501 Winter 2016 Assignment 1 for the Week of January 25
page 1 of 5 ENCM 501 Winter 2016 Assignment 1 for the Week of January 25 Steve Norman Department of Electrical & Computer Engineering University of Calgary January 2016 Assignment instructions and other
More informationCell Programming Tips & Techniques
Cell Programming Tips & Techniques Course Code: L3T2H1-58 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn Key programming techniques to exploit cell hardware organization and
More informationFast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names
Fast access ===> use map to find object HW == SW ===> map is in HW or SW or combo Extend range ===> longer, hierarchical names How is map embodied: --- L1? --- Memory? The Environment ---- Long Latency
More informationOptimized Scientific Computing:
Optimized Scientific Computing: Coding Efficiently for Real Computing Architectures Noah Kurinsky SASS Talk, November 11 2015 Introduction Components of a CPU Architecture Design Choices Why Is This Relevant
More informationCISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles. Interrupts and Exceptions. Device Interrupt (Say, arrival of network message)
CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles Michela Taufer Interrupts and Exceptions http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy
More informationBuses. Disks PCI RDRAM RDRAM LAN. Some slides adapted from lecture by David Culler. Pentium 4 Processor. Memory Controller Hub.
es > 100 MB/sec Pentium 4 Processor L1 and L2 caches Some slides adapted from lecture by David Culler 3.2 GB/sec Display Memory Controller Hub RDRAM RDRAM Dual Ultra ATA/100 24 Mbit/sec Disks LAN I/O Controller
More information