LLVM Performance Improvements and Headroom
|
|
- Peregrine Wade
- 5 years ago
- Views:
Transcription
1 LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers Meeting 2015 San Jose, CA
2 Messages Tuning and focused local optimizations Advancing optimization technology Getting inspired by heroic optimizations Exposing performance opportunities to developers
3 Benchmarks SPEC CINT2006 1) Kernels LLVM Tests benchmarking-only SPEC CFP2006 (7 C/C++ benchmarks) 1) SPEC is a registered trademark of the Standard Performance Evaluation Cooperation
4 Setup Clang-600 (~LLVM 3.5) vs Clang-700 (~LLVM 3.7) -O3 -FLTO -PGO -O3 -FLTO ARM64
5 Some Performance Gains SPEC CINT2006: +6.5% Kernels: up to 70%
6 Acknowledgements Adam Nemet, Arnold Schwaighofer, Chad Rossier, Chandler Carruth, James Molloy, Michael Zolotukhin, Tyler Nowicki, Yi Jiang and many other contributors of the LLVM community
7 SPEC CINT % 22% 18.35% 19.3% 11.7% 8% 5.05% 4.4% 5.3% 6% 6.5% 3.2% 2.5% 2.6% -1.6% 400.perlbench 401.bzip2 0.9% 403.gcc 429.mcf 1.3% 445.gobmk 456.hmmer 458.sjeng 462.libquantum 464.h264ref -1.6% 471.omnetpp 473.astar 483.xalancbmk Geomean
8 Some Reasons For Gains Condition folding: ((c) >= 'A' && (c) <= 'Z') ((c) >= 'a' && (c) <= z ) cmp + br -> tbnz Unrolling of loops with conditional stores Register pressure aware loop rotation Local (narrow) optimizations
9 Hot Loop: 456.hmmer for (k = 1; k <= M; k++) { mc[k] = mpp[k - 1] + t0[k - 1]; ; d[k] = d[k - 1] + t1[k - 1]; if ((sc = mc[k - 1] + t2[k - 1]) > d[k]) d[k] = sc; ;
10 Hot Loop: 456.hmmer Loop Distribution for (k = 1; k <= M; k++) { mc[k] = mpp[k - 1] + t0[k - 1]; ; for (k = 1; k <= M; k++) { // split d[k] = d[k - 1] + t1[k - 1]; if ((sc = mc[k - 1] + t2[k - 1]) > d[k]) d[k] = sc; ; - Improves cache efficiency - Partial vectorization
11 Hot Loop: 456.hmmer Loop Distribution + Store To Load Forwarding for (k = 1; k <= M; k++) { mc[k] = mpp[k - 1] + t0[k - 1]; ; = d[0]; Tk-1 for (k = 1; k <= M; k++) { // split // d[k]= d[k-1] + t1[k-1] d[k] = T k = + t1[k - 1]; Tk-1 if ((sc = mc[k - 1] + t2[k - 1]) > T k ) T k = d[k] = sc; - Critical Path Shortening
12 Reflection Many local narrowly focused optimizations Loop Distribution advances capabilities of Loop Transformation Framework
13 Kernel Performance 70% 70% 52.5% 53% 35% 17.5% 0% -10% -8% -6% -17.5% SHA Sobel Filter Compress BlackScholes Convolution
14 Convolution : Estimate Loop Unrolling Benefits No Unrolling static const int k[] = { 0, 1, 5 ; for (v = 0; v < size; v++) { r += src[v] * k[v];
15 Convolution : Estimate Loop Unrolling Benefits With Unrolling static const int k[] = { 0, 1, 5 ; r += src[0] * k[0]; r += src[1] * k[1]; r += src[2] * k[2];
16 Convolution : Estimate Loop Unrolling Benefits With Unrolling static const int k[] = { 0, 1, 5 ; r += src[0] * k[0]; r += src[1] * k[1]; r += src[2] * k[2];
17 Convolution : Estimate Loop Unrolling Benefits With Unrolling static const int k[] = { 0, 1, 5 ; r += src[0] * 0; r += src[1] * 1;a r += src[2] * 5;
18 Convolution : Estimate Loop Unrolling Benefits With Unrolling static const int k[] = { 0, 1, 5 ; r += 0; r += src[1];a saves mul + add saves mul r += src[2] * 5;
19 Kernel Regressions SHA (-10%) Aggressive load hoisting Tune Scheduler Sobel Filter (-8%) LSR expression normalization Avoid GEP in base address calculation Compress (-6%) No CCMP optimization due to tbnz Generalize CCMP
20 Performance Changes In LLVM Tests 70.00% 50.00% Gains %Gain 30.00% 10.00% % Losses %
21 Headroom
22 CINT2006 Headroom +5%? +15% Inlining+GlobalOpts +LoopFusion libquantum( 10X ) +10% AoS->SoA libquantum (2X) +2% Tuning? CINT2006 2% 10% 15% 5% 0% 10% 20% 30% 40% Tuning Advanced Heroic Unknown
23 CFP2006 Headroom +5%? +25% AoS->SoA milc (3X) Value Profiling povray(10%) +2% Tuning CFP2006 2% 25% 5% 0% 10% 20% 30% 40% Tuning Advanced Heroic Unknown
24 Array of Structs (AoS) to Structs of Array (SoA)
25 Libquantum: AoS->SoA hot_code(, quantum_reg_node *reg) { int i; for (i = 0; i < reg->size; i++) { if (reg->node[i].state & C) { reg->node[i].state ^= T; struct quantum_reg_node { COMPLEX_FLOAT amplitude; int state; ; struct quantum_reg_node { int width; int size; int hashw; quantum_reg_node *node; int *hash; ; Hot loop uses only some fields of a structure
26 struct quantum_reg_node { COMPLEX_FLOAT amplitude; int state; ; quantum_reg_node N[] for(i=0; i<reg->size; i++){ if(reg->n[i].state & C) { reg->n[i].state ^= T);
27 quantum_reg_node N[] N[0] N[1] N[2] N[0].state N[1].state N[2].state 20 Cache: b 128b 192b fictitious cache line size!
28 struct quantum_reg_node { COMPLEX_FLOAT amplitude; int state; ; quantum_reg_node N[] for(i=0; i<reg->size; i++){ if(reg->n[i].state & C) { reg->n[i].state ^= T);
29 COMPLEX_FLOAT *amplitude; int *state; quantum_reg_node N[] for(i=0; i<reg->size; i++){ if(reg->n[i].state & C) { reg->n[i].state ^= T);
30 COMPLEX_FLOAT *amplitude; int *state; COMLEX_FLOAT amplitude[] MAX_UNSIGNED state[] for(i=0; i<reg->size; i++){ if(reg->n[i].state & C) { reg->n[i].state ^= T);
31 COMPLEX_FLOAT *amplitude; int *state; COMLEX_FLOAT amplitude[] MAX_UNSIGNED state[] for(i=0; i<reg->size; i++){ if(reg->state[i] & C) { reg->state[i] ^= T);
32 COMPLEX_FLOAT *amplitude; int *state; COMLEX_FLOAT amplitude[] MAX_UNSIGNED state[] Cache (after AoS):
33 AoS to SoA speeds up benchmarks and applications that are memory-bandwidth bound and/or cache bound
34 AoS->SoA: Challenges Legality Casts to/from struct type, escaped types, address taken of individual fields, parameter and return values, semantic of constants Transformations Data accesses, memory allocation Usability Debugging
35 AoS->SoA: Changing Structure Definition and Accesses struct quantum_reg_node { COMPLEX_FLOAT amplitude; MAX_UNSIGNED state; ; struct quantum_reg_node { int width; int size; int hashw; quantum_reg_node *node; int *hash; ; struct quantum_reg_node { int width; int size; int hashw; COMLEX_FLOAT *amplitude; int *state; int *hash; ; reg->n[i].state to reg->state[i]
36 Constants if (!reg.state) reg.node = calloc(r; eg.size, sizeof(quantum_reg_node)) %call10 = call %conv, i64 16);
37 Parameter Passing j = quantum_get_state(reg1->node) j = quantum_get_state(reg1->amplitude, reg1->state);
38 References Implementing Data Layout Optimizations in LLVM Framework, Prashantha NR et al., 2014 LLVM Developer Meeting G. Chakrabarti, F. Chow, Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements. Gautam Chakrabarti, Open64 Workshop at CGO, 2008 O. Golovanevsky et al, Workshops/compiler2007/present/data-layout-optimizations-ingcc.pdf, GCC Workshop, 2007 R. Hundt et al, Practical Structure Layout Optimization and Advice, CGO, 2006
39 More Headroom: Heroics
40 Libquantum: Heroics test_sum() {...; cnot(2 * width - 1, width - 1, reg); sigma_x(2 * width - 1, reg);...; 2 similar functions Hot loops Whole program visibility Alias analysis GlobalModRef Cost Model Inline + Fuse
41 Interprocedural Loop Fusion void cnot(int C, int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B1; else { if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) decohere(reg); void sigma(int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B2; else { if (foo(y)) return; for (i = 0; i < reg->size; i++) { decohere(reg);
42 Interprocedural Loop Fusion void cnot(int C, int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B1; else { if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) decohere(reg); void sigma(int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B2; else { if (foo(y)) return; for (i = 0; i < reg->size; i++) { decohere(reg);
43 Interprocedural Loop Fusion void void status(&qec, status(&qec, B1; decohere(reg); B2 decohere(reg); decohere(reg);
44 Interprocedural Loop Fusion void void status(&qec, void decohere(q_reg) { status(&qec, Global B1; Value Prop if (status) { B2; do_something; return; decohere(reg); decohere(reg);
45 Interprocedural Loop Fusion void void status(&qec, void decohere(q_reg) { status(&qec, Global B1; Value Prop if (0) { B2 do_something; return; decohere(reg); decohere(reg);
46 Interprocedural Loop Fusion void cnot(int C, int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B1; else { if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) decohere(reg); void sigma(int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B2; else { if (foo(y)) return; for (i = 0; i < reg->size; i++) { decohere(reg);
47 Interprocedural Loop Fusion void void status(&qec, NULL); if (qec) B1; status(&qec, NULL); if (qec) B2;
48 Interprocedural Loop Fusion void void Inlining status(&qec, B1; status(&qec, NULL); if (qec) status(&qec, B2;
49 Interprocedural Loop Fusion void void Inlining status(&qec, if (globalvar) status(&qec, B1; B2;
50 Interprocedural Loop Fusion void cnot(int C, int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (globalvar) B1; else { if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) decohere(reg); void sigma(int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (globalvar) B2; else { if (foo(y)) return; for (i = 0; i < reg->size; i++) { decohere(reg);
51 Interprocedural Loop Fusion void GlobalModRef void status(&qec, NULL); if (globalvar) B1; else { status(&qec, NULL); if (globalvar) B2; else {
52 Interprocedural Loop Fusion void GlobalModRef void status(&qec, B1; if (globalvar) { B1; B2; else { status(&qec, if (foo(x)) return; for (i B2= 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; if (foo(y)) return; for (i = 0; reg->state[i] i < reg->size; ^= T; i++) {
53 Interprocedural Loop Fusion void GlobalModRef void status(&qec, if (globalvar) { B1; B2; B1; else { status(&qec, if (foo(x)) return; for (i B2= 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; if (foo(y)) return; for (i = 0; reg->state[i] i < reg->size; ^= T; i++) {
54 Interprocedural Loop Fusion void GlobalModRef void status(&qec, if (globalvar) { B1; B2; B1; else { status(&qec, if (foo(x)) return; B2 if (foo(y)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) for (i = 0; reg->state[i] i < reg->size; ^= T; i++) {
55 Interprocedural Loop Fusion void GlobalModRef void status(&qec, if (globalvar) { B1; B2; B1; else { Fuse bool V1 = foo(x); status(&qec, bool V2 = foo(y); if (V1 V2) B2 return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) for (i = 0; reg->state[i] i < reg->size; ^= T; i++) {
56 Interprocedural Loop Fusion void GlobalModRef void status(&qec, if (globalvar) { B1; B2; B1; else { Fused Loops bool V1 = foo(x); status(&qec, bool V2 = foo(y); if (V1 V2) B2 return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C)
57 What can we learn? Optimization Scope: Call Chain Concept: Function Similarity Challenge: Hoist statements across loops Techniques: Global Value Prop, Partial Inlining, GlobalModRef, Loop Fusion,
58 Take Aways Techniques needed for heroics can be generalized to advance optimization technology Cost Model?
59 How to expose performance opportunities to developers?
60 builtin_nontemporal_store void scaledcpy(float * restrict a, float * restrict b, float S, int N) { for (int i = 0; i < N; i++) b[i] = S * a[i];
61 builtin_nontemporal_store void scaledcpy(float * restrict a, float * restrict b, float S, int N) { for (int i = 0; i < N; i++) // b[i] = S * a[i]; builtin_nontemporal_store(s * a[i], &b[i]);
62 Vectorizer: Hints and Diagnostics while (good) { for (i = 0; i < N; i++) { DW[i] = A[i - 3] + B[i - 2] + C[i - 1] + D[i]; UW[i] = A[i] + B[i + 1] + C[i + 2] + D[i + 3]; remark: loop not vectorized: Avoid runtime pointer checking when you know the arrays will always be independent by specifying '#pragma clang loop vectorize(assume_safety) before the loop or by specifying 'restrict' on the array arguments. Erroneous results will occur if these options are incorrectly applied!
63 Conclusions The best days for LLVM performance are ahead of us
64 Questions?
A Fast Instruction Set Simulator for RISC-V
A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.
More informationCS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines
CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell
More informationFunction Merging by Sequence Alignment
Function Merging by Sequence Alignment Rodrigo C. O. Rocha Pavlos Petoumenos Zheng Wang University of Edinburgh, UK University of Edinburgh, UK Lancaster University, UK r.rocha@ed.ac.uk ppetoume@inf.ed.ac.uk
More informationResource-Conscious Scheduling for Energy Efficiency on Multicore Processors
Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe
More informationKruiser: Semi-synchronized Nonblocking Concurrent Kernel Heap Buffer Overflow Monitoring
NDSS 2012 Kruiser: Semi-synchronized Nonblocking Concurrent Kernel Heap Buffer Overflow Monitoring Donghai Tian 1,2, Qiang Zeng 2, Dinghao Wu 2, Peng Liu 2 and Changzhen Hu 1 1 Beijing Institute of Technology
More informationEnhanced Operating System Security Through Efficient and Fine-grained Address Space Randomization
Enhanced Operating System Security Through Efficient and Fine-grained Address Space Randomization Anton Kuijsten Andrew S. Tanenbaum Vrije Universiteit Amsterdam 21st USENIX Security Symposium Bellevue,
More informationBuilding opensuse with link-time optimizations. Jan Hubička and Martin Liška SUSElabs
Building opensuse with link-time optimizations Jan Hubička and Martin Liška SUSElabs jh@suse.cz, mliska@suse.cz Outlilne What is link-time optimization? Link-time optimization and GCC Benchmarks Can we
More informationINSTRUCTION LEVEL PARALLELISM
INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,
More informationLightweight Memory Tracing
Lightweight Memory Tracing Mathias Payer*, Enrico Kravina, Thomas Gross Department of Computer Science ETH Zürich, Switzerland * now at UC Berkeley Memory Tracing via Memlets Execute code (memlets) for
More informationCoverage-guided Fuzzing of Individual Functions Without Source Code
Coverage-guided Fuzzing of Individual Functions Without Source Code Alessandro Di Federico Politecnico di Milano October 25, 2018 1 Index Coverage-guided fuzzing An overview of rev.ng Experimental results
More informationLLVM performance optimization for z Systems
LLVM performance optimization for z Systems Dr. Ulrich Weigand Senior Technical Staff Member GNU/Linux Compilers & Toolchain Date: Mar 27, 2017 2017 IBM Corporation Agenda LLVM on z Systems Performance
More informationDynamic Trace Analysis with Zero-Suppressed BDDs
University of Colorado, Boulder CU Scholar Electrical, Computer & Energy Engineering Graduate Theses & Dissertations Electrical, Computer & Energy Engineering Spring 4-1-2011 Dynamic Trace Analysis with
More informationUCB CS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 36 Performance 2010-04-23 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in
More informationAutotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT
Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic
More informationImproving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.
Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses
More informationSuper-optimizing LLVM IR
Super-optimizing LLVM IR Duncan Sands DeepBlueCapital / CNRS Thanks to Google for sponsorship Super optimization Optimization Improve code Super optimization Optimization Improve code Super-optimization
More informationComputer Architecture. Introduction
to Computer Architecture 1 Computer Architecture What is Computer Architecture From Wikipedia, the free encyclopedia In computer engineering, computer architecture is a set of rules and methods that describe
More informationUnder the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world.
Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world. Supercharge your PS3 game code Part 1: Compiler internals.
More informationFDO: Magic Make My Program Faster compilation option?
FDO: Magic Make My Program Faster compilation option? Paweł Moll Embedded Linux Conference Europe, Berlin, October 2016 Agenda FDO Basics Instrumentation based FDO Sample based ( Auto ) FDO Deployments
More informationHigh System-Code Security with Low Overhead
High System-Code Security with Low Overhead Jonas Wagner, Volodymyr Kuznetsov, George Candea, and Johannes Kinder École Polytechnique Fédérale de Lausanne Royal Holloway, University of London High System-Code
More informationArchitecture of Parallel Computer Systems - Performance Benchmarking -
Architecture of Parallel Computer Systems - Performance Benchmarking - SoSe 18 L.079.05810 www.uni-paderborn.de/pc2 J. Simon - Architecture of Parallel Computer Systems SoSe 2018 < 1 > Definition of Benchmark
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationPerformance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Performance Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Defining Performance (1) Which airplane has the best performance? Boeing 777 Boeing
More informationBalancing DRAM Locality and Parallelism in Shared Memory CMP Systems
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard
More informationEnergy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012
Energy Proportional Datacenter Memory Brian Neel EE6633 Fall 2012 Outline Background Motivation Related work DRAM properties Designs References Background The Datacenter as a Computer Luiz André Barroso
More informationExploi'ng Compressed Block Size as an Indicator of Future Reuse
Exploi'ng Compressed Block Size as an Indicator of Future Reuse Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons, Michael A. Kozuch Execu've Summary In a compressed
More informationUCB CS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 38 Performance 2008-04-30 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in
More informationLLVM Auto-Vectorization
LLVM Auto-Vectorization Past Present Future Renato Golin LLVM Auto-Vectorization Plan: What is auto-vectorization? Short-history of the LLVM vectorizer What do we support today, and an overview of how
More informationEvaluation of RISC-V RTL with FPGA-Accelerated Simulation
Evaluation of RISC-V RTL with FPGA-Accelerated Simulation Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach, Krste Asanovic CARRV 2017 10/14/2017 Evaluation Methodologies For Computer
More informationCall Paths for Pin Tools
, Xu Liu, and John Mellor-Crummey Department of Computer Science Rice University CGO'14, Orlando, FL February 17, 2014 What is a Call Path? main() A() B() Foo() { x = *ptr;} Chain of function calls that
More informationNightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems
NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems Rentong Guo 1, Xiaofei Liao 1, Hai Jin 1, Jianhui Yue 2, Guang Tan 3 1 Huazhong University of Science
More informationScheduling the Intel Core i7
Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne
More informationLinux Performance on IBM zenterprise 196
Martin Kammerer martin.kammerer@de.ibm.com 9/27/10 Linux Performance on IBM zenterprise 196 visit us at http://www.ibm.com/developerworks/linux/linux390/perf/index.html Trademarks IBM, the IBM logo, and
More informationSandbox Based Optimal Offset Estimation [DPC2]
Sandbox Based Optimal Offset Estimation [DPC2] Nathan T. Brown and Resit Sendag Department of Electrical, Computer, and Biomedical Engineering Outline Motivation Background/Related Work Sequential Offset
More informationSSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals
SSE and SSE2 Timothy A. Chagnon 18 September 2007 All images from Intel 64 and IA 32 Architectures Software Developer's Manuals Overview SSE: Streaming SIMD (Single Instruction Multiple Data) Extensions
More informationEnergy Models for DVFS Processors
Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July
More informationLoop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization
Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization Yulei Sui, Xiaokang Fan, Hao Zhou and Jingling Xue School of Computer Science and Engineering The University of
More informationWhat run-time services could help scientific programming?
1 What run-time services could help scientific programming? Stephen Kell stephen.kell@cl.cam.ac.uk Computer Laboratory University of Cambridge Contrariwise... 2 Some difficulties of software performance!
More informationEffectiveSan: Type and Memory Error Detection using Dynamically Typed C/C++
EffectiveSan: Type and Memory Error Detection using Dynamically Typed C/C++ Gregory J. Duck and Roland H. C. Yap Department of Computer Science National University of Singapore {gregory, ryap}@comp.nus.edu.sg
More informationField Analysis. Last time Exploit encapsulation to improve memory system performance
Field Analysis Last time Exploit encapsulation to improve memory system performance This time Exploit encapsulation to simplify analysis Two uses of field analysis Escape analysis Object inlining April
More informationFast, precise dynamic checking of types and bounds in C
Fast, precise dynamic checking of types and bounds in C Stephen Kell stephen.kell@cl.cam.ac.uk Computer Laboratory University of Cambridge p.1 Tool wanted if (obj >type == OBJ COMMIT) { if (process commit(walker,
More informationImproving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.
Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses
More informationGenerating Low-Overhead Dynamic Binary Translators
Generating Low-Overhead Dynamic Binary Translators Mathias Payer ETH Zurich, Switzerland mathias.payer@inf.ethz.ch Thomas R. Gross ETH Zurich, Switzerland trg@inf.ethz.ch Abstract Dynamic (on the fly)
More informationA C++/CUDA DSL for Object-oriented Programming with Structure-of-Arrays Layout
A C++/CUDA DSL for Object-oriented Programming with Structure-of-Arrays Layout Matthias Springer Tokyo Institute of Technology CGO 2018, ACM Student Research Competition AOS vs. SOA AOS: Array of Structures
More informationPerformance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor
Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Sarah Bird ϕ, Aashish Phansalkar ϕ, Lizy K. John ϕ, Alex Mericas α and Rajeev Indukuru α ϕ University
More informationApple LLVM GPU Compiler: Embedded Dragons. Charu Chandrasekaran, Apple Marcello Maggioni, Apple
Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1 Agenda How Apple uses LLVM to build a GPU Compiler Factors that affect GPU performance The Apple GPU compiler
More informationSVF: Static Value-Flow Analysis in LLVM
SVF: Static Value-Flow Analysis in LLVM Yulei Sui, Peng Di, Ding Ye, Hua Yan and Jingling Xue School of Computer Science and Engineering The University of New South Wales 2052 Sydney Australia March 18,
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight
More information9/5/17. The Design and Implementation of Programming Languages. Compilation. Interpretation. Compilation vs. Interpretation. Hybrid Implementation
Language Implementation Methods The Design and Implementation of Programming Languages Compilation Interpretation Hybrid In Text: Chapter 1 2 Compilation Interpretation Translate high-level programs to
More informationA Framework for Safe Automatic Data Reorganization
Compiler Technology A Framework for Safe Automatic Data Reorganization Shimin Cui (Speaker), Yaoqing Gao, Roch Archambault, Raul Silvera IBM Toronto Software Lab Peng Peers Zhao, Jose Nelson Amaral University
More informationEnergy-centric DVFS Controlling Method for Multi-core Platforms
Energy-centric DVFS Controlling Method for Multi-core Platforms Shin-gyu Kim, Chanho Choi, Hyeonsang Eom, Heon Y. Yeom Seoul National University, Korea MuCoCoS 2012 Salt Lake City, Utah Abstract Goal To
More informationClang - the C, C++ Compiler
Clang - the C, C++ Compiler Contents Clang - the C, C++ Compiler o SYNOPSIS o DESCRIPTION o OPTIONS SYNOPSIS clang [options] filename... DESCRIPTION clang is a C, C++, and Objective-C compiler which encompasses
More informationAutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming
AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming David Pfander, Malte Brunn, Dirk Pflüger University of Stuttgart, Germany May 25, 2018 Vancouver, Canada, iwapt18 May 25, 2018 Vancouver,
More informationSecurity-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat
Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance
More informationSimone Campanoni Loop transformations
Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple
More informationFootprint-based Locality Analysis
Footprint-based Locality Analysis Xiaoya Xiang, Bin Bao, Chen Ding University of Rochester 2011-11-10 Memory Performance On modern computer system, memory performance depends on the active data usage.
More informationgpucc: An Open-Source GPGPU Compiler
gpucc: An Open-Source GPGPU Compiler Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt One-Slide Overview Motivation
More informationgpucc: An Open-Source GPGPU Compiler
gpucc: An Open-Source GPGPU Compiler Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt One-Slide Overview Motivation
More informationAMD S X86 OPEN64 COMPILER. Michael Lai AMD
AMD S X86 OPEN64 COMPILER Michael Lai AMD CONTENTS Brief History AMD and Open64 Compiler Overview Major Components of Compiler Important Optimizations Recent Releases Performance Applications and Libraries
More informationSSA Based Mobile Code: Construction and Empirical Evaluation
SSA Based Mobile Code: Construction and Empirical Evaluation Wolfram Amme Friedrich Schiller University Jena, Germany Michael Franz Universit of California, Irvine, USA Jeffery von Ronne Universtity of
More informationClass Project Report
15-745 Class Project Report Brendan Meeder (bmeeder@cs), Jamie Morgenstern (jaimiemmt@cs), Richard Peng (richard.peng@gmail.com) April 27, 2011 Abstract Selecting an ordering for running optimizations
More informationPolly Polyhedral Optimizations for LLVM
Polly Polyhedral Optimizations for LLVM Tobias Grosser - Hongbin Zheng - Raghesh Aloor Andreas Simbürger - Armin Grösslinger - Louis-Noël Pouchet April 03, 2011 Polly - Polyhedral Optimizations for LLVM
More informationNovember IBM XL C/C++ Compilers Insights on Improving Your Application
November 2010 IBM XL C/C++ Compilers Insights on Improving Your Application Page 1 Table of Contents Purpose of this document...2 Overview...2 Performance...2 Figure 1:...3 Figure 2:...4 Exploiting the
More informationComputer Sciences Department
Computer Sciences Department Compiler Construction of Idempotent Regions Marc de Kruijf Karthikeyan Sankaralingam Somesh Jha Technical Report #1700 November 2011 Compiler Construction of Idempotent Regions
More informationIntroduction to C++ Introduction. Structure of a C++ Program. Structure of a C++ Program. C++ widely-used general-purpose programming language
Introduction C++ widely-used general-purpose programming language procedural and object-oriented support strong support created by Bjarne Stroustrup starting in 1979 based on C Introduction to C++ also
More informationSimBench. A Portable Benchmarking Methodology for Full-System Simulators. Harry Wagstaff Bruno Bodin Tom Spink Björn Franke
SimBench A Portable Benchmarking Methodology for Full-System Simulators Harry Wagstaff Bruno Bodin Tom Spink Björn Franke Institute for Computing Systems Architecture University of Edinburgh ISPASS 2017
More informationPlease refer to the turn-in procedure document on the website for instructions on the turn-in procedure.
1 CSE 131 Winter 2013 Compiler Project #2 -- Code Generation Due Date: Friday, March 15 th 2013 @ 11:59pm Disclaimer This handout is not perfect; corrections may be made. Updates and major clarifications
More informationLecture 20 CIS 341: COMPILERS
Lecture 20 CIS 341: COMPILERS Announcements HW5: OAT v. 2.0 records, function pointers, type checking, array-bounds checks, etc. Due: TOMORROW Wednesday, April 11 th Zdancewic CIS 341: Compilers 2 A high-level
More informationIntroduction to C++ with content from
Introduction to C++ with content from www.cplusplus.com 2 Introduction C++ widely-used general-purpose programming language procedural and object-oriented support strong support created by Bjarne Stroustrup
More informationIA-64 Compiler Technology
IA-64 Compiler Technology David Sehr, Jay Bharadwaj, Jim Pierce, Priti Shrivastav (speaker), Carole Dulong Microcomputer Software Lab Page-1 Introduction IA-32 compiler optimizations Profile Guidance (PGOPTI)
More informationShort Notes of CS201
#includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system
More informationEKT 303 WEEK Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+ EKT 303 WEEK 2 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved. Chapter 2 + Performance Issues + Designing for Performance The cost of computer systems continues to drop dramatically,
More informationNear-Threshold Computing: How Close Should We Get?
Near-Threshold Computing: How Close Should We Get? Alaa R. Alameldeen Intel Labs Workshop on Near-Threshold Computing June 14, 2014 Overview High-level talk summarizing my architectural perspective on
More informationStructure Array Copy Optimization
Structure Array Copy Optimization 1. Objective Modern programming language use structure to gather relative datum, in some of the program, only part of the structure is accessed. If the size of accessed
More informationOpenPrefetch. (in-progress)
OpenPrefetch Let There Be Industry-Competitive Prefetching in RISC-V Processors (in-progress) Bowen Huang, Zihao Yu, Zhigang Liu, Chuanqi Zhang, Sa Wang, Yungang Bao Institute of Computing Technology(ICT),
More informationSEN361 Computer Organization. Prof. Dr. Hasan Hüseyin BALIK (2 nd Week)
+ SEN361 Computer Organization Prof. Dr. Hasan Hüseyin BALIK (2 nd Week) + Outline 1. Overview 1.1 Basic Concepts and Computer Evolution 1.2 Performance Issues + 1.2 Performance Issues + Designing for
More informationTDT4255 Computer Design. Lecture 1. Magnus Jahre
1 TDT4255 Computer Design Lecture 1 Magnus Jahre 2 Outline Practical course information Chapter 1: Computer Abstractions and Technology 3 Practical Course Information 4 TDT4255 Computer Design TDT4255
More informationEfficient Memory Shadowing for 64-bit Architectures
Efficient Memory Shadowing for 64-bit Architectures The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Qin Zhao, Derek Bruening,
More informationCS201 - Introduction to Programming Glossary By
CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with
More informationCS 261 Fall C Introduction. Variables, Memory Model, Pointers, and Debugging. Mike Lam, Professor
CS 261 Fall 2017 Mike Lam, Professor C Introduction Variables, Memory Model, Pointers, and Debugging The C Language Systems language originally developed for Unix Imperative, compiled language with static
More informationFinal CSE 131B Winter 2003
Login name Signature Name Student ID Final CSE 131B Winter 2003 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 _ (20 points) _ (25 points) _ (21 points) _ (40 points) _ (30 points) _ (25 points)
More informationStructure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements
Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti, Fred Chow PathScale, LLC. {gautam, fchow}@pathscale.com Abstract A common performance
More informationCompiling for Performance on hp OpenVMS I64. Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005
Compiling for Performance on hp OpenVMS I64 Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005 Compilers discussed C, Fortran, [COBOL, Pascal, BASIC] Share GEM optimizer
More informationCell Programming Tips & Techniques
Cell Programming Tips & Techniques Course Code: L3T2H1-58 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn Key programming techniques to exploit cell hardware organization and
More informationPractical Data Compression for Modern Memory Hierarchies
Practical Data Compression for Modern Memory Hierarchies Thesis Oral Gennady Pekhimenko Committee: Todd Mowry (Co-chair) Onur Mutlu (Co-chair) Kayvon Fatahalian David Wood, University of Wisconsin-Madison
More informationSpatial Memory Streaming (with rotated patterns)
Spatial Memory Streaming (with rotated patterns) Michael Ferdman, Stephen Somogyi, and Babak Falsafi Computer Architecture Lab at 2006 Stephen Somogyi The Memory Wall Memory latency 100 s clock cycles;
More informationScalable Dynamic Task Scheduling on Adaptive Many-Cores
Introduction: Many- Paradigm [Our Definition] Scalable Dynamic Task Scheduling on Adaptive Many-s Vanchinathan Venkataramani, Anuj Pathania, Muhammad Shafique, Tulika Mitra, Jörg Henkel Bus CES Chair for
More informationAddressing End-to-End Memory Access Latency in NoC-Based Multicores
Addressing End-to-End Memory Access Latency in NoC-Based Multicores Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das The Pennsylvania State University University Park, PA, 682, USA {akbar,euk39,kandemir,das}@cse.psu.edu
More informationChapter 1. Computer Abstractions and Technology. Adapted by Paulo Lopes, IST
Chapter 1 Computer Abstractions and Technology Adapted by Paulo Lopes, IST The Computer Revolution Progress in computer technology Sustained by Moore s Law Makes novel and old applications feasible Computers
More informationBentley Rules for Optimizing Work
6.172 Performance Engineering of Software Systems SPEED LIMIT PER ORDER OF 6.172 LECTURE 2 Bentley Rules for Optimizing Work Charles E. Leiserson September 11, 2012 2012 Charles E. Leiserson and I-Ting
More informationOptimising with the IBM compilers
Optimising with the IBM Overview Introduction Optimisation techniques compiler flags compiler hints code modifications Optimisation topics locals and globals conditionals data types CSE divides and square
More informationArchitectural Supports to Protect OS Kernels from Code-Injection Attacks
Architectural Supports to Protect OS Kernels from Code-Injection Attacks 2016-06-18 Hyungon Moon, Jinyong Lee, Dongil Hwang, Seonhwa Jung, Jiwon Seo and Yunheung Paek Seoul National University 1 Why to
More informationHexType: Efficient Detection of Type Confusion Errors for C++ Yuseok Jeon Priyam Biswas Scott A. Carr Byoungyoung Lee Mathias Payer
HexType: Efficient Detection of Type Confusion Errors for C++ Yuseok Jeon Priyam Biswas Scott A. Carr Byoungyoung Lee Mathias Payer Motivation C++ is a popular programming language Google Chrome, Firefox,
More informationBaikal-T1 Microprocessor Performance Tests
Baikal-T1 Microprocessor Performance Tests Revision list Revision Date Author Description 1.0 15.03.2017 Initial version 1.1 08.08.2017 Added SPEC CPU2006 Int, iperf results Revision list... 1 1. List
More informationIs dynamic compilation possible for embedded system?
Is dynamic compilation possible for embedded system? Scopes 2015, St Goar Victor Lomüller, Henri-Pierre Charles CEA DACLE / Grenoble www.cea.fr June 2 2015 Introduction : Wake Up Questions Session FAQ
More informationCIS 341 Final Examination 4 May 2017
CIS 341 Final Examination 4 May 2017 1 /14 2 /15 3 /12 4 /14 5 /34 6 /21 7 /10 Total /120 Do not begin the exam until you are told to do so. You have 120 minutes to complete the exam. There are 14 pages
More informationstd::assume_aligned Abstract
std::assume_aligned Timur Doumler (papers@timur.audio) Chandler Carruth (chandlerc@google.com) Document #: P1007R0 Date: 2018-05-04 Project: Programming Language C++ Audience: Library Evolution Working
More informationLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15 Akihiro Hayashi (Rice University) Jisheng Zhao (Rice University) Michael Ferguson
More informationPostprint. This is the accepted version of a paper presented at CGO 2017, February 4 8, Austin, TX.
http://www.diva-portal.org Postprint This is the accepted version of a paper presented at CGO 2017, February 4 8, Austin, TX. Citation for the original published paper: Tran, K-A., Carlson, T E., Koukos,
More informationLLVM for the future of Supercomputing
LLVM for the future of Supercomputing Hal Finkel hfinkel@anl.gov 2017-03-27 2017 European LLVM Developers' Meeting What is Supercomputing? Computing for large, tightly-coupled problems. Lots of computational
More information