LLVM Performance Improvements and Headroom

Size: px
Start display at page:

Download "LLVM Performance Improvements and Headroom"

Transcription

1 LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers Meeting 2015 San Jose, CA

2 Messages Tuning and focused local optimizations Advancing optimization technology Getting inspired by heroic optimizations Exposing performance opportunities to developers

3 Benchmarks SPEC CINT2006 1) Kernels LLVM Tests benchmarking-only SPEC CFP2006 (7 C/C++ benchmarks) 1) SPEC is a registered trademark of the Standard Performance Evaluation Cooperation

4 Setup Clang-600 (~LLVM 3.5) vs Clang-700 (~LLVM 3.7) -O3 -FLTO -PGO -O3 -FLTO ARM64

5 Some Performance Gains SPEC CINT2006: +6.5% Kernels: up to 70%

6 Acknowledgements Adam Nemet, Arnold Schwaighofer, Chad Rossier, Chandler Carruth, James Molloy, Michael Zolotukhin, Tyler Nowicki, Yi Jiang and many other contributors of the LLVM community

7 SPEC CINT % 22% 18.35% 19.3% 11.7% 8% 5.05% 4.4% 5.3% 6% 6.5% 3.2% 2.5% 2.6% -1.6% 400.perlbench 401.bzip2 0.9% 403.gcc 429.mcf 1.3% 445.gobmk 456.hmmer 458.sjeng 462.libquantum 464.h264ref -1.6% 471.omnetpp 473.astar 483.xalancbmk Geomean

8 Some Reasons For Gains Condition folding: ((c) >= 'A' && (c) <= 'Z') ((c) >= 'a' && (c) <= z ) cmp + br -> tbnz Unrolling of loops with conditional stores Register pressure aware loop rotation Local (narrow) optimizations

9 Hot Loop: 456.hmmer for (k = 1; k <= M; k++) { mc[k] = mpp[k - 1] + t0[k - 1]; ; d[k] = d[k - 1] + t1[k - 1]; if ((sc = mc[k - 1] + t2[k - 1]) > d[k]) d[k] = sc; ;

10 Hot Loop: 456.hmmer Loop Distribution for (k = 1; k <= M; k++) { mc[k] = mpp[k - 1] + t0[k - 1]; ; for (k = 1; k <= M; k++) { // split d[k] = d[k - 1] + t1[k - 1]; if ((sc = mc[k - 1] + t2[k - 1]) > d[k]) d[k] = sc; ; - Improves cache efficiency - Partial vectorization

11 Hot Loop: 456.hmmer Loop Distribution + Store To Load Forwarding for (k = 1; k <= M; k++) { mc[k] = mpp[k - 1] + t0[k - 1]; ; = d[0]; Tk-1 for (k = 1; k <= M; k++) { // split // d[k]= d[k-1] + t1[k-1] d[k] = T k = + t1[k - 1]; Tk-1 if ((sc = mc[k - 1] + t2[k - 1]) > T k ) T k = d[k] = sc; - Critical Path Shortening

12 Reflection Many local narrowly focused optimizations Loop Distribution advances capabilities of Loop Transformation Framework

13 Kernel Performance 70% 70% 52.5% 53% 35% 17.5% 0% -10% -8% -6% -17.5% SHA Sobel Filter Compress BlackScholes Convolution

14 Convolution : Estimate Loop Unrolling Benefits No Unrolling static const int k[] = { 0, 1, 5 ; for (v = 0; v < size; v++) { r += src[v] * k[v];

15 Convolution : Estimate Loop Unrolling Benefits With Unrolling static const int k[] = { 0, 1, 5 ; r += src[0] * k[0]; r += src[1] * k[1]; r += src[2] * k[2];

16 Convolution : Estimate Loop Unrolling Benefits With Unrolling static const int k[] = { 0, 1, 5 ; r += src[0] * k[0]; r += src[1] * k[1]; r += src[2] * k[2];

17 Convolution : Estimate Loop Unrolling Benefits With Unrolling static const int k[] = { 0, 1, 5 ; r += src[0] * 0; r += src[1] * 1;a r += src[2] * 5;

18 Convolution : Estimate Loop Unrolling Benefits With Unrolling static const int k[] = { 0, 1, 5 ; r += 0; r += src[1];a saves mul + add saves mul r += src[2] * 5;

19 Kernel Regressions SHA (-10%) Aggressive load hoisting Tune Scheduler Sobel Filter (-8%) LSR expression normalization Avoid GEP in base address calculation Compress (-6%) No CCMP optimization due to tbnz Generalize CCMP

20 Performance Changes In LLVM Tests 70.00% 50.00% Gains %Gain 30.00% 10.00% % Losses %

21 Headroom

22 CINT2006 Headroom +5%? +15% Inlining+GlobalOpts +LoopFusion libquantum( 10X ) +10% AoS->SoA libquantum (2X) +2% Tuning? CINT2006 2% 10% 15% 5% 0% 10% 20% 30% 40% Tuning Advanced Heroic Unknown

23 CFP2006 Headroom +5%? +25% AoS->SoA milc (3X) Value Profiling povray(10%) +2% Tuning CFP2006 2% 25% 5% 0% 10% 20% 30% 40% Tuning Advanced Heroic Unknown

24 Array of Structs (AoS) to Structs of Array (SoA)

25 Libquantum: AoS->SoA hot_code(, quantum_reg_node *reg) { int i; for (i = 0; i < reg->size; i++) { if (reg->node[i].state & C) { reg->node[i].state ^= T; struct quantum_reg_node { COMPLEX_FLOAT amplitude; int state; ; struct quantum_reg_node { int width; int size; int hashw; quantum_reg_node *node; int *hash; ; Hot loop uses only some fields of a structure

26 struct quantum_reg_node { COMPLEX_FLOAT amplitude; int state; ; quantum_reg_node N[] for(i=0; i<reg->size; i++){ if(reg->n[i].state & C) { reg->n[i].state ^= T);

27 quantum_reg_node N[] N[0] N[1] N[2] N[0].state N[1].state N[2].state 20 Cache: b 128b 192b fictitious cache line size!

28 struct quantum_reg_node { COMPLEX_FLOAT amplitude; int state; ; quantum_reg_node N[] for(i=0; i<reg->size; i++){ if(reg->n[i].state & C) { reg->n[i].state ^= T);

29 COMPLEX_FLOAT *amplitude; int *state; quantum_reg_node N[] for(i=0; i<reg->size; i++){ if(reg->n[i].state & C) { reg->n[i].state ^= T);

30 COMPLEX_FLOAT *amplitude; int *state; COMLEX_FLOAT amplitude[] MAX_UNSIGNED state[] for(i=0; i<reg->size; i++){ if(reg->n[i].state & C) { reg->n[i].state ^= T);

31 COMPLEX_FLOAT *amplitude; int *state; COMLEX_FLOAT amplitude[] MAX_UNSIGNED state[] for(i=0; i<reg->size; i++){ if(reg->state[i] & C) { reg->state[i] ^= T);

32 COMPLEX_FLOAT *amplitude; int *state; COMLEX_FLOAT amplitude[] MAX_UNSIGNED state[] Cache (after AoS):

33 AoS to SoA speeds up benchmarks and applications that are memory-bandwidth bound and/or cache bound

34 AoS->SoA: Challenges Legality Casts to/from struct type, escaped types, address taken of individual fields, parameter and return values, semantic of constants Transformations Data accesses, memory allocation Usability Debugging

35 AoS->SoA: Changing Structure Definition and Accesses struct quantum_reg_node { COMPLEX_FLOAT amplitude; MAX_UNSIGNED state; ; struct quantum_reg_node { int width; int size; int hashw; quantum_reg_node *node; int *hash; ; struct quantum_reg_node { int width; int size; int hashw; COMLEX_FLOAT *amplitude; int *state; int *hash; ; reg->n[i].state to reg->state[i]

36 Constants if (!reg.state) reg.node = calloc(r; eg.size, sizeof(quantum_reg_node)) %call10 = call %conv, i64 16);

37 Parameter Passing j = quantum_get_state(reg1->node) j = quantum_get_state(reg1->amplitude, reg1->state);

38 References Implementing Data Layout Optimizations in LLVM Framework, Prashantha NR et al., 2014 LLVM Developer Meeting G. Chakrabarti, F. Chow, Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements. Gautam Chakrabarti, Open64 Workshop at CGO, 2008 O. Golovanevsky et al, Workshops/compiler2007/present/data-layout-optimizations-ingcc.pdf, GCC Workshop, 2007 R. Hundt et al, Practical Structure Layout Optimization and Advice, CGO, 2006

39 More Headroom: Heroics

40 Libquantum: Heroics test_sum() {...; cnot(2 * width - 1, width - 1, reg); sigma_x(2 * width - 1, reg);...; 2 similar functions Hot loops Whole program visibility Alias analysis GlobalModRef Cost Model Inline + Fuse

41 Interprocedural Loop Fusion void cnot(int C, int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B1; else { if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) decohere(reg); void sigma(int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B2; else { if (foo(y)) return; for (i = 0; i < reg->size; i++) { decohere(reg);

42 Interprocedural Loop Fusion void cnot(int C, int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B1; else { if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) decohere(reg); void sigma(int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B2; else { if (foo(y)) return; for (i = 0; i < reg->size; i++) { decohere(reg);

43 Interprocedural Loop Fusion void void status(&qec, status(&qec, B1; decohere(reg); B2 decohere(reg); decohere(reg);

44 Interprocedural Loop Fusion void void status(&qec, void decohere(q_reg) { status(&qec, Global B1; Value Prop if (status) { B2; do_something; return; decohere(reg); decohere(reg);

45 Interprocedural Loop Fusion void void status(&qec, void decohere(q_reg) { status(&qec, Global B1; Value Prop if (0) { B2 do_something; return; decohere(reg); decohere(reg);

46 Interprocedural Loop Fusion void cnot(int C, int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B1; else { if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) decohere(reg); void sigma(int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B2; else { if (foo(y)) return; for (i = 0; i < reg->size; i++) { decohere(reg);

47 Interprocedural Loop Fusion void void status(&qec, NULL); if (qec) B1; status(&qec, NULL); if (qec) B2;

48 Interprocedural Loop Fusion void void Inlining status(&qec, B1; status(&qec, NULL); if (qec) status(&qec, B2;

49 Interprocedural Loop Fusion void void Inlining status(&qec, if (globalvar) status(&qec, B1; B2;

50 Interprocedural Loop Fusion void cnot(int C, int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (globalvar) B1; else { if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) decohere(reg); void sigma(int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (globalvar) B2; else { if (foo(y)) return; for (i = 0; i < reg->size; i++) { decohere(reg);

51 Interprocedural Loop Fusion void GlobalModRef void status(&qec, NULL); if (globalvar) B1; else { status(&qec, NULL); if (globalvar) B2; else {

52 Interprocedural Loop Fusion void GlobalModRef void status(&qec, B1; if (globalvar) { B1; B2; else { status(&qec, if (foo(x)) return; for (i B2= 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; if (foo(y)) return; for (i = 0; reg->state[i] i < reg->size; ^= T; i++) {

53 Interprocedural Loop Fusion void GlobalModRef void status(&qec, if (globalvar) { B1; B2; B1; else { status(&qec, if (foo(x)) return; for (i B2= 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; if (foo(y)) return; for (i = 0; reg->state[i] i < reg->size; ^= T; i++) {

54 Interprocedural Loop Fusion void GlobalModRef void status(&qec, if (globalvar) { B1; B2; B1; else { status(&qec, if (foo(x)) return; B2 if (foo(y)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) for (i = 0; reg->state[i] i < reg->size; ^= T; i++) {

55 Interprocedural Loop Fusion void GlobalModRef void status(&qec, if (globalvar) { B1; B2; B1; else { Fuse bool V1 = foo(x); status(&qec, bool V2 = foo(y); if (V1 V2) B2 return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) for (i = 0; reg->state[i] i < reg->size; ^= T; i++) {

56 Interprocedural Loop Fusion void GlobalModRef void status(&qec, if (globalvar) { B1; B2; B1; else { Fused Loops bool V1 = foo(x); status(&qec, bool V2 = foo(y); if (V1 V2) B2 return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C)

57 What can we learn? Optimization Scope: Call Chain Concept: Function Similarity Challenge: Hoist statements across loops Techniques: Global Value Prop, Partial Inlining, GlobalModRef, Loop Fusion,

58 Take Aways Techniques needed for heroics can be generalized to advance optimization technology Cost Model?

59 How to expose performance opportunities to developers?

60 builtin_nontemporal_store void scaledcpy(float * restrict a, float * restrict b, float S, int N) { for (int i = 0; i < N; i++) b[i] = S * a[i];

61 builtin_nontemporal_store void scaledcpy(float * restrict a, float * restrict b, float S, int N) { for (int i = 0; i < N; i++) // b[i] = S * a[i]; builtin_nontemporal_store(s * a[i], &b[i]);

62 Vectorizer: Hints and Diagnostics while (good) { for (i = 0; i < N; i++) { DW[i] = A[i - 3] + B[i - 2] + C[i - 1] + D[i]; UW[i] = A[i] + B[i + 1] + C[i + 2] + D[i + 3]; remark: loop not vectorized: Avoid runtime pointer checking when you know the arrays will always be independent by specifying '#pragma clang loop vectorize(assume_safety) before the loop or by specifying 'restrict' on the array arguments. Erroneous results will occur if these options are incorrectly applied!

63 Conclusions The best days for LLVM performance are ahead of us

64 Questions?

A Fast Instruction Set Simulator for RISC-V

A Fast Instruction Set Simulator for RISC-V A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.

More information

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

Function Merging by Sequence Alignment

Function Merging by Sequence Alignment Function Merging by Sequence Alignment Rodrigo C. O. Rocha Pavlos Petoumenos Zheng Wang University of Edinburgh, UK University of Edinburgh, UK Lancaster University, UK r.rocha@ed.ac.uk ppetoume@inf.ed.ac.uk

More information

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

Kruiser: Semi-synchronized Nonblocking Concurrent Kernel Heap Buffer Overflow Monitoring

Kruiser: Semi-synchronized Nonblocking Concurrent Kernel Heap Buffer Overflow Monitoring NDSS 2012 Kruiser: Semi-synchronized Nonblocking Concurrent Kernel Heap Buffer Overflow Monitoring Donghai Tian 1,2, Qiang Zeng 2, Dinghao Wu 2, Peng Liu 2 and Changzhen Hu 1 1 Beijing Institute of Technology

More information

Enhanced Operating System Security Through Efficient and Fine-grained Address Space Randomization

Enhanced Operating System Security Through Efficient and Fine-grained Address Space Randomization Enhanced Operating System Security Through Efficient and Fine-grained Address Space Randomization Anton Kuijsten Andrew S. Tanenbaum Vrije Universiteit Amsterdam 21st USENIX Security Symposium Bellevue,

More information

Building opensuse with link-time optimizations. Jan Hubička and Martin Liška SUSElabs

Building opensuse with link-time optimizations. Jan Hubička and Martin Liška SUSElabs Building opensuse with link-time optimizations Jan Hubička and Martin Liška SUSElabs jh@suse.cz, mliska@suse.cz Outlilne What is link-time optimization? Link-time optimization and GCC Benchmarks Can we

More information

INSTRUCTION LEVEL PARALLELISM

INSTRUCTION LEVEL PARALLELISM INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,

More information

Lightweight Memory Tracing

Lightweight Memory Tracing Lightweight Memory Tracing Mathias Payer*, Enrico Kravina, Thomas Gross Department of Computer Science ETH Zürich, Switzerland * now at UC Berkeley Memory Tracing via Memlets Execute code (memlets) for

More information

Coverage-guided Fuzzing of Individual Functions Without Source Code

Coverage-guided Fuzzing of Individual Functions Without Source Code Coverage-guided Fuzzing of Individual Functions Without Source Code Alessandro Di Federico Politecnico di Milano October 25, 2018 1 Index Coverage-guided fuzzing An overview of rev.ng Experimental results

More information

LLVM performance optimization for z Systems

LLVM performance optimization for z Systems LLVM performance optimization for z Systems Dr. Ulrich Weigand Senior Technical Staff Member GNU/Linux Compilers & Toolchain Date: Mar 27, 2017 2017 IBM Corporation Agenda LLVM on z Systems Performance

More information

Dynamic Trace Analysis with Zero-Suppressed BDDs

Dynamic Trace Analysis with Zero-Suppressed BDDs University of Colorado, Boulder CU Scholar Electrical, Computer & Energy Engineering Graduate Theses & Dissertations Electrical, Computer & Energy Engineering Spring 4-1-2011 Dynamic Trace Analysis with

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 36 Performance 2010-04-23 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in

More information

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

Super-optimizing LLVM IR

Super-optimizing LLVM IR Super-optimizing LLVM IR Duncan Sands DeepBlueCapital / CNRS Thanks to Google for sponsorship Super optimization Optimization Improve code Super optimization Optimization Improve code Super-optimization

More information

Computer Architecture. Introduction

Computer Architecture. Introduction to Computer Architecture 1 Computer Architecture What is Computer Architecture From Wikipedia, the free encyclopedia In computer engineering, computer architecture is a set of rules and methods that describe

More information

Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world.

Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world. Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world. Supercharge your PS3 game code Part 1: Compiler internals.

More information

FDO: Magic Make My Program Faster compilation option?

FDO: Magic Make My Program Faster compilation option? FDO: Magic Make My Program Faster compilation option? Paweł Moll Embedded Linux Conference Europe, Berlin, October 2016 Agenda FDO Basics Instrumentation based FDO Sample based ( Auto ) FDO Deployments

More information

High System-Code Security with Low Overhead

High System-Code Security with Low Overhead High System-Code Security with Low Overhead Jonas Wagner, Volodymyr Kuznetsov, George Candea, and Johannes Kinder École Polytechnique Fédérale de Lausanne Royal Holloway, University of London High System-Code

More information

Architecture of Parallel Computer Systems - Performance Benchmarking -

Architecture of Parallel Computer Systems - Performance Benchmarking - Architecture of Parallel Computer Systems - Performance Benchmarking - SoSe 18 L.079.05810 www.uni-paderborn.de/pc2 J. Simon - Architecture of Parallel Computer Systems SoSe 2018 < 1 > Definition of Benchmark

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Performance Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Defining Performance (1) Which airplane has the best performance? Boeing 777 Boeing

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012 Energy Proportional Datacenter Memory Brian Neel EE6633 Fall 2012 Outline Background Motivation Related work DRAM properties Designs References Background The Datacenter as a Computer Luiz André Barroso

More information

Exploi'ng Compressed Block Size as an Indicator of Future Reuse

Exploi'ng Compressed Block Size as an Indicator of Future Reuse Exploi'ng Compressed Block Size as an Indicator of Future Reuse Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons, Michael A. Kozuch Execu've Summary In a compressed

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 38 Performance 2008-04-30 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in

More information

LLVM Auto-Vectorization

LLVM Auto-Vectorization LLVM Auto-Vectorization Past Present Future Renato Golin LLVM Auto-Vectorization Plan: What is auto-vectorization? Short-history of the LLVM vectorizer What do we support today, and an overview of how

More information

Evaluation of RISC-V RTL with FPGA-Accelerated Simulation

Evaluation of RISC-V RTL with FPGA-Accelerated Simulation Evaluation of RISC-V RTL with FPGA-Accelerated Simulation Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach, Krste Asanovic CARRV 2017 10/14/2017 Evaluation Methodologies For Computer

More information

Call Paths for Pin Tools

Call Paths for Pin Tools , Xu Liu, and John Mellor-Crummey Department of Computer Science Rice University CGO'14, Orlando, FL February 17, 2014 What is a Call Path? main() A() B() Foo() { x = *ptr;} Chain of function calls that

More information

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems Rentong Guo 1, Xiaofei Liao 1, Hai Jin 1, Jianhui Yue 2, Guang Tan 3 1 Huazhong University of Science

More information

Scheduling the Intel Core i7

Scheduling the Intel Core i7 Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne

More information

Linux Performance on IBM zenterprise 196

Linux Performance on IBM zenterprise 196 Martin Kammerer martin.kammerer@de.ibm.com 9/27/10 Linux Performance on IBM zenterprise 196 visit us at http://www.ibm.com/developerworks/linux/linux390/perf/index.html Trademarks IBM, the IBM logo, and

More information

Sandbox Based Optimal Offset Estimation [DPC2]

Sandbox Based Optimal Offset Estimation [DPC2] Sandbox Based Optimal Offset Estimation [DPC2] Nathan T. Brown and Resit Sendag Department of Electrical, Computer, and Biomedical Engineering Outline Motivation Background/Related Work Sequential Offset

More information

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals SSE and SSE2 Timothy A. Chagnon 18 September 2007 All images from Intel 64 and IA 32 Architectures Software Developer's Manuals Overview SSE: Streaming SIMD (Single Instruction Multiple Data) Extensions

More information

Energy Models for DVFS Processors

Energy Models for DVFS Processors Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July

More information

Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization

Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization Yulei Sui, Xiaokang Fan, Hao Zhou and Jingling Xue School of Computer Science and Engineering The University of

More information

What run-time services could help scientific programming?

What run-time services could help scientific programming? 1 What run-time services could help scientific programming? Stephen Kell stephen.kell@cl.cam.ac.uk Computer Laboratory University of Cambridge Contrariwise... 2 Some difficulties of software performance!

More information

EffectiveSan: Type and Memory Error Detection using Dynamically Typed C/C++

EffectiveSan: Type and Memory Error Detection using Dynamically Typed C/C++ EffectiveSan: Type and Memory Error Detection using Dynamically Typed C/C++ Gregory J. Duck and Roland H. C. Yap Department of Computer Science National University of Singapore {gregory, ryap}@comp.nus.edu.sg

More information

Field Analysis. Last time Exploit encapsulation to improve memory system performance

Field Analysis. Last time Exploit encapsulation to improve memory system performance Field Analysis Last time Exploit encapsulation to improve memory system performance This time Exploit encapsulation to simplify analysis Two uses of field analysis Escape analysis Object inlining April

More information

Fast, precise dynamic checking of types and bounds in C

Fast, precise dynamic checking of types and bounds in C Fast, precise dynamic checking of types and bounds in C Stephen Kell stephen.kell@cl.cam.ac.uk Computer Laboratory University of Cambridge p.1 Tool wanted if (obj >type == OBJ COMMIT) { if (process commit(walker,

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

Generating Low-Overhead Dynamic Binary Translators

Generating Low-Overhead Dynamic Binary Translators Generating Low-Overhead Dynamic Binary Translators Mathias Payer ETH Zurich, Switzerland mathias.payer@inf.ethz.ch Thomas R. Gross ETH Zurich, Switzerland trg@inf.ethz.ch Abstract Dynamic (on the fly)

More information

A C++/CUDA DSL for Object-oriented Programming with Structure-of-Arrays Layout

A C++/CUDA DSL for Object-oriented Programming with Structure-of-Arrays Layout A C++/CUDA DSL for Object-oriented Programming with Structure-of-Arrays Layout Matthias Springer Tokyo Institute of Technology CGO 2018, ACM Student Research Competition AOS vs. SOA AOS: Array of Structures

More information

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Sarah Bird ϕ, Aashish Phansalkar ϕ, Lizy K. John ϕ, Alex Mericas α and Rajeev Indukuru α ϕ University

More information

Apple LLVM GPU Compiler: Embedded Dragons. Charu Chandrasekaran, Apple Marcello Maggioni, Apple

Apple LLVM GPU Compiler: Embedded Dragons. Charu Chandrasekaran, Apple Marcello Maggioni, Apple Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1 Agenda How Apple uses LLVM to build a GPU Compiler Factors that affect GPU performance The Apple GPU compiler

More information

SVF: Static Value-Flow Analysis in LLVM

SVF: Static Value-Flow Analysis in LLVM SVF: Static Value-Flow Analysis in LLVM Yulei Sui, Peng Di, Ding Ye, Hua Yan and Jingling Xue School of Computer Science and Engineering The University of New South Wales 2052 Sydney Australia March 18,

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight

More information

9/5/17. The Design and Implementation of Programming Languages. Compilation. Interpretation. Compilation vs. Interpretation. Hybrid Implementation

9/5/17. The Design and Implementation of Programming Languages. Compilation. Interpretation. Compilation vs. Interpretation. Hybrid Implementation Language Implementation Methods The Design and Implementation of Programming Languages Compilation Interpretation Hybrid In Text: Chapter 1 2 Compilation Interpretation Translate high-level programs to

More information

A Framework for Safe Automatic Data Reorganization

A Framework for Safe Automatic Data Reorganization Compiler Technology A Framework for Safe Automatic Data Reorganization Shimin Cui (Speaker), Yaoqing Gao, Roch Archambault, Raul Silvera IBM Toronto Software Lab Peng Peers Zhao, Jose Nelson Amaral University

More information

Energy-centric DVFS Controlling Method for Multi-core Platforms

Energy-centric DVFS Controlling Method for Multi-core Platforms Energy-centric DVFS Controlling Method for Multi-core Platforms Shin-gyu Kim, Chanho Choi, Hyeonsang Eom, Heon Y. Yeom Seoul National University, Korea MuCoCoS 2012 Salt Lake City, Utah Abstract Goal To

More information

Clang - the C, C++ Compiler

Clang - the C, C++ Compiler Clang - the C, C++ Compiler Contents Clang - the C, C++ Compiler o SYNOPSIS o DESCRIPTION o OPTIONS SYNOPSIS clang [options] filename... DESCRIPTION clang is a C, C++, and Objective-C compiler which encompasses

More information

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming David Pfander, Malte Brunn, Dirk Pflüger University of Stuttgart, Germany May 25, 2018 Vancouver, Canada, iwapt18 May 25, 2018 Vancouver,

More information

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance

More information

Simone Campanoni Loop transformations

Simone Campanoni Loop transformations Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple

More information

Footprint-based Locality Analysis

Footprint-based Locality Analysis Footprint-based Locality Analysis Xiaoya Xiang, Bin Bao, Chen Ding University of Rochester 2011-11-10 Memory Performance On modern computer system, memory performance depends on the active data usage.

More information

gpucc: An Open-Source GPGPU Compiler

gpucc: An Open-Source GPGPU Compiler gpucc: An Open-Source GPGPU Compiler Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt One-Slide Overview Motivation

More information

gpucc: An Open-Source GPGPU Compiler

gpucc: An Open-Source GPGPU Compiler gpucc: An Open-Source GPGPU Compiler Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt One-Slide Overview Motivation

More information

AMD S X86 OPEN64 COMPILER. Michael Lai AMD

AMD S X86 OPEN64 COMPILER. Michael Lai AMD AMD S X86 OPEN64 COMPILER Michael Lai AMD CONTENTS Brief History AMD and Open64 Compiler Overview Major Components of Compiler Important Optimizations Recent Releases Performance Applications and Libraries

More information

SSA Based Mobile Code: Construction and Empirical Evaluation

SSA Based Mobile Code: Construction and Empirical Evaluation SSA Based Mobile Code: Construction and Empirical Evaluation Wolfram Amme Friedrich Schiller University Jena, Germany Michael Franz Universit of California, Irvine, USA Jeffery von Ronne Universtity of

More information

Class Project Report

Class Project Report 15-745 Class Project Report Brendan Meeder (bmeeder@cs), Jamie Morgenstern (jaimiemmt@cs), Richard Peng (richard.peng@gmail.com) April 27, 2011 Abstract Selecting an ordering for running optimizations

More information

Polly Polyhedral Optimizations for LLVM

Polly Polyhedral Optimizations for LLVM Polly Polyhedral Optimizations for LLVM Tobias Grosser - Hongbin Zheng - Raghesh Aloor Andreas Simbürger - Armin Grösslinger - Louis-Noël Pouchet April 03, 2011 Polly - Polyhedral Optimizations for LLVM

More information

November IBM XL C/C++ Compilers Insights on Improving Your Application

November IBM XL C/C++ Compilers Insights on Improving Your Application November 2010 IBM XL C/C++ Compilers Insights on Improving Your Application Page 1 Table of Contents Purpose of this document...2 Overview...2 Performance...2 Figure 1:...3 Figure 2:...4 Exploiting the

More information

Computer Sciences Department

Computer Sciences Department Computer Sciences Department Compiler Construction of Idempotent Regions Marc de Kruijf Karthikeyan Sankaralingam Somesh Jha Technical Report #1700 November 2011 Compiler Construction of Idempotent Regions

More information

Introduction to C++ Introduction. Structure of a C++ Program. Structure of a C++ Program. C++ widely-used general-purpose programming language

Introduction to C++ Introduction. Structure of a C++ Program. Structure of a C++ Program. C++ widely-used general-purpose programming language Introduction C++ widely-used general-purpose programming language procedural and object-oriented support strong support created by Bjarne Stroustrup starting in 1979 based on C Introduction to C++ also

More information

SimBench. A Portable Benchmarking Methodology for Full-System Simulators. Harry Wagstaff Bruno Bodin Tom Spink Björn Franke

SimBench. A Portable Benchmarking Methodology for Full-System Simulators. Harry Wagstaff Bruno Bodin Tom Spink Björn Franke SimBench A Portable Benchmarking Methodology for Full-System Simulators Harry Wagstaff Bruno Bodin Tom Spink Björn Franke Institute for Computing Systems Architecture University of Edinburgh ISPASS 2017

More information

Please refer to the turn-in procedure document on the website for instructions on the turn-in procedure.

Please refer to the turn-in procedure document on the website for instructions on the turn-in procedure. 1 CSE 131 Winter 2013 Compiler Project #2 -- Code Generation Due Date: Friday, March 15 th 2013 @ 11:59pm Disclaimer This handout is not perfect; corrections may be made. Updates and major clarifications

More information

Lecture 20 CIS 341: COMPILERS

Lecture 20 CIS 341: COMPILERS Lecture 20 CIS 341: COMPILERS Announcements HW5: OAT v. 2.0 records, function pointers, type checking, array-bounds checks, etc. Due: TOMORROW Wednesday, April 11 th Zdancewic CIS 341: Compilers 2 A high-level

More information

Introduction to C++ with content from

Introduction to C++ with content from Introduction to C++ with content from www.cplusplus.com 2 Introduction C++ widely-used general-purpose programming language procedural and object-oriented support strong support created by Bjarne Stroustrup

More information

IA-64 Compiler Technology

IA-64 Compiler Technology IA-64 Compiler Technology David Sehr, Jay Bharadwaj, Jim Pierce, Priti Shrivastav (speaker), Carole Dulong Microcomputer Software Lab Page-1 Introduction IA-32 compiler optimizations Profile Guidance (PGOPTI)

More information

Short Notes of CS201

Short Notes of CS201 #includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system

More information

EKT 303 WEEK Pearson Education, Inc., Hoboken, NJ. All rights reserved.

EKT 303 WEEK Pearson Education, Inc., Hoboken, NJ. All rights reserved. + EKT 303 WEEK 2 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved. Chapter 2 + Performance Issues + Designing for Performance The cost of computer systems continues to drop dramatically,

More information

Near-Threshold Computing: How Close Should We Get?

Near-Threshold Computing: How Close Should We Get? Near-Threshold Computing: How Close Should We Get? Alaa R. Alameldeen Intel Labs Workshop on Near-Threshold Computing June 14, 2014 Overview High-level talk summarizing my architectural perspective on

More information

Structure Array Copy Optimization

Structure Array Copy Optimization Structure Array Copy Optimization 1. Objective Modern programming language use structure to gather relative datum, in some of the program, only part of the structure is accessed. If the size of accessed

More information

OpenPrefetch. (in-progress)

OpenPrefetch. (in-progress) OpenPrefetch Let There Be Industry-Competitive Prefetching in RISC-V Processors (in-progress) Bowen Huang, Zihao Yu, Zhigang Liu, Chuanqi Zhang, Sa Wang, Yungang Bao Institute of Computing Technology(ICT),

More information

SEN361 Computer Organization. Prof. Dr. Hasan Hüseyin BALIK (2 nd Week)

SEN361 Computer Organization. Prof. Dr. Hasan Hüseyin BALIK (2 nd Week) + SEN361 Computer Organization Prof. Dr. Hasan Hüseyin BALIK (2 nd Week) + Outline 1. Overview 1.1 Basic Concepts and Computer Evolution 1.2 Performance Issues + 1.2 Performance Issues + Designing for

More information

TDT4255 Computer Design. Lecture 1. Magnus Jahre

TDT4255 Computer Design. Lecture 1. Magnus Jahre 1 TDT4255 Computer Design Lecture 1 Magnus Jahre 2 Outline Practical course information Chapter 1: Computer Abstractions and Technology 3 Practical Course Information 4 TDT4255 Computer Design TDT4255

More information

Efficient Memory Shadowing for 64-bit Architectures

Efficient Memory Shadowing for 64-bit Architectures Efficient Memory Shadowing for 64-bit Architectures The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Qin Zhao, Derek Bruening,

More information

CS201 - Introduction to Programming Glossary By

CS201 - Introduction to Programming Glossary By CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with

More information

CS 261 Fall C Introduction. Variables, Memory Model, Pointers, and Debugging. Mike Lam, Professor

CS 261 Fall C Introduction. Variables, Memory Model, Pointers, and Debugging. Mike Lam, Professor CS 261 Fall 2017 Mike Lam, Professor C Introduction Variables, Memory Model, Pointers, and Debugging The C Language Systems language originally developed for Unix Imperative, compiled language with static

More information

Final CSE 131B Winter 2003

Final CSE 131B Winter 2003 Login name Signature Name Student ID Final CSE 131B Winter 2003 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 _ (20 points) _ (25 points) _ (21 points) _ (40 points) _ (30 points) _ (25 points)

More information

Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti, Fred Chow PathScale, LLC. {gautam, fchow}@pathscale.com Abstract A common performance

More information

Compiling for Performance on hp OpenVMS I64. Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005

Compiling for Performance on hp OpenVMS I64. Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005 Compiling for Performance on hp OpenVMS I64 Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005 Compilers discussed C, Fortran, [COBOL, Pascal, BASIC] Share GEM optimizer

More information

Cell Programming Tips & Techniques

Cell Programming Tips & Techniques Cell Programming Tips & Techniques Course Code: L3T2H1-58 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn Key programming techniques to exploit cell hardware organization and

More information

Practical Data Compression for Modern Memory Hierarchies

Practical Data Compression for Modern Memory Hierarchies Practical Data Compression for Modern Memory Hierarchies Thesis Oral Gennady Pekhimenko Committee: Todd Mowry (Co-chair) Onur Mutlu (Co-chair) Kayvon Fatahalian David Wood, University of Wisconsin-Madison

More information

Spatial Memory Streaming (with rotated patterns)

Spatial Memory Streaming (with rotated patterns) Spatial Memory Streaming (with rotated patterns) Michael Ferdman, Stephen Somogyi, and Babak Falsafi Computer Architecture Lab at 2006 Stephen Somogyi The Memory Wall Memory latency 100 s clock cycles;

More information

Scalable Dynamic Task Scheduling on Adaptive Many-Cores

Scalable Dynamic Task Scheduling on Adaptive Many-Cores Introduction: Many- Paradigm [Our Definition] Scalable Dynamic Task Scheduling on Adaptive Many-s Vanchinathan Venkataramani, Anuj Pathania, Muhammad Shafique, Tulika Mitra, Jörg Henkel Bus CES Chair for

More information

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

Addressing End-to-End Memory Access Latency in NoC-Based Multicores Addressing End-to-End Memory Access Latency in NoC-Based Multicores Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das The Pennsylvania State University University Park, PA, 682, USA {akbar,euk39,kandemir,das}@cse.psu.edu

More information

Chapter 1. Computer Abstractions and Technology. Adapted by Paulo Lopes, IST

Chapter 1. Computer Abstractions and Technology. Adapted by Paulo Lopes, IST Chapter 1 Computer Abstractions and Technology Adapted by Paulo Lopes, IST The Computer Revolution Progress in computer technology Sustained by Moore s Law Makes novel and old applications feasible Computers

More information

Bentley Rules for Optimizing Work

Bentley Rules for Optimizing Work 6.172 Performance Engineering of Software Systems SPEED LIMIT PER ORDER OF 6.172 LECTURE 2 Bentley Rules for Optimizing Work Charles E. Leiserson September 11, 2012 2012 Charles E. Leiserson and I-Ting

More information

Optimising with the IBM compilers

Optimising with the IBM compilers Optimising with the IBM Overview Introduction Optimisation techniques compiler flags compiler hints code modifications Optimisation topics locals and globals conditionals data types CSE divides and square

More information

Architectural Supports to Protect OS Kernels from Code-Injection Attacks

Architectural Supports to Protect OS Kernels from Code-Injection Attacks Architectural Supports to Protect OS Kernels from Code-Injection Attacks 2016-06-18 Hyungon Moon, Jinyong Lee, Dongil Hwang, Seonhwa Jung, Jiwon Seo and Yunheung Paek Seoul National University 1 Why to

More information

HexType: Efficient Detection of Type Confusion Errors for C++ Yuseok Jeon Priyam Biswas Scott A. Carr Byoungyoung Lee Mathias Payer

HexType: Efficient Detection of Type Confusion Errors for C++ Yuseok Jeon Priyam Biswas Scott A. Carr Byoungyoung Lee Mathias Payer HexType: Efficient Detection of Type Confusion Errors for C++ Yuseok Jeon Priyam Biswas Scott A. Carr Byoungyoung Lee Mathias Payer Motivation C++ is a popular programming language Google Chrome, Firefox,

More information

Baikal-T1 Microprocessor Performance Tests

Baikal-T1 Microprocessor Performance Tests Baikal-T1 Microprocessor Performance Tests Revision list Revision Date Author Description 1.0 15.03.2017 Initial version 1.1 08.08.2017 Added SPEC CPU2006 Int, iperf results Revision list... 1 1. List

More information

Is dynamic compilation possible for embedded system?

Is dynamic compilation possible for embedded system? Is dynamic compilation possible for embedded system? Scopes 2015, St Goar Victor Lomüller, Henri-Pierre Charles CEA DACLE / Grenoble www.cea.fr June 2 2015 Introduction : Wake Up Questions Session FAQ

More information

CIS 341 Final Examination 4 May 2017

CIS 341 Final Examination 4 May 2017 CIS 341 Final Examination 4 May 2017 1 /14 2 /15 3 /12 4 /14 5 /34 6 /21 7 /10 Total /120 Do not begin the exam until you are told to do so. You have 120 minutes to complete the exam. There are 14 pages

More information

std::assume_aligned Abstract

std::assume_aligned Abstract std::assume_aligned Timur Doumler (papers@timur.audio) Chandler Carruth (chandlerc@google.com) Document #: P1007R0 Date: 2018-05-04 Project: Programming Language C++ Audience: Library Evolution Working

More information

LLVM-based Communication Optimizations for PGAS Programs

LLVM-based Communication Optimizations for PGAS Programs LLVM-based Communication Optimizations for PGAS Programs nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15 Akihiro Hayashi (Rice University) Jisheng Zhao (Rice University) Michael Ferguson

More information

Postprint. This is the accepted version of a paper presented at CGO 2017, February 4 8, Austin, TX.

Postprint.   This is the accepted version of a paper presented at CGO 2017, February 4 8, Austin, TX. http://www.diva-portal.org Postprint This is the accepted version of a paper presented at CGO 2017, February 4 8, Austin, TX. Citation for the original published paper: Tran, K-A., Carlson, T E., Koukos,

More information

LLVM for the future of Supercomputing

LLVM for the future of Supercomputing LLVM for the future of Supercomputing Hal Finkel hfinkel@anl.gov 2017-03-27 2017 European LLVM Developers' Meeting What is Supercomputing? Computing for large, tightly-coupled problems. Lots of computational

More information