Enabling Parallel Computing in Chapel with Clang and LLVM
|
|
- Angela McDowell
- 5 years ago
- Views:
Transcription
1 Enabling Parallel Computing in Chapel with Clang and LLVM Michael Ferguson Cray Inc. October 19, 2017
2 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial guidance and expected operating results, our opportunities and future potential, our product development and new product introduction plans, our ability to expand and penetrate our addressable markets and other statements that are not historical facts. These statements are only predictions and actual results may materially vary from those projected. Please refer to Cray's documents filed with the SEC from time to time concerning factors that could affect the Company and these forward-looking statements.!2
3 Outline Introducing Chapel C interoperability Combined Code Generation Communication Optimization in LLVM!3
4 What is Chapel? Chapel: A productive parallel programming language portable open-source a collaborative effort Goals: Support general parallel programming any parallel algorithm on any parallel hardware Make parallel programming at scale far more productive!4
5 What does Productivity mean to you? Recent Graduates: something similar to what I used in school: Python, Matlab, Java, Seasoned HPC Programmers: that sugary stuff that I don t need because I was born to suffer want full control to ensure performance Computational Scientists: something that lets me express my parallel computations without having to wrestle with architecture-specific details Chapel Team: something that lets computational scientists express what they want, without taking away the control that HPC programmers want, implemented in a language as attractive as recent graduates want.!5
6 STREAM Triad: a trivial parallel computation Given: m-element vectors A, B, C Compute: i 1..m, A i = B i + α C i In pictures: A B C α = +!6
7 STREAM Triad: a trivial parallel computation Given: m-element vectors A, B, C Compute: i 1..m, A i = B i + α C i In pictures, in parallel (shared memory / multicore): A B C = + = + α = + = +!7
8 STREAM Triad: a trivial parallel computation Given: m-element vectors A, B, C Compute: i 1..m, A i = B i + α C i In pictures, in parallel (distributed memory): A B C α = + = + = + = +!8
9 Given: m-element vectors A, B, C Compute: i 1..m, A i = B i + α C i In pictures, in parallel (distributed memory multicore): STREAM Triad: a trivial parallel computation A B C α = + = + = + = + = + = + = + = +!9
10 STREAM Triad: MPI #include <hpcc.h> static int VectorSize; static double *a, *b, *c; int HPCC_StarStream(HPCC_Params *params) { int myrank, commsize; int rv, errcount; MPI_Comm comm = MPI_COMM_WORLD; MPI_Comm_size( comm, &commsize ); MPI_Comm_rank( comm, &myrank ); MPI if (!a!b!c) { if (c) HPCC_free(c); if (b) HPCC_free(b); if (a) HPCC_free(a); if (doio) { fprintf( outfile, "Failed to allocate memory (%d).\n", VectorSize ); fclose( outfile ); return 1; rv = HPCC_Stream( params, 0 == myrank); MPI_Reduce( &rv, &errcount, 1, MPI_INT, MPI_SUM, 0, comm ); return errcount; int HPCC_Stream(HPCC_Params *params, int doio) { register int j; double scalar; VectorSize = HPCC_LocalVectorSize( params, 3, sizeof(double), 0 ); a = HPCC_XMALLOC( double, VectorSize ); b = HPCC_XMALLOC( double, VectorSize ); c = HPCC_XMALLOC( double, VectorSize ); for (j=0; j<vectorsize; j++) { b[j] = 2.0; c[j] = 1.0; scalar = 3.0; for (j=0; j<vectorsize; j++) a[j] = b[j]+scalar*c[j]; HPCC_free(c); HPCC_free(b); HPCC_free(a); return 0;!10
11 STREAM Triad: MPI+OpenMP #include <hpcc.h> #ifdef _OPENMP #include <omp.h> #endif static int VectorSize; static double *a, *b, *c; int HPCC_StarStream(HPCC_Params *params) { int myrank, commsize; int rv, errcount; MPI_Comm comm = MPI_COMM_WORLD; MPI_Comm_size( comm, &commsize ); MPI_Comm_rank( comm, &myrank ); rv = HPCC_Stream( params, 0 == myrank); MPI_Reduce( &rv, &errcount, 1, MPI_INT, MPI_SUM, 0, comm ); return errcount; int HPCC_Stream(HPCC_Params *params, int doio) { register int j; double scalar; VectorSize = HPCC_LocalVectorSize( params, 3, sizeof(double), 0 ); a = HPCC_XMALLOC( double, VectorSize ); b = HPCC_XMALLOC( double, VectorSize ); c = HPCC_XMALLOC( double, VectorSize ); MPI + OpenMP if (!a!b!c) { if (c) HPCC_free(c); if (b) HPCC_free(b); if (a) HPCC_free(a); if (doio) { fprintf( outfile, "Failed to allocate memory (%d).\n", VectorSize ); fclose( outfile ); return 1; #ifdef _OPENMP #pragma omp parallel for #endif for (j=0; j<vectorsize; j++) { b[j] = 2.0; c[j] = 1.0; scalar = 3.0; #ifdef _OPENMP #pragma omp parallel for #endif for (j=0; j<vectorsize; j++) a[j] = b[j]+scalar*c[j]; HPCC_free(c); HPCC_free(b); HPCC_free(a); return 0;!11
12 STREAM Triad: MPI+OpenMP #include <hpcc.h> #ifdef _OPENMP #include <omp.h> #endif static int VectorSize; static double *a, *b, *c; int HPCC_StarStream(HPCC_Params *params) { int myrank, commsize; int rv, errcount; MPI_Comm comm = MPI_COMM_WORLD; MPI_Comm_size( comm, &commsize ); MPI_Comm_rank( comm, &myrank ); rv = HPCC_Stream( params, 0 == myrank); MPI_Reduce( &rv, &errcount, 1, MPI_INT, MPI_SUM, 0, comm ); return errcount; int HPCC_Stream(HPCC_Params *params, int doio) { register int j; double scalar; VectorSize = HPCC_LocalVectorSize( params, 3, sizeof(double), 0 ); a = HPCC_XMALLOC( double, VectorSize ); b = HPCC_XMALLOC( double, VectorSize ); c = HPCC_XMALLOC( double, VectorSize ); MPI + OpenMP if (!a!b!c) { if (c) HPCC_free(c); if (b) HPCC_free(b); if (a) HPCC_free(a); if (doio) { fprintf( outfile, "Failed to allocate memory (%d).\n", VectorSize ); fclose( outfile ); return 1; #ifdef _OPENMP #pragma omp parallel for #endif for (j=0; j<vectorsize; j++) { b[j] = 2.0; c[j] = 1.0; scalar = 3.0; #ifdef _OPENMP #pragma omp parallel for #endif for (j=0; j<vectorsize; j++) a[j] = b[j]+scalar*c[j]; HPCC_free(c); HPCC_free(b); HPCC_free(a); return 0; #define N int main() { float *d_a, *d_b, *d_c; float scalar; cudamalloc((void**)&d_a, sizeof(float)*n); cudamalloc((void**)&d_b, sizeof(float)*n); cudamalloc((void**)&d_c, sizeof(float)*n); dim3 dimblock(128); dim3 dimgrid(n/dimblock.x ); if( N % dimblock.x!= 0 ) dimgrid.x+=1; set_array<<<dimgrid,dimblock>>>(d_b,.5f, N); set_array<<<dimgrid,dimblock>>>(d_c,.5f, N); scalar=3.0f; STREAM_Triad<<<dimGrid,dimBlock>>>(d_b, d_c, d_a, scalar, N); cudathreadsynchronize(); cudafree(d_a); cudafree(d_b); cudafree(d_c); CUDA global void set_array(float *a, float value, int len) { int idx = threadidx.x + blockidx.x * blockdim.x; if (idx < len) a[idx] = value; global void STREAM_Triad( float *a, float *b, float *c, float scalar, int len) { int idx = threadidx.x + blockidx.x * blockdim.x; if (idx < len) c[idx] = a[idx]+scalar*b[idx];!12
13 STREAM Triad: MPI+OpenMP #include <hpcc.h> #ifdef _OPENMP #include <omp.h> #endif static int VectorSize; static double *a, *b, *c; int HPCC_Stream(HPCC_Params *params, int doio) { register int j; double scalar; VectorSize = HPCC_LocalVectorSize( params, 3, sizeof(double), 0 ); a = HPCC_XMALLOC( double, VectorSize ); b = HPCC_XMALLOC( double, VectorSize ); c = HPCC_XMALLOC( double, VectorSize ); MPI + OpenMP if (!a!b!c) { if (c) HPCC_free(c); if (b) HPCC_free(b); if (a) HPCC_free(a); if (doio) { fprintf( outfile, "Failed to allocate memory (%d).\n", #define N int main() { float *d_a, *d_b, *d_c; float scalar; cudamalloc((void**)&d_a, sizeof(float)*n); cudamalloc((void**)&d_b, sizeof(float)*n); cudamalloc((void**)&d_c, sizeof(float)*n); VectorSize ); int HPCC_StarStream(HPCC_Params *params) { fclose( outfile ); int myrank, commsize; int rv, errcount; return 1; dim3 dimblock(128); MPI_Comm comm = MPI_COMM_WORLD; dim3 dimgrid(n/dimblock.x ); if( N % dimblock.x!= 0 ) dimgrid.x+=1; MPI_Comm_size( comm, &commsize ); MPI_Comm_rank( comm, &myrank ); #ifdef _OPENMP HPC suffers from too many distinct #pragma omp notations parallel for for expressing set_array<<<dimgrid,dimblock>>>(d_b, parallelism.5f, and N); locality. #endif set_array<<<dimgrid,dimblock>>>(d_c,.5f, N); rv = HPCC_Stream( params, 0 == myrank); MPI_Reduce( &rv, &errcount, This 1, MPI_INT, tends MPI_SUM, to 0, be comm a for result (j=0; j<vectorsize; of bottom-up j++) { b[j] = 2.0; scalar=3.0f; language design. ); c[j] = 1.0; STREAM_Triad<<<dimGrid,dimBlock>>>(d_b, d_c, d_a, scalar, N); cudathreadsynchronize(); return errcount; scalar = 3.0; #ifdef _OPENMP #pragma omp parallel for #endif for (j=0; j<vectorsize; j++) a[j] = b[j]+scalar*c[j]; HPCC_free(c); HPCC_free(b); HPCC_free(a); return 0; cudafree(d_a); cudafree(d_b); cudafree(d_c); CUDA global void set_array(float *a, float value, int len) { int idx = threadidx.x + blockidx.x * blockdim.x; if (idx < len) a[idx] = value; global void STREAM_Triad( float *a, float *b, float *c, float scalar, int len) { int idx = threadidx.x + blockidx.x * blockdim.x; if (idx < len) c[idx] = a[idx]+scalar*b[idx];!13
14 STREAM Triad: Chapel #include <hpcc.h> #ifdef _OPENMP #include <omp.h> #endif static int VectorSize; static double *a, *b, *c; int HPCC_StarStream(HPCC_Params *params) { int myrank, commsize; int rv, errcount; MPI_Comm comm = MPI_COMM_WORLD; MPI_Comm_size( comm, &commsize ); #ifdef _OPENMP MPI_Comm_rank( comm, &myrank ); #pragma omp parallel for C = 1.0; #endif rv = HPCC_Stream( params, 0 == myrank); for (j=0; j<vectorsize; j++) { MPI_Reduce( &rv, &errcount, 1, MPI_INT, MPI_SUM, 0, comm b[j] = 2.0; ); return errcount; int HPCC_Stream(HPCC_Params *params, int doio) { register int j; double scalar; VectorSize = HPCC_LocalVectorSize( params, 3, sizeof(double), 0 ); a = HPCC_XMALLOC( double, VectorSize ); b = HPCC_XMALLOC( double, VectorSize ); c = HPCC_XMALLOC( double, VectorSize ); use ; MPI + OpenMP if (!a!b!c) { if (c) HPCC_free(c); config const if (b) HPCC_free(b); m = 1000, if (a) HPCC_free(a); if (doio) alpha { = 3.0; fprintf( outfile, "Failed to allocate memory (%d).\n", VectorSize ); fclose( outfile ); return 1; c[j] = 1.0; scalar = 3.0; #ifdef _OPENMP #pragma omp parallel for #endif for (j=0; j<vectorsize; j++) a[j] = b[j]+scalar*c[j]; HPCC_free(c); HPCC_free(b); HPCC_free(a); #define N int main() { float *d_a, *d_b, *d_c; float scalar; const ProblemSpace = {1..m dmapped ; var A, B, C: [ProblemSpace] real; B = 2.0; A = B + alpha * C; cudamalloc((void**)&d_a, sizeof(float)*n); cudamalloc((void**)&d_b, sizeof(float)*n); cudamalloc((void**)&d_c, sizeof(float)*n); dim3 dimblock(128); dim3 dimgrid(n/dimblock.x ); if( N % dimblock.x!= 0 ) dimgrid.x+=1; set_array<<<dimgrid,dimblock>>>(d_b,.5f, N); set_array<<<dimgrid,dimblock>>>(d_c,.5f, N); scalar=3.0f; STREAM_Triad<<<dimGrid,dimBlock>>>(d_b, d_c, d_a, scalar, N); cudathreadsynchronize(); cudafree(d_a); cudafree(d_b); cudafree(d_c); CUDA global void set_array(float *a, float value, int len) { int idx = threadidx.x + blockidx.x * blockdim.x; if (idx < len) a[idx] = value; global void STREAM_Triad( float *a, float *b, float *c, float scalar, int len) { int idx = threadidx.x + blockidx.x * blockdim.x; return 0; Philosophy: Good, top-down language design can tease system-specific if (idx < len) c[idx] = a[idx]+scalar*b[idx]; implementation details away from an algorithm, permitting the compiler, runtime, applied scientist, and HPC expert to each focus on their strengths. The special sauce: How should this index set and any arrays and computations over it be mapped to the system?!14
15 Chapel+LLVM History 1 Chapel project starts LLVM 1.0 released first Chapel release public Chapel release clang released --llvm backend --llvm-wide-opt extern { c code --llvm perf competitive --llvm default? !15
16 Chapel+LLVM History Chapel project grew up generating C code 1 Chapel project starts LLVM 1.0 released first Chapel release public Chapel release clang released --llvm backend --llvm-wide-opt extern { c code --llvm perf competitive --llvm default? !16
17 chpl compilation flow Chapel runtime (.h,.a) a.chpl a.c chpl CC Executable b.chpl b.c......!17
18 chpl --llvm compilation flow runtime.h extern { runtime.a a.chpl clang chpl LLVM module LLVM Opts linker b.chpl... Executable
19 C Interoperability runtime.h extern { runtime.a a.chpl clang chpl LLVM module LLVM Opts linker b.chpl... Executable
20 Goals of C Interoperability Chapel is a new language Libraries are important for productivity Easy use of libraries in another language is important! Chapel supports interoperability with C Need to be able to use an existing C library functions, variables, types, and macros Using C functions needs to be efficient performance is a goal here!!20
21 C Interoperability Example // add1.h static inline int add1(int x) { return x+1; // addone.chpl extern proc add1(x:c_int):c_int; writeln(add1(4)); $ chpl addone.chpl add1.h $./addone 5!21
22 With extern { // add1.h static inline int add1(int x) { return x+1; // addone.chpl extern { #include add1.h writeln(add1(4)); $ chpl addone.chpl $./addone 5!22
23 extern block compilation flow frontend passes parse.chpl extern { int add1(int x); readexternc clang parse: C clang AST readexternc: clang AST chapel extern decls... extern proc add1(x:c_int):c_int;
24 Combined Code Generation runtime.h extern { runtime.a a.chpl clang chpl LLVM module LLVM Opts linker b.chpl... Executable
25 Inlining with C code Some C functions expect to be inlined if not, there is a performance penalty Runtime is written primarily in C Enabling easy use of libraries like qthreads Chapel uses third-party libraries such as GMP Library authors control what might be inlined Since Chapel generated C, it became normal to assume: functions can be inlined C types are available fields in C structs are available C macros are available!25
26 Example: Accessing a Struct Field // sockaddr.h #include <sys/socket.h> typedef struct my_sockaddr_s { struct sockaddr_storage addr; size_t len; my_sockaddr_t; // network.chpl require "sockaddr.h"; extern record my_sockaddr_t { var len: size_t; var x:my_sockaddr_t; x.len = c_sizeof(c_int); writeln(x.len); OR extern { #include sockaddr.h!26
27 Implementing Combined Code Generation clang chpl chpl AST -> IR codegen clang AST -> IR CodeGen Call to C proc? Use of C global? GetAddrOfGlobal Use of extern type? CodeGen::convertTypeForMemory* Use of field in extern record? CodeGen::getLLVMFieldNumber** * new to clang 5 ** will be new in clang 6
28 C Macros // usecs.h #define USECS_PER_SEC // microseconds.chpl require "usecs.h"; const USECS_PER_SEC:c_int; config const secs = 1; writeln(secs, " seconds is ", secs*usecs_per_sec, " microseconds"); $ chpl macrodemo.chpl $./macrodemo 5 How can this work when we generate LLVM IR?!28
29 'forall' parallelism The earlier example used A = B + alpha * C Which is equivalent to: forall (a,b,c) in zip(a,b,c) do a = b + alpha * c; Chapel's forall loop represents a data parallel loop iterations can run in any order typically divides iterations up among some tasks parallelism is controlled by what is iterated over!29
30 'forall' lowering (1) // user-code.chpl forall x in MyIter() { body(x); + // library.chpl iter MyIter(...) { coforall i in 1..n do // creates n tasks for j in 1..m do yield i*m+j; coforall i in 1..n do for j in 1..m do body(i*m+j);!30
31 'forall' lowering (2) coforall i in 1..n do for j in 1..m do body(i*m+j); count = n for i in 1..n do spawn(taskfn, i) wait for count == 0 proc taskfn(i) { for j in 1..m do body(i*m+j); Could be represented in a parallel LLVM IR!31
32 Vectorizing We'd like to vectorize 'forall' loops Recall, 'forall' means iterations can run in any order Two strategies for vectorization: A. Vectorize in Chapel front-end Chapel front-end creates vectorized LLVM IR Challenges: not sharing vectorizer, might be a deoptimization B. Vectorize in LLVM optimizations (LoopVectorizer) Chapel front-end generates loops with parallel_loop_access Challenges: user-defined reductions, querying vector lane!32
33 Vectorizing in the Front-End It would be lower level than most front end operations a lot depends on the processor: vector width supported vector operations whether or not vectorizing is profitable Front-end vectorization reasonable if details are simple Presumably there is a reason that vectorization normally runs late in the LLVM optimization pipeline... Misses out on a chance to share with the community!33
34 Vectorizing in as an LLVM optimization parallel_loop_access good for most of loop body but what about user-defined reductions? Would opaque function calls help? front-end generates opaque_accumulate function calls after vectorization, these are replaced with real reduction ops vectorizer would see something like this: define opaque_accumulate(...) readnone for... { load/stores with parallel_loop_access... %acc = opaque_accumulate(%acc, %value) Would it interfere too much with the vectorizer? Harm cost modelling? Does LLVM need a canonical way to express custom reductions?!34
35 Communication Optimization in LLVM runtime.h extern { runtime.a a.chpl clang chpl LLVM module LLVM Opts linker b.chpl... Executable
36 Aside: Introducing PGAS Communication!36
37 Parallelism and Locality: Distinct in Chapel This is a parallel, but local program: coforall i in 1..msgs do writeln( Hello from task, i); This is a distributed, but serial program: writeln( Hello from locale 0! ); on Locales[1] do writeln( Hello from locale 1! ); on Locales[2] do writeln( Hello from locale 2! ); This is a distributed parallel program: coforall i in 1..msgs do on Locales[i%numLocales] do writeln( Hello from task, i, running on locale, here.id);!37
38 Partitioned Global Address Space (PGAS) Languages (Or more accurately: partitioned global namespace languages) abstract concept: support a shared namespace on distributed memory permit parallel tasks to access remote variables by naming them establish a strong sense of ownership every variable has a well-defined location local variables are cheaper to access than remote ones traditional PGAS languages have been SPMD in nature best-known examples: Fortran 2008 s co-arrays, Unified Parallel C (UPC) partitioned shared name-/address space private space 0 private space 1 private space 2 private space 3 private space 4!38
39 SPMD PGAS Languages (using a pseudo-language, not Chapel) shared int i(*); // declare a shared variable i i =!39
40 SPMD PGAS Languages (using a pseudo-language, not Chapel) shared int i(*); // declare a shared variable i function main() { i = 2*this_image(); // each image initializes its copy i = !40
41 SPMD PGAS Languages (using a pseudo-language, not Chapel) shared int i(*); // declare a shared variable i function main() { i = 2*this_image(); // each image initializes its copy private int j; // declare a private variable j i = j =!41
42 SPMD PGAS Languages (using a pseudo-language, not Chapel) shared int i(*); // declare a shared variable i function main() { i = 2*this_image(); // each image initializes its copy barrier(); private int j; // declare a private variable j j = i( (this_image()+1) % num_images() ); // ^^ access our neighbor s copy of i; compiler and runtime implement the communication // Q: How did we know our neighbor had an i? // A: Because it s SPMD we re all running the same program so if we have an i, so do they. i = j = !42
43 Chapel and PGAS Chapel is PGAS, but unlike most, it s not inherently SPMD never think about the other copies of the program global name/address space comes from lexical scoping as in traditional languages, each declaration yields one variable variables are stored on the locale where the task declaring it is executing Locales (think: compute nodes )!43
44 Chapel: Scoping and Locality var i: int; i Locales (think: compute nodes )!44
45 Chapel: Scoping and Locality var i: int; on Locales[1] { i Locales (think: compute nodes )!45
46 Chapel: Scoping and Locality var i: int; on Locales[1] { var j: int; i j Locales (think: compute nodes )!46
47 Chapel: Scoping and Locality var i: int; on Locales[1] { var j: int; coforall loc in Locales { on loc { i j Locales (think: compute nodes )!47
48 Chapel: Scoping and Locality var i: int; on Locales[1] { var j: int; coforall loc in Locales { on loc { var k: int; i k j k k k k Locales (think: compute nodes )!48
49 Chapel: Scoping and Locality var i: int; on Locales[1] { var j: int; coforall loc in Locales { on loc { var k: int; k = 2*i + j; OK to access i, j, and k wherever they live k = 2*i + j; i k j k k k k Locales (think: compute nodes )!49
50 Chapel: Scoping and Locality var i: int; on Locales[1] { var j: int; coforall loc in Locales { on loc { var k: int; k = 2*i + j; here, i and j are remote, so the compiler + runtime will transfer their values k = 2*i + j; (i) i k j k k k k (j) Locales (think: compute nodes )!50
51 Chapel: Locality queries var i: int; on Locales[1] { var j: int; coforall loc in Locales { on loc { var k: int; here j.locale // query the locale on which this task is running // query the locale on which j is stored i k j k k k k Locales (think: compute nodes )!51
52 Communication Optimization in LLVM runtime.h extern { runtime.a a.chpl clang chpl LLVM module LLVM Opts linker b.chpl... Executable
53 Communication Optimization: Overview Idea is to use LLVM passes to optimize GET and PUT Enabled with --llvm-wide-opt compiler flag First appeared in Chapel 1.8 Unfortunately was not working in 1.15 and 1.16 releases!53
54 Communication Optimization: In a Picture TO GLOBAL MEMORY // x is possibly remote var sum = 0; for i in { %1 = get(x); sum += %1; var sum = 0; for i in { %1 = load <100> %x sum += %1; EXISTING LLVM OPTIMIZATION LICM var sum = 0; %1 = get(x); for i in { sum += %1; TO DISTRIBUTED MEMORY var sum = 0; %1 = load <100> %x for i in { sum += %r1; load <100> %x = load i64 addrspace(100)* %x!54
55 Communication Optimization: Details Uses existing LLVM passes to optimize GET and PUT GET/PUT represented as load/store with special pointer type normal LLVM optimizations run and optimize load/store as usual a custom LLVM pass lowers them back to calls to the Chapel runtime Optimization gains from this strategy can be significant See "LLVM-based Communication Optimizations for PGAS Programs" Historically, needed packed wide pointers as workaround wide pointer normally stored as a 128-bit struct: {node id, address bugs in LLVM 3.3 prevented using 128-bit pointers packed wide pointers store node id in high bits of a 64-bit address led to scalability constraints - maximum of nodes sometimes made --llvm-wide-opt code slower than C backend!55
56 Communication Optimization: Recent Work Fixed --llvm-wide-opt Removed reliance on packed wide pointers Now generates LLVM IR with 128-bit pointers revealed (only) a few LLVM bugs (so far) BasicAA - needed to use APInt more (not just int64_t) ValueTracking - error using APInt!56
57 Comm Opt: Impact Speedup of llvm and --llvm-wide-opt vs C on 16 nodes XC HPCC PTRANS minimd NPB EP HPCC HPL HPCC FFT NPB MG Lulesh C llvm llvm-wide-opt CoMD Times Faster!57
58 Future Work Chapel hope to make --llvm the default Migrate some Chapel-specific optimizations to LLVM Continue improving the LLVM IR that Chapel generates Separate compilation & link-time optimization Chapel interpreter using LLVM JIT Using a shared parallel LLVM IR!58
59 Thanks Thanks for your attention great discussions patch reviews related work! Check us out at
60 Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, and URIKA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners. Copyright 2015 Copyright Cray Inc Not Cray for External Inc. Distribution!60
61 !61
62 !62
63 Backup Slides!63
64 Chapel Community R&D Efforts (and several others )
65 Contributions to LLVM/clang Add clang CodeGen support for generating field access supports 'extern record' Fix a bug in BasicAA crashing with 128-bit pointers enables --llvm-wide-opt with {node, address wide pointers Fix a bug in ValueTracking crashing with 128-bit pointers enables --llvm-wide-opt with {node, address wide pointers Inc.!65
66 ISx Execution Time: MPI, SHMEM 13 ISx weakiso Total Time better Time (seconds) MPI SHMEM Nodes(x 36 cores per node)!66
67 ISx Execution Time: MPI, SHMEM, Chapel 13 ISx weakiso Total Time better Time (seconds) Chapel 1.15 MPI SHMEM Nodes(x 36 cores per node)!67
68 RA Performance: Chapel vs. MPI better (x 36 cores per locale)!68
69 Chapel+LLVM - Google Summer of Code Przemysław Leśniak contributed many improvements: mark signed integer arithmetic with 'nsw' to improve loop optimization command-line flags to emit LLVM IR at particular points in compilation new tests that use LLVM tool FileCheck to verify emitted LLVM IR mark order-independent loops with llvm.parallel_loop_access metadata mark const variables with llvm.invariant.start enable LLVM floating point optimization when --no-ieee-float is used add nonnull attribute to ref arguments to functions add a header implementing clang built-ins to improve 'complex' performance Inc.!69
70 Performance Regression: LLVM 3.7 to 4 upgrade LCALS find_first_min went from 2x faster than C to 3x slower Time (seconds) 3.5 C LLVM /5/15 4/5/16 10/5/16 4/6/17 10/6/17 Inc.!70
71 Performance Improvements: LLVM 3.7 to 4 C --llvm Inc.!71
72 Performance Improvements: GSoC nsw PRK stencil got 3x faster with no-signed-wrap on signed integer addition loop induction variable now identified Inc.!72
73 Performance Improvement: GSoC fast float LCALS fir got 3x faster with LLVM floating point optimizations enabled for --no-ieee-float Time (seconds) 2.5 C LLVM /25/15 6/4/16 11/14/16 4/26/17 10/6/17 Inc.!73
74 Performance Improvements: Built-ins Header complex version of Mandelbrot got 3x faster with header implementing clang built-ins Inc.!74
75 --llvm is Now Competitive Benchmark runtime now competitive or better with --llvm Inc.!75
76 Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, and URIKA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners. Copyright 2015 Copyright Cray Inc Not Cray for External Inc. Distribution!76
Chapel Background and Motivation COMPUTE STORE ANALYZE
Chapel Background and Motivation Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements
More information1 GF 1988: Cray Y-MP; 8 Processors. Static finite element analysis. 1 TF 1998: Cray T3E; 1,024 Processors. Modeling of metallic magnet atoms
1 GF 1988: Cray Y-MP; 8 Processors Static finite element analysis 1 TF 1998: Cray T3E; 1,024 Processors Modeling of metallic magnet atoms 1 PF 2008: Cray XT5; 150,000 Processors Superconductive materials
More informationGiven: m- element vectors A, B, C. Compute: i 1..m, A i = B i + α C i. In pictures:
Given: m- element vectors A, B, C Compute: i 1..m, A i B i α C i In pictures: A B C α 2 Given: m- element vectors A, B, C Compute: i 1..m, A i B i α C i In pictures, in parallel: A B C α 3 Given: m- element
More informationSung Eun Choi Cray, Inc. KIISE KOCSEA HPC SIG Joint SC 12
Sung Eun Choi Cray, Inc. KIISE KOCSEA HPC SIG Joint Workshop @ SC 12 Given: m element vectors A, B, C Compute: i 1..m, A i = B i + α C i In pictures: A B C α = + 2 Given: m element vectors A, B, C Compute:
More informationCompiler / Tools Chapel Team, Cray Inc. Chapel version 1.17 April 5, 2018
Compiler / Tools Chapel Team, Cray Inc. Chapel version 1.17 April 5, 2018 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward
More informationLocality/Affinity Features COMPUTE STORE ANALYZE
Locality/Affinity Features Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about
More informationBrad Chamberlain, Cray Inc. Punctuated Equilibrium at Exascale Panel/BoF. November 17 th, 2011
Brad Chamberlain, Cray Inc. Punctuated Equilibrium at Exascale Panel/BoF November 17 th, 2011 v2 This talk s contents should be considered my personal opinions (or at least one facet of them), not necessarily
More informationVassily Litvinov and the Chapel team, Cray Inc. HPC Advisory Council Conference 13 September 2012
Vassily Litvinov and the Chapel team, Cray Inc. HPC Advisory Council Conference 13 September 2012 Motivation: Programming Models Multiresolution Programming Empirical Evaluation About the Project 2 1 GF
More informationVassily Litvinov and the Chapel team, Cray Inc. University of Malaga Department of Computer Architecture 11 September 2012
Vassily Litvinov and the Chapel team, Cray Inc. University of Malaga Department of Computer Architecture 11 September 2012 Motivation: Programming Models Introducing Chapel Multiresolution Programming
More informationBrad Chamberlain, Chapel Team, Cray Inc. UIUC, May 16, 2013
Brad Chamberlain, Chapel Team, Cray Inc. UIUC, May 16, 2013 Given: m-element vectors A, B, C Compute: i 1..m, A i = B i + α C i In pictures, in parallel (distributed memory multicore): 2 A B C α = + =
More informationChapel Overview. CSCE 569 Parallel Computing
Chapel Overview CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu https://passlab.github.io/csce569/ Slides adapted from presentation by Brad Chamberlain,
More informationUser-Defined Distributions and Layouts in Chapel Philosophy and Framework
User-Defined Distributions and Layouts in Chapel Philosophy and Framework Brad Chamberlain, Steve Deitz, David Iten, Sung Choi Cray Inc. HotPAR 10 June 15, 2010 What is Chapel? A new parallel language
More informationHands-On II: Ray Tracing (data parallelism) COMPUTE STORE ANALYZE
Hands-On II: Ray Tracing (data parallelism) Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include
More informationData-Centric Locality in Chapel
Data-Centric Locality in Chapel Ben Harshbarger Cray Inc. CHIUW 2015 1 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward
More informationProductive Programming in Chapel: A Computation-Driven Introduction Chapel Team, Cray Inc. SC16, Salt Lake City, UT November 13, 2016
Productive Programming in Chapel: A Computation-Driven Introduction Chapel Team, Cray Inc. SC16, Salt Lake City, UT November 13, 2016 Safe Harbor Statement This presentation may contain forward-looking
More informationPGAS Programming and Chapel
PGAS Programming and Chapel Brad Chamberlain, Chapel Team, Cray Inc. UW CSEP 524, Spring 2015 April 28 th, 2015 Safe Harbor Statement This presentation may contain forward-looking statements that are based
More informationCompiler Improvements Chapel Team, Cray Inc. Chapel version 1.13 April 7, 2016
Compiler Improvements Chapel Team, Cray Inc. Chapel version 1.13 April 7, 2016 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations.
More informationMemory Leaks Chapel Team, Cray Inc. Chapel version 1.14 October 6, 2016
Memory Leaks Chapel Team, Cray Inc. Chapel version 1.14 October 6, 2016 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward
More informationGrab-Bag Topics / Demo COMPUTE STORE ANALYZE
Grab-Bag Topics / Demo Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about
More informationAdding Lifetime Checking to Chapel Michael Ferguson, Cray Inc. CHIUW 2018 May 25, 2018
Adding Lifetime Checking to Chapel Michael Ferguson, Cray Inc. CHIUW 2018 May 25, 2018 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations.
More informationArray, Domain, & Domain Map Improvements Chapel Team, Cray Inc. Chapel version 1.17 April 5, 2018
Array, Domain, & Domain Map Improvements Chapel Team, Cray Inc. Chapel version 1.17 April 5, 2018 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current
More informationData Parallelism COMPUTE STORE ANALYZE
Data Parallelism Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial
More informationCHAPEL Multiresolution Parallel Programming. Brad Chamberlain Cray Inc. Principal Engineer
CHAPEL Multiresolution Parallel Programming Brad Chamberlain Cray Inc. Principal Engineer 1 GF 1988: Cray Y-MP; 8 Processors Static finite element analysis 1 TF 1998: Cray T3E; 1,024 Processors Modeling
More informationSonexion GridRAID Characteristics
Sonexion GridRAID Characteristics CUG 2014 Mark Swan, Cray Inc. 1 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking
More informationThe Use and I: Transitivity of Module Uses and its Impact
The Use and I: Transitivity of Module Uses and its Impact Lydia Duncan, Cray Inc. CHIUW 2016 May 27 th, 2016 Safe Harbor Statement This presentation may contain forward-looking statements that are based
More informationThe Chapel Parallel Programming Language of the future
The Chapel Parallel Programming Language of the future Brad Chamberlain, Chapel Team, Cray Inc. Pacific Northwest Numerical Analysis Seminar October 19 th, 2013 The Chapel Parallel Programming Language
More informationQ & A, Project Status, and Wrap-up COMPUTE STORE ANALYZE
Q & A, Project Status, and Wrap-up Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements
More informationBrad Chamberlain, Cray Inc. 40 th SPEEDUP Workshop on HPC February 6 th, 2012
Brad Chamberlain, Cray Inc. 40 th SPEEDUP Workshop on HPC February 6 th, 2012 1 GF 1988: Cray Y-MP; 8 Processors Static finite element analysis 1 TF 1998: Cray T3E; 1,024 Processors Modeling of metallic
More informationCSE 501 Language Issues
CSE 501 Language Issues Languages for High Performance Computing (HPC) and Parallelization Brad Chamberlain, Chapel Team, Cray Inc. UW CSE 501, Spring 2015 May 5 th, 2015 Chapel: The Language for HPC/Parallelization*
More informationBrad Chamberlain, Cray Inc. I2PC Seminar, UIUC April 8 th, 2012
Brad Chamberlain, Cray Inc. I2PC Seminar, UIUC April 8 th, 2012 1 GF 1988: Cray Y-MP; 8 Processors Static finite element analysis 1 TF 1998: Cray T3E; 1,024 Processors Modeling of metallic magnet atoms
More informationTransferring User Defined Types in
Transferring User Defined Types in OpenACC James Beyer, Ph.D () 1 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking
More informationOpenFOAM Scaling on Cray Supercomputers Dr. Stephen Sachs GOFUN 2017
OpenFOAM Scaling on Cray Supercomputers Dr. Stephen Sachs GOFUN 2017 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking
More informationHPC Programming Models: Current Practice, Emerging Promise. Brad Chamberlain Chapel Team, Cray Inc.
HPC Programming Models: Current Practice, Emerging Promise Brad Chamberlain Chapel Team, Cray Inc. SIAM Conference on Parallel Processing for Scientific Computing (PP10) February 25, 2010 Why I m Glad
More informationBrad Chamberlain, Cray Inc. INT Exascale Workshop. June 30 th, 2011
Brad Chamberlain, Cray Inc. INT Exascale Workshop June 30 th, 2011 1 GF 1988: Cray Y-MP; 8 Processors Static finite element analysis 1 TF 1998: Cray T3E; 1,024 Processors Modeling of metallic magnet atoms
More informationCaching Puts and Gets in a PGAS Language Runtime
Caching Puts and Gets in a PGAS Language Runtime Michael Ferguson Cray Inc. Daniel Buettner Laboratory for Telecommunication Sciences September 17, 2015 C O M P U T E S T O R E A N A L Y Z E Safe Harbor
More informationMultiresolution Global-View Programming in Chapel
Multiresolution Global-View Programming in Chapel Brad Chamberlain, Chapel Team, Cray Inc. Argonne Training Program on Extreme-Scale Computing August 1 st, 2013 Sustained Performance Milestones 1 GF 1988:
More informationPorting the parallel Nek5000 application to GPU accelerators with OpenMP4.5. Alistair Hart (Cray UK Ltd.)
Porting the parallel Nek5000 application to GPU accelerators with OpenMP4.5 Alistair Hart (Cray UK Ltd.) Safe Harbor Statement This presentation may contain forward-looking statements that are based on
More informationXC System Management Usability BOF Joel Landsteiner & Harold Longley, Cray Inc. Copyright 2017 Cray Inc.
XC System Management Usability BOF Joel Landsteiner & Harold Longley, Cray Inc. 1 BOF Survey https://www.surveymonkey.com/r/kmg657s Aggregate Ideas at scale! Take a moment to fill out quick feedback form
More informationBrad Chamberlain, Cray Inc. Barcelona Multicore Workshop 2010 October 22, 2010
Brad Chamberlain, Cray Inc. Barcelona Multicore Workshop 2010 October 22, 2010 Brad Chamberlain, Cray Inc. Barcelona Multicore Workshop 2010 October 22, 2010 Brad Chamberlain, Cray Inc. Barcelona Multicore
More informationAn Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar
An Exploration into Object Storage for Exascale Supercomputers Raghu Chandrasekar Agenda Introduction Trends and Challenges Design and Implementation of SAROJA Preliminary evaluations Summary and Conclusion
More informationChapel s New Adventures in Data Locality Brad Chamberlain Chapel Team, Cray Inc. August 2, 2017
Chapel s New Adventures in Data Locality Brad Chamberlain Chapel Team, Cray Inc. August 2, 2017 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current
More informationReveal Heidi Poxon Sr. Principal Engineer Cray Programming Environment
Reveal Heidi Poxon Sr. Principal Engineer Cray Programming Environment Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to
More informationThe resurgence of parallel programming languages
The resurgence of parallel programming languages Jamie Hanlon & Simon McIntosh-Smith University of Bristol Microelectronics Research Group hanlon@cs.bris.ac.uk 1 The Microelectronics Research Group at
More informationReveal. Dr. Stephen Sachs
Reveal Dr. Stephen Sachs Agenda Reveal Overview Loop work estimates Creating program library with CCE Using Reveal to add OpenMP 2 Cray Compiler Optimization Feedback OpenMP Assistance MCDRAM Allocation
More informationChapel Hierarchical Locales
Chapel Hierarchical Locales Greg Titus, Chapel Team, Cray Inc. SC14 Emerging Technologies November 18 th, 2014 Safe Harbor Statement This presentation may contain forward-looking statements that are based
More informationCray XC System Node Diagnosability. Jeffrey J. Schutkoske Platform Services Group (PSG)
Cray XC System Node Diagnosability Jeffrey J. Schutkoske Platform Services Group (PSG) jjs@cray.com Safe Harbor Statement This presentation may contain forward-looking statements that are based on our
More informationProductive Programming in Chapel:
Productive Programming in Chapel: A Computation-Driven Introduction Base Language with n-body Michael Ferguson and Lydia Duncan Cray Inc, SC15 November 15th, 2015 C O M P U T E S T O R E A N A L Y Z E
More informationProject Caribou; Streaming metrics for Sonexion Craig Flaskerud
Project Caribou; Streaming metrics for Sonexion Craig Flaskerud Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual
More informationChapel: Productive Parallel Programming from the Pacific Northwest
Chapel: Productive Parallel Programming from the Pacific Northwest Brad Chamberlain, Cray Inc. / UW CS&E Pacific Northwest Prog. Languages and Software Eng. Meeting March 15 th, 2016 Safe Harbor Statement
More informationNew Tools and Tool Improvements Chapel Team, Cray Inc. Chapel version 1.16 October 5, 2017
New Tools and Tool Improvements Chapel Team, Cray Inc. Chapel version 1.16 October 5, 2017 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations.
More informationLustre Lockahead: Early Experience and Performance using Optimized Locking. Michael Moore
Lustre Lockahead: Early Experience and Performance using Optimized Locking Michael Moore Agenda Purpose Investigate performance of a new Lustre and MPI-IO feature called Lustre Lockahead (LLA) Discuss
More informationVectorization of Chapel Code Elliot Ronaghan, Cray Inc. June 13, 2015
Vectorization of Chapel Code Elliot Ronaghan, Cray Inc. CHIUW @PLDI June 13, 2015 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations.
More informationHow-to write a xtpmd_plugin for your Cray XC system Steven J. Martin
How-to write a xtpmd_plugin for your Cray XC system Steven J. Martin (stevem@cray.com) Cray XC Telemetry Plugin Introduction Enabling sites to get telemetry data off the Cray Plugin interface enables site
More informationMPI for Cray XE/XK Systems & Recent Enhancements
MPI for Cray XE/XK Systems & Recent Enhancements Heidi Poxon Technical Lead Programming Environment Cray Inc. Legal Disclaimer Information in this document is provided in connection with Cray Inc. products.
More informationRedfish APIs on Next Generation Cray Hardware CUG 2018 Steven J. Martin, Cray Inc.
Redfish APIs on Next Generation Cray Hardware Steven J. Martin, Cray Inc. Modernizing Cray Systems Management Use of Redfish APIs on Next Generation Cray Hardware Steven Martin, David Rush, Kevin Hughes,
More informationAn Example-Based Introduction to Global-view Programming in Chapel. Brad Chamberlain Cray Inc.
An Example-Based Introduction to Global-view Programming in Chapel Brad Chamberlain Cray Inc. User Experience and Advances in Bridging Multicore s Programmability Gap November 16, 2009 SC09 -- Portland
More informationPerformance Optimizations Generated Code Improvements Chapel Team, Cray Inc. Chapel version 1.11 April 2, 2015
Performance Optimizations Generated Code Improvements Chapel Team, Cray Inc. Chapel version 1.11 April 2, 2015 Safe Harbor Statement This presentation may contain forward-looking statements that are based
More informationBenchmarks, Performance Optimizations, and Memory Leaks Chapel Team, Cray Inc. Chapel version 1.13 April 7, 2016
Benchmarks, Performance Optimizations, and Memory Leaks Chapel Team, Cray Inc. Chapel version 1.13 April 7, 2016 Safe Harbor Statement This presentation may contain forward-looking statements that are
More informationCray Performance Tools Enhancements for Next Generation Systems Heidi Poxon
Cray Performance Tools Enhancements for Next Generation Systems Heidi Poxon Agenda Cray Performance Tools Overview Recent Enhancements Support for Cray systems with KNL 2 Cray Performance Analysis Tools
More informationEnhancing scalability of the gyrokinetic code GS2 by using MPI Shared Memory for FFTs
Enhancing scalability of the gyrokinetic code GS2 by using MPI Shared Memory for FFTs Lucian Anton 1, Ferdinand van Wyk 2,4, Edmund Highcock 2, Colin Roach 3, Joseph T. Parker 5 1 Cray UK, 2 University
More informationIntel Xeon PhiTM Knights Landing (KNL) System Software Clark Snyder, Peter Hill, John Sygulla
Intel Xeon PhiTM Knights Landing (KNL) System Software Clark Snyder, Peter Hill, John Sygulla Motivation The Intel Xeon Phi TM Knights Landing (KNL) has 20 different configurations 5 NUMA modes X 4 memory
More informationData Parallelism, By Example COMPUTE STORE ANALYZE
Data Parallelism, By Example Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements
More informationPerformance Optimizations Chapel Team, Cray Inc. Chapel version 1.14 October 6, 2016
Performance Optimizations Chapel Team, Cray Inc. Chapel version 1.14 October 6, 2016 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations.
More informationHybrid Warm Water Direct Cooling Solution Implementation in CS300-LC
Hybrid Warm Water Direct Cooling Solution Implementation in CS300-LC Roger Smith Mississippi State University Giridhar Chukkapalli Cray, Inc. C O M P U T E S T O R E A N A L Y Z E 1 Safe Harbor Statement
More informationLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15 Akihiro Hayashi (Rice University) Jisheng Zhao (Rice University) Michael Ferguson
More informationFirst experiences porting a parallel application to a hybrid supercomputer with OpenMP 4.0 device constructs. Alistair Hart (Cray UK Ltd.
First experiences porting a parallel application to a hybrid supercomputer with OpenMP 4.0 device constructs Alistair Hart (Cray UK Ltd.) Safe Harbor Statement This presentation may contain forward-looking
More informationLecture 11: GPU programming
Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!
More informationChapel: An Emerging Parallel Programming Language. Brad Chamberlain, Chapel Team, Cray Inc. Emerging Technologies, SC13 November 20 th, 2013
Chapel: An Emerging Parallel Programming Language Brad Chamberlain, Chapel Team, Cray Inc. Emerging Technologies, SC13 November 20 th, 2013 A Call To Arms Q: Why doesn t HPC have languages as enjoyable
More informationChapel: An Emerging Parallel Programming Language. Thomas Van Doren, Chapel Team, Cray Inc. Northwest C++ Users Group April 16 th, 2014
Chapel: An Emerging Parallel Programming Language Thomas Van Doren, Chapel Team, Cray Inc. Northwest C Users Group April 16 th, 2014 My Employer: 2 Parallel Challenges Square-Kilometer Array Photo: www.phy.cam.ac.uk
More informationStandard Library and Interoperability Improvements Chapel Team, Cray Inc. Chapel version 1.11 April 2, 2015
Standard Library and Interoperability Improvements Chapel Team, Cray Inc. Chapel version 1.11 April 2, 2015 Safe Harbor Statement This presentation may contain forward-looking statements that are based
More informationCS691/SC791: Parallel & Distributed Computing
CS691/SC791: Parallel & Distributed Computing Introduction to OpenMP 1 Contents Introduction OpenMP Programming Model and Examples OpenMP programming examples Task parallelism. Explicit thread synchronization.
More information. Programming in Chapel. Kenjiro Taura. University of Tokyo
.. Programming in Chapel Kenjiro Taura University of Tokyo 1 / 44 Contents. 1 Chapel Chapel overview Minimum introduction to syntax Task Parallelism Locales Data parallel constructs Ranges, domains, and
More informationCS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.
CS4961 Parallel Programming Lecture 16: Introduction to Message Passing Administrative Next programming assignment due on Monday, Nov. 7 at midnight Need to define teams and have initial conversation with
More informationLanguage Improvements Chapel Team, Cray Inc. Chapel version 1.15 April 6, 2017
Language Improvements Chapel Team, Cray Inc. Chapel version 1.15 April 6, 2017 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations.
More informationCUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list
CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into
More informationLustre Networking at Cray. Chris Horn
Lustre Networking at Cray Chris Horn hornc@cray.com Agenda Lustre Networking at Cray LNet Basics Flat vs. Fine-Grained Routing Cost Effectiveness - Bandwidth Matching Connection Reliability Dealing with
More informationIntroduction to the Message Passing Interface (MPI)
Introduction to the Message Passing Interface (MPI) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction to the Message Passing Interface (MPI) Spring 2018
More informationSTREAM and RA HPC Challenge Benchmarks
STREAM and RA HPC Challenge Benchmarks simple, regular 1D computations results from SC 09 competition AMR Computations hierarchical, regular computation SSCA #2 unstructured graph computation 2 Two classes
More informationChapter 4. Message-passing Model
Chapter 4 Message-Passing Programming Message-passing Model 2 1 Characteristics of Processes Number is specified at start-up time Remains constant throughout the execution of program All execute same program
More informationCS 179: GPU Programming. Lecture 14: Inter-process Communication
CS 179: GPU Programming Lecture 14: Inter-process Communication The Problem What if we want to use GPUs across a distributed system? GPU cluster, CSIRO Distributed System A collection of computers Each
More informationCS 470 Spring Mike Lam, Professor. OpenMP
CS 470 Spring 2017 Mike Lam, Professor OpenMP OpenMP Programming language extension Compiler support required "Open Multi-Processing" (open standard; latest version is 4.5) Automatic thread-level parallelism
More informationBrad Chamberlain Cray Inc. March 2011
Brad Chamberlain Cray Inc. March 2011 Approach the topic of mapping Chapel to a new target platform by reviewing some core Chapel concepts describing how Chapel s downward-facing interfaces implement those
More informationBlue Waters Programming Environment
December 3, 2013 Blue Waters Programming Environment Blue Waters User Workshop December 3, 2013 Science and Engineering Applications Support Documentation on Portal 2 All of this information is Available
More informationCS 470 Spring Mike Lam, Professor. OpenMP
CS 470 Spring 2018 Mike Lam, Professor OpenMP OpenMP Programming language extension Compiler support required "Open Multi-Processing" (open standard; latest version is 4.5) Automatic thread-level parallelism
More informationCS 470 Spring Parallel Languages. Mike Lam, Professor
CS 470 Spring 2017 Mike Lam, Professor Parallel Languages Graphics and content taken from the following: http://dl.acm.org/citation.cfm?id=2716320 http://chapel.cray.com/papers/briefoverviewchapel.pdf
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationNew Programming Paradigms: Partitioned Global Address Space Languages
Raul E. Silvera -- IBM Canada Lab rauls@ca.ibm.com ECMWF Briefing - April 2010 New Programming Paradigms: Partitioned Global Address Space Languages 2009 IBM Corporation Outline Overview of the PGAS programming
More informationExperiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms
Experiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms Kristyn J. Maschhoff and Michael F. Ringenburg Cray Inc. CUG 2015 Copyright 2015 Cray Inc Legal Disclaimer Information
More informationProductive Performance on the Cray XK System Using OpenACC Compilers and Tools
Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid
More informationLanguage and Compiler Improvements Chapel Team, Cray Inc. Chapel version 1.14 October 6, 2016
Language and Compiler Improvements Chapel Team, Cray Inc. Chapel version 1.14 October 6, 2016 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current
More informationUsually, target code is semantically equivalent to source code, but not always!
What is a Compiler? Compiler A program that translates code in one language (source code) to code in another language (target code). Usually, target code is semantically equivalent to source code, but
More informationOverview: Emerging Parallel Programming Models
Overview: Emerging Parallel Programming Models the partitioned global address space paradigm the HPCS initiative; basic idea of PGAS the Chapel language: design principles, task and data parallelism, sum
More informationIntel Array Building Blocks
Intel Array Building Blocks Productivity, Performance, and Portability with Intel Parallel Building Blocks Intel SW Products Workshop 2010 CERN openlab 11/29/2010 1 Agenda Legal Information Vision Call
More informationSINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM
SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Basic usage of OpenSHMEM 2 Outline Concept and Motivation Remote Read and Write Synchronisation Implementations OpenSHMEM Summary 3 Philosophy of the talks In
More informationSteve Deitz Chapel project, Cray Inc.
Parallel Programming in Chapel LACSI 2006 October 18 th, 2006 Steve Deitz Chapel project, Cray Inc. Why is Parallel Programming Hard? Partitioning of data across processors Partitioning of tasks across
More informationMPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016
MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared
More informationOpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing
CS 590: High Performance Computing OpenMP Introduction Fengguang Song Department of Computer Science IUPUI OpenMP A standard for shared-memory parallel programming. MP = multiprocessing Designed for systems
More informationParallel Programming. Marc Snir U. of Illinois at Urbana-Champaign & Argonne National Lab
Parallel Programming Marc Snir U. of Illinois at Urbana-Champaign & Argonne National Lab Summing n numbers for(i=1; i++; i
More informationIntel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant
Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Parallel is the Path Forward Intel Xeon and Intel Xeon Phi Product Families are both going parallel Intel Xeon processor
More informationXC Series Shifter User Guide (CLE 6.0.UP02) S-2571
XC Series Shifter User Guide (CLE 6.0.UP02) S-2571 Contents Contents 1 About the XC Series Shifter User Guide...3 2 Shifter System Introduction...6 3 Download and Convert the Docker Image...7 4 Submit
More information