A Parallelizing Compiler for Multicore Systems

Size: px

Start display at page:

Download "A Parallelizing Compiler for Multicore Systems"

Curtis Thompson
5 years ago
Views:

1 A Parallelizing Compiler for Multicore Systems José M. Andión, Manuel Arenaz, Gabriel Rodríguez and Juan Touriño 17th International Workshop on Software and Compilers for Embedded Systems (SCOPES 2014) June 10-11, 2014 Schloss Rheinfels, Sankt Goar, Germany

2 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

3 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

4 100,000 Performance (vs. VAX-11/780) 10, AX-11/780, 5 MHz AMD Athlon 64, 2.8 GHz 11,865 14,38719,484 AMD Athlon, 2.6 GHz Intel Xeon EE 3.2 GHz 7,108 Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 6,043 6,681 IBM Power4, 1.3 GHz 4,195 3,016 Intel VC820 motherboard, 1.0 GHz Pentium III processor 1,779 Professional Workstation XP1000, 667 MHz 21264A Digital AlphaServer /575, 575 MHz , AlphaServer /600, 600 MHz Digital Alphastation 5/500, 500 MHz Digital Alphastation 5/300, 300 MHz Digital Alphastation 4/266, 266 MHz IBM POWERstation 100, 150 MHz Digital 3000 AXP/500, 150 MHz HP 9000/750, 66 MHz IBM RS6000/540, 30 MHz MIPS M2000, 25 MHz 18 MIPS M/120, 16.7 MHz 13 Sun-4/260, 16.7 MHz 9 VAX 8700, 22 MHz %/year Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) Intel Core Duo Extreme 2 cores, 3.0 GHz Intel Core 2 Extreme 2 cores, 2.9 GHz %/year 24,129 21,871 25%/year 1.5, VAX-11/ The Parallel Challenge David A. Patterson and John L. Hennessy.! Computer Organization and Design: The Hardware/Software Interface.! Elsevier, 2014.

5 The Parallel Challenge libraries compiler directives programming languages parallelizing compilers

6 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

7 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

8 dikernel: Domain- Independent Computational Kernel DOMAIN-SPECIFIC CONCEPT LEVEL (problem solving methods and application domain) DOMAIN-INDEPENDENT CONCEPT LEVEL (programming practice) SEMANTIC LEVEL (control flow and data dependence graphs) SYNTACTIC LEVEL (abstract syntax tree) TEXT LEVEL (ASCII code) Characterizes the computations carried out in a program without being affected by how they are coded Exposes multiple levels of parallelism M. Arenaz et al. XARK: An Extensible Framework for Automatic Recognition of Computational Kernels. ACM Transactions on Programming Languages and Systems, 30(6), 2008.

9 Standard statement-based IR BB0 i = 0; BB1 t = 0; (2) j = 0; (2) 1.for (i = 0; i < n; i++) { 2. t = 0; 3. for (j = 0; j < m; j++) { 4. t = t + A[i][j] * x[j]; 5. } 6. y[i] = t; 7.} (1) BB2 t = t + A[i][j] * x[j]; j++; (2) (1) BB3 if (j < m) (2) BB4 (1) y[i] = t; (2) i++; (2) (1) BB5 if (i < n) T F

10 Building the KIR (I) BB0 i = 0; BB1 i=0 dominates i++ DEF(i,i=0) USE(i, i++) t = 0; (2) < i BB0 > j = 0; (2) BB2 < i BB4 > < j BB1 > t = t + A[i][j] * x[j]; (2) (1) (1) BB3 j++; (2) < j BB2 > < t BB1 > if (j < m) < t BB2 > BB4 (1) y[i] = t; (2) < y BB4 > i++; (2) (1) BB5 if (i < n) T F

11 Building the KIR (II) < i BB0 > ROOT EXECUTION SCOPE ES_for i (Fig. 1, lines 1-7) < i BB4 > < j BB1 > < t BB1 > scalar assignment < j BB2 > < t BB1 > ES_for j (Fig. 1, lines 3-5) < t BB2 > scalar reduction < t BB2 > < y BB4 > < y BB4 > regular assignment

12 Building the KIR (and III) ROOT EXECUTION SCOPE ES_for i (Fig. 1, lines 1-7) 1.for (i = 0; i < n; i++) { 2. t = 0; 3. for (j = 0; j < m; j++) { 4. t = t + A[i][j] * x[j]; 5. } 6. y[i] = t; 7.} < t BB1 > scalar assignment ES_for j (Fig. 1, lines 3-5) < t BB2 > scalar reduction < y BB4 > regular assignment

13 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

14 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

15 Automatic Partitioning driven by the KIR (I) t is a privatizable scalar variable ROOT EXECUTION SCOPE ES_for i (Fig. 1, lines 1-7) 1.for (i = 0; i < n; i++) { 2. t = 0; 3. for (j = 0; j < m; j++) { 4. t = t + A[i][j] * x[j]; 5. } 6. y[i] = t; 7.} < t BB1 > scalar assignment ES_for j (Fig. 1, lines 3-5) < t BB2 > scalar reduction < y BB4 > regular assignment

16 Automatic Partitioning driven by the KIR (II) spurious dikernel-level dependence ROOT EXECUTION SCOPE ES_for i (Fig. 1, lines 1-7) 1.for (i = 0; i < n; i++) { 2. t = 0; 3. for (j = 0; j < m; j++) { 4. t = t + A[i][j] * x[j]; 5. } 6. y[i] = t; 7.} < t BB1 > scalar assignment ES_for j (Fig. 1, lines 3-5) < t BB2 > scalar reduction < y BB4 > regular assignment

17 Automatic Partitioning driven by the KIR (III) critical path ROOT EXECUTION SCOPE ES_for i (Fig. 1, lines 1-7) 1.for (i = 0; i < n; i++) { 2. t = 0; 3. for (j = 0; j < m; j++) { 4. t = t + A[i][j] * x[j]; 5. } 6. y[i] = t; 7.} < t BB1 > scalar assignment ES_for j (Fig. 1, lines 3-5) < t BB2 > scalar reduction < y BB4 > regular assignment

18 Automatic Partitioning driven by the KIR (and IV) ROOT EXECUTION SCOPE ES_for i (Fig. 1, lines 1-7) critical path < t BB1 > scalar assignment ES_for j (Fig. 1, lines 3-5) < t BB2 > scalar reduction 1.#pragma omp parallel for 2. shared(a,x,y) private (t,i,j) 3.for (i = 0; i < n; i++) { 4. t = 0; 5. for (j = 0; j < m; j++) { 6. t = t + A[i][j] * x[j]; 7. } 8. y[i] = t; 9.} < y BB4 > regular assignment

19 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

20 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

21 Experimental Remaining Overhead Irregular Evaluation 70 Built on top of GCC Execution Time (s) EQUAKE from SPEC CPU2000 on 2 Intel Xeon E5520 quad-core processors The Intel compiler is unable to parallelize this case study properly while our approach reduces the execution time KIR/ICC ICC KIR/ICC ICC KIR/ICC WL x 1 WL x 2 WL x 3 ICC More results on J.M. Andión et al. A Novel Compiler Support for Automatic Parallelization on Multicore Systems. Parallel Computing, 39(9), Speedup KIR/ICC ICC KIR/ICC ICC KIR/ICC WL x 1 WL x 2 WL x 3 ICC

22 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

23 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

24 1.The KIR: a dikernel-based IR dikernels dikernel-level dependences execution scopes 2.Automatic Partitioning Technique coarse-grain parallelism global OpenMP parallelization strategy

25 Future Work Locality exploitation techniques Fine-grain parallelism Many-core architectures such as GPUs J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014 & International Journal of Parallel Programming (to appear)

26 A Parallelizing Compiler for Multicore Systems José M. Andión, Manuel Arenaz, Gabriel Rodríguez and Juan Touriño 17th International Workshop on Software and Compilers for Embedded Systems (SCOPES 2014) June 10-11, 2014 Schloss Rheinfels, Sankt Goar, Germany

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel