Lecture 13: March 25

Size: px

Start display at page:

Download "Lecture 13: March 25"

Loren Hart
6 years ago
Views:

CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms 13.

1 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms Basic Idea Sparsity typically expressed as the number of non-zero entries per row. There are two kinds of Matrix: Sparse Matrix and Dense Matrix. In order to improve the low increasing rate of single-core kernel, the SpMV is tested, which is one of the most heavily used kernels in scientific computing-across a broad spectrum of multicore designs SPMV overview Disadvantages: 1) Higher instruction and storage overheads per flop; 2) Indirect and irregular memory access patterns. Improvement: Select a compact data structure and code transformations that best exploit properties of both the sparse matrix which may be known only at run-time and the underlying machine architecture. Data Structure: The most common data structure used to store a sparse matrix for SpMV-heavy computations is compressed sparse row (CSR) format: // Basic SpMV implementation, // y <- y + A*x, where A is in CSR. for (i = 0; i < m; ++i) { double y0 = y[i]; for (k = ptr[i]; k < ptr[i+1]; ++k) y0 += val[k] * x[ind[k]]; y[i] = y0; } SPMV Optimizations Goal: As much autotuning as possible. Three categories of optimizations: low-level code optimizations, data structure optimizations and parallelization optimizations.

2 Optimizations: 1) Thread Blocking: The first phase in the optimization process is exploiting thread-level parallelism. In this paper, we only exploit row partitioning, The matrix is partitioned into NThreads thread blocks, which may in turn be individually optimized. There are three approaches to partitioning the matrix: by row blocks, by column blocks, and into segments. In both row and column parallelization, the matrix is explicitly blocked to exploit NUMA systems. 2) Cache and Local Store Blocking: For sufficiently large matrices, we first quantify the number of cache lines available for blocking, and span enough columns such that the number of source vector cache lines touched is equal to those available for cache blocking. Using this approach allows each cache block to touch the same number of cache lines, even though they span vastly different numbers of columns. 3) TLB Blocking: TLB misses can vary by an order of magnitude depending on the blocking strategy, highlighting the importance of TLB blocking. 4) Register Blocking and Format Selection: The next phase of our implementation is to optimize the SpMV data format. Register blocking groups adjacent nonzeros into rectangular tiles, with only one coordinate index per tile. Key point: For memory-bound multicore applications, we believe that minimizing the memory footprint is more effective than improving single thread performance. 5) Index Size Selection: 16b integers to reduce memory traffic. 6) Architecture Specific Kernels: through auto-tuning. 7) SIMDization: The xlc static timing analyzer provides information on whether cycles are spent in instruction issue, double-precision issue stalls, or stalls for data hazards, thus simplifying the process of kernel optimization. 8) Loop Optimizations: CSR data storage means that the column and value arrays are accessed in a streaming (unitstride) fashion. We can explicitly software pipeline the code to hide any further instruction latency. The code can be further optimized using a branchless implementation, which is in effect a segmented scan of vector-length equal to one Test result 1) Evaluated Sparse Matrices:

The dense matrix provides the performance upper bound: SpMV is limited by memory throughput Dense case

Memory fetch time 2) Peak Effective Bandwidth Results: Observe that the systems achieve a wide range of

saturating the socket bandwidth, utilizing an impressive 96% of the theoretical potential.

3 The dense matrix provides the performance upper bound: SpMV is limited by memory throughput Dense case supports arbitrary register blocks (no added zeros) Loops are long running -> more CPU time vs. Memory fetch time 2) Peak Effective Bandwidth Results: Observe that the systems achieve a wide range of the available memory bandwidth, however, only the full version of the Cell (8 SPEs) comes close to fully saturating the socket bandwidth, utilizing an impressive 96% of the theoretical potential. Outside of this project, we typically expect only 10% of peak performance. 3) Effective SPMV performance on Muticore platform:

Results show that, as expected, single thread results are extremely poor, achieving only 75 Mflop/s for the median matrix in the naïve case, with about 10% speedup from our suite of optimizations (86

4 Results show that, as expected, single thread results are extremely poor, achieving only 75 Mflop/s for the median matrix in the naïve case, with about 10% speedup from our suite of optimizations (86 Mflop/s). 4) Comparisons median matrix results: The result shows that the optimized performance of our SpMV implementation, as well as OSKI, using a single-core, fully-packed single socket, and full system configuration. Results clearly indicate that the Cell blade significantly outperforms all other platforms in our study, achieving 3.3, 4.1, and 2.2 speedups compared with the AMD X2, Clovertown, and Niagara2 despite its poor double-precision and sub-optimal register blocking implementation Dimitrij Krepis - POSH: A TLS Compiler that Exploits Program Structure Introduction to Thread Level Speculation(TLS) They work on how to break the code into speculative tasks and when to spawn them have a crucial impact on the performance of the resulting TLS system. TLS: Enables the compiler to create parallel threads despite the existence of ambiguous data dependence. The speedup of TLS comes from two effects: task parallelism and data prefetching. Task 1, 2 (Chart (a)) can benefit from parallelism and prefetching. TLS benefits from parallelism when the two tasks run concurrently (Chart (b)). TLS benefits from data prefetching when a task suffers a cache miss on datum A, the task is then squashed, and later a second task that will not be squashed obtains A from the cache. Figure 3(c) illustrates this effect when the task that benefits from prefetching is the one that was squashed

13.2.2 POSH (1) POSH: A new, fully automated TLS compiler infrastructure that we have developed. POSH adds several TLS passes to gcc-3.5. In the design of POSH, we have made two main design decisions.

5 POSH (1) POSH: A new, fully automated TLS compiler infrastructure that we have developed. POSH adds several TLS passes to gcc-3.5. In the design of POSH, we have made two main design decisions. First is to partition the code into tasks. The second design decision is to add a simple profiling pass that takes into account both the parallelism and the data prefetching effects provided by the speculative tasks. (2) Framework: The POSH framework is composed of two parts closely tied together: a compiler and a profiler. (3) Hardware Assumption: Shared Memory CMP; No register transfer between tasks; Write-Through on Registers; All Live-Ins via Memory; Detects Data Dependency Violations; Spawn, Commit instructions. (4) Compiler Phases: Task Selection, Spawn Hoist-ing, and Task Refinement. Task Selection: Identify Tasks: Subroutines, Subroutine continuations; Loop Iterations/Continuations For Each Task: Identify Beginning and End; Inserts Commits before every task. Value Prediction: Reduces data dependency violations; Function return values/loop induction variables. Spawn Hoist-ing: Inserts SPAWN instructions at beginning of all tasks: Spawn Points. Hoists tasks as early as possible: Improves Parallelism and Prefetching. Restrictions: Spawning before the definition of variable used; Except Value prediction; Control Flow Restrictions; Spawning in reverse order. Task Refinement: Makes the final decisions on which tasks will make it into the final binary. The refinement phase includes the Parallelism, Small Tasks, Register Dependence and Profiled steps. (5) Profiler: Train Input Set Sequential Execution Models simple Cache: Estimated Cache misses. 5min execution on desktop. Assigns time to each instruction: Rewinds time to spawn point + Overhead time. On Load, lookup in Store table: Load timer<store Timer: dependence violation.

(6) Experiment: Since there is no hardware platform that supports TLS, we target POSH to SESC, a cycle-accurate execution-driven simulator.

The L1 caches are connected through a crossbar to an on-chip shared L2 cache.

prediction. Impact of Task Selection: only subroutines and subroutine continuations (Subr); only loop iterations and loop continuations (Loop), or all such tasks (Subr+Loop).

6 (6) Experiment: Since there is no hardware platform that supports TLS, we target POSH to SESC, a cycle-accurate execution-driven simulator. It is a four-processor CMP with TLS support. Each processor is a 3-issue core and has a private L1 cache that buffers the speculative data. The L1 caches are connected through a crossbar to an on-chip shared L2 cache. (7) Evaluation: we examine several issues: task selection, static and dynamic task characteristics, memory behavior, prefetching, and effectiveness of the profiler and value prediction. Impact of Task Selection: only subroutines and subroutine continuations (Subr); only loop iterations and loop continuations (Loop), or all such tasks (Subr+Loop). The best speedups are obtained when both types of tasks are selected (Subr+Loop). Contribution of Prefetching to TLS Speedup: The prefetching effect of squashed tasks contributes to the speedup of TLS execution. The difference between the two bars is the effect of data prefetching induced by TLS. The application that benefits the most is gap.

Without the profiler, the TLS execution obtains a minor average speedup of 1.04. If we apply the profiling pass, we obtain the 1.30 average speedup.

7 Effectiveness of the Profiler: we compare the TLS code generated by POSH with and without the profiling pass. The figure shows the speedups of such codes over the sequential execution. Without the profiler, the TLS execution obtains a minor average speedup of If we apply the profiling pass, we obtain the 1.30 average speedup. Effectiveness of Value Prediction: Figure 12 shows the speedup of TLS with and without value prediction over the sequential execution. On average, the applications run about 7% slower if POSH does not use value prediction. Consequently, we recommend its use.

POSH: A TLS Compiler that Exploits Program Structure

POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign