Performance Impact of Resource Conflicts on Chip Multi-processor Servers

Similar documents
Compiler Support and Performance Tuning of OpenMP Programs on SunFire Servers

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

SunFire range of servers

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Area-Efficient Error Protection for Caches

Simultaneous Multithreading on Pentium 4

Introduction to Multicore Programming

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

6.1 Multiprocessor Computing Environment

NUMA-aware OpenMP Programming

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

AS the processor-memory speed gap continues to widen,

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

WHY PARALLEL PROCESSING? (CE-401)

Computer Architecture Spring 2016

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Parallel Processing SIMD, Vector and GPU s cont.

Application Performance on Dual Processor Cluster Nodes

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

One-Level Cache Memory Design for Scalable SMT Architectures

Parallelism via Multithreaded and Multicore CPUs. Bradley Dutton March 29, 2010 ELEC 6200

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Optimising for the p690 memory system

Systems Programming and Computer Architecture ( ) Timothy Roscoe

CSE502: Computer Architecture CSE 502: Computer Architecture

Amdahl's Law in the Multicore Era

Computer Architecture Crash course

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

Impact of Cache Coherence Protocols on the Processing of Network Traffic

1. Memory technology & Hierarchy

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Performance of Multicore LUP Decomposition

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

Introduction to Multicore Programming

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Shared Symmetric Memory Systems

Computer Systems Architecture

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

CUDA GPGPU Workshop 2012

PCS - Part Two: Multiprocessor Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

Cache Performance (H&P 5.3; 5.5; 5.6)

Master Informatics Eng.

Workloads, Scalability and QoS Considerations in CMP Platforms

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Understanding Cache Interference

Parallel Architecture. Hwansoo Han

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Architecture-Conscious Database Systems

Control Hazards. Prediction

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Memory Subsystem Profiling with the Sun Studio Performance Analyzer

Microarchitecture Overview. Performance

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Lecture 14: Multithreading

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Issues in Multiprocessors

Lecture 2. Memory locality optimizations Address space organization

Microarchitecture Overview. Performance

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

CS377P Programming for Performance Multicore Performance Multithreading

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

JBus Architecture Overview

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Multi-core Architectures. Dr. Yingwu Zhu

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Comp. Org II, Spring

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Arquitecturas y Modelos de. Multicore

Base Vectors: A Potential Technique for Micro-architectural Classification of Applications

Quantitative study of data caches on a multistreamed architecture. Abstract

Computer Architecture

CS 654 Computer Architecture Summary. Peter Kemper

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Transcription:

Performance Impact of Resource Conflicts on Chip Multi-processor Servers Myungho Lee, Yeonseung Ryu, Sugwon Hong, and Chungki Lee Department of Computer Software, MyongJi University, Yong-In, Gyung Gi Do, Korea 449-728 {myunghol, ysryu, swhong, cklee}@mju.ac.kr http://www.mju.ac.kr/myunghol Abstract. Chip Multi-Processors (CMPs) are becoming mainstream microprocessors for High Performance Computing and commercial business applications as well. Multiple CPU cores on CMPs allow multiple software threads executing on the same chip at the same time. Thus they promise to deliver higher capacity of computations performed per chip in a given time interval. However, resource sharing among the threads executing on the same chip can cause conflicts and lead to performance degradation. Thus, in order to obtain high performance and scalability on CMP servers, it is crucial to first understand the performance impact that the resource conflicts have on the target applications. In this paper, we evaluate the performance impact of the resource conflicts on an example high-end CMP server, Sun Fire E25K, using a standard OpenMP benchmark suite, SPEC OMPL. 1 Introduction Recently, microprocessor designers have been considering many design choices to efficiently utilize the ever increasing effective chip area with the increase of transistor density. Instead of employing a complicated processor pipeline on a chip with an emphasis on improving single thread s performance, incorporating multiple processor cores on a single chip (or Chip Multi-Processor) has become a main stream microprocessor design trend. As a Chip Multi-Processor (CMP), it can execute multiple software threads on a single chip at the same time. Thus a CMP provides a larger capacity of computations performed per chip for a given time interval (or throughput). Examples are Dual-Core Intel Xeon [3], AMD Opteron [1], UltraSPARC IV, IV+, T1 microprocessors from Sun Microsystems [12], [14], IBM Power 5 [5], among others. Shared-Memory Multiprocessor (SMP) servers based on CMPs are already introduced in the market, e.g., Sun Fire E25K [12] from Sun Microsystems based on dual-core UltraSPARC IV processors. They are rapidly adopted in High Performance Computing (HPC) applications as well as in commercial business applications. Although CMP servers promise to deliver higher chip-level throughput performance than the servers based on the traditional single core processors, resources on the CMPs such as cache(s), cache/memory bus, functional units, etc., are B. Kågström et al. (Eds.): PARA 2006, LNCS 4699, pp. 1168 1177, 2007. c Springer-Verlag Berlin Heidelberg 2007

Performance Impact of Resource Conflicts on Chip Multi-processor Servers 1169 shared among the cores on the same processor chip. Software threads running on the cores of the same processor chip compete for the shared resources, which can cause conflicts and hurt performance. Thus exploiting the full performance potential of CMP servers is a challenging task. In this paper, we evaluate the performance impact of the resource conflicts among the processor cores of CMPs on a high-end SMP server, Sun Fire E25K. For our performance evaluation, we use HPC applications parallelized using OpenMP standard [9] for SMP: SPEC OMPL benchmark suite [11]. Using the Sun Studio 10 compiler suite [13], we generate fairly high optimized executables for SPEC OMPL programs and run them on E25K server. In order to evaluate the performance impact of the resource conflicts on the shared resources, level-2 cache bus and main memory bus, 64-thread (and 32-thread) runs were conducted using both cores of 32 CMPs (16 CMPs for 32-thread run) and using only one core of 64 CMPs (32 CMPs for 32-thread run). The experimental results show 17 18% average (geometric mean for the 9 benchmark programs) slowdowns for the runs with resource conflicts than without the conflicts. Benchmarks which intensively utilize the memory bandwidth or allocate large amounts of memory suffer more due to the resource conflicts. The rest of the paper is organized as follows: Section 2 describes the architecture of an example CMP server, Sun Fire E25K. Section 3 describes the OpenMP programming model and our test benchmark suite, SPEC OMPL. It also describes how to generate optimized executables for SPEC OMPL. Section 4 first shows the settings for utilizing Solaris 10 Operating System features useful for achieving high performance for SPEC OMPL. Then it shows the experimental results on E25K. Section 5 wraps up the paper with conclusions. 2 Chip Multi-processor Server In this section, we describe the architecture of an example high-end CMP server which we used for our performance experiments in this paper. The Sun Fire E25K server is the first generation throughput computing server from Sun Microsystems which aims to dramatically increase the application throughput by employing dual-core CMPs. The server is based on the dual-core UltraSPARC IV processor and can scale up to 72 processors executing 144 threads (two threads per each UltraSPARC IV processor) simultaneously. The system offers up to twice the compute power of the UltraSPARC III Cu (predecessor to UltraSPARC IV processor) based high-end systems. The UltraSPARC IV contains two enhanced UltraSPARC III Cu cores (or Thread Execution Engines: TEEs), a memory controller, and the necessary cache tag for 8 MB of external L2 cache per core (see Fig. 1). The off-chip L2 cache is 16 MB in size (8 MB per core). The two cores share the Fireplane System Interconnect, as well as the L2 cache bus. Thus they become the potential source of performance bottlenecks. The basic computational component of the Sun Fire E25K server is the Uni- Board [12]. Each UniBoard consists of up to four UltraSPARC IV processors,

1170 M. Lee et al. Fig. 1. UltraSPARC IV processor their L2 caches, and associated main memory. Sun Fire E25K can contain up to 18 UniBoards, thus at maximum 72 UltraSPARC IV processors. In order to maintain cache coherency system wide, the snoopy cache coherency protocol is used within the UniBoard and directory-based cache coherency protocol is used among different UniBoards. The memory latency, measured using lat mem rd( ) routine of lmbench, to the memory within the same UniBoard is 240nsec and 455nsec to the memory in different Uniboard (or remote memory). 3 SPEC OMPL Benchmarks The SPEC OMPL is a standard benchmark suite for evaluating the performance of OpenMP applications. It consists of application programs written in C and Fortran, and parallelized using the OpenMP API [11]. The underlying execution model for OpenMP programs is fork-join (see Fig. 2) [9]. A master thread executes sequentially until a parallel region of code is encountered. At that point, the master thread forks a team of worker threads. All threads participate in executing the parallel region concurrently. At the end of the parallel region (the join point), the team of worker threads and the master synchronize. After then the master thread alone continues sequential execution. OpenMP parallelization incurs an overhead cost that does not exist in sequential programs: cost of creating threads, synchronizing threads, accessing shared data, allocating copies of private data, bookkeeping of information related to threads, and so on. The SPEC OMPL benchmark suite consists of nine application programs representative of HPC applications from the areas of chemistry, mechanical engineering, climate modeling, and physics. Each benchmark requires a memory size up to 6.4 GB when running on a single processor. Thus

Performance Impact of Resource Conflicts on Chip Multi-processor Servers 1171 Fig. 2. OpenMP execution model the benchmarks target large-scale systems with 64-bit address space. Table 1 lists the benchmarks and their application areas. Table 1. SPEC OMPL Benchmarks Using Sun Studio 10 compiler suite [13], we ve generated executables for the benchmarks in SPEC OMPL suite. By using combinations of compiler options provided by the Sun Studio 10, fairly high level of compiler optimizations is applied to the benchmarks. Commonly used compiler flags are -fast -openmp -xipo=2 -autopar -xprofile -xarch=v9a. Other further optimization flags are applied to individual benchmark also. These options provide many common and also advanced optimizations such as scalar optimizations, loop transformations, data prefetching, memory hierarchy optimizations, interproce-

1172 M. Lee et al. dural optimizations, profile feedback optimizations, among others. (Please see [13] for more details on the compiler options.) The -openmp option processes openmp directives and generate parallel code for execution on multiprocessors. The -autopar option provides automatic parallelization by the compiler beyond user-specified parallelization. This can further improve the performance. 4 Performance Results Using the compiler options described in section 3, weve generated highly optimized executables for SPEC OMPL. In this section, we first describe the system environments on which the optimized executables are executed. We then show the performance results, impact of resource conflicts, on Sun Fire E25K. We also show one example compiler technique which can reduce the impact of resource conflicts along with the experimental results. 4.1 System Environments The Solaris 10 Operating System provides features which help improve performance of OpenMP applications. They are Memory Placement Optimization (MPO) and Multiple Page Size Support (MPSS). MPO feature can be useful in improving performance of programs with intensive data accesses to localized regions of memory. With the default MPO policy called first-touch, memory accesses can be kept on the local board most of the time, whereas, without MPO, those accesses would be distributed all over the boards (both local and remote) which can become very expensive. MPSS can improve performance of programs which use a large amount of memory. Using large size pages (supported by MPSS), the number of TLB entries needed for the program and the number of TLB misses can be significantly reduced. Thus performance can be significantly improved [10]. We are enabling both MPO and MPSS for our runs of SPEC OMPL executables. OpenMP threads can be bound to processors using the environment variable SUNW MP PROCBIND which is supported by thread library in Solaris 10. Processor binding, when used along with the static scheduling, benefits applications that exhibit a certain data reuse pattern where data accessed by a thread in a parallel region will either be in the local cache from a previous invocation of a parallel region, or in local memory due to the OS s first-touch memory allocation policy. 4.2 Impact of Resource Conflicts on CMP AsmentionedinSection2,twocoresononeUltraSPARCIVCMPsharetheL2 cache bus and the memory bus, which are potential sources of performance bottlenecks. In order to measure the performance impact of these resource conflicts on SPEC OMPL, we ve measured the performance of 64-thread (and 32-thread) runs in two ways:

Performance Impact of Resource Conflicts on Chip Multi-processor Servers 1173 1. Using 64 (32) UltraSPARC IV processors, thus using only one core per processor. 2. Using 32 (16) UltraSPARC IV processors, thus using both cores of the processor. In this case, there are possible resource conflicts between the two cores. Table 2. 64-thread cases-64x1 vs. 32x2 Table 2 (and Table 3) shows the run times for both 1 and 2 using 64 threads (32 threads) measured on Sun Fire E25K with 1050Mhz UltraSPARC IV processors. They also show the speed-ups of 1 over 2. Overall, 1 performs 1.18x (1.17x) better than 2 in 64-thread run (32-thread run). Benchmarks with greater performance gains from 1 core show the following characteristics: 313.swim l: This is a memory bandwidth-intensive benchmark. For example, there are 14 common arrays accessed all over the program. All the arrays are Table 3. 32-thread cases-32x1 vs. 16x2

1174 M. Lee et al. of the same size (7702 x 7702) and each array element is 8 bytes long. Thus the total array size is 6,342 Mbytes. The arrays are seldom reused in the same loop iteration and the accesses stride through the arrays continuously. When only one core is used per processor, it can fully utilize the L2 cache and the main memory bandwidth available on the processor chip, whereas when two cores are used the bandwidth is effectively halved between the two cores. This led to 1.39x gain of 1 over 2. Unless some aggressive compiler optimizations are performed to increase the data reuses, the benchmark will suffer from the continuous feeding of data to the processor cores which burns all the available memory bandwidths. 315.mgrid l: This benchmark, like 313.swim l, requires high memory bandwidth. Although this benchmark shows some data reuses (group reuse) of the three dimensional arrays which are intensively accessed, the same data is reused at most three times. Therefore, the accesses stride through the arrays. Using only one core can have much higher memory bandwidth, as in 313.swim l s case, which leads to 1.20x gain. 325.apsi l and 331.art l: These benchmarks allocate large amount of memory per thread at run-time. For example, 325.apsi l allocates 6,771 Mbytes of an array at run-time besides many other smaller arrays. The dynamic memory allocation can be parallelized, however it still requires a large memory space per processor core. Thus, instead of allowing 8 threads to allocate large memory on the same UniBoards memory, allowing only 4 threads, by using only one core per each UltraSPARC IV, can have significant performance benefit. 331.art l also shows similar characteristics. 327.gafort l: In this benchmark, the two hottest subroutines have critical sections inside the main loops. Also they both suffer from intensive memory loads and stores generated from the critical section loops. These take up large portions of the total run time. Placing 2 threads on one UltraSPARC IV (by using both cores) can reduce the overhead involved in the locks and unlocks. However, allocating 8 threads on two different UniBoards (by using only one core in each UltraSPARC IV) reduces the pressure on the memory bandwidth significantly compared with allocating 8 threads on the same UniBoard. The benefit from the latter dominates that of the former. Benchmarks other than the above (311.wupwise l, 317.applu l, 321.equake l, 329.fma3d l) relatively give less pressure on the memory bandwidth and/or consume smaller amount of memory. Thus the performance gap between 1 and 2 is smaller. These benchmarks are not heavily affected by the resource conflicts and are more suitable for execution on CMP servers. In order to show the performance impact due to resource conflicts from a different perspective, we ve calculated the speed-ups from 32-thread runs to 64- thread runs in two ways: Calculating scalabilities from 32 x 1 run to 64 x 1 run, i.e. when only core is used. Calculating scalabilities from 32 x 1 run to 32 x 2 run. Thus 64-thread run is performed with resource conflicts.

Performance Impact of Resource Conflicts on Chip Multi-processor Servers 1175 Fig. 3. Scalabilities from 32-thread runs to 64-thread runs Fig. 3 shows the scalabilities in both cases. For the benchmarks which are affected more due to the resource conflicts, the two scalability bars show bigger gaps. 4.3 Algorithmic/Compiler Techniques to Reduce Resource Conflicts on CMP For benchmarks which suffer a lot due to resource conflicts, algorithmic and/or compiler techniques are needed to reduce the penalties. For example, aggressive procedure inlining and skewed tiling [7] technique can be used for 313.swim l. The skewed tiling, when applied to 313.swim l, can convert a major portion of the memory accesses to cache accesses by increasing data reuses. Thus can significantly cut down the traffic to main memory and make a large performance gain. Using the compiler flags -Qoption iropt -Atile:skewp provided in Sun Studio10 Fortran compiler, we ve generated a new executable for 313.swim l. We ve run both the original and the new executables on a smaller Sun SMP server (SunFire E2900 employing 12 UltraSPARV IV processors) using both cores of each UltraSPARC IV. For these runs we ve reduced the array sizes to 1/4th of the original sizes. (There are fourteen two-dimensional arrays with sizes 7702 x 7702 in 313.swim l. Were reduced them into 3802 x 3802.) We ve also reduced the number of loop iterations from 2400 to 1200. Then we ve conducted the following two runs: Using 8 threads, the original executable runs in 1431 sec and the new one runs in 624 sec, resulting in 2.29x speed-up.

1176 M. Lee et al. Using 16 threads, the original executable runs in 1067 sec and the new one runs in 428 sec, resulting in 2.49x speed-up. Above results show the effectiveness of skewed tiling for 313.swim l. Other algorithmic/compiler techniques are being sought for benchmarks which are affected more by the resource conflicts. 5 Conclusion In this paper, we first described the architecture of an example CMP server, Sun Fire E25K, in detail. Then we introduced the OpenMP execution model along with the SPEC OMPL benchmark suite used for our performance study. We also showed how to generate highly optimized executables for SPEC OMPL using the Sun Studio 10 compiler. We then described the system settings on which we run the optimized executables of SPEC OMPL. They include features in Solaris 10 OS (MPO, MPSS) which help improve HPC application performance and binding of threads to processors. Using these features, we ve measured the performance impact of the resource conflicts on CMPs for SPEC OMPL using either one core or both cores of UltraSPARC IV CMPs in the system. It turned out that the benchmarks which have high memory bandwidths requirements and/or use large amounts of memory suffer in the presence of the resource conflicts. Algorithmic and compiler techniques are needed to reduce the conflicts on the limited resources shared among different cores. Acknowledgments. This work was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD, Basic Research Promotion Fund) (KRF-2006-311-D00785). The authors would like to extend their thanks to the Center for Computing and Communications of the RWTH Aachen University for allowing the accesses of Sun Fire E25K and E2900 servers. References 1. AMD Multi-Core: Introducing x86 Multi-Core Technology & Dual-Core Processors (2005), http://multicore.amd.com/ 2. Chaudhry, S., Caprioli, P., Yip, S., Tremblay, M.: High-Performance Throughput Computing, IEEE Micro (May-June 2005) 3. Intel Dual-Core Server Processor, http://www.intel.com/business/bss/products/server/dual-core.htm 4. Intel Hyperthreading Technology, http://www.intel.com/technology/hyperthread/index.htm 5. Kalla, R., Sinharoy, B., Tendler, J.: IBM POWER5 chip: a dual core multithreaded processor, IEEE Micro (March-April 2004) 6. Li, Y., Brooks, D., Hu, Z., Shadron, K.: Performance, Energy, and Thermal Considerations for SMT and CMP Architectures. In: 11th International Symposium on High-Performance Computer Architecture (2005) 7. Li, Z.: Optimal Skewed Tiling for Cache Locality Enhancement. In: International Parallel and Distributed Processing Symposium (IPDPS 03) (2003)

Performance Impact of Resource Conflicts on Chip Multi-processor Servers 1177 8. Olukotun, K., et al.: The Case for a single Chip-Multiprocessor. In: International Conference on Architectural Support for Programming Languages and Operating Systems (1996) 9. OpenMP Architecture Review Board, http://www.openmp.org 10. Solaris 10 Operating System, http://www.sun.com/software/solaris 11. The SPEC OMP benchmark suite, http://www.spec.org/omp 12. Sun Fire E25K server, http://www.sun.com/servers/highend/sunfire e25k/index.xml 13. Sun Studio 10 Software, http://www.sun.com/software/products/studio/index.html 14. Sun UltraSPARC T1 microprocessor, http://www.sun.com/processors/ultrasparc-t1 15. Tullsen, D., Eggers, S., Levy, H.: Simultaneous MultiThreading: Maximizing On- Chip Parallelism. In: International Symposium on Computer Architecture (1995)