The Intel move from ILP into Multi-threading

Size: px

Start display at page:

Download "The Intel move from ILP into Multi-threading"

Giles Curtis
6 years ago
Views:

1 The Intel move from ILP into Multi-threading Miguel Pires Departamento de Informática, Universidade do Minho Braga, Portugal Abstract. Multicore technology came into consumer market in the last years to face what seem to be the limits to the technological paradigm of single core. With a small increase in chip cost and some engineering development, the first implementations of multithreading presented relevant improvements by dividing programs into threads and mixing threads in the processor pipeline (single core). Core parallelism (multicore architectures) applied the multithread technology at core level (in the same chip). This paper presents an introduction to simultaneous multithreading technology and its implementations. It explains Intel s hyperthreading approach, and the analysis of multicore techniques, with special emphasis on the Xeon processor (single and dual core) and its competitors. As the single core technology does not seem to achieve enough improving, multithreading techniques at core level (core parallelism) and sometimes at logical level too (like hyperthreading) combined with powerful software "thread tailored" seem to achieve the best results. 1 Introduction Thread level parallelism techniques in a single processor, like HT (hyperthreading), tend to minimise both horizontal and vertical waste in processor pipeline by simulating a logical processor and by mixing various threads at the same time. MC (multicore) processors extend this technique in a multi processor (in the same chip) at physical level. Nowadays some processors combine these techniques at instruction and thread (logical and processor) level with powerful thread oriented software to maximise a processor s multithreading capabilities and than to achieve better throughputs. In section 2 of this article will be described SC and Section 4 compares HT and MC performance gain on Intel s micro architectures and critical factors in performance. Finally is presented a comparative analysis between two different approaches to MC architecture, Intel Xeon and AMD Opteron (section 5). 2 Multithreading on Single Core Superscalar single threaded architectures optimize processor pipeline but do not have in mind important factors that are external to the pipeline, like cash-misses, interrupts and branch mispredictions. A superscalar, out-of-order, with multi level pre-fetches and with the possibility of executing several instructions simultaneously in the same clock cycle processor is represented in figure 1. As is can be seen, some horizontal and vertical execution bubbles can be found, because only one block of instructions runs at a time [1][2][12]. In fine-grain multithreading various threads run simultaneously, but only one thread is executed at a clock cycle. A commuting process selects one thread at a time, avoiding horizontal bubbles in the pipeline, but does not give any solution to the vertical waste. SMT (simultaneous multi threading) runs at the same time various threads, trying to fill

pipeline vertically and horizontally, as much as possible. partition resources either in space or time, thereby limiting their flexibility to adapt to available parallelism.

TX2 used multiple threads to support fast context switching to handle I/O functions.

2 pipeline vertically and horizontally, as much as possible. partition resources either in space or time, thereby limiting their flexibility to adapt to available parallelism. Figure 1 Vertical and horizontal waste of non-threaded microarchitectures [3] As described in [1] the first known implementation of multithreading technology is called TX2 and dates back to TX2 used multiple threads to support fast context switching to handle I/O functions. Since then many evolutions of this concept were developed but the more significant one was a fine-grained multithreading scheme with interleave scheduling among threads (CDC 6600). The first simulation for multithreaded superscalar architecture appeared in 1994 and in 1995 was known the first realistic simulation assessment and coined the term simultaneous multi threading (SMT). As we can see in figure 2, experiences made in this field [4], shown that even when comparing SMT single core and parallel multiprocessor without SMT, the first one s performance is better (like in figure 3). Intel introduced Hyperthreading technology in 2002, based in SMT techniques by allowing two simultaneous threads at the same clock cycle in a single processor. The execution can be divided in two mixed threads in the pipeline. Using optimized algorithms, threads share physical resources such as caches, execution units, branch predictors, control logic, and buses. s (Advanced Programmable Interrupt Controllers) control the state of each logical processor and therefore they are duplicated as shown in Figure 4. [5] Figure 2 Performance comparison of SMT to Superscalars, multithreaded processors and onchip multiprocessors (instructions/cycle)[4] Multiprocessor SMT is then an evaluative technique that minimizes the pipeline s waste at multiple levels (thread and instruction levels), to raise substantially the use of the processor. As a consequence, the number of instructions per clock cycle raises and leads to both multiprogramming and parallel workloads gains. Multiscalar processors speculatively execute threads using dynamic branch prediction techniques and squashes threads if control (branches) or data (memory) speculation is incorrect. Although all of these architectures exploit multiple forms of parallelism, only SMT has the ability to dynamically share execution resources among all threads. In contrast, the others Figure 3 Pipeline Multiprocessor Architecture (based in [5])

DualCore A HT Technolog way, HT gains depend on how applications are fitted to take advantage of this technology, like those who explore data-parallel execution but most of the times require some

[7] Figure 4 Dual Core and HyperThreading Intel technology (based in [5]) In a super-pipelined micro architecture, events like cache misses, interrupts and branch mispredictions can be costly, so

Intel reports that once the logical processors share almost all physical microprocessor resources, and only a few small structures were replicated, the die area cost of the implementation was less

3 DualCore A HT Technolog way, HT gains depend on how applications are fitted to take advantage of this technology, like those who explore data-parallel execution but most of the times require some engineering. [6] Intel reported that HT achieved 15% to 27% increase in processor resources utilization in well-optimized multimedia Technology. [7] Figure 4 Dual Core and HyperThreading Intel technology (based in [5]) In a super-pipelined micro architecture, events like cache misses, interrupts and branch mispredictions can be costly, so when this happens in one thread, HT processors can fill the pipeline with the other thread, and then maximise the number of instructions instruction per cycle. Intel reports that once the logical processors share almost all physical microprocessor resources, and only a few small structures were replicated, the die area cost of the implementation was less than 5% of the total area and the clock-cycle time is not significantly different from the nonmultithread one [5]. This two logical processor architecture has many engineering implications. HT changed many basic assumptions about single-threaded out-of-order design. Therefore to introduce multithread Intel had to change algorithms and create new ones to prioritize micro operations, or micro-ops, from different logical processors. They had to take some options concerning memory sharing by the two threads and pointer manipulation, a subject already complex enough in x86 architecture. Increased complexity dramatically increases the validation effort. Also on the platform side they reviewed and optimised chipset, BIOS, operating systems, and applications. [5] HT improves overall performance by multitasking, and when applications are already multithreaded. In this Figure 5 Hyperthreading technology performance gains on several popular multithreaded software packages. [8] Figure 6 - Hyperthreading technology performance boost on multitasking workloads. [8] During the period , microprocessor performances improved at appreciated levels (following the Moore s Law) in a single processor basis. In November of 2003, a group of Intel researchers announced the technological limits for the microprocessors miniaturization [16]. Due to this single core technology constraints, continuous demands in speed gains, limits in exploring Instruction Parallelism (will not support the same growth) and the advances in parallel computing technology, manufacturers saw new opportunities in connecting multiple processors together [1]. As the main aim of parallelism is to maximize the use of the processor, all the accelerating techniques in a single core cause more activity and therefore higher temperatures in the processor. The more single processor technology slows down the performance growth, the more attractive is the multiprocessor field.

3 Multithreading on Multicore Although HT seems like there are two logical possessors instead of one, the number of microinstructions at a single clock cycle that can be executed at the same time

4 3 Multithreading on Multicore Although HT seems like there are two logical possessors instead of one, the number of microinstructions at a single clock cycle that can be executed at the same time (pipeline width) is the same. Furthermore, SMT single core technologies have a major impact in long pipelines (Intel s netburst architecture case) but can be inefficient in other micro architectures. [11] The dissemination of SMT technology and the good performances achieved in parallel computing encouraged manufacturers to think about new opportunities in the SMT (applied to multiplie cores) as shown in [9]. Researchers realized that SMT would lead to higher degrees of parallelism with MP products. With significant advances in microelectronics and high threaded software usage, Intel reported in 2005 the MC (MultiCore) product line. [10] Advances in electronics and miniaturization made possible to have two cores (and its cache memories) in the same chip. It is like having independent processors but with much faster communication and memory access. Initially in the server market with the high-end computing market Pentium (Xeon), Intel introduced for the first time the MultiProcessor (MP) Technology. In one the first versions of Multicore Xeon (in figure 7), each of the Xeon cores has L1 (16k) and L2 (1MB) and the 16 MB of L3 memory is shared between cores. Due to the 65 nm technologies, it was possible to put millions of transistors in one single chip. This Xeon presents a hyper-pipelined architecture with 32 levels of pipeline. two threads can be mixed in the same pipeline and run at the same time (what means four threads for dual core). If the execution has fewer threads than de maximum allowed by processor, naturally preference goes to core execution because of the limitations of HT compared to MC efficiency explained above. In some versions of MC, HT does not exist or can be switched off, because in some computing markets HT is not efficient in the MC approach. Processors explore the full advantage of MC when the execution is thread tuned (naturally or forced ) but there are some computing markets where work tends to be naturally threaded. In this cases like server market, is possible to take advantage of this multithreading executing procedure as many times as microelectronics (and market) can get. One example of this is Intel s QuadCore tailored to server market, which principle is the same applied to 4 Cores (and also HT). But having the possibility of many cores, can normal systems take advantage of Hyperthreading? To what limit? Many authors think that systems do not explore yet the possibilities of multithreaded execution, because this paradigm only recently was realised and is relatively recent in the software industry, so there is much more to run in this way. [10] Researches in this domain suggest that a very high number of threads lead to complicated and inefficient resource sharing, even in powerful processors. In this way, some authors think that the processor shall decide the optimal number of threads to process. [13]. The advantage of having two cores in the same chip is the possibility of real processor multithreading with a communication much faster between the cores (one of the problems of parallel computing) and more efficient shared memory management. In first versions all the Xeon processors accumulated MC with HT technology, what means that in each core, Figure 7 View of Intel Xeon Dual Core Chip. [16]

4 SMT Performance comparison: SC and MC As it was referred in the section 2, multithreading techniques surplus depend how software takes advantage of processor s multithreading capabilities.

They suggest that to take full advantage from the MC innovations, compilers should be coretuned (2 Cores, 4 Cores, etc.).

5 4 SMT Performance comparison: SC and MC As it was referred in the section 2, multithreading techniques surplus depend how software takes advantage of processor s multithreading capabilities. Experiences in this field [10] demonstrate that MC demand specific adjustments in Compilers and other software. They suggest that to take full advantage from the MC innovations, compilers should be coretuned (2 Cores, 4 Cores, etc.). Placing the two SMT techniques side-by-side, Dual Core with two threads is more efficient than a SC with the same threads because of the resource sharing in a single core. Is not always efficient to use various contexts (virtual processors) in each core, because at a high number of threads may cause conflict sharing (depends on the application), but it is certain that Multicore Processors are faster than Uni-core ones when applications (mainly compilers) take advantage of multithread. The MC gain can be up to 30% when parallel execution is at high level, but in common applications will be under that value [7]. Figure 8 shows the MC effect at microinstructions level in various scenarios (cores, threads). The following comparison gives an overview of characteristics and performances of two big competitors in multicore processors, the Intel Xeon 7140M 3.4 Ghz and AMD Opteron 8220SE 2.8 Ghz [12]. Intel presents an architecture dualcore hyper-pipelined (31 stages), superscalar, hyperthreading, L1 (16k), L2 (1MB) and 16MB L3 shared chache. AMD Opteron is a dual core with a 64Kb L1, and 1MB L2 per core, hypertransport and AMD virtualization technologies. Both processors present high performance levels although very distinct architectural options. Xeon s 16MB L3 cache is a surplus in ERP s Applications and database. On the other hand, Opteron gets better scalability due to bus between cores, memory and a larger bandwidth. Due to the long pipeline, Xeon uses hyperthreading technology to optimize threads in each core. Intel s simply placed two Prescott (previous series) cores in the same chip. On the other side, AMD developed a new memory control between cores. This means that there is no need to communicate thru chipset, because memories are addressed directly from an exclusive bus named hypertransport what means best bandwidth. The communication with the other resources is also made by hypertransport. There is no need to share resources of the super I/O IDE controller, SATA, AGP, PCI- Express, USB, etc.). Hypertransport is a high performance, low latency and full-duplex connection, and it is possible to expand from dual core to quadcore applying the same scheme (Fig. 9). Figure 8-Normalized execution time of the benchmarks on the SMT multiprocessor. The sequential execution time is used as a reference for the normalization. 5 Comparison with competitors As usual, the best technological examples are introduced first in the high-value market, and the high-performance server market is a good example. Figure 9-AMD QuadCore technologies with HyperTransport [15]

6 6 Conclusions Simultaneous Multi Threading is a processor design methodology that combines the instruction level parallelism and the thread level parallelism. The aim is to increase gains of conventional superscalar processors in single or lately multiprocessor-in-one chip basis. Multithread techniques divide the execution into several independent threads. In single core SMT technology (HT in Intel netburst architecture), physical resources are just shared in an optimal thread mixing, but the pipeline does not enlarge and probably at micro-instruction level may cause some inefficiency. On the other hand SMT single core is inefficient in small pipelines like AMD s Opteron because there are no many wait times in the pipeline. Lately, incorporating SMT and parallel computation knowledge and recent progresses of microelectronics, manufacturers moved into MultiCore Many processors in one chip concept that allows threads distributed by the cores available. Core parallelism is a model that is only in the beginning and can be improved to electronic miniaturization limits. The MC technology replication seems to give good results and can be a key to faster micro processing architectures. However, the bandwidth off-chip does not seem to increase at the same speed and this will be certainly constraining the number of useful cores in a chip. This means that the supply-chain of cores will not be fast enough to send all the data that cores can process [14]. References [1] Hennessy, John L. and Patterson, David A.: Computer Architecture A quantitative approach Chapter 6, 3d edition, Elsevier Science USA, 2003 [2] LO, J., EMER, J., Levy, H., Stamm, R., Tullsen, M., Converting Thread-Level Parallelism to Instruction-Level Parallelism via SMT, ACM, Vol. 15, No. 3, August [3] Tullsen, D., Levy H., Simultaneous Multithreading: Maximizing On-Chip Parallelism, ACM Transactions on Computer Systems, 1995 [4] Eggers, S., Emer, J., Levy, H., Lo, J., Stamm, R., Tullsen, D.: SMT: A Platform for Next-Generation Processors, IEEE Micro, 1997 [5] Marr, D., Binns, F., Hill, D., Hinton, G., Koufaty, D., Miller, J. Upton, M.: Hyper-Threading Technology Architecture and Microarchitecture: Intel Technology Journal Q1, 2002 [6] Magro, W., Petersen, P., Shah S.: Hyper- Threading Technology: Impact on Compute- Intensive Workloads, Intel Technology Journal Q1, 2002 [7] Chen, Y., Holliman, M., Debes, E., Zheltov, S., Knyazev, A., Bratanov, S., Belenov, R., Santos, I.:Media Applications on Hyper-Threading Technology, Intel Technology Journal, 2002 [8] Koufaty, David, Marr, Deborah T.: Hyperthreading Technology in the Netburst Microarchitecture, IEEE Computer Society, 2003 [9] Spracklen, L., Abraham, G.: Chip Multithreading: Opportunities and Challenges, IEEE, 2005 [10] Curtis-Maury, M., Ding, X., Antonoupoulos, C., Nikopoulos, S.: An evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors, DCS The College of William and Mary, 2002 [11] Hewlett-Packard Development Company: Characterizing x86 processors for industrystandard servers: AMD Opteron and Intel Xeon, Technology Brief, 2nd Edition, 2005 [12] Silva, D., Ferreira, A.: Comparação dos MultiProcessadores Intel Xeon Dual Core e AMD Opteron, IST DEI, 2006/2007 [13] Courtis-Mauri, M., Wang, T., Antonopoulos, C., Nikolopoulos, D.:Integrating Multiple Forms of Multithreaded Execution on SMT Processors, College of William and Mary, 2005 [14] yperthreading/ visited in January, 25, 2007 Dua, R., Lokhande, B.:A Comparative study of SMT and CMP multiprocessors, Princeton University, ee8365, 2006 [15] Cardoso, B., Rosa, S., Fernandes, T.:Multicore, Unicamp, 2005

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3