Multi-core Programming Evolution Based on slides from Intel Software ollege and Multi-ore Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts, Evolution of Multi-ore Technology Single-threading: Only one task processes at one time. Singlethreading Multitasking and Multithreading: Two or more tasks execute at one time by using content switching (Functionality)). Multithreading HT Technology: Two single threads execute simultaneously on the same processor core. HT Technology Multi-core Technology: omputational work of an application is divided and spread over multiple execution cores (Performance). Multi-core Technology Evolution of Multi-ore Technology 2008/1/17 2 1
Formal Definition of Threads, Processors, and Hyper-Threading in Terms of Resources Thread Basic unit of PU utilization Program counter, PU state information, stack Processor Architecture state: general purpose PU registers, interrupt controller registers, caches, buses, execution units, branch prediction logic Logical Processor Duplicates the architecture space of processor can be shared among different processors HyperThreading Intel version of Simultaneous Multi-Threading (SMT) Makes single physical processor appear as multiple logical processors OS can schedule to logical processors 2008/1/17 3 HyperThreading and Multicore HyperThreading Instructions from logical processors are persistent and execute on shared execution resources Up to microarchitecture how and when to interleave the instructions When one thread stalls, another can proceed This includes cache misses, and branch mispredictions Multicore hip multiprocessing 2 or more execution cores in a single processor 2 or more processors on a single die an share an on-chip cache an be combined with SMT 2008/1/17 4 2
Single Processor Multiprocessor 2008/1/17 5 Hyper-Threading Multi-core Processor 2008/1/17 6 3
Multicore Shared ache Multicore with HypeThreading Technology 2008/1/17 7 Defining Multi-ore Technology With Multi-core technology: Two or more complete computational engines are placed in a single processor. PU State Interrupt Logic ache Units PU State Interrupt Logic ache Units omputers split the computational work of a threaded application and spread it over multiple execution cores. Processor with Multi-ore Technology omputational Work More tasks get completed in less time increasing the performance and responsiveness of the system. ore 1 ore 2 ore 3 omputational Work Distributed Between Multiple ores 2008/1/17 8 4
Evolution to Dual ore Basic Processor Pipelined 1 Instr. / ycle 2008/1/17 9 Evolution to Dual ore Superscalar Parallel Functional Units 2008/1/17 10 5
Evolution to Dual ore Dual ore Double the transistors 2008/1/17 11 Scope Instruction Level encoding/pipelined execution increase throughput Instruction Level scheduling, superscalar Outermost Loops, whole program data/functional 2008/1/17 12 6
HT Technology and Dual-ore Dual Processors Processor w/ HT API API API API API API ore ore ore API: Advanced Programmable System Interrupt Bus ontrollers: Solve interrupt routing efficiency issues in multiprocessor D-Processor computer systems. D and HT Logical Logical Processor Processor API API API API API API ore ore ore ore System Bus 2008/1/17 13 Hyper-Threading vs. Dual ore Both threads without Multitasking or Hyper-Threading Technology: Both threads with MultiTasking Technology: Both threads with Dual ore processors: Time saved Time saved Time 2008/1/17 14 7
Driving Greater Parallelism Normalized Performance vs. initial Intel Pentium 4 Processor DUAL/MULTI-ORE PERFORMANE Performance 10X SINGLE-ORE PERFORMANE 3X 2000 2004 2008 FOREAST 2008/1/17 15 Power Vs. Frequency 2f P ~ 4f 2 = 4P0 4x Power Increase f P 0 ~ f 2 f P = 4P0 ore Die/Socket f f P ~ 2f 2 = 2P0 only 2x 2008/1/17 16 8
How HT Technology and Dual-ore Work Physical processor cores Logical processors visible to OS Thread 2 Thread 1 Resource 1 Resource 2 Resource 3 Physical processor resource allocation Time Throughput Thread 1 Thread 2 Resource 1 Resource 2 Resource 3 Thread 1 Thread 2 Resource 1 Resource 2 Resource 3 Resource 1 Resource 2 Resource 3 2008/1/17 17 Today? Multi-ore Power ache Large ore 4 3 2 1 Performance 2 1 Power = 1/4 Performance = 1/2 Small ore 1 1 1 3 ache 2 4 4 3 2 1 4 3 2 1 Multi-ore: Power efficient Better power and thermal management 2008/1/17 18 9
The Future? Many-core General Purpose ores SP SP SP SP Special Purpose HW Interconnect fabric Heterogeneous Multi-ore Platform 2008/1/17 19 Taking full advantage of Multi-ore requires multi-threaded software Increased responsiveness and worker productivity Increased application responsiveness when different tasks run in parallel Improved performance in parallel environments When running computations on multiple processors More computation per cubic foot of data center Web-based apps are often multi-threaded by nature 2008/1/17 20 10
Threading - the most common parallel technique OS creates process for each program loaded Additional threads can be created within the process Each thread has its own Stack and Instruction Pointer All threads share code and data Single shared address space Thread local storage if required ontrast with Distributed address parallelism: Message Passing Interface (MPI) thread2() Stack IP Process Data ode thread1() Stack IP threadn() Stack IP 2008/1/17 21 Independent Programs Parallel Programs Application Application Application Application Application Application Threads Process Switch Thread Switch Threads allows one application to utilize the power of multiple processors 2008/1/17 22 11
Multi-Threading on Single-ore v Multi-ore Optimal application performance on multi-core architectures will be achieved by effectively using threads to partition software workloads On single core, multi-threading can only be used to hide latency Ex: rather than block UI when printing spawn a thread, free UI thread to respond to user Multi-threaded applications running on multi-core platforms have different design considerations than do multi-threaded applications running on single-core platforms On single core can make assumptions to simplify writing and debugging These may not be valid on multi-core platforms e.g areas like memory caching and thread priority 2008/1/17 23 Memory aching Example Each core may have own cache May be out of sync Ex: 2 thread on dual-core reading and writing neighboring memory locations Data values may be in same cache line because of data locality Memory system may mark the line as invalid due to write Result is cache miss even though the data in cache being read is valid Not an issue on single-core since only one cache 2008/1/17 24 12
Thread Priorities Example Applications with two threads of different priorities To improve performance developer assumes that higher thread will run without interference from lower priority thread On single core valid since scheduler will not yield PU to lower priority thread On multi-core may schedule 2 threads on different cores, and they may run simultaneously making code unstable 2008/1/17 25 13