ARCHITECTURES FOR PARALLEL COMPUTATION

Size: px
Start display at page:

Download "ARCHITECTURES FOR PARALLEL COMPUTATION"

Transcription

1 Datorarkitektur Fö 11/12-1 Datorarkitektur Fö 11/12-2 ARCHITECTURES FOR PARALLEL COMTATION 1. Why Parallel Computation 2. Parallel Programs 3. A Classification of Computer Architectures 4. Performance of Parallel Architectures 5. The Interconnection Network 6. Array s 7. Multiprocessors 8. Multicomputers 9. Multi Architectures 10. Multithreading 11. General Purpose Graphic Units 12. ector s 13. Multimedia Extensions to Microprocessors Why Parallel Computation? The need for high performance! Two main factors contribute to high performance of modern processors: 1. Fast circuit technology 2. Architectural features: - large caches - multiple fast buses - pipelining - superscalar architectures (multiple funct. units) However Computers running with a single C, often are not able to meet performance needs in certain areas: - Fluid flow analysis and aerodynamics; - Simulation of large complex systems, for example in physics, economy, biology, technic; - Computer aided design; - Multimedia. Applications in the above domains are characterized by a very high amount of numerical computation and/or a high quantity of input data. Datorarkitektur Fö 11/12-3 Datorarkitektur Fö 11/12-4 A Solution: Parallel Computers Parallel Programs One solution to the need for high performance: architectures in which several Cs are running in order to solve a certain application. 1. Parallel sorting Unsorted-1 Unsorted-2 Unsorted-3 Unsorted-4 Such computers have been organized in very different ways. Some key features: - number and complexity of individual Cs - availability of common (shared memory) - interconnection topology - performance of interconnection network - I/O devices Sort-1 Sorted-1 Sort-2 Sort-3 Sorted-2 Sorted-3 Merge S O R T E D Sort-4 Sorted-4

2 Datorarkitektur Fö 11/12-5 Datorarkitektur Fö 11/12-6 Parallel Programs (cont d) Parallel Programs (cont d) A possible program for parallel sorting: var t: array[ ] of integer; procedure sort(i,j:integer); -sort elements between t[i] and t[j]- end sort; procedure merge; - - merge the four sub-arrays - - end merge; begin cobegin sort(1,250) sort(251,500) sort(501,750) sort(751,1000) coend; merge; end; 2. Matrix addition: a 11 a 21 a 31 a n1 a 12 a 1m a 22 a 2m a 32 a 3m a n2 a nm b 11 b 12 b 1m b 21 b 22 b 2m b 31 b 32 b 3m b n1 b n2 b nm = c 11 c 21 c 31 c n1 var a: array[1..n,1..m] of integer; b: array[1..n,1..m] of integer; c: array[1..n,1..m] of integer; i:integer begin for i:=1 to n do for j:= 1 to m do c[i,j]:=a[i,j]b[i,j]; end for end for end; c 12 c 1m c 22 c 2m c 32 c 3m c n2 c nm Datorarkitektur Fö 11/12-7 Datorarkitektur Fö 11/12-8 Parallel Programs (cont d) Matrix addition - parallel version: var a: array[1..n,1..m] of integer; b: array[1..n,1..m] of integer; c: array[1..n,1..m] of integer; i:integer procedure add_vector(n_ln:integer); var j:integer begin for j:=1 to m do c[n_ln,j]:=a[n_ln,j]b[n_ln,j]; end for end add_vector; begin cobegin for i:=1 to n do add_vector(i); coend; end; Parallel Programs (cont d) Matrix addition - vector computation version: var a: array[1..n,1..m] of integer; b: array[1..n,1..m] of integer; c: array[1..n,1..m] of integer; i,j:integer begin for i:=1 to n do c[i,1:m]:=a[i,1:m]b[i,1:m]; end for; end; Or even so: begin c[1:n,1:m]:=a[1:n,1:m]b[1:n,1:m]; end;

3 Datorarkitektur Fö 11/12-9 Datorarkitektur Fö 11/12-10 Parallel Programs (cont d) Parallel Programs (cont d) Pipeline model computation: A program for the previous computation: x x a = 45 logx y = 5 45 logx a y y = 5 a y channel ch:real; - cobegin var x:real; while true do read(x); send(ch,45log(x)); end while var v:real; while true do receive(ch,v); write(5*sqrt(v)); end while coend; - Datorarkitektur Fö 11/12-11 Datorarkitektur Fö 11/12-12 Flynn s Classification of Computer Architectures Flynn s Classification (cont d) Flynn s classification is based on the nature of the instruction flow executed by the computer and that of the data flow on which the instructions operate. 2. Single Instruction stream, Multiple Data stream (SIMD) SIMD with shared memory 1. Single Instruction stream, Single Data stream (SISD) C unit instr. stream unit data stream unit IS unit_1 unit_2 unit_n DS 1 DS 2 DS n Interconnection Network Shared

4 Datorarkitektur Fö 11/12-13 Datorarkitektur Fö 11/12-14 Flynn s Classification (cont d) Flynn s Classification (cont d) SIMD with no shared memory 3. Multiple Instruction stream, Multiple Data stream (MIMD) MIMD with shared memory LM 1 unit_1 DS 1 C_1 IS 1 LM 1 DS 1 LM unit IS LM 2 unit_2 LM n unit_n DS 2 DS n Interconnection Network C_2 C_n unit_1 unit_2 IS 2 IS n unit_1 LM 2 unit_2 LM n DS 2 DS n Interconnection Network Shared unit_n unit_n Datorarkitektur Fö 11/12-15 Datorarkitektur Fö 11/12-16 Flynn s Classification (cont d) Performance of Parallel Architectures MIMD with no shared memory Important questions: C_1 IS 1 LM 1 DS 1 unit_1 unit_1 How fast runs a parallel computer at its maximal potential? C_2 C_n unit_2 IS 2 IS n LM 2 unit_2 LM n DS 2 DS n Interconnection Network How fast execution can we expect from a parallel computer for a concrete application? How do we measure the performance of a parallel computer and the performance improvement we get by using such a computer? unit_n unit_n

5 Datorarkitektur Fö 11/12-17 Datorarkitektur Fö 11/12-18 Performance Metrics Performance Metrics (cont d) Peak rate: the maximal computation rate that can be theoretically achieved when all modules are fully utilized. The peak rate is of no practical significance for the user. It is mostly used by vendor companies for marketing of their computers. Speedup: measures the gain we get by using a certain parallel computer to run a given parallel program in order to solve a specific problem. T S S = T P T S : execution time needed with the best sequential algorithm; T P : execution time needed with the parallel algorithm. Efficiency: this metric relates the speedup to the number of processors used; by this it provides a measure of the efficiency with which the processors are used. E S = -- p S: speedup; p: number of processors. For the ideal situation, in theory: T S S = = p ; which means E = 1 T S p Practically the ideal efficiency of 1 can not be achieved! Datorarkitektur Fö 11/12-19 Datorarkitektur Fö 11/12-20 Amdahl s Law Amdahl s Law (cont d) Consider f to be the ratio of computations that, according to the algorithm, have to be executed sequentially (0 f 1); p is the number of processors; ( 1 f) T S T P = f T S p T S 1 S = = T S ( 1 f) f T S ( 1 f) f p p Amdahl s law: even a little ratio of sequential computation imposes a certain limit to speedup; a higher speedup than 1/f can not be achieved, regardless the number of processors. S 1 E = -- = P f ( p 1) 1 S f To efficiently exploit a high number of processors, f must be small (the algorithm has to be highly parallel).

6 Datorarkitektur Fö 11/12-21 Datorarkitektur Fö 11/12-22 Other Aspects which Limit the Speedup Beside the intrinsic sequentiality of some parts of an algorithm there are also other factors that limit the achievable speedup: - communication cost - load balancing of processors - costs of creating and scheduling processes - I/O operations There are many algorithms with a high degree of parallelism; for such algorithms the value of f is very small and can be ignored. These algorithms are suited for massively parallel systems; in such cases the other limiting factors, like the cost of communications, become critical. Efficiency and Communication Cost Consider a highly parallel computation, so that f (the ratio of sequential computations) can be neglected. We define f c, the fractional communication overhead of a processor: T calc : time that a processor executes computations; T comm : time that a processor is idle because of communication; f c = T comm T calc T p = T S ( 1 f p c ) S = T S p = T P 1 f c E = f 1 f c c With algorithms that have a high degree of parallelism, massively parallel computers, consisting of large number of processors, can be efficiently used if f c is small; this means that the time spent by a processor for communication has to be small compared to its effective time of computation. In order to keep f c reasonably small, the size of processes can not go below a certain limit. Datorarkitektur Fö 11/12-23 Datorarkitektur Fö 11/12-24 The Interconnection Network The Interconnection Network (cont d) Single Bus The interconnection network (IN) is a key component of the architecture. It has a decisive influence on the overall performance and cost. Node 1 Node 2 Node n The traffic in the IN consists of data transfer and transfer of commands and requests. The key parameters of the IN are - total bandwidth: transferred bits/second - cost Single bus networks are simple and cheap. One single communication is allowed at a time; the bandwidth is shared by all nodes. Performance is relatively poor. In order to keep a certain performance, the number of nodes is limited (16-20).

7 Datorarkitektur Fö 11/12-25 Datorarkitektur Fö 11/12-26 The Interconnection Network (cont d) Completely connected network The Interconnection Network (cont d) Crossbar network Node 1 Node 1 Node 2 Node 5 Node 2 Node 3 Node 4 Node n Each node is connected to every other one. Communications can be performed in parallel between any pair of nodes. Both performance and cost are high. Cost increases rapidly with number of nodes. The crossbar is a dynamic network: the interconnection topology can be modified by positioning of switches. The crossbar switch is completely connected: any node can be directly connected to any other. Fewer interconnections are needed than for the static completely connected network; however, a large number of switches is needed. A large number of communications can be performed in parallel (one certain node can receive or send only one data at a time). Datorarkitektur Fö 11/12-27 Datorarkitektur Fö 11/12-28 The Interconnection Network (cont d) The Interconnection Network (cont d) Mesh network Hypercube network N 0 N 4 Node 1 Node 5 Node 9 Node 13 N 8 N 12 Node 2 Node 6 Node 10 Node 14 N 1 N 9 N 13 N 5 Node 3 Node 7 Node 11 Node 15 N 10 N 14 Node 4 Node 8 Node 12 Node 16 N 2 N N 6 11 N 15 Mesh networks are cheaper than completely connected ones and provide relatively good performance. In order to transmit an information between certain nodes, routing through intermediate nodes is needed (maximum 2*(n-1) intermediates for an n*n mesh). It is possible to provide wraparound connections: between nodes 1 and 13, 2 and 14, etc. Three dimensional meshes have been also implemented. N 3 N 7 2 n nodes are arranged in an n-dimensional cube. Each node is connected to n neighbours. In order to transmit an information between certain nodes, routing through intermediate nodes is needed (maximum n intermediates).

8 Datorarkitektur Fö 11/12-29 Datorarkitektur Fö 11/12-30 SIMD Computers MULTIPROCESSORS Shared memory MIMD computers are called multiprocessors: unit Local Local Local 1 2 n SIMD computers are usually called array processors. s are usually very simple: an ALU which executes the instruction broadcast by the CU, a few registers, and some local memory. The first SIMD computer: - ILLIAC I (1970s): 64 relatively powerful processors (mesh connection, see above). Contemporary commercial computer: - CM-2 (Connection Machine) by Thinking Machines Corporation: very simple processors (connected as a hypercube). Array processors are highly specialized for numerical problems that can be expressed in matrix or vector format (see program on slide 8). Each computes one element of the result. Shared Some multiprocessors have no shared memory which is central to the system and equally accessible to all processors. All the memory is distributed as local memory to the processors. However, each processor has access to the local memory of any other processor a global physical address space is available. This memory organization is called distributed shared memory. Datorarkitektur Fö 11/12-31 Datorarkitektur Fö 11/12-32 Multiprocessors (cont d) Mutiprocessors (cont d) Communication between processors is through the shared memory. One processor can change the value in a location and the other processors can read the new value. From the programmers point of view communication is realised by shared variables; these are variables which can be accessed by each of the parallel activities (processes): - table t in slide 5; - matrixes a, b, and c in slide 7; With many fast processors memory contention can seriously degrade performance multiprocessor architectures don t support a high number of processors. IBM System/370 (1970s): two IBM Cs connected to shared memory. IBM System370/A (1981): multiple Cs can be connected to shared memory. IBM System/390 (1990s): similar features like S370/A, with improved performance. Possibility to connect several multiprocessor systems together through fast fibre-optic connection. CRA -MP (mid 1980s): from one to four vector processors connected to shared memory (cycle time: 8.5 ns). CRA -MP (1988): from one to 8 vector processors connected to shared memory; 3 times more powerful than CRA -MP (cycle time: 4 ns). C90 (early 1990s): further development of CRA - MP; 16 vector processors. CRA 3 (1993): maximum 16 vector processors (cycle time 2ns). Butterfly multiprocessor system, by BBN Advanced Computers (1985/87): maximum 256 Motorola processors, interconnected by a sophisticated dynamic switching network; distributed shared memory organization. BBN TC2000 (1990): improved version of the Butterfly using Motorola RISC processor.

9 Datorarkitektur Fö 11/12-33 Datorarkitektur Fö 11/12-34 Multicomputers Multicomputers (cont d) MIMD computers with a distributed address space, so that each processor has its one private memory which is not visible to other processors, are called multicomputers: Communication between processors is only by passing messages over the interconnection network. Private Private Private From the programmers point of view this means that no shared variables are available (a variable can be accessed only by one single process). For communication between parallel activities (processes) the programmer uses channels and send/receive operations (see program in slide 10). 1 2 n There is no competition of the processors for the shared memory the number of processors is not limited by memory contention. The speed of the interconnection network is an important parameter for the overall performance. Datorarkitektur Fö 11/12-35 Datorarkitektur Fö 11/12-36 Multicomputers (cont d) Intel ipsc/2 (1989): 128 Cs of type interconnected by a 7-dimensional hypercube (2 7 =128). Intel Paragon (1991): over 2000 processors of type i860 (high performance RISC) interconnected by a two-dimensional mesh network. KSR-1 by Kendal Square Research (1992): 1088 processors interconnected by a ring network. ncube/2s by ncube (1992): 8192 processors interconnected by a 10-dimensional hypercube. Cray T3E MC512 (1995): 512 Cs interconnected by a three-dimensional mesh; each C is a DEC Alpha RISC. Network of workstations: A group of workstations connected through a Local Area Network (LAN), can be used together as a multicomputer for parallel computation. Performances usually will be lower than with specialized multicomputers, because of the communication speed over the LAN. However, this is a cheap solution. Multi chips: Muti Architectures Several processors on the same chip. A parallel computer on a chip. It is the only way to increase chip performance without excessive increase in power consumption: - Instead of increasing processor frequency, use several processors and run each at lower frequency. Examples: Intel x86 Multi architectures - Intel Core Duo - Intel Core i7 ARM11 MPCore

10 Datorarkitektur Fö 11/12-37 Datorarkitektur Fö 11/12-38 Intel Core Duo Intel Core i7 Composed of two Intel Core superscalar processors (see Fö7/8, slide 40) 256 KB L2 Cache 2 MB L2 Shared Cache & Cache coherence Off chip Main 256 KB L2 Cache 256 KB L2 Cache 256 KB L2 Cache 8 MB L3 Shared Cache & Cache coherence Off chip Main Contains four Nehalem (see Fö 7/8, slide 41) processors. Datorarkitektur Fö 11/12-39 Datorarkitektur Fö 11/12-40 ARM11 MPCore Multithreading Arm11 A running program: - one or several processes; each process: - one or several threads Arm11 Arm11 Arm11 Cache coherence unit Off chip Main thread 1_1 process 1 process 2 process 3 thread 1_2 thread 1_3 thread 2_1 thread 2_2 processor 1 processor 2 thread 3_1 A thread is a piece of sequential code executed in parallel with other threads.

11 Datorarkitektur Fö 11/12-41 Datorarkitektur Fö 11/12-42 Multithreading (cont d) Several threads can be active simultaneously on the same processor. - Typically, the Operating System is scheduling the threads on the processor. - The OS is switching between threads so that one thread is active (running) on a processor at a time. Hardware Multithreading Multithreaded processors provide hardware support for executing multithreaded code: - separate program counter & register set for individual threads; - instruction fetching on thread basis; - hardware supported context switching. - Efficient execution of multithread software. - Efficient utilisation of processor resources. Switching between threads implies saving/restoring the Program Counter, Registers, Status flags, etc. Switching overhead is considerable! By handling several threads: - There is greater chance to find instructions to execute in parallel on the available resources. - When one thread is blocked, due to e.g. memory access or data dependencies, instructions from another thread can be executed. Thus, latencies due to memory access (at cache miss), data dependencies, etc. can efficiently be hidden! Multithreading can be implemented on top of both scalar and superscalar processors. Datorarkitektur Fö 11/12-43 Datorarkitektur Fö 11/12-44 Approaches to Multithreaded Execution Approaches to Multithreaded Execution (cont d) Scalar (non-superscalar) processors Interleaved multithreading: The processor switches from one thread to another at each clock cycle; if any thread is blocked due to dependency or memory latency, it is skipped and an instruction from a ready thread is executed. Blocked multithreading: Instructions of the same thread are executed until the thread is blocked; blocking of a thread triggers a switch to another thread ready to execute. Interleaved and blocked multithreading can be applied to both scalar and superscalar processors. When applied to superscalars, several instructions are issued simultaneously. However, all instructions issued during a cycle have to be from the same thread. no multithreading A A A A interleaved multithreading ABCD thread blocked blocked multithreading ABCD Simultaneous multithreading (SMT): Applies only to superscalar processors. Several instructions are fetched during a cycle and they can belong to different threads. A B C D B C D A A B B B C C D A

12 Datorarkitektur Fö 11/12-45 Datorarkitektur Fö 11/12-46 Approaches to Multithreaded Execution (cont d) Superscalar processors no multithreading es: Multithreaded s Are multithreaded processors parallel computers? - they execute parallel threads; - certain sections of the processor are available in several copies (those that store the state of the architecture); - the processor appears to the operating system as several processors. interleaved multithreading blocked multithreading simultaneous multithreading C No: - only certain sections of the processor are available in several copies but we do not have several processors; the execution resources are common. In fact, a single physical processor appears as multiple logical processors to the operating system. Datorarkitektur Fö 11/12-47 Datorarkitektur Fö 11/12-48 Multithreaded s (cont d) General Purpose Gs IBM Power5, Power6: - simultaneous multithreading; - two threads/; - both power5 and power6 are dual chips. The first Gs (graphic processing units) were non-programmable 3D-graphic accelerators. Intel Montecito (Itanium 2 family) - blocked multithreading (called by Intel temporal multithreading); - two threads/; - Itanium 2 processors are dual. Today s Gs are highly programmable and efficient. NIDIA, AMD, etc. have introduced high performance Gs that can be used for general purpose high performance computing: general purpose graphic processing units (GPGs). Intel Pentium 4, Nehalem - Pentium 4 was the first Intel processor to implement multithreading; - simultaneous multithreading (called by Intel Hyperthreading); - two threads/ (8 simultaneous threads per quad - see slide 38); GPGs are multi, multithreaded processors which also include SIMD capabilities.

13 Datorarkitektur Fö 11/12-49 Datorarkitektur Fö 11/12-50 The NIDIA Tesla GPG The NIDIA Tesla GPG (cont d) Host System TPC TPC TPC TPC TPC TPC TPC TPC SMSM SMSM SMSM SMSM SMSM SMSM SMSM SMSM SM1 TPC Cache MT Issue SMC Cache MT Issue SM2 The NIDIA Tesla (GeForce 8800) architecture: - 8 independent processing units (texture processor clusters - TPCs); - each TPC consists of 2 streaming multiprocessors (SMs); - each SM consists of 8 streaming processors (s). SFU SFU Shared memory Texture unit SFU SFU Shared memory Each TPC consists of: - two SMs, controlled by the SM controller (SMC); - a texture unit (TU): specialised for special graphic processing; the TU can serve four threads per cycle. Datorarkitektur Fö 11/12-51 Datorarkitektur Fö 11/12-52 The NIDIA Tesla GPG (cont d) The NIDIA Tesla GPG (cont d) Each streaming multiprocessor SM consists of: - 8 streaming processors (s); - multithreaded instruction fetch and issue unit (MT issue); - cache and shared memory; - two Special Function Units (SFUs); each SFU also includes 4 floating point multipliers. Each SM is a multithreaded processor; it executed up to 768 concurrent threads. At each instruction issue the SM selects a ready warp for execution. Individual threads belonging to the active warp are mapped for execution on the s. The 32 threads of the same warp execute code starting from the same address; they execute independently and are free to branch individually. Once a warp has been selected for execution by the SM, all threads execute the same instruction at a time; some threads can stay idle (due to branching). The SM architecture is based on a combination of SIMD and multithreading - single instruction multiple thread (SIMT): - a warp consist of 32 parallel threads; - each SM manages a group of 24 warps at a time (768 threads, in total); The (streaming processor) is the primary processing unit of an SM. It performs: - floating point add, multiply, multiply-add; - integer arithmetic, comparison, conversion.

14 Datorarkitektur Fö 11/12-53 Datorarkitektur Fö 11/12-54 The NIDIA Tesla GPG (cont d) ector s GPGS are optimised for throughput: - each thread may take longer time to execute but there are hundreds of threads active; - large amount of activity in parallel and large amount of data produced at the same time. Common C based parallel computers are primarily optimised for latency: - each thread runs as fast as possible, but only a limited amount of threads are active. ector processors include in their instruction set, beside scalar instructions, also instructions operating on vectors. Array processors (SIMD) computers (see slide 29) can operate on vectors by executing simultaneously the same instruction on pairs of vector elements; each instruction is executed by a separate processing element. Several computer architectures have implemented vector operations using the parallelism provided by pipelined functional units. Such architectures are called vector processors. Throughput oriented applications for GPGs: - extensive data parallelism: thousands of computations on independent data elements; - limited process-level parallelism: large groups of threads run the same program; not so many different groups that run different programs. - latency tolerant; the primary goal is the amount of work completed. Datorarkitektur Fö 11/12-55 Datorarkitektur Fö 11/12-56 ector s (cont d) ector s (cont d) Strictly speaking, vector processors are not parallel processors, although they behave like SIMD computers. There are not several Cs in a vector processor, running in parallel. They are SISD processors which have implemented vector instructions executed on pipelined functional units. ector computers usually have vector registers which can store each 64 up to 128 words. ector instructions (see slide 58): - load vector from memory into vector register - store vector into memory - arithmetic and logic operations between vectors - operations between vectors and scalars - etc. From the programmers point of view this means that he is allowed to use operations on vectors in his programmes (see program in slide 8), and the compiler translates these instructions into vector instructions at machine level. Scalar instructions Instruction decoder ector instructions Scalar unit Scalar registers Scalar functional units ector registers ector functional units ector unit ector computers: - CDC Cyber CRA - IBM 3090 (an extension to the IBM System/370) - NEC S - Fujitsu P - HITACHI S8000

15 Datorarkitektur Fö 11/12-57 Datorarkitektur Fö 11/12-58 The ector Unit ector Instructions A vector unit typically consists of - pipelined functional units - vector registers LOAD-STORE instructions: R A(x1:x2:incr) A(x1:x2:incr) R R MASKED(A) A MASKED(R) R INDIRECT(A()) A() INDIRECT(R) load store masked load masked store indirect load indirect store ector registers: - n general purpose vector registers R i, 0 i n-1; - vector length register L; stores the length l (0 l s), of the currently processed vector(s); s is the length of the vector registers R i. - mask register M; stores a set of l bits, one for each element in a vector register, interpreted as boolean values; vector instructions can be executed in masked mode so that vector register elements corresponding to a false value in M, are ignored. Arithmetic - logic R R' b_op R'' R S b_op R' R u_op R' M R rel_op R' WHERE(M) R R' b_op R'' chaining R2 R0 R1 R3 R2 * R4 execution of the vector multiplication has not to wait until the vector addition has terminated; as elements of the sum are generated by the addition pipeline they enter the multiplication pipeline; thus, addition and multiplication are performed (partially) in parallel. Datorarkitektur Fö 11/12-59 Datorarkitektur Fö 11/12-60 ector Instructions (cont d) Multimedia Extensions to General Purpose Microprocessors In a language with vector computation instructions: if T[1..50]>0 then T[1..50]:=T[1..50]1; ideo and audio applications very often deal with large arrays of small data types (8 or 16 bits). Such applications exhibit a large potential of SIMD (vector) parallelism. A compiler for a vector computer generates something like: R0 T(0:49:1) L 50 M R0 > 0 WHERE(M) R0 R0 1 General purpose microprocessors have been equipped with special instructions to exploit this potential of parallelism. The specialised multimedia instructions perform vector computations on bytes, half-words, or words.

16 Datorarkitektur Fö 11/12-61 Datorarkitektur Fö 11/12-62 Multimedia Extensions to General Purpose Microprocessors (cont d) Multimedia Extensions to General Purpose Microprocessors (cont d) Several vendors have extended the instruction set of their processors in order to improve performance with multimedia applications: MM for Intel x86 family The basic idea: subword execution Use the entire width of a processor data path (32 or 64 bits), even when processing the small data types used in signal processing (8, 12, or 16 bits). IS for UltraSparc MDM for MIPS With word size 64 bits, the adders will be used to implement eight 8 bit additions in parallel MA-2 for Hewlett-Packard PA-RISC NEON for ARM Cortex-A8, ARM Cortex-A9 This is practically a kind of SIMD parallelism, at a reduced scale. The Pentium line provides 57 MM instructions. They treat data in a SIMD fashion. Datorarkitektur Fö 11/12-63 Datorarkitektur Fö 11/12-64 Multimedia Extensions to General Purpose Microprocessors (cont d) Multimedia Extensions to General Purpose Microprocessors (cont d) Three packed data types are defined for parallel operations: packed byte, packed word, packed double word. Examples of SIMD arithmetics with the MM instruction set: ADD R3 R1,R2 Packed byte a7 a6 a5 a4 a3 a2 a1 a0 q7 q6 q5 q4 q3 q2 q1 q0 b7 b6 b5 b4 b3 b2 b1 b0 Packed word q3 q2 q1 q0 = = = = = = = a7b7 a6b6 a5b5 a4b4 a3b3 a2b2 a1b1 = a0b0 Packed double word q1 q0 MPADD R3 R1,R2 a7 a6 a5 a4 a3 a2 a1 a0 x- x- x- x- Quadword q0 b7 = b6 b5 = b4 b3 = b2 b1 = b0 64 bits (a6xb6)(a7xb7) (a4xb4)(a5xb5) (a2xb2)(a3xb3) (a0xb0)(a1xb1)

17 Datorarkitektur Fö 11/12-65 Datorarkitektur Fö 11/12-66 Multimedia Extensions to General Purpose Microprocessors (cont d) How to get the data ready for computation? How to get the results back in the right format? Packing and Unpacking Summary The growing need for high performance can not always be satisfied by computers running a single C. With Parallel computers, several Cs are running in order to solve a given application. PACK.W R3 R1,R2 b1 a1 b0 a0 Parallel programs have to be available in order to use parallel computers. Computers can be classified based on the nature of the instruction flow executed and that of the data flow on which the instructions operate: SISD, SIMD, and MIMD architectures. truncated b1 truncated b0 truncated a1 truncated a0 UNPACK R3 R1 a3 a2 a1 a0 a1 a0 The performance we effectively can get by using a parallel computer depends not only on the number of available processors but is limited by characteristics of the executed programs. The efficiency of using a parallel computer is influenced by features of the parallel program, like: degree of parallelism, intensity of inter-processor communication, etc. Datorarkitektur Fö 11/12-67 Datorarkitektur Fö 11/12-68 Summary (cont d) Summary (cont d) A key component of a parallel architecture is the interconnection network. Array processors execute the same operation on a set of interconnected processing units. They are specialized for numerical problems expressed in matrix or vector formats. Multiprocessors are MIMD computers in which all Cs have access to a common shared address space. The number of Cs is limited. Multicomputers have a distributed address space. Communication between Cs is only by message passing over the interconnection network. The number of interconnected Cs can be high. Multi architectures are an alternative not only to increase performance but also to do it without excessive increase in power consumption. Multithreading processors provide hardware support for executing multithreaded code. This leads to better usage of resources and higher potential to hide latencies. General purpose Gs are multi, multithreaded processors with SIMD style capabilities. They are optimised for throughput. ector processors are SISD processors which include in their instruction set instructions operating on vectors. They are implemented using pipelined functional units. Multimedia applications exhibit a large potential of SIMD parallelism. The instruction set of modern general purpose microprocessors (Pentium, UltraSparc) has been extended to support SIMDstyle parallelism with operations on short vectors.

ARCHITECTURES FOR PARALLEL COMPUTATION

ARCHITECTURES FOR PARALLEL COMPUTATION Datorarkitektur Fö 11/12-1 Datorarkitektur Fö 11/12-2 Why Parallel Computation? ARCHITECTURES FOR PARALLEL COMTATION 1. Why Parallel Computation 2. Parallel Programs 3. A Classification of Computer Architectures

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra Binary Representation Computer Systems Information is represented as a sequence of binary digits: Bits What the actual bits represent depends on the context: Seminar 3 Numerical value (integer, floating

More information

Computer Architecture

Computer Architecture Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Chapter 4 Data-Level Parallelism

Chapter 4 Data-Level Parallelism CS359: Computer Architecture Chapter 4 Data-Level Parallelism Yanyan Shen Department of Computer Science and Engineering Shanghai Jiao Tong University 1 Outline 4.1 Introduction 4.2 Vector Architecture

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Types of Parallel Computers

Types of Parallel Computers slides1-22 Two principal types: Types of Parallel Computers Shared memory multiprocessor Distributed memory multicomputer slides1-23 Shared Memory Multiprocessor Conventional Computer slides1-24 Consists

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

Computer Architecture and Organization

Computer Architecture and Organization 10-1 Chapter 10 - Advanced Computer Architecture Computer Architecture and Organization Miles Murdocca and Vincent Heuring Chapter 10 Advanced Computer Architecture 10-2 Chapter 10 - Advanced Computer

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

More information

Lecture 8: RISC & Parallel Computers. Parallel computers

Lecture 8: RISC & Parallel Computers. Parallel computers Lecture 8: RISC & Parallel Computers RISC vs CISC computers Parallel computers Final remarks Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) is an important innovation in computer

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

Multi-core Programming - Introduction

Multi-core Programming - Introduction Multi-core Programming - Introduction Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Fundamentals of Computer Design

Fundamentals of Computer Design Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor

More information

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2 CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 erformance Workshop odern HC Architectures David Henty d.henty@epcc.ed.ac.uk ECC, University of Edinburgh Overview Components History Flynn s Taxonomy SID ID Classification via emory Distributed

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam Parallel Computer Architectures Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam Outline Flynn s Taxonomy Classification of Parallel Computers Based on Architectures Flynn s Taxonomy Based on notions of

More information

Fundamentals of Computers Design

Fundamentals of Computers Design Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

2. Parallel Architectures

2. Parallel Architectures 2. Parallel Architectures 2.1 Objectives To introduce the principles and classification of parallel architectures. To discuss various forms of parallel processing. To explore the characteristics of parallel

More information

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner CS104 Computer Organization and rogramming Lecture 20: Superscalar processors, Multiprocessors Robert Wagner Faster and faster rocessors So much to do, so little time... How can we make computers that

More information

Three basic multiprocessing issues

Three basic multiprocessing issues Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated

More information

BİL 542 Parallel Computing

BİL 542 Parallel Computing BİL 542 Parallel Computing 1 Chapter 1 Parallel Programming 2 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Parallel Architectures

Parallel Architectures Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

PIPELINING AND VECTOR PROCESSING

PIPELINING AND VECTOR PROCESSING 1 PIPELINING AND VECTOR PROCESSING Parallel Processing Pipelining Arithmetic Pipeline Instruction Pipeline RISC Pipeline Vector Processing Array Processors 2 PARALLEL PROCESSING Parallel Processing Execution

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Architecture of parallel processing in computer organization

Architecture of parallel processing in computer organization American Journal of Computer Science and Engineering 2014; 1(2): 12-17 Published online August 20, 2014 (http://www.openscienceonline.com/journal/ajcse) Architecture of parallel processing in computer

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Embedded Systems Architecture. Computer Architectures

Embedded Systems Architecture. Computer Architectures Embedded Systems Architecture Computer Architectures M. Eng. Mariusz Rudnicki 1/18 A taxonomy of computer architectures There are many different types of architectures, and it is worth considering some

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Processor Design So we talk

More information

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Multicore Hardware and Parallelism

Multicore Hardware and Parallelism Multicore Hardware and Parallelism Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Instructor: Tsung-Che Chiang tcchiang@ieee.org Department of Science and Information Engineering National Taiwan Normal University Introduction In the roughly three decades between

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Pipeline and Vector Processing 1. Parallel Processing SISD SIMD MISD & MIMD

Pipeline and Vector Processing 1. Parallel Processing SISD SIMD MISD & MIMD Pipeline and Vector Processing 1. Parallel Processing Parallel processing is a term used to denote a large class of techniques that are used to provide simultaneous data-processing tasks for the purpose

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Flynn s Classification

Flynn s Classification Flynn s Classification Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor Electrical & Computer Engineering University of Delaware ggao@capsl.udel.edu 652-14F-PXM-intro 1 Execution

More information

Part VII Advanced Architectures. Feb Computer Architecture, Advanced Architectures Slide 1

Part VII Advanced Architectures. Feb Computer Architecture, Advanced Architectures Slide 1 Part VII Advanced Architectures Feb. 2011 Computer Architecture, Advanced Architectures Slide 1 About This Presentation This presentation is intended to support the use of the textbook Computer Architecture:

More information

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.2 The Performs: Floating point operations (FLOPS) - add, mult,

More information

THREAD LEVEL PARALLELISM

THREAD LEVEL PARALLELISM THREAD LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 4 is due on Dec. 11 th This lecture

More information

PARALLEL COMPUTER ARCHITECTURES

PARALLEL COMPUTER ARCHITECTURES 8 ARALLEL COMUTER ARCHITECTURES 1 CU Shared memory (a) (b) Figure 8-1. (a) A multiprocessor with 16 CUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different

More information

Architectures of Flynn s taxonomy -- A Comparison of Methods

Architectures of Flynn s taxonomy -- A Comparison of Methods Architectures of Flynn s taxonomy -- A Comparison of Methods Neha K. Shinde Student, Department of Electronic Engineering, J D College of Engineering and Management, RTM Nagpur University, Maharashtra,

More information

Advanced Parallel Architecture. Annalisa Massini /2017

Advanced Parallel Architecture. Annalisa Massini /2017 Advanced Parallel Architecture Annalisa Massini - 2016/2017 References Advanced Computer Architecture and Parallel Processing H. El-Rewini, M. Abd-El-Barr, John Wiley and Sons, 2005 Parallel computing

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Administrative UPDATE Nikhil office hours: - Monday, 2-3 PM, MEB 3115 Desk #12 - Lab hours on Tuesday afternoons during programming

More information

Chapter 5:: Target Machine Architecture (cont.)

Chapter 5:: Target Machine Architecture (cont.) Chapter 5:: Target Machine Architecture (cont.) Programming Language Pragmatics Michael L. Scott Review Describe the heap for dynamic memory allocation? What is scope and with most languages how what happens

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate to solve large problems fast Broad issues involved: Resource Allocation: Number of processing elements

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

CS Parallel Algorithms in Scientific Computing

CS Parallel Algorithms in Scientific Computing CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

CS/COE1541: Intro. to Computer Architecture

CS/COE1541: Intro. to Computer Architecture CS/COE1541: Intro. to Computer Architecture Multiprocessors Sangyeun Cho Computer Science Department Tilera TILE64 IBM BlueGene/L nvidia GPGPU Intel Core 2 Duo 2 Why multiprocessors? For improved latency

More information