Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory

Size: px

Start display at page:

Download "Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory"

Cornelius Hines
6 years ago
Views:

1 Parallel Machine 1

2 CPU Usage Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory Solution Use multiple CPUs or multiple ALUs For simultaneous processing Known as parallel computers or multiprocessors computer To improve the efficiency of computer Increase the speed of processor Improve memory access 2

3 Type of Processing Flynn Taxonomy FLYNN TAXONOMY Single Instruction Multiple Instruction SISD SIMD MISD MIMD 3

4 Type of Processing SISD Single Instruction Single Data Single processor executes a single instruction stream to operate on data stored in single memory. Example: Von Neumann Machine. SIMD Single Instruction Multiple Data Single machine instruction controls the simultaneous execution of a number of processing elements on a lockstep basis. Each processing elements has an associated data memory, so that each instruction is executed on different set of data Vector Parallel 4

5 Type of Processing MISD Multiple Instruction Single Data A sequence of data is transmitted to a set of processors, each of which executes a different instruction sequence. This structure is not commercially implemented. Extraordinary MIMD Multiple Instruction Multiple Data A set of processors simultaneously execute different instruction sequence on different data sets. SMP (Symmetric Multiprocessor) and NUMA (Non Uniform Memory Access) Shared memory switch bus Distributed / Local Memory Switch Bus 5

6 MIMD Distributed Memory Using many CPUs connected CPU control the implementation of each operation separately Can perform various tasks simultaneously 2 techniques of connection between CPU and memory: Direct connection Net/Grid Connection The relationship between a corner with the opposite corner is fat Solution: use the hypercube / n-cube 6

7 Direct Connection 7

8 Net Connection 8

9 Hypercube Connection Route from 100 to 111 XOR Therefore, the possible routes are through 110 and 101 9

10 MIMD Shared Memory Bus Use Bus simple and easy CPU1 CPU2 CPU3 Cache memory Cache memory Cache memory Memory 10

11 MIMD Shared Memory Bus Using bus Problem: Von Neumann Bottleneck Solution: Use cache memory in each CPU Problem: Coherence cache memory 2 processors read the same data. When one of them change the data, the other processor assumed the data is original and did know the changing of data. Solution: Software Hardware 11

12 MIMD Shared Memory Bus Solution in software Classified the data Shared Read only Read-Write Unshared Problem on shared data read-write Solution: not allow the caching 12

13 MIMD Shared Memory - Bus Solution in hardware Using the cache memory controller and cache memory resolution protocol The required word block will be loaded in the memory cache. 13

14 MIMD Shared Memory - Switch Crossed Switch that connecting n CPU with k memory Advantage network without barries Disadvantage use a lot of cross point (increase in n 2 ) 14

15 Omega Network Have log 2 n stages / levels with n/2 switch in each stage Example: Omega Network 8 CPU x 8 memory Stage : log 2 8 = 3 Number of Switch : 8/2 = 4 Suis Bersilang Total number of switch = 3 * 4 = 12 8 CPU x 8 Ingatan = Less crossed point. 64 Suis Disadvantage network detained 15

16 Omega Network A 1B 1C 2A 2B 2C 3A 3B 3C 1D 2D 3D

17 Benes Network Resolved obstacles in omega network Use more switches and more stage Provide more route options from CPU to memory 17

18 SIMD Parallel Computer Execution of programs with the same set of data simultaneously More simple, cheap and very fast Example: connection machine 18

19 Connection Machine Consist of: 4 quadrant which can be operated separately 1 quadrant = 2 part of 8KPE (8192 processors) Each quadrant has: ALU 8Kb memory 4 bit flags Interface with memory and I/O system 1 route determinant 19

20 Connection Machine The compiler is written in C or LISP Each section of 8KPE sub-cube quadrant is divided into 2 part of 4KPE (256 cip pemproses) Each 4KPE subcube has I/O system of its own Bus Width I/O = 64 bit Has 39 disk drive I/O 1 disk 1 bit 20

21 SIMD Computer Vector Connection machine is only suitable to solve artificial intelligent problems For floating point arithmetic such as grafic processing that involves vectors, connection machine is not suitable Example of SIMD Computer Vector Super Computer CRAY-1 21

22 CRAY-1 Consist of Multiple ALU that can operate simultaneously 2 addressing unit to compute addresses 4 unit integer scalar for arithmetic operations. 6 unit vector integer for vector operations 22

23 Cache Memory Characteristics of Memory System Location: Refers to whether memory is internal or external to the computer Example: main memory, cache (internal) and optical disk, magnetic disk (external) Capacity: Number of words or Number of bytes Unit of transfer: Word or block Access Method: Sequential, Direct, Random, Associative Performance: Access time, cycle time and transfer time Physical type: semiconductor, magnetic, optical Physical characteristic: volatile or erasable Organization: memory modules 23

24 Cache Memory Principles It is intended to give memory speed approaching that of the fastest memory available At the same time provide a large memory size at the price of less expensive types of semiconductor memories. The cache contains a copy of portions of main memory. When the processor attempts to read a word of memory: A check is made to determine if the word is in the cache. If so, word is delivered to the processor. If not, a block of main memory is read into cache and the word is delivered to the processor. The phenomenon of locality of reference, it is likely that there will be future references to that same memory location or to other words in the block 24

25 Cache/Main Memory Structure 25

26 Cache/Main Memory Principles Main memory consists up to 2 n addressable words, with each word having a unique n-bit address. For mapping purpose, this memory is considered to consist of a number of fixed length blocks of K words each. That is, M=2 n /K blocks in main memory. The cache consist of m blocks called lines. Each line contains K words, plus a tag of a few bits. Each line also includes control bits. 26

27 Cache Read Operation 27

28 Cache Mapping Function An algorithm is needed for mapping main memory blocks to cache line. It is because of a fewer cache lines than main memory blocks. The choice of the mapping function dictates how the cache is organized. Three technique can be used: Direct, associative and set associative. 28

29 Example A line is an adjacent series of bytes in main memory (that is, their addresses are contiguous). Suppose a line is 16 bytes in size. For example, suppose we have a 212 = 4K-byte cache with 28 = byte lines; a 224 = 16M-byte main memory, which is 212 = 4K times the size of the cache; and a 400-line program, which will not all fit into the cache at once. 29

30 Direct Mapping Under this mapping scheme, each memory line j maps to cache line j mod 128 so the memory address looks like this: Here: The "Word" field selects one from among the 16 addressable words in a line: The "Line" field defines the cache line where this memory line should reside. The "Tag" field of the address is is then compared with that cache line's 5-bit tag to determine whether there is a hit or a miss. If there's a miss, we need to swap out the memory line that occupies that position in the cache and replace it with the desired memory line. 30

31 Direct Mapping E.g., Supposed that we want to read or write a word at the address 357A, whose 16 bits are This translates to Tag = 6, line = 87, and Word = 10 (all in decimal). If line 87 in the cache has the same tag (6), then memory address 357A is in the cache. Otherwise, a miss has occurred and the contents of cache line 87 must be replaced by the memory line = 855 before the read or write is executed. Direct mapping is the most efficient cache mapping scheme, but it is also the least effective in its utilization of the cache - that is, it may leave some cache lines unused. 31

32 Associative mapping This mapping scheme attempts to improve cache utilization, but at the expense of speed. Here, the cache line tags are 12 bits, rather than 5, and any memory line can be stored in any cache line. The memory address looks like this: Here: The "Tag" field identifies one of the 2 12 = 4096 memory lines; all the cache tags are searched to find out whether or not the Tag field matches one of the cache tags. If so, we have a hit, and if not there's a miss and we need to replace one of the cache lines by this line before reading or writing into the cache. The "Word" field again selects one from among 16 addressable words (bytes) within the line. 32

33 Associative Mapping For example, suppose again that we want to read or write a word at the address 357A, whose 16 bits are Under associative mapping, this translates to Tag = 855 and Word = 10 (in decimal). So we search all of the 128 cache tags to see if any one of them will match with 855. If not, there's a miss and we need to replace one of the cache lines with line 855 from memory before completing the read or write. The search of all 128 tags in the cache is time-consuming. However, the cache is fully utilized since none of its lines will be unused prior to a miss (recall that direct mapping may detect a miss even though the cache is not completely full of active lines). 33

34 Set Associative Mapping This scheme is a compromise between the direct and associative schemes described above. Here, the cache is divided into sets of tags, and the set number is directly mapped from the memory address (e.g., memory line j is mapped to cache set j mod 64), as suggested by the diagram below: 34

35 Set Associative Mapping The memory address is now partitioned to like this: Here: The "Tag" field identifies one of the 26 = 64 different memory lines in each of the 26 = 64 different "Set" values. Since each cache set has room for only two lines at a time, the search for a match is limited to those two lines (rather than the entire cache). If there's a match, we have a hit and the read or write can proceed immediately. Otherwise, there's a miss and we need to replace one of the two cache lines by this line before reading or writing into the cache. The "Word" field again select one from among 16 addressable words inside the line. 35

36 Set Associative Mapping In set-associative mapping, when the number of lines per set is n, the mapping is called n-way associative. For instance, the above example is 2- way associative. Example: Again, supposed that we want to read or write a word at the memory address 357A, whose 16 bits are Under set-associative mapping, this translates to Tag = 13, Set = 23, and Word = 10 (all in decimal). So we search only the two tags in cache set 23 to see if either one matches tag 13. If so, we have a hit. Otherwise, one of these two must be replaced by the memory line being addressed (good old line 855) before the read or write can be executed. 36

ARCHITECTURAL CLASSIFICATION. Mariam A. Salih

ARCHITECTURAL CLASSIFICATION. Mariam A. Salih ARCHITECTURAL CLASSIFICATION Mariam A. Salih Basic types of architectural classification FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE FENG S CLASSIFICATION Handler Classification Other types of architectural