BWM CRM, PRM CRM, WB, PRM BRM, WB CWM, CWH, PWM. Exclusive (read/write) CWM PWM

Size: px

Start display at page:

Download "BWM CRM, PRM CRM, WB, PRM BRM, WB CWM, CWH, PWM. Exclusive (read/write) CWM PWM"

Conrad Williamson
6 years ago
Views:

1 University of California, Berkeley College of Engineering Computer Science Division EECS Spring 998 D.A. Patterson Quiz 2 Solutions CS252 Graduate Computer Architecture

2 Question : Bigger, Better, Faster? A computer system has the following characteristics: æ Uses 0GB disks that rotate at 0000 RPM, have a data transfer rate of 0 MByteès èfor each diskè, and have a 8 ms seek time æ Has an average IèO size of 32 KByte æ Is limited only by the disks æ Has a total of 20 disks Each disk can handle only one request at a time, but each disk in the system can be handling a diæerent request. The data is not striped èall IèO for each request has to go to one diskè. aè What is the average service time for a request? service time = seek time + rotational latency + transfer time seek time = 8ms min 60 sec rotational latency = æ 0000 rotations min æ rotation = 3 ms 2 sec transfer time = 0 æ 2 20 æ 32kBytes = 3:25 ms Bytes service time = 8ms + 3ms + 3:25 ms = 4:25 ms bè Given the average IèO size from above and a random distribution of disk locations, what is the maximum number of IèOs per second èiopsè for the system? IOPS = service time =.0425 sec = 7 So, a single disk can support 7 IOPS. Therefore, the overall IOPS = 20 æ 7 = 420 IOPS. Someone suggests improving the system by using new, better disks. For the same total price as the original disks, you can get disks that have 9GBeach, rotate at 2000 RPM, transfer at 2 MBès, and have a 6 ms seek time. cè What would be the average service time for a request in the new system? service time = seek time + rotational latency + transfer time seek time = 6ms min 60 sec rotational latency = æ 2000 rotations min æ rotation = 2:5 ms 2 sec transfer time = 2 æ 2 20 æ 32 kbytes = 2:60 ms Bytes service time = 6ms + 2:5 ms + 2:60 ms = :0 ms 2

3 Question ècontinuedè dè What is the maximum number of IOPS in the new system? IOPS = service time = :00 sec = 90 So, a single disk can support 90 IOPS. Therefore, the overall IOPS = æ 90 = 990 IOPS. eè Treat the entire system as a MèMèm queue èthat is, a system with m servers rather than oneè, where each disk is a server. All requests are in a single queue. Requests may not overlap. Assume both systems receive an average of 950 IèO requests per second. Assume that any disk can service any request. What is the mean response time of the old system? The new one? You might ænd the following equation for an MèMèm queue useful: Old system: Server utilization = Arrival rate = Arrival rate æ Time server m Time server ç =m ç Time system = Time server æ + Server utilization mè, Server utilizationè 950 æ :0425 utilization = = : ç ç :6709 Ts = :0425 æ + 20 æ è, 0:6709è =5:56 ms New system: 950 æ :00 utilization = = :9586 ç ç :9586 Ts = :00 æ + æ è, 0:9586è =34:47 ms fè Which system has a lower average response time? Why? The system with 20 disks has a lower average response time. Even though each disk has worse performance, the larger number of disks means that the old system is capable of more IOPS, and hence has a lower utilization. Thus, the waiting time is much lower than on the new system. 3

4 Question 2: A MESI Situation Figure below shows the three-phase write-back cache coherence protocol from the book. CRH BWM Invalid CRM, PRM Shared (read only) BWM, WB CWM, PWM CRM, WB, PRM BRM, WB CWM, CWH, PWM CRM PRM CWH CRH Exclusive (read/write) CWM PWM The following terminology is used: Figure : Three-Phase Protocol CPU stimulus causing transition Operation on bus causing transition CPU action on bus Label CRH CRM CWH CWM BRM BWM PRM PWM WB Stimulus or action CPU read hit CPU read miss CPU write hit CPU write miss read miss for this block write miss for this block place CPU read miss on bus place CPU write miss on bus write back cache block 4

5 Question 2 ècontinuedè Figure 2 below shows a write-back MESI èmodiæed, Exclusive, Shared, Invalidè protocol. Assume that the processor is able to detect whether a read miss is a shared read miss or an exclusive read miss. BWM CRMs PRM Invalid Read Only (Shared) CRH CRH CWH BWM, WB Read/Write (dirty exclusive, or Modified) CRMx, PRM CRMs, PRM CWM, PWM CRMs, PRM, WB BRM, WB BWM CRMx, PRM, WB CWH CWM, PWM CWH, CWM PWM BRM CRMx, PRM CRMs, PRM Read Only (unshared, or clean Exclusive) CRH CWM PWM CRMx PRM The following terminology is used: Figure 2: MESI Protocol CPU stimulus causing transition Operation on bus causing transition CPU action on bus Label CRH CRMs CRMx CWH CWM BRM BWM PRM PWM WB Stimulus or Action CPU read hit CPU read miss èsharedè CPU read miss èexclusiveè CPU write hit CPU write miss read miss for this block write miss for this block place CPU read miss on bus place CPU write miss on bus write back cache block 5

6 Question 2 ècontinuedè Here is a sequence of memory accesses. Assume only 2 processors, with the value 5 stored in address A. All cache locations start out in the invalid state. æ P reads A æ P writes 0 to A æ P2 reads A æ P2 writes 5 to A Below are the actions that occur for the above sequence on a group of machines using the three-phase protocol. Mark in the table any of the actions that change when the machines use the MESI protocol. Only show the items that change. Extra blank lines have been provided for you to show your changes. There may be more blank lines than you need. ëread" for bus state means that a processor is reading the value that is on the bus. In the table below, a bus action in one line aæects processors and memory in the next line. For bus actions, denote a shared read miss as ërdmss" and an exclusive read miss as ërdmsx". For states, use ëmod", ëexcl", ëshar", or ëinv" to represent the readèwrite, read only unshared, read only shared, and invalid states. Use ëènoneè" to represent an item that exists in the three-phase protocol, but not in the MESI protocol. P P2 Bus Memory Operation State Addr Val State Addr Val Action Proc Addr Val Addr Val P Rd A Shar A RdMs P A A 5 Excl A RdMsX P A Shar A 5 Read A 5 A 5 Excl A 5 Read A 5 P Wr 0toA Ex A 0 WrMs P A A 5 Mod A 0 ènoneè P2 Rd A Ex A 0 Shar A RdMs P2 A A 5 Mod A 0 Shar A RdMsS P2 A Shar A 0 Shar A WrBk P A 0 A 5 Shar A 0 Shar A WrBk P A 0 Shar A 0 Shar A 0 A 0 Shar A 0 Shar A 0 A 0 P2 Wr 5toA Shar A 0 Excl A 5 WrMs P2 A A 0 Shar A 0 Mod A 5 WrMs P2 A Inv A Excl A 5 A 0 Inv A Mod A 5 6

7 Question 3: Cluster vs SMP æ Evaluate the resource utilization while performing streaming IèO on the following three architectures: æ A single workstation æ A cluster of workstations æ A symmetric multiprocessor èsmpè The basis for the ærst two architectures is shown in Figure 3. The cluster is built of 8 copies of the single workstation and is shown in Figure 4. The workstation contains a 67 MHz processor èwith 52 KB of L2 cacheè and 28 Mbyte of memory. The memory bus is 28 bits wide and operates at 83.3 MHz. The workstation contains one 32-bit, 25 MHz IèO bus ècalled the S-Busè. Attached to this IèO bus are two fast-wide è6-bit, 0 MHzè SCSI controllers. In the cluster, a Myrinet network interface, which is a switch based network that can support 280 Mbitès in each direction, is also installed in each machine; the machines are all connected to a single eight-port switch. Processor Memory Memory Bus 28-bit, 83.3 MHz I/O Chip S-Bus 32-bit, 25 MHz 6-bit 0 MHz SCSI # Disk SCSI #2 Myrinet Network Interface Myrinet Network 280 Mbit/s Figure 3: The æ This problem is based on a simpliæed version of the study ëthe Architectural Costs of Streaming IèO: A Comparison of s, Clusters, and SMPs" by Remzi H. Arpaci-Dusseau, Andrea C. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, and David A. Patterson from the Fourth International Symposium on High-Performance Computer Architecture. The paper is available at 7

8 Question 3 ècontinuedè Switch Myrinet Figure 4: The Cluster The SMP is shown in Figure 5. The system consists of four CPUèMemory and four S-Bus IèO boards connected via the GigaPlane memory bus. The GigaPlane is a 256-bit wide 83.3 MHz bus. Each CPUèMemory board contains two 67 MHz processors èeach with 52 KB of L2 cacheè and 256 Mbyte of memory. Each IèO board contains two S-Busses. Each S-Bus has one fast-wide è6-bit, 0 MHzè SCSI controller. All communication is performed via loads and stores to shared memory. All memory accesses are uniform access time. Processor CPU/Memory Board x4 S-Bus I/O Card x4 Memory SCSI # 6-bit, 0 MHz SCSI #2 Processor S-Bus # 32-bit, 25 MHz S-Bus #2 I/O Chip I/O Chip GigaPlane 256-bit, 83.3 MHz Figure 5: The SMP 8

9 Question 3 ècontinuedè The streaming IèO benchmark we will use is a sorting benchmark. The benchmark processes 00- byte records that include 0-byte keys. The basic algorithm is the same on all three platforms. In the ærst step, the records must be converted from the layout on disk to a format more suitable for eæcient sorting. As records are read from disk, the key èwhich is part of the recordè and a pointer to the full record are placed into buckets based on the top few bits of the key; this improves the cache behavior of the sort in two ways. First, the sort operates on only épartial key, pointeré pairs, thus copying only 8-bytes rather than 00-byte records as keys are compared and swapped. Second, the number of keys in each bucket matches the size of the second-level cache. The next step sorts the keys in each bucket. Assume that the data is initially randomly placed over all disks. The basic algorithm has been slightly tailored for best performance on each platform. Figures 6, 7, and 8 show a graphical representation of the read phase for each platform. The arrows show the order and direction of data that moves across busses, but does not show the relative sizes of each transfer. The following paragraphs refer to the numbers in those ægures. Disk Memory Processor Figure 6: Sort Read Phase In the workstation read phase, the input æle is read into the user's address space èè. These records are then copied to an input buæer è2, 3è. Each key is examined è4è, and a épartial key, pointeré pair is written into the correct bucket è5è. Disk / Net Memory Processor Figure 7: Cluster Sort Read Phase In the cluster read phase, the input æle is read into the user's address space èè. Records are then copied into one of 8 send buæers è2, 3è; as each buæer ælls, it is sent to the appropriate destination processor è4è. Upon receipt of records from other processors è5è, records are copied into a record buæer è6, 7è. Then, each key is examined, and a épartial key, pointeré pair is written into the bucket array è8, 9è, as in the single workstation sort. 9

10 Question 3 ècontinuedè Disk Memory Processor Figure 8: SMP Sort Read Phase In the SMP read phase, the input æle is read into the user's address space èè. Records are then copied into an input buæer è2, 3è. Each key is examined è4è, and a épartial key, pointeré pair is written into the correct bucket buæer è5è. When a bucket buæer ælles, the processor copies the épartial key, pointeré pair è6, 7è and records è8, 9è into a global array. The GigaPlane bus can sustain 94è of its theoretical maximum transfer rate. The SCSI bus can sustain 80è of its theoretical maximum transfer rate. The workstation and cluster memory bus can sustain 75è of its theoretical maximum transfer rate. The S-Bus can sustain 55è of its theoretical maximum transfer rate. The table below shows the number of millions of instructions required to processes each megabyte of data on the disk for the diæerent platforms. The diæerences are mainly from overhead of sending and receiving network messages, and from slightly diæerent ways of zeroing pages on the diæerent platforms. Cluster SMP Read Phase The table below shows the measured CPI for each platform while running the benchmark. Cluster SMP Read Phase

11 Question 3 ècontinuedè aè Determine how much of each resource èièo bus and memory busè is used during the read phase of the sort for each platform. First, write a general equation for how much ofeach resource is used in terms of the rate data is read from disk èd r è, the number of processors in the cluster or SMP èp è, and the sizes of the records èrecè, keys èkeyè, and épartial key, pointeré pairs èbucketè. D r is the total rate that data is read from disks èthe sum of all the individual disk ratesè. Give the combined bandwidth required for all the busses. Then, æll in the table on the next page with the summary. Provide a short justiæcation for these equations. The resource usage for the workstation sort has been completed as an example. Memory Bus: During the read phase, data is read from disk èd r è, then copied into memory è2d r è. The keys are read è key rec æ D rè, and épartial key, pointeré pairs written to the right bucket è bucket rec æ D r è. IèO Bus: Data is read from the disk èd r è. Cluster Memory Bus: During the read phase, data is read from disk èd r è, then copied into buæers è2d r è. Then, blocks are sent to other processors è P, æ D P rè, and received from other processors è P, æ D P rè then copied into buæers è2d r è. After this, keys are read è key rec æ D rè, and épartial key, pointeré pairs are written è bucket rec æ D r è. IèO Bus: Data is read from the disk èd r è and blocks are sent to and received from other processors è2 P, æ D P rè. SMP Memory Bus: During the read phase, data is read from disk èd r è, then copied into buæers è2d r è. Then, each key is examined è key rec æ D rè, and épartial key, pointeré pairs are written è bucket rec æ D r è. Once the buckets æll, épartial key, pointeré pairs are copied è2 bucket rec æ D r è and records are copied è2d r èinto a global array. IèO Bus: Data is read from the disk èd r è.

12 Question 3 ècontinuedè Resource Usage Memory Bus IèO Bus D r +è2d r è+è key rec èd r +è bucket rec èd r D r Cluster Memory Bus D r +2D r +2è P,èD P r +2D r +è key rec èd r +è bucket rec èd r Cluster IèO Bus D r +2è P,èD P r SMP Memory Bus SMP IèO Bus D r +4D r +è key rec èd r +3è bucket rec èd r D r bè Fill in the values of the general equations from part èaè, using the following values: 8 processors in the cluster and SMP èp è, 00-byte records èrecè, 0-byte keys èkeyè, and 8-byte épartial key, pointeré pairs èbucketè. Leave the term D r in your equations. The read phase of the workstation sort has been completed as an example. Memory Bus Usage IèO Bus Usage 3:8D r D r Cluster 6:93D r 2:75D r SMP 5:34D r D r 2

13 Question 3 ècontinuedè cè Each disk can read data at 5.5 Mbyteès. Assume disks are organized the most eæcient way possible èthe disks are equally spread over all the busses availableè. If we use 2 disks per processor, what is the utilization of each resource èscsi bus, IèO bus, memory bus, processorè during the read phase of the sort for only the cluster and SMP platforms? èdetermine utilization as a percent of maximum sustainable transfer rate for each bus.è SCSI Bus Cluster: 2 disks per processor, 2 SCSI busses per processor, therefore one disk per SCSI bus. 5.5 MBès.8 * 2 Bytes * 0 MHz =34:38è SMP: 2 disks per processor, SCSI bus per processor, therefore two disks per SCSI bus. 2 * 5.5 MBès.8 * 2 Bytes * 0 MHz =68:75è IèO Bus Cluster: The cluster bandwidth required on the IèO bus is 2.75 times the bandwidth read from disk * 2 * 5.5 MBès.55 * 4 Bytes * 25 MHz = 55è SMP: The SMP bandwidth required on the IèO bus is the same as the bandwidth read from disk. *2*5.5MBès.55 * 4 Bytes * 25 MHz = 20è 3

14 Question 3 ècontinuedè Memory Bus Cluster: The cluster bandwidth required on the memory bus is 6.93 times the bandwidth read from disk * 2 * 5.5 MBès.75 * 6 Bytes * 83.3 MHz =7:6è SMP: The SMP bandwidth required on the memory bus is 5.34 times the bandwidth read from disk. Since this is a shared memory bus, the total bandwidth required will be 8 times greater èsince we have 8 processorsè. 8 * 5.34 * 2 * 5.5 MBès.94 * 32 Bytes * 83.3 MHz =8:75è CPU Cluster: The cluster requires 5.5 million instructions per megabyte of data. The CPI during the read phase is * 2.2 * 2 * 5.5 MBès 67 MHz =79:70è SMP: The SMP requires 4.6 million instructions per megabyte of data. The CPI during the read phase is * 2.2 * 2 * 5.5 MBès 67 MHz =66:66è eè Explain brieæy which system scales the best èin terms of adding more disksè for this benchmark. The SMP can add another disk at full bandwidth for this benchmark while the cluster can not, because of the CPU utilization. 4

BWM CRM, PRM CRM, WB, PRM BRM, WB CWM, CWH, PWM. Exclusive (read/write) CWM PWM

BWM CRM, PRM CRM, WB, PRM BRM, WB CWM, CWH, PWM. Exclusive (read/write) CWM PWM University of California, Berkeley College of Engineering Computer Science Division EECS Spring 1998 D.A. Patterson Quiz 2 April 22, 1998 CS252 Graduate Computer Architecture You are allowed to use a calculator