CDA3101 Recitation Section 13

Size: px

Start display at page:

Download "CDA3101 Recitation Section 13"

Godwin Wells
6 years ago
Views:

1 CDA3101 Recitation Section 13 Storage + Bus + Multicore and some exam tips

2 Hard Disks Traditional disk performance is limited by the moving parts.

3 Some disk terms Disk Performance Platters - the surfaces on which data is written. Track - one "ring" of data Cylinder - the collection of tracks on all platters at equal distance from the shaft Drive Head - the device that reads from the platter, mounted on a moving arm Sector - one unit of data on a hard drive, a subdivision of the track. All tracks have same # of cylinders. partition-table/harddisk-physical -structure.php

4 Disk Performance c c c Partitions defined in terms of sectors A "block" is 1024 bytes in Linux

5 Disk Performance Sources of disk delay Seek Time: time to move head into correct position Rotational Delay: time for spinning platter to come around to correct position (on average, ½ rotation time) Controller Delay: overhead introduced by the controller Transfer Delay: time to actually read the data from the disk

Problem 1: Disk Performance Compute the average read time for a 512B sector from this drive: 5200rpm *5.7ms average seek time, 3.0Gb/s transfer rate *0.2ms controller overhead $129.

6 Problem 1: Disk Performance Compute the average read time for a 512B sector from this drive: 5200rpm *5.7ms average seek time, 3.0Gb/s transfer rate *0.2ms controller overhead $129.00, Apr Answer: One disk revolution = 1s / 5200 rpm * 60 s/m = 11.5 ms Avg read requires ½ revolution, or 5.77 ms rotational delay 3.0 Gb/s = 3.0 * 1000 / 8 = B/s 512B / B/s = ms 0.2 ms ms ms ms = ms * Specs were ambiguous for these items

7 RAID 0 Uses striping to increase disk throughput. Consecutive blocks written to different disks. Advantage: Requests for several blocks can be serviced in parallel. Disadvantage: MTTF for array is T/N, where N is the number of disks, and T is the MTTF of a single disk Source: Wikipedia

8 RAID 1 Mirror data onto multiple disks. Multiple reads to different locations can be serviced at high frequencies. Improves reliability all drives must fail before array will fail. Disadvantage: higher write cost, as each copy must be updated. Disadvantage: expensive. 100% capacity overhead. Source: Wikipedia

9 No longer used. RAID 2

10 RAID 3 Uses byte level striping and a parity disk. Can recover from 1 failure.

11 Problem 2: IO Issues Define: Memory Mapped IO Answer: Control registers for IO devices are in the same address space as main memory, and are accessed using typical lw/sw instructions. Special IO Instructions are not needed. This scheme is advantageous because it does not require extra instructions, and provides access to all the addressing modes already supported for memory access. A downside is that the entire address space is not available for main memory. Although x86 uses IO Instructions, AMD 64 uses Memory Mapped IO.

12 Problem 3: IO Issues Define: Polling, for IO systems Answer: When waiting for an IO request to complete, the CPU repeatedly queries the IO status register in a tight loop, and does not do any additional work until the operation is finished. This is in contrast to interrupt driven IO, where the CPU does other things while waiting for a special interrupt signal from the IO controller.

13 Problem 4: IO Issues Define: Direct Memory Access Answer: When transferring data from an IO device to memory, the CPU instructs the IO controller to copy data directly into main memory without passing through the CPU. An interrupt occurs when the operation is complete. DMA is more efficient than repeatedly loading data into the CPU and then writing it to memory.

14 Bus Performance Assumptions Bandwidth (B), Bit Error Rate (BER) Failure Rate (FR), Failure Cost (FC) in bits/sec Actual Bandwidth (B ) B = B x (1-BER) - FC/(1-FR) Error Effect Failure Penalty

15 Problem 5: Buses Define: Throughput, for buses Answer: how much data can be transmitted through the bus in a given time. In contrast, latency is a measurement of bus performance that corresponds to the amount of time for the first unit of data to pass through the bus. ** Note that even though you don't need to mention latency in a throughput definition, comparing the two metrics beefs up your answer. If your throughput definition is lacking, you can get points back from the comparison. This is very helpful for when you understand a concept but have a hard time formulating a definition.

16 Bus Performance with EC Modern buses have error correction technology. Some errors can be corrected without the need to retransmit packets. Buses of this form have these extra parameters: C D = cost of error detection C C = cost of error correction, when possible C R = cost of a retransmission Typically, C R dominates C D and C C. This means C D and C C are typically negligible in computations. (As long as you state that as an assumption!)

17 Problem 5: Bus Bandwidth (HW help) Given a bus with nominal bandwidth of 36 Mbits/sec and clock rate of 1.8 MHz, what is the actual bandwidth of the bus in duplex mode if each packet has a mean size of 2Kbyte, 57 percent of the packets sent along the bus result in a collision, with retransmission occurring only once, and there is an additional error rate of 0.02 percent that also causes re-transmission of packets? Answer with error correction: Let us assume that f is the fraction of packets that can NOT be error corrected, and C D, C C, C R are the costs of detection, correction, and retransmission. Error Penalty ~ (1 - f) (C D +C C ) + f (C D +C R ) Assume f = , from problem Also assume C D =0, C C = 0, since C R >> C C, C D look at Web notes Answer is:15.47 Mbit/sec

18 Motivation for Parallelism Unicore Processors Moores law can not continue indefinitely. Recall Moores Law: the number of transistors on integrated circuits doubles every two years Fundamental Clock Rate Limitations Power Requirements: grows with the square of clock speed Heat Dissipation

19 Parallelism Bit Level Parallelism Deal with data bits in parallel rather than serial Instruction Level Parallelism Pipelining Tomasulo's Algorithm: permits out of order execution of instructions Thread/Process Level Parallelism Multi-thread applications Multiple processes running on the same thread

21 Tomasulo ILP you don't need to know this From CDA5155 Peir, UFL

22 Flynn's Taxonomy SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) Multiple processors on a single data stream No commercial prototypes. Can be thought of as successive refinement of a given set of data by multiple processors (units). SIMD (Single Instruction Multiple Data) Simple programming model, low overhead, and flexibility Examples: Illiac-IV, CM-2; All custom integrated circuits MIMD (Multiple Instruction Multiple Data) Flexible Difficult to program no unifying model of parallelism Use off-the-shelf microprocessors Examples: Sun Enterprise 5000, Cray T3D, SGI Origin Classifications, but no unifying model of parallel computing.

23 Shared Memory vs Distributed Memory Shared Memory: Processors share access to a single centralized memory. Symmetric Relationship (SMP) Uniform Access Time (UMA) Distributed Memory: scales memory bandwidth, also permits lower latencies DSM: Distributed Shared-Memory NUMA: Non-uniform memory access

24 Shared Memory vs Distributed Memory Compulsory misses (aka cold start misses) First access to a block Capacity misses Due to finite cache size, a replaced block is later accessed again Conflict misses (aka collision misses) In a non-fully associative cache Due to competition for entries in a set Coherence Misses Miss because value in memory is stale

GPU Technology GPUs consist of several streaming multiprocessors (SM). Each SM contains several (32 for modern GPUs) SIMD cores, and is connected to a large global memory.

25 GPU Technology GPUs consist of several streaming multiprocessors (SM). Each SM contains several (32 for modern GPUs) SIMD cores, and is connected to a large global memory. Each SM can be used for MIMD. (good for GPGPU computing) GPUs are good for problems that can be partitioned into blocks of atomic operations. Bad for sequential problems low clock rate + latency to move data to and from GPU CPUs contain fewer, more functional cores, and typically have access to larger blocks of low latency memory. CPUs consume less power, but also have lower throughput.

26 A Simple GPU (CUDA) Program Write a program that fills a 50 x 50 float array with 1.0f, in a region of memory already allocated and assigned to the variable foo. C Code for(i=0; i<50; i++){ for(j=0; j<50; j++){ foo[i*50+j]=1.0f; } } CUDA Code Host: cudamemcpy(d_foo, foo, 50*50*sizeof(float), cudamemcpyhosttodevice) dim3 grid(5,5); dim3 threads(10,10); kernel<<<grid,threads>>>(d_foo); cudamemcpy(foo, d_foo, 50*50*sizeof(float), cudamemcpydevicetohost) Device: global void kernel(float *d_foo){ int i,j; i = blockidx.x * blockdim.x + threadidx.x; j = blockidx.y * blockdim.y + threadidx.y; d_foo[ i*50 + j ] = 1.0f; }

27 Exam Review Review recitation slides + quizzes for calculation problems likely to be on the exam. Review HW and exam reviews for the knowledge-based questions. Reread the web notes. Understand RISC/CISC, CPI, Pipelining, Memory Hierarchies. Parallelism (Multicore+GPUs) are the state of the art. You will probably be tested on it. Be able to use Amdahl's law. Exam will probably be biased toward OLD material.

Computer parallelism Flynn s categories

04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories