UNIT I 1.What is ILP ILP = Instruction level parallelism multiple operations (or instructions) can be executed in parallel

Size: px

Start display at page:

Download "UNIT I 1.What is ILP ILP = Instruction level parallelism multiple operations (or instructions) can be executed in parallel"

Winifred Lynch
5 years ago
Views:

1 DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY CS2354 ADVANCED COMPUTER ARCHITECTURE PART-A QUESTIONS WITH ANSWERS UNIT I 1.What is ILP ILP = Instruction level parallelism multiple operations (or instructions) can be executed in parallel 2.What are the needs of ILP Sufficient resources Parallel scheduling Hardware solution Software solution Application should contain ILP 3.What are the various hazards Three types of hazards Structural Data dependence Control dependence 4.What is Dynamic scheduling? Dynamic scheduling: hardware rearranges instruction execution to reduce stalls Allow instructions behind stall to proceed 5. What are the advantages of dynamic scheduling Handles cases when dependences unknown at compile time e.g., because they may involve a memory reference It simplifies the compiler Allows code compiled for one machine to run efficiently on a different machine, with different number of function units (FUs), and different pipelining 6. What is branch prediction. High branch penalties in pipelined processors: With on average 20% of the instructions being a branch, the maximum ILP is five CPI = CPIbase + fbranch * fmisspredict * penalty Large impact if: Penalty high: long pipeline CPIbase low: for multiple-issue processors 7. What is speculation Hardware-based speculation follows the predicted flow of data values to choose when to execute instructions. This method of executing programs is essentially a data flow execution:

2 Operations execute as soon as their operands are available. instruction execution sequence requires an additional set of hardware buffers 8. What is instruction commit? When an instruction is no longer speculative, we allow it to update the register file or memory;we call this additional step in the instruction execution sequence instructioncommit. 9. What is reorder buffer? that hold the results of instructions that have finished execution but have not committed. This hardware buffer, which we call the reorder buffer. 10.What are the four steps involved in instruction execution. Issue Execute Write result Commit 11. What is principle of locality An implication of locality is that we can predict with reasonable accuracy what instructions and data a program will use in the near future based on its accesses in the recent past. The principle of locality also applies to data accesses, though not as strongly as to code accesses. 12.What are the types of locality. Two different types of locality have been observed. Temporal locality states that recently accessed items are likely to be accessed in the near future. Spatial locality says that items whose addresses are near one another tend to be referenced close together in time. 13. How to calculate the value of CPI. The value of the CPI (cycles per instruction) for a pipelined processor is the sum of the base CPI and all contributions from stalls: Pipeline CPI = Ideal pipeline CPI+Structural stalls+data hazard stalls + Control stalls 14. What is ideal pipeline. The ideal pipeline CPI is a measure of the maximum performance attainable by the implementation. By reducing each of the terms of the right -hand side, we minimize the overall pipeline CPI or, alternatively, increase the IPC (instructions per clock). 15. What is loop level Parallelism The simplest and most common way to increase the ILP is to exploit parallelism among iterations of a loop. This type of parallelism is often called loop-level Parallelism 16. What are the types of data dependencies. There are three different types of dependences: data dependences (also called true data dependences),name dependences, and control dependences.

3 17. What a re the various data hazards. RaW read after write WaR write after read WaW write after write 18. What is RAW J tries to read a source before i writes it, so j incorrectly gets the old value. This hazard is themost common type and corresponds to a true data dependence. pogram order must be preserved to ensure hat j receives the value from i. 19. What is WAW WAW (write after write) j tries to write an operand before it is written by i.the writes end up being performed in the wrong order, leaving the value written by i rather than the value written by j in the destination. This hazard corresponds to an output dependence. WAW hazards are present only in pipelines that write in more than one pipe stage or allow an instruction to proceed even when a previous instruction is stalled. 20. What is WAR. WAR (write after read) j tries to write a destination before it is read by i, so i incorrectly gets the new value. This hazard arises from an antidependence. WAR hazards cannot occur in most static issue pipelines even deeper pipelinesor floating-point pipelines because all reads are early (in ID) and all writes are late (inwb). 21. What is control dependence? A control dependence determines the ordering of an instruction, i, with respect to a branch instruction so that the instruction i is executed in correct program order and only when it should be. Every instruction, except for those in the first basic block of the program, is control dependent on some set of branches, and, in general, these control dependences must be preserved to preserve program order. 22. Write one of the simplest example for control dependence if p1 { S1; }; if p2 { S2 ; } 23. What is loop unrolling? (May 2011) To control the various dependencies the loop is unrolled as many times as possible. UNIT II

4 1. What is the goal of multiple issue processors? The goal of the multiple-issue processors, is to allow multiple instructions to issue in a clock cycle. 2. What are the types of multiple issue processors? a) Statically scheduled superscalar processors, b) VLIW (very long instruction word) processors, and c) Dynamically scheduled superscalar processors. 3. How does superscalar processor vary from VLIW processor. The two types of superscalar processors issue varying numbers of instructions per clock and use in-order execution if they are statically scheduled or out -of order execution if they are dynamically scheduled. VLIW processors, in contrast, issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instructi on. VLIW processors are inherently statically scheduled by the compiler. 4. What is register renaming? Register renaming is a technique to eliminate anti- and output dependencies. Register renaming eliminates false dependencies. 5. How can register renaming be implemented? Can be implemented by the compiler advantage: low cost disadvantage: old codes perform poorly in hardware advantage: binary compatibility disadvantage: extra hardware needed 6. What is the limitation of VLIW processors? Very smart compiler needed (but largely solved!) Loop unrolling increases code size Unfilled slots waste bits Cache miss stalls whole pipeline 7. How to avoid superscalar complexity? An alternative to avoid superscalar complexity is to use EPIC (explicit parallel instru ction computer) 8. How is EPIC a better alternative? Superscalar: expensive but binary compatible VLIW: simple, but not compatible 9. Write about Itanium Instruction format Instructions grouped in 128 -bit bundles 3 * 41-bit instruction

5 5 template bits, indicate type and stop location 10. What is control speculation? Loads incur high latency Need to schedule loads as early as possible Two barriers branches and stores Control speculation move loads above branches 11. What is data speculation? It moves loads above potentially overlapping stores. 12. What is Register Stack Register stack is used to save/restore procedure contexts across calls Stack area in memory to save/restore procedure context 13. What is register stack engine? Automatically saves/restores stack registers without software intervention Avoids explicit spill/fill (Eliminates stack management overhead) Provides the illusion of infinite physical registers RSE uses unused memory bandwidth (cycle stealing) to perform register spill and fill operations in the background 14. What is superscalar processor? Superscalar: multiple instructions issued per cycle Statically scheduled Dynamically scheduled 15. What is VLIW?( May 2011) single instruction issue, but multiple operations per instruction 16. What is SIMD / Vector? - single instruction issue, single operation, but multiple data sets per operation 17. What is Multiple issue processor?(may 2011) Multiple-Issue Processors Superscalar: varying no. instructions/cycle (0 to 8), scheduled by HW (dyna mic issue capability) IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4, etc. VLIW (very long instr. word): fixed number of instructions (4-16) scheduled by the

6 18. What are super pipelined processors Anticipated success of multiple instructions led to Instructions Per Cycle (IPC) metric instead of CPI 19. What is vector Multimedia instructions being added to many processors coding of independent loops as operations on large vectors of numbers 20. What are the advanced complier support techniques? oloop-level parallelism Software pipelining Global scheduling (across basic blocks) processing? Vector Processing: Explicit 21. What is software pipelining? Software pipelining is a related technique that that consumes less code space. It interleaves instructions from different iterations 22. What is global code scheduling? Loop unrolling and software pipelining work well when there are no control statements (if statements) in the loop body ie., if the loop is a single basic block. So if there are control statements then Global code scheduling is implemented scheduling/moving code across branches: larger scheduling scope UNIT-III 1. What is write serialization? Serializing the writes ensures that every processor will see the write done the case that some processor could see the write of P2 first and then see the write of P1, maintaining the value written by P1 indefinitely. The simplest way to avoid such difficulties is to ensure that all writes to the same location are seen in the same order; this property is called write serialization. 2. What is snoop cache and write through cache?(may 2011) Every cache that has a copy of the data from a block of physical memory also has a copy of the sharing status of the block, but no centralized state is kept. The caches are all accessible via some broadcast medium (a bus or switch), and all cache controllers monitor or Snoop on the medium to determine whether or not they have a copy of a block that is requested on a bus or switch access. We focus on this approach in this section. 3. What is symmetric shared memory? Symmetric shared-memory machines usually support the caching of both shared and private data. 4. What is private data and shared data?

7 Private data are used by a single processor, while shared data are used by multipleprocessors, essentially providing communication among the processors through reads and writes of the shared data. 5. What happens when a private and shared item is cached? When a private item is cached, its location is migrated to the cache, reducing the average access time as well as the memory bandwidth required. Since no other processor uses the data, the program behavior is identical to that in a uni processor. 6. What is cache coherence? When shared data are cached, the shared value may be replicated in multiple caches. In addition to the reduction in access latency and required memory bandwidth, this replication also provides a reduction in contention that may exist for shared data items that are being read by multiple processors simultaneously. Caching of shared data, however, introduces a new problem. This problem is called as cache coherence. 7. What Is Multiprocessor Cache Coherence? Unfortunately, caching shared data introduces a new problem because the view of memory held by two different processors is through t heir individual caches, which, without any additional precautions, could end up seeing two different values. Two different processors can have two different values for the same location. This difficulty is generally referred to as the cache coherence problem. 8. What is meant by coherence? Informally, we could say that a memory system is coherent if any read of a data item returns the most recently written value of that data item. This aspect, is called coherence, which defines what values can be returned by a read. 9. What is meant by consistency? The aspect, called consistency, determines when a written value will be returned by a read. 10. What are the schemes provided by coherent multiprocessor? In a coherent multiprocessor, the caches provide both migration andreplication of shared data items. 11. The overall cache performance is based on what attributes? The overall cache performance is a combination of the behavior of uniprocessor cache miss traffic and the traffic caused by communication, which results in invalidations and subsequent cache misses. 12. What are types of coherence misses? Similarly, the misses that arise from interprocessor communication, which are often called

8 coherence misses, can be broken into two separate sources. The first source is called as truesharing misses and the second source is called as false sharing misses. 13. What is true sharing miss? The first source is the so-called true sharing misses that arise from the communication of data through the cache coherence mechanism. In an invalidation based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block. Additionally, when another processor attempts to read a modified word in that cache block, a miss occurs and the resultant block is transferred. Both these misses are classified as true sharing misses since they directly arise from the sharing of data among processors. 14. What is false sharing miss? False sharing, arises from the use of invalidation based coherence algorithm with a singlevalid bit per cache block. False sharing occurs when a block is invalidated (and a subsequent reference causes a miss) because some word in the block, other than the one being read, is written into. If, however, the word being written and the word read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. In a false sharing miss, the block is shared, but no word in the cache is actually shared, and the miss would not occur if the block size were a single word. 15. How can memory bandwidth be increased? We can increase the memory bandwidth and interconnection bandwidth by distributing the memory, this immediately separates local memory traffic from remote memory traffic, reducing the bandwidth demands on the memory system and on the interconnection network. UNIT-IV 1. What is spatial locality? Spatial locality These are items close in space to a recently accessed item have a high probability of being accessed next 2. What are the basic cache optimizations? Reduces miss rate Larger block size Bigger cache Higher associativity Reduces conflict rate Reduce miss penalty Multi-level caches Give priority to read misses over write misses reduce hit time Avoid address translation (from virtual to physical addr.) during indexing of the cache

9 3. What are the advanced cache optimizations? Reducing hit time Increasing cache bandwidth Reducing Miss Penalty Reducing Miss Rate Reducing miss penalty or miss rate via parallelism 4. What is non-blocking cache? Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss requires out-of-order execution CPU 5. What is hit under miss? hit under miss reduces the effective miss penalty by continuing during a miss.by overlapping multiple misses 6. How to calculate Average memory access time2-way? Average memory access time2 -way = Hit time + Miss rate Miss penalty 7.How to calculate Average memory access time4-way? Average memory access time4 -way = Hit time Miss rate Miss penalty 10.What is way prediction? In way prediction, extra bits are kept in the cache to predict the way, or block within the set of the next cache access. This prediction means the multiplexor is set early to select the desired block, and only a single tag compar ison is performed that clock cycle in parallel with reading the cache data. 11. What is sequential interleaving? A simple mapping that works well is to spread the addresses of the block sequentially across the banks, called sequential interleaving. 12.What are the two basic strategies in the seventh optimization? Critical word first Request the missed word first from memory and send it to the processoras soon as it arrives; let the processor continue execution while filling the rest of the words in the block. Early restart Fetch the words in normal order, but as soon as the requested word of the block arrives, send it to the processor and let the processor continue execution. 13. What is the eighth optimization? Eighth optimization merges write buffers to r educe miss penalty. 14. What is a victim buffer? In a write-back cache, the block that is replaced is sometimes called the victim. Hence, the AMD Opteron calls its write buffer a victim buffer.

10 15. What is a victim cache? The write victim uffer or victim buffer contains the dirty blocks that are discarded from a cache because of a miss. Rather than stall on a subsequent cache miss, the contents of the buffer are checked on a miss to see if they have the desired data before going to the next lower-level memory. This is a victim cache. 16. How is compile optimization done to reduce miss rates? Merging Arrays: Loop Interchange Loop Fusion Blocking 17.How is memory latency related? Memory latency is traditionally quoted using two measures access time and cycle time. 18.What is access time? Access time is the time between when a read is requested and when the desired word arrives. 19.What is cycle time? Cycletime is the minimum time between requests. 20. What does RAID stands for? RAID stands for redundant array of inexpensive disks. 21.What is JBOD? RAID0has no redundancy and is sometimes nicknamed JBOD, for just a bunch of disks, although the data may be striped across the disks in the array. 22. What is I/O bandwidth and latency? I/O throughput is sometimes called I/O bandwidth, and response time is sometimes called latency. 23. What is transaction time? Transaction time is the sum of entry time, system response time and think time. 24. Differentiate between SRAM AND DRAM (May2011) SRAMs don t need to refresh and so the access time is very close to the cycle time. SRAMs typically uses six transistors per bit to prevent the information from being disturbed when read. SRAM needs only minimal power to retain the charge in standby mode. SRAM designs are concerned with speed and capacity, while in DRAM designs the emphasis is on cost per bit and capacity. The capacity of DRAMs is roughly 4 8 times that of SRAMs. The cycle time of SRAMs is 8 16 times faster than DRAMs, but they are also 8 16 times as expensive.

11 UNIT V 1. What is single threading? Single threading performs only one task at a time. 2. What is multitasking and multithreading. Multitasking is the execution of two or more tasks at one time by using content switching (functionality). Multithreading is a process wherein multiple threads to share the functional units of one processor via overlapping. 3. What is meant by multicore technology? Computational work of an application is divided and spread over multiple execution cores (performance) 4. What is HT Technology? Two single threads execute simultaneously on the same processor core- ht technology 5. Define a thread. Thread is a basic unit of CPU utilization-program counter, CPU state information and stack 6. What is a logical processor? Duplicates the architecture space of processor -execution can be shared among different processors-logical processor. 7. What is hyper threading? Intel version of simultaneous multithreading (SMT).It makes single physical processor appear as multiple logical processors. Operating system can schedule to logical proces sors 8. What is the advantage of multicore technology? More tasks get completed in less time, this increases the performance and responsiveness of the system. 9. What is multithreading? Simultaneous multithreading (SMT) is a variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor to exploit TLP at the same time it exploits ILP. 11. What is the key insight that motivates SMT? The key insight that motivates SMT is that modern multiple-issue processors often have more functional unit parallelism available than a single thread can effectively use. 12. What are the various processor configurations? A superscalar with no multithreading support A superscalar with coarse-grained multithreading A superscalar with fine-grained multithreading A superscalar with simultaneous multithreading 13. What happens to a superscalar without multithreading support?

12 In the superscalar without multithreading support, the use of issue slots is limited by a lack of ILP In addition, a major stall, such as an instruction cache miss, can leave the entire processor idle. 14. What happens to coarse grained multithreaded superscalar? In the coarse-grained multithreaded superscalar, the long stalls are partially hidden by switching to another thread that uses the resources of the processor. Although this reduces the number of completely idle clock cycles, within each clock cycle, the ILP limitations still lead to idle cycles. 15. What happens to fine-grained multithreading? In the fine-grained case, the interleaving of threads eliminates fully empty slots. Because only one thread issues instructions in a given clock cycle, however, ILP limitations still lead to a significant number of idle slots within individual clock cycles. 16. What happens in SMT? In the SMT case, TLP and ILP are exploited simultaneously, with multiple threads using the issue slots in a single clock cycle. 17. What are the advantages of SMT? SMT also increases hardware design flexibility. Simultaneous multithreading increases the complexity of instruction scheduling. 18. What are the hardware mechanisms that support multithreading? HW mechanisms to support multithreading Large set of virtual registers that can be used to hold the register sets of independent threads Register renaming provides unique register identifiers, so instructions from multiple threads can be mixed in datapath without confusing sources and destinations across threads Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW 19. What is Vertical and Horizontal waste? Vertical waste is introduced when the processor issues no instructions in a cycle Horizontal waste when not all issue slots can be filled in a cycle. 20. What are the advantages of multicore processor? Increased responsiveness and woker productivity. Improved performance in parallel environments when running computations on multiple processors. 21. In what way is multicore processor superior to single core processor? On singlecore processor multithreading can only be used to hide latency.

13 22. Define a multicore processor? (May 2011) A multicore design takes several processor cores and packages them as a single processor. The goal is to enable system to run more tasks simultaneously and thereb y achieve greater overall system performance. 23.What is a cell processor ( May 2011) Cell is heterogeneous multi core processor comprised of control intensive processor and compute intensive SIMD processor cores. Cell consists of 1 control intensive processor core(ppe) and 8 compute intensive processor core(spe)- EIB (element interconnect bus) is a high speed bus used for interconnecting the processor cores within a cell.

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight