A Distributed-multicore Hybrid ATPG System

Size: px
Start display at page:

Download "A Distributed-multicore Hybrid ATPG System"

Transcription

1 A Distributed-multicore Hybrid ATPG System X. Cai and P. Wohl Synopsys, Inc. Mountain View, CA, USA {xcai, Abstract We present a distributed-multicore hybrid ATPG system which leverages the computing power of multiple machines each with multiple CPUs. The system is versatile and scalable and supports flexible configuration. Experimental results are compared to a highly efficient multicore ATPG system. 1. Introduction Even using today s most advanced CPUs, ATPG on large industrial designs can take so long that it must be truncated, resulting in less than optimal test coverage and design quality. A parallel ATPG system is a way to cut the ATPG time as design sizes grow bigger. In general, a parallel ATPG system can be classified as distributed on different machines or localized on one machine. In a localized system (e.g., multicore), speedup is limited by the number of CPUs and the performance of the memory management on the machine. In a distributed system, the inter-process communication can become the bottleneck of the system as more machines get involved. For even more speedup, the hybrid mode that utilizes the advantages of multicore and distributed configurations is a good option. In practice, most of enterprise server farms have machines with 4-8 idle CPUs but memory size limits the degree to which multicore ATPG can utilize the speedup scalability of the system. A distributed-multicore hybrid system best fits such conditions. It can fully use the memory and CPUs on each machine while adding more speedup with more machines. Such a system can be cheaper to run than a multicore system on a high end machine. The paper is organized as follows. In Section 2, we present our job partitioning scheme and hybrid architecture. The experimental results are reported in Section 3. Section 4 discusses limitations and future improvements of the system. Finally Section 5 includes a conclusion. 2. Our Hybrid System An ATPG system contains 3 major components: test generation, good machine simulation and fault machine simulation. A parallel ATPG system should consider parallelization in all these 3 components in order to get scalable speedup. Job partitioning is an important aspect of a parallel ATPG system. A good job partitioning scheme is the foundation of scalable speedup. Various job partitioning schemes have been discussed in previous works. We will compare our job partitioning to previous works in the following section. System architecture is another foundational block of a parallel ATPG system. A partitioning scheme may not perform well if the system architecture cannot implement the job partitioning efficiently. 2.1 Job partitioning Various job partitioning schemes have been previously presented. These schemes can be classified as fault partitioning, search-space partitioning, circuit partitioning, heuristic partitioning and algorithm partitioning. For fault simulation, fault partitioning and sometimes pattern partitioning are commonly used methods [1, 2, 3, 4, 5]. Search space partitioning can only be used in test generation and has the benefit of improving coverage for hard-to-detect faults [6, 7, 8, 9]. While circuit partitioning is sometimes possible [10, 11, 12], it can be tricky to handle the boundaries between partitions. This partitioning is also highly dependent on circuit topology, which makes it hard to get consistent performance across different designs. Heuristic partitioning refers to applying different test generation heuristics in parallel for a given fault [13]. Algorithm partitioning distributes different components of ATPG to be executed in pipelined parallel fashion [14]. Such a partition scheme usually requires complex synchronization and large amount of data transfer between We employ fault partitioning as our main partitioning scheme as it can be applied to both test generation and fault simulation. Various fault partitioning schemes have been tested for fault simulation. Several heuristic algorithms on how to group faults together within a partition have been discussed. In [4], a predetermined fault partition based on gate level and fault effect cone correlation is presented. In [3], a multistage pipelined synchronous algorithm is implemented to reduce the possibility of missed detections when combined with pattern partition. However all of these works mainly used static fault partitions, which may lead to unbalanced work load among different In [15], faults are not partitioned but multiple faults are simulated in parallel to exploit the vast parallel vectors in a GPU system. In [2], static partitioning based on fault correlation is used first and then dynamic partitioning is used as a supplement. Paper 10.4 INTERNATIONAL TEST CONFERENCE /13/$ IEEE

2 Both static and dynamic partitioning have limitations seen in previously proposed fault partitioning schemes. With static partitioning, the heuristic used can have significant effects on pattern set size and speed up. With dynamic partitioning, the synchronization overhead can be substantial [16]. Current industrial designs can have tens or even hundreds of millions of gates. To apply a correlation analysis for the entire fault list can be time consuming. With ever growing circuit size, such analysis provides diminishing benefits compared to a simple partition of the fault list roughly built on localities. On the other hand, increased network communication bandwidth and memory access speed makes dynamic partitioning more applicable. We found that a simple locality based initial partitioning at the beginning of ATPG combined with dynamic fault partition gives best results [17]. The key idea is to blend static and dynamic fault partitioning at different stage of ATPG based on the number of undetected faults. Unlike previous works, we do not limit one process to only work on one part of the fault list. We first order the fault list based on fan-out-free-regions. Each slave process has access to the full fault list but starts picking a primary fault target from a different section of the fault set. The dynamic pattern compaction algorithm then picks secondary targets according to certain heuristics. If a fault has been picked as primary target or has been detected, the fault is skipped by the next fault target selection. Other than that, each slave can choose any remaining faults as target faults. Since we do not limit fault target selection to a subset of faults like a static fault partition scheme, we avoid the pattern inflation problem that may occur in a static partition system. At the beginning of the ATPG process, there are plenty of faults so that different slaves are unlikely to select the same fault targets. However, as the fault population decreases, we rely more on fault status communication to avoid duplicate fault targets. We keep a global fault status table, which is checked before each fault target is selected. Each slave sends a request to update the global status table as soon as it detects a fault. Unlike previous works, our dynamic fault partition technique exchanges fault status information much more frequently but with less overhead. This is achieved by eliminating the locking/unlocking associated with such communication. We keep a local copy of the global fault status table for each sub-master machine so that slave processes need not to go through network for fault status lookup. Since each slave works on different part of the fault list, it is unlikely the slaves would update the same fault entry in the table at the same time. So the locking/unlocking for updating the table can also be largely eliminated. Another key part of our hybrid system is pattern partitioning. In our system, each slave process only simulates the patterns it has created locally. This essentially partitions the patterns for good and fault machine simulation. Many of previous papers only addressed the fault simulation problem [3, 4, 15]. To avoid duplicating good machine simulations by different processes, it is necessary to partition the patterns [3]. 2.2 System Architecture Depending on the communication method between processes, two types of systems have been designed. One type is based on shared memory on one machine. The other is message passing between machines. Shared memory based systems have been discussed earlier [11, 15]. In [11], a system of fine grained search space partition in test generation and circuit partitioning in fault simulation are combined. This system requires different strategies for easy and hard to detect faults. It also has many synchronization points that undermine speedup. As a result, the speedup numbers are unpredictable. Several GPU based fault simulation system were also implemented in recent years [15, 19, 20]. In such systems, fault simulation takes advantage of the large vector parallelism in a GPU to simulate multiple faults at the same time. This is an extension of 32-bit-wise parallelism of existing parallel fault simulation and requires special hardware (GPUs) to implement. Examples of message passing based system can be found in [21, 2, 5]. In [2, 5], a distributed ATPG system is designed with a master/server setup. Each node can talk to other nodes for workloads. Such a system is not scalable because the number of communication links grows quadratically with the number of nodes. In [21], the good machine simulation was not parallelized based on the assumption that it is not a major bottleneck. Such an assumption may no longer be true with current circuit size and complicated clocking features. Even if this were still true it could become a bottleneck as the number of nodes increases as predicted by Amdahl s law [22]. In a shared memory system, the scale of parallelism is limited by the number of CPUs and the memory size of the machine. A message passing system can extend the computing power beyond one machine. As more and more machines became multi-core, a hybrid system can combine the strengths of the two approaches. In a computing system, locality is an important mean to achieving efficient system throughput. We designed our system to follow this principle as much as possible. All generated vectors are simulated by the same process that created them. As concluded in [5], vector broadcasting is not as efficient as fault broadcasting. This is mainly due to the fact that good machine simulation has to be repeated for the same vectors with vector broadcasting. Thus we designed our system to eliminate vector broadcasting. While this could cause some lost coverage with a partitioned fault list, we solve this problem with a scheme wherein every process can potentially simulate any faults Paper 10.4 INTERNATIONAL TEST CONFERENCE 2

3 in the remaining fault list. This scheme in turn may cause duplicate simulation work. We solve this second problem with an efficient dynamic fault partitioning. Overall scheme Our hybrid system has 3 types of processes: the master process, the sub master processes and the slave The architecture for the entire system can be seen in figure 1. This is a hierarchical architecture meant to reduce the number of direct communication links in the system. All ATPG work is done by the slave processes, which generate and simulate their own test patterns. Master and sub-master processes only consume very little memory and CPU for logistic and communication purpose. The hierarchy allows scaling to a large number of slave Figure 1. System architecture for hybrid mode Master process There is only one master process, which is the original process launched. The full design database and ATPG constraints are stored in the master process. The responsibilities of the master process are controlling the entire system, collecting fault status and patterns and reporting progress (Figure 2). Before launching sub-masters, the master process saves a binary copy of the database including the ATPG constraints. Once launched, the sub-masters on remote machines can then read in the saved database. The fault list is then sent to the sub-masters from the master process. The master process then enters a waiting loop for patterns or fault status events from sub-masters. The master exits this loop only after all sub-masters terminate. After launching sub-masters, the master process only accesses fault and pattern data. The rest of the design database can be swapped out to disk. Sub-master processes The sub masters are launched on remote machines through server farm utilities. They are responsible for collecting patterns and fault status from slave processes and communicating with the master. The sub-master processes are mostly idle and are awaken from time to time by fault status or pattern activities (Figure 2). To avoid jamming up the communication channels between the master and sub-masters, the sub-masters consolidate the fault and pattern information from slaves to make the data transfer to the master more compact and efficient; such consolidation is not necessary in the communication between the sub-masters and the slaves. Because they shared the same memory system and communication is much more efficient. For example, the sub-master can combine two or more fault status events which arrived at the same time into one message. Since each message has a fixed overhead, the consolidation improves communication efficiency. Like the master process, a sub-master process only needs to access fault and pattern data after launching slave However, unlike the master process, the rest of data base is shared by slave processes during ATPG. In our multicore system, no sub-master process is needed since all slave processes are on the same machine. The fault status change can be directly updated to the shared fault status table. Slave processes The real ATPG and simulation work is done by the slave They only communicate to their sub-masters. Each slave works similar to a single process ATPG flow. The process first performs test generation with dynamic pattern compaction to accumulate an interval of (usually 32) patterns. A primary fault target is randomly selected for test generation. We try to keep the primary target evenly distributed over the entire circuit for one interval. Then test generation packs many care bits as possible for secondary fault targets. After the test generation phase, a parallel pattern single fault simulation is performed for all active faults. Fault targets and fault status changes are broadcasted to other slaves on the same machine and also to the sub-master. The sub-master sends the information to the master, which in turn forwards it to other sub-masters. Unlike single-process ATPG, each slave checks the shared fault status table before a fault is targeted or simulated. The global fault status table records the most recent status for each fault after collecting information from all slave Figure 2 shows flow charts for all 3 types of Paper 10.4 INTERNATIONAL TEST CONFERENCE 3

4 Master process Store database Spawn sub-masters All processes done Store patterns Report progress Done no yes Figure 2. Process flows for master, sub-master and slave Fault status communication Sub-master processes Restore database Spawn slave processes All processes done Receive/send patterns Receive/send fault stat Done Start Available faults Test generation Fault simulation Done Fault status communication is crucial to the quality of results of the hybrid system. Different processes rely on efficient fault status communication to reduce duplicate works. There are two levels of communication. At the machine level, slave processes communicate with each other through shared memory. At the system level, different machines communicate through TCP/IP message passing. The slave processes associated with the same sub-master process share a fault status table. This table is updated immediately with any fault detections by slave The sub-master process periodically collects new fault detections from this shared table. The new detections are sent to the master process as messages. On the other hand, the sub-master also processes new detection messages received from the master process. These new detections from other sub-masters are also updated in the locally shared fault status table. The fault status changes at an uneven rate during ATPG. At the beginning, there are large numbers of faults detected per pattern. The rate gradually reduces to a few near the end. We designed the sub-master message package size to dynamically adjust to such changes in detection rate. An average detection per pattern data within a sliding window of 100 patterns is roughly the size of a fault status message. At the beginning of ATPG, most faults are detected by random patterns. As different slaves are working on different parts of the fault list, it is not critical to avoid duplicate work through communication. Thus, at the no Slave processes yes yes no beginning, a larger message size is used to reduce the overall number of messages to be sent. Since there is an overhead for each message sent, a larger message size utilizes the communication bandwidth more efficiently. The message size is gradually reduced as detection per pattern decreases. User Interface Based on their need and knowledge of available resources, users can specify the list of machines or get a fixed number of machines from the computer farm, and also the number of slave processes each machine should run. Assume we specify m sub-masters and n slaves for each sub-master, and then we have m x n slave processes perform ATPG. Therefore, there are multiple possible configurations to launch the same number of total slave It is possible to specify m=1 and n=1. Such a system has 1 master process, 1 sub-master process and 1 slave process. It adds no value as compared to a traditional single process ATPG. However, such a configuration is allowed for debugging purpose. 3. Experimental Results We selected 8 industrial circuits with roughly 1.5 to 70 million gates to evaluate our hybrid system. The circuit characteristics are listed in table 1 (all numbers are in millions). D1 to D4 are bigger designs for which we used the stuck-at fault model. D5 tod8 are smaller designs for which we used the transition fault model. Since transition faults ATPG runs very long on D1-D4 and stuck-at faults ATPG runs very fast for D5-D8, we didn t collect both fault model data on all designs. Table 1. Circuit characteristics. name Fault Sizes(M) Model #gates #flops #faults D1 stuck-at D2 stuck-at D3 stuck-at D4 stuck-at D5 transition D6 transition D7 transition D8 transition A 4x4 hybrid configuration was compared to single process, 4 core and 16 core multicore. We also ran a 5x6 hybrid configuration to demonstrate the versatility of the system. The single process and multicore results are obtained on one 2666MHz Intel Xeon machine with 256GB memory and 24 cores. The 24 cores are installed in 4 sockets with 6 cores in each socket. Since we could not Paper 10.4 INTERNATIONAL TEST CONFERENCE 4

5 find other machines with the same CPUs, the result of hybrid are obtained on MHz Intel Xeon server with 72GB memory and 8 cores each. Each machine has 2 sockets with 4 cores in each socket. Although the CPU speed of the two types of machines is the same, the internal cache performance of the two may not be identical. However, a thorough investigation of caching effects is beyond the scope of this paper. The final test coverage results are listed in table 2. Single process test coverage is shown in column 2 as the base lines. The final test coverage differences of various multicore and hybrid configurations are listed in column 3 to 6. A plus sign means higher coverage. In all multicore or hybrid runs, the final test coverage was higher than single process. The last row listed the average coverage gain. On average, the test coverage gain increases with the number of slave The explanation of coverage gain is that combining dynamic pattern compaction with dynamic fault partition brings more randomness into the system. Some hard to detect faults may get fortuitous detections in a multicore system [17]. Further, in most cases, we can see the final test coverage of hybrid is higher than multicore. The explanation for this behavior could be that the longer communication delays in a hybrid system may cause some of the hard to detect faults being targeted more times than in a multicore system. When a slave targets a fault, it informs the other slaves to avoid the duplicate work of targeting the same fault; this is good in general because it avoids pattern inflation; however, for hard to detect faults and slow communication, multiple slaves can end up targeting the same fault, thus increase coverage but also pattern count. Table 2. Final test coverage Name Cov (%) Final cov diff (%) Single x4 5x6 D D D D D D D D Ave Table 3 lists the speedup at the end of each run. The table is arranged similar to table 2. For all runs in table 3, hybrid mode has higher speedup as compared to a multicore configuration with same number of slave This is due to the fact that a full-fledged multicore ATPG run may challenge the performance of a machine s memory system. For example, in a multi-socket NUMA system, remote memories are more time consuming to access than local memories. If all required processes were on one socket, all shared memory may be local to these However if the number of required processes exceeds the number of CPUs that one socket can schedule, additional processes will be scheduled on another socket. Since these processes all share the same database memory, some processes may have to transfer data from remote memory. A hybrid system can distribute working processes to different machines so that remote memory usage can be reduced in each machine. With design D5 to D8, we used transition fault model. The speedup of multicore and hybrid configuration drops in transition fault ATPG than for stuck-at ATPG. This is mainly due to additional ATPG effort to detect more faults that contributed to the additional fault coverage. In transition fault ATPG, the search space of a target fault is much greater than for a stuck-at fault. With dynamic compaction the detection of the fault is more sensitive to fault ordering. With multicore or hybrid, more possibilities of difference fault ordering are tested in parallel by different fault ordering in each process, thus increasing the chance that a hard-to-test fault is detected. Table 3. Speedup at the end of run name CPU(s) end of run Single x4 5x6 D D D D D D D D Ave For example, in D4 the speedup for 16 cores is far less than for the 4x4 hybrid. This design has a very long and flat tail in ATPG. So a little bit more test coverage gain (+0.01% from table 2) diminished speedup considerably. Table 4 lists pattern count comparison at the end of run. A + sign means pattern inflation and a - sign means pattern reduction. We can see that both multicore and hybrid create more patterns, but also have additional coverage. However, hybrid produces a less compact pattern sets compared to a multicore configuration with the same number of slaves. This is due to the fact that network communication is much slower than on-chip shared memory communication so that duplicate work is more Paper 10.4 INTERNATIONAL TEST CONFERENCE 5

6 likely to happen in a hybrid system than in a multicore system. Table 4. Pattern count at the end of run name Patterns Pattern diffs (%) Single x4 5x6 D D D D D D D D Ave The designs we used may not be large enough to justify a run with 30 slave We can see the 5x6 hybrid configuration didn t produce significantly better speedup with the designs we have. The pattern inflation gets worse as design size decreases. However we demonstrated the versatility of our hybrid system. Hybrid system can leverage the computing resource of multiple machines to achieve the speedup expected. The memory consumption on each sub-master host is similar to our multicore system described in [17]. Each sub-master copies the master process memory. Each slave process adds about 25-30% memory overhead. 4. Limitations and Discussions Our ATPG system generates 32 patterns before fault grading these patterns. We cannot guarantee these 32 patterns do not have duplicate detections among them. With multiple ATPG processes, the number of patterns created in parallel increases even more. So our multicore or hybrid solution effectively increased the parallel pattern size, which can have a negative effect on pattern compactness. We rely on fault status communication among multiple processes to avoid duplicated work in ATPG. The communication delay among machines has significant impact on the compactness of the pattern set. As fewer and fewer fault remain, it is harder to avoid duplicate fault targets. This may also contribute to the pattern inflation of our multicore and hybrid system. The communication delay over the network is much slower than the communication through shared memory. In general, it is hard to share detailed information about fault targets, such as merging attempt counts, across different machines. This causes ATPG algorithms to behave slightly different between multicore and hybrid. This contributes to the slight difference in results between the two. It may also explain the increased pattern inflation on average between a 16 processes multicore system and a 4x4 hybrid system, as shown in table 4. These areas can be further optimized in the future. In some cases shown in table 4, multicore and hybrid create a more compact pattern set than single process ATPG. The reason is that fault status sharing in our parallel system essentially explores multiple fault orderings with different With dynamic merging, some hard to detect faults in one process can be detected early in another process. This may also explain our parallel ATPG results have slightly higher coverage than single process ATPG. However, this benefit may be offset by the duplicated detection as discussed earlier. The final result depends on how these two effects balance each other out. An implication of the above property in our parallel ATPG system is the non-repeatability of results. We can see the final coverage and pattern set size changes with parallel configuration in table 2 and 4. It is hard to predict which configuration will produce the best results since it is very design dependent. However, with a few experimental runs, it can easily be decided what the best configuration is based on speedup and pattern inflation. In practice, the configuration is often fixed due to resource constraints. 5. Conclusions We presented the architecture and algorithms of a novel distributed-multicore hybrid ATPG system. Results indicate that on average our hybrid ATPG system can achieve similar speedup as multicore with same number of slave The hybrid system uses much less expensive machines so that it is easier for user to find resources to run the parallel ATPG job. The experimental results also showed significant pattern inflation when a large number of slave processes are used for relatively small or medium sized designs. This suggests future optimizations for a large number of slaves competing for a small number of remaining faults. 6. References [1] An Analysis of Fault Partitioned Parallel Test Generation, Joseph M. Wolf, Lori M. Kaufman, Robert H. Klenke, James H. Aylor, Ron Waxman, IEEE Transactions on Computer-Aided of Integrated Circuits and Systems, Vol. 15, No. 5, May [2] Distributed Implementation of an ATPG System Using Dynamic Fault Allocation. M. J. Aguado, E. de la Torre, M. A. Miranda, C. Lopez-Barrio, Proceedings of International Test Conference [3] SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation. D. Krishnaswamy, E. M. Rudnick, J. H. Patel, P. Paper 10.4 INTERNATIONAL TEST CONFERENCE 6

7 Banerjee, Proceedings of the VLSI Test Symposium, pp , April [4] Data Parallel-Fault Simulation. Minesh B. Amin and Bapiraju Vinnakota, IEEE Transactions on Very Large Scale Intergration (VLSI) System, Vol. 7, No. 2, June [5] An Analysis of Fault Partitioned Parallel Test Generation. Joseph M. Wolf, Lori M. Kaufman, Robert H. Klenke, James H. Aylor, and Ron Waxman, IEEE Transactions on Computer-Aided of Integrated Circuits and Systems, Vol. 15, No. 5, May [6] On the Efficiency of Parallel Backtracking. V. Nageshwara Rao and Vipin Kumar, IEEE Transactions on Parallel and Distributed Systems, 4(4), pp , April [7] Parallel Test Generation with Low Communication Overhead. Sivaramakrishnan Venkatraman, Sharad Seth, Prathima Agrawal, Proceedings of International Conference on VLSI [8] A Parallel Branch and Bound Algorithm for Test Generation, Srinivas Patil and Prithviraj Banerjee, IEEE Transactions on Computer Aided, Vol. 9, No. 3, March [9] ProperHITEC: A Portable, Parallel, Object- Oriented Approach to Sequential Test Generation, Steven Parkes, Prithviraj Banerjee, Janak Patel, Proceedings of Automation Conference1994. [10] Parallel Test Generation Using Circuit Partitioning and Spectral Techniques. Consolacion Gil, Julio Ortega, Proceedings of the Sixth Euromicro Workshop on Parallel and Distributed Processing, [11] Parallel test generation for sequential circuits on general-purpose multiprocessors. S. Patil, P. Banerjee, and J. H. Patel, Proceedings of Automation Conference1991. [12] Parallelization methods for circuit partitioning based parallel automatic test pattern generation, Robert H. Klenke, Ronald D. Williams, James H. Aylor, VLSI Test Symposium [13] Experimental Evaluation of Testability Measures for Test Generation, Susheel J. Chandra and Janak H. Patel, IEEE Transactions on Computer Aided, Vol. 8, No. 1, Jan [14] VLSI logic and fault simulation on generalpurpose parallel computers, R. B. Mueller- Thurns, D. G. Saab, and R. F. D. J. A. Abraham, IEEE Transactions on Computer Aided of Integrated Circuit and Systems, Vol. 12, No. 3, Mar [15] Towards Acceleration of Fault Simulation using Graphics Pocessing Units. Kanupriya Galati and Sunil P. Khatri, Proceedings of Automation Conference2008. [16] Performance Trade-Offs in a Parallel Test Generation/Fault simulation Environment, Srinivas Patil, and Prithviraj Banerjee, IEEE Transactions on Computer Aided, Vol. 10, No. 12, Dec [17] Highly Efficient Parallel ATPG Based on Shared Memory. Xiaolei Cai, Peter Wohl, John A. Waicukauski, and Pramod Notiyath, Proceedings of International Test Conference [18] Optimal Granularity and Scheme of Parallel Test Generation in a Distributed System, Hideo Fujiwara and Tomoo Inoue, IEEE Transactions on Parallel and Distributed Systems, Vol. 6, No. 7, July [19] GPU-Accelerated Fault Simulation and Its New Applications, Huawei Li, Dawen Xu, and Kwang- Ting Cheng, International Symposium on VLSI Automation and Test [20] 3-D Parallel Fault Simulation with GPGPU, Min Li, Michael S. Hsiao, IEEE Transactions on Computer Aided of Integrated Circuits and Systems, Vol. 30, No. 10, October [21] and Implementation of a Parallel Automatic Test Pattern Generation Algorithm with Low Test Vector Count. Robert Butler, Brion Keller, Sarala Paliwal, Richard Schoonover, Joseph Swenton, Proceedings of International Test Conference [22] Validity of the single processor approach to achieveing large scale computing capabilities. Gene M. Amdahl, Proceedings of AFIPS Spring Joint Computer Conference 1967.J. Doe & M. Jones, Measuring Interesting Waveforms with Novel Techniques, Proceedings IEEE Int. Test Conference, 1999, pp Paper 10.4 INTERNATIONAL TEST CONFERENCE 7

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Network-on-Chip Architecture

Network-on-Chip Architecture Multiple Processor Systems(CMPE-655) Network-on-Chip Architecture Performance aspect and Firefly network architecture By Siva Shankar Chandrasekaran and SreeGowri Shankar Agenda (Enhancing performance)

More information

Multiple Fault Models Using Concurrent Simulation 1

Multiple Fault Models Using Concurrent Simulation 1 Multiple Fault Models Using Concurrent Simulation 1 Evan Weststrate and Karen Panetta Tufts University Department of Electrical Engineering and Computer Science 161 College Avenue Medford, MA 02155 Email:

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced? Chapter 10: Virtual Memory Questions? CSCI [4 6] 730 Operating Systems Virtual Memory!! What is virtual memory and when is it useful?!! What is demand paging?!! When should pages in memory be replaced?!!

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

N-Model Tests for VLSI Circuits

N-Model Tests for VLSI Circuits 40th Southeastern Symposium on System Theory University of New Orleans New Orleans, LA, USA, March 16-18, 2008 MC3.6 N-Model Tests for VLSI Circuits Nitin Yogi and Vishwani D. Agrawal Auburn University,

More information

Memory Management. CSE 2431: Introduction to Operating Systems Reading: , [OSC]

Memory Management. CSE 2431: Introduction to Operating Systems Reading: , [OSC] Memory Management CSE 2431: Introduction to Operating Systems Reading: 8.1 8.3, [OSC] 1 Outline Basic Memory Management Swapping Variable Partitions Memory Management Problems 2 Basic Memory Management

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

The Need for Speed: Understanding design factors that make multicore parallel simulations efficient

The Need for Speed: Understanding design factors that make multicore parallel simulations efficient The Need for Speed: Understanding design factors that make multicore parallel simulations efficient Shobana Sudhakar Design & Verification Technology Mentor Graphics Wilsonville, OR shobana_sudhakar@mentor.com

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to

More information

VLSI Test Technology and Reliability (ET4076)

VLSI Test Technology and Reliability (ET4076) VLSI Test Technology and Reliability (ET4076) Lecture 4(part 2) Testability Measurements (Chapter 6) Said Hamdioui Computer Engineering Lab Delft University of Technology 2009-2010 1 Previous lecture What

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Nanometer technologies enable higher-frequency designs

Nanometer technologies enable higher-frequency designs By Ron Press & Jeff Boyer Easily Implement PLL Clock Switching for At-Speed Test By taking advantage of pattern-generation features, a simple logic design can utilize phase-locked-loop clocks for accurate

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION CHAPTER 1 INTRODUCTION Rapid advances in integrated circuit technology have made it possible to fabricate digital circuits with large number of devices on a single chip. The advantages of integrated circuits

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Advanced Digital Logic Design EECS 303

Advanced Digital Logic Design EECS 303 Advanced igital Logic esign EECS 33 http://ziyang.eecs.northwestern.edu/eecs33/ Teacher: Robert ick Office: L477 Tech Email: dickrp@northwestern.edu Phone: 847 467 2298 Outline. 2. 2 Robert ick Advanced

More information

ECE519 Advanced Operating Systems

ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor

More information

High-level Variable Selection for Partial-Scan Implementation

High-level Variable Selection for Partial-Scan Implementation High-level Variable Selection for Partial-Scan Implementation FrankF.Hsu JanakH.Patel Center for Reliable & High-Performance Computing University of Illinois, Urbana, IL Abstract In this paper, we propose

More information

Clustering and Reclustering HEP Data in Object Databases

Clustering and Reclustering HEP Data in Object Databases Clustering and Reclustering HEP Data in Object Databases Koen Holtman CERN EP division CH - Geneva 3, Switzerland We formulate principles for the clustering of data, applicable to both sequential HEP applications

More information

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University 18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?

More information

A Controller Testability Analysis and Enhancement Technique

A Controller Testability Analysis and Enhancement Technique A Controller Testability Analysis and Enhancement Technique Xinli Gu Erik Larsson, Krzysztof Kuchinski and Zebo Peng Synopsys, Inc. Dept. of Computer and Information Science 700 E. Middlefield Road Linköping

More information

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Wei-keng Liao Alok Choudhary ECE Department Northwestern University Evanston, IL Donald Weiner Pramod Varshney EECS Department

More information

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1]) EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,

More information

SAS Meets Big Iron: High Performance Computing in SAS Analytic Procedures

SAS Meets Big Iron: High Performance Computing in SAS Analytic Procedures SAS Meets Big Iron: High Performance Computing in SAS Analytic Procedures Robert A. Cohen SAS Institute Inc. Cary, North Carolina, USA Abstract Version 9targets the heavy-duty analytic procedures in SAS

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

High-Performance and Parallel Computing

High-Performance and Parallel Computing 9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement

More information

Deterministic BIST ABSTRACT. II. DBIST Schemes Based On Reseeding of PRPG (LFSR) I. INTRODUCTION

Deterministic BIST ABSTRACT. II. DBIST Schemes Based On Reseeding of PRPG (LFSR) I. INTRODUCTION Deterministic BIST Amiri Amir Mohammad Ecole Polytechnique, Montreal, December 2004 ABSTRACT This paper studies some of the various techniques of DBIST. Normal BIST structures use a PRPG (LFSR) to randomly

More information

Multiprocessor scheduling

Multiprocessor scheduling Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.

More information

Best Practices for Setting BIOS Parameters for Performance

Best Practices for Setting BIOS Parameters for Performance White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page

More information

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics ,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics The objectives of this module are to discuss about the need for a hierarchical memory system and also

More information

Static Compaction Techniques to Control Scan Vector Power Dissipation

Static Compaction Techniques to Control Scan Vector Power Dissipation Static Compaction Techniques to Control Scan Vector Power Dissipation Ranganathan Sankaralingam, Rama Rao Oruganti, and Nur A. Touba Computer Engineering Research Center Department of Electrical and Computer

More information

An Efficient Method for Multiple Fault Diagnosis

An Efficient Method for Multiple Fault Diagnosis An Efficient Method for Multiple Fault Diagnosis Khushboo Sheth Department of Electrical and Computer Engineering Auburn University, Auburn, AL Abstract: In this paper, failing circuits are analyzed and

More information

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects CS 378: Programming for Performance Administration Instructors: Keshav Pingali (Professor, CS department & ICES) 4.126 ACES Email: pingali@cs.utexas.edu TA: Hao Wu (Grad student, CS department) Email:

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

CSE 120 Principles of Operating Systems

CSE 120 Principles of Operating Systems CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

12. Use of Test Generation Algorithms and Emulation

12. Use of Test Generation Algorithms and Emulation 12. Use of Test Generation Algorithms and Emulation 1 12. Use of Test Generation Algorithms and Emulation Jacob Abraham Department of Electrical and Computer Engineering The University of Texas at Austin

More information

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,

More information

Upper Bounding Fault Coverage by Structural Analysis and Signal Monitoring

Upper Bounding Fault Coverage by Structural Analysis and Signal Monitoring Upper Bounding Fault Coverage by Structural Analysis and Signal Monitoring Abstract A new algorithm for determining stuck faults in combinational circuits that cannot be detected by a given input sequence

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

More information

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform

More information

At-Speed Scan Test with Low Switching Activity

At-Speed Scan Test with Low Switching Activity 21 28th IEEE VLSI Test Symposium At-Speed Scan Test with Low Switching Activity Elham K. Moghaddam Department of ECE, University of Iowa, Iowa City, IA 52242 ekhayatm@engineering.uiowa.edu Janusz Rajski

More information

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Stanley Bak Abstract Network algorithms are deployed on large networks, and proper algorithm evaluation is necessary to avoid

More information

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8. Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS Performance Analysis of Java NativeThread and NativePthread on Win32 Platform Bala Dhandayuthapani Veerasamy Research Scholar Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India dhanssoft@gmail.com

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

Multiprocessor Systems. COMP s1

Multiprocessor Systems. COMP s1 Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve

More information

Caching Basics. Memory Hierarchies

Caching Basics. Memory Hierarchies Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby

More information

Concurrency for data-intensive applications

Concurrency for data-intensive applications Concurrency for data-intensive applications Dennis Kafura CS5204 Operating Systems 1 Jeff Dean Sanjay Ghemawat Dennis Kafura CS5204 Operating Systems 2 Motivation Application characteristics Large/massive

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Load Balancing Algorithm over a Distributed Cloud Network

Load Balancing Algorithm over a Distributed Cloud Network Load Balancing Algorithm over a Distributed Cloud Network Priyank Singhal Student, Computer Department Sumiran Shah Student, Computer Department Pranit Kalantri Student, Electronics Department Abstract

More information

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Job Re-Packing for Enhancing the Performance of Gang Scheduling Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards This time: Spell correction Soundex Index construction Index

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades Evaluation report prepared under contract with Dot Hill August 2015 Executive Summary Solid state

More information

Database Systems II. Secondary Storage

Database Systems II. Secondary Storage Database Systems II Secondary Storage CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 The Memory Hierarchy Swapping, Main-memory DBMS s Tertiary Storage: Tape, Network Backup 3,200 MB/s (DDR-SDRAM

More information

Efficient Test Compaction for Combinational Circuits Based on Fault Detection Count-Directed Clustering

Efficient Test Compaction for Combinational Circuits Based on Fault Detection Count-Directed Clustering Efficient Test Compaction for Combinational Circuits Based on Fault Detection Count-Directed Clustering Aiman El-Maleh, Saqib Khurshid King Fahd University of Petroleum and Minerals Dhahran, Saudi Arabia

More information

Memory. Objectives. Introduction. 6.2 Types of Memory

Memory. Objectives. Introduction. 6.2 Types of Memory Memory Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured. Master the concepts

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

4. Hardware Platform: Real-Time Requirements

4. Hardware Platform: Real-Time Requirements 4. Hardware Platform: Real-Time Requirements Contents: 4.1 Evolution of Microprocessor Architecture 4.2 Performance-Increasing Concepts 4.3 Influences on System Architecture 4.4 A Real-Time Hardware Architecture

More information

Advanced VLSI Design Prof. Virendra K. Singh Department of Electrical Engineering Indian Institute of Technology Bombay

Advanced VLSI Design Prof. Virendra K. Singh Department of Electrical Engineering Indian Institute of Technology Bombay Advanced VLSI Design Prof. Virendra K. Singh Department of Electrical Engineering Indian Institute of Technology Bombay Lecture 40 VLSI Design Verification: An Introduction Hello. Welcome to the advance

More information

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation

More information

Von Neumann architecture. The first computers used a single fixed program (like a numeric calculator).

Von Neumann architecture. The first computers used a single fixed program (like a numeric calculator). Microprocessors Von Neumann architecture The first computers used a single fixed program (like a numeric calculator). To change the program, one has to re-wire, re-structure, or re-design the computer.

More information

Chapter 4 NETWORK HARDWARE

Chapter 4 NETWORK HARDWARE Chapter 4 NETWORK HARDWARE 1 Network Devices As Organizations grow, so do their networks Growth in number of users Geographical Growth Network Devices : Are products used to expand or connect networks.

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Design Issues 1 / 36. Local versus Global Allocation. Choosing

Design Issues 1 / 36. Local versus Global Allocation. Choosing Design Issues 1 / 36 Local versus Global Allocation When process A has a page fault, where does the new page frame come from? More precisely, is one of A s pages reclaimed, or can a page frame be taken

More information

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Millisort: An Experiment in Granular Computing. Seo Jin Park with Yilong Li, Collin Lee and John Ousterhout

Millisort: An Experiment in Granular Computing. Seo Jin Park with Yilong Li, Collin Lee and John Ousterhout Millisort: An Experiment in Granular Computing Seo Jin Park with Yilong Li, Collin Lee and John Ousterhout Massively Parallel Granular Computing Massively parallel computing as an application of granular

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Parallel Programming Multicore systems

Parallel Programming Multicore systems FYS3240 PC-based instrumentation and microcontrollers Parallel Programming Multicore systems Spring 2011 Lecture #9 Bekkeng, 4.4.2011 Introduction Until recently, innovations in processor technology have

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

Survey on Virtual Memory

Survey on Virtual Memory Survey on Virtual Memory 1. Introduction Today, Computers are a part of our everyday life. We want to make our tasks easier and we try to achieve these by some means of tools. Our next preference would

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

1. Introduction. Traditionally, a high bandwidth file system comprises a supercomputer with disks connected

1. Introduction. Traditionally, a high bandwidth file system comprises a supercomputer with disks connected 1. Introduction Traditionally, a high bandwidth file system comprises a supercomputer with disks connected by a high speed backplane bus such as SCSI [3][4] or Fibre Channel [2][67][71]. These systems

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

Design of Parallel Algorithms. Course Introduction

Design of Parallel Algorithms. Course Introduction + Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:

More information

Mark Sandstrom ThroughPuter, Inc.

Mark Sandstrom ThroughPuter, Inc. Hardware Implemented Scheduler, Placer, Inter-Task Communications and IO System Functions for Many Processors Dynamically Shared among Multiple Applications Mark Sandstrom ThroughPuter, Inc mark@throughputercom

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2014 Lecture 14 LAST TIME! Examined several memory technologies: SRAM volatile memory cells built from transistors! Fast to use, larger memory cells (6+ transistors

More information