Filtering Metadata Lookups in Instruction-Grain Application Monitoring

Size: px
Start display at page:

Download "Filtering Metadata Lookups in Instruction-Grain Application Monitoring"

Transcription

1 EDIC RESEARCH PROPOSAL 1 Filtering etadata Lookups in Instruction-Grain Application onitoring Yusuf Onur Kocberber Parallel Systems Architecture Lab (PARSA) Ecole Polytechnique Fédérale de Lausanne Lausanne, Switzerland Abstract Dynamic Information Flow Tracking (DIFT) is a promising technique to perform instruction-grain application monitoring. DIFT detects software bugs by checking and analyzing every individual instruction at runtime. In softwareonly implementations of DIFT, performance degrades significantly (10-100x) because processor resources are shared between the application and the DIFT tool. Hardware-only implementations of DIFT eliminate the overhead, but focus on a specific monitoring tool or require invasive changes in the processor core. Log-Based Architectures (LBA) are flexible hardware frameworks to accelerate a wide range of instruction-grain DIFT tools. LBA leverage general-purpose multi-core chips with modest modifications to hardware but incur 3-5x slowdown. In this paper, we will introduce a custom metadata cache to eliminate the slowdown in LBA. By profiling metadata, we observed vast majority of metadata lookups are redundant. Our mechanism performs fast lookups to metadata and prevents invoking monitoring functionality by filtering events. I. INTRODUCTION S oftware debugging and verification are becoming challenging as computing systems are becoming faster and more complex. isbehaving systems negate all the design efforts that have been made to increase performance or reduce energy consumption. Bugs in complex software, often introduced by humans, are not only hard to catch, but also hard to recreate. This research plan has been approved: Proposal submitted to committee: September 30th, 2010; Candidacy exam date: October 7th, 2010; Candidacy exam committee: Giovanni De icheli, Babak Falsafi, Paolo Ienne. Date: Doctoral candidate: (name and signature) Thesis director: (name and signature) Thesis co-director: (if applicable) (name and signature) Doct. prog. director: (R. Urbanke) (signature) There have been tremendous efforts to remove or detect software bugs. These efforts include building static tools to analyze the software before execution, others include postmortem tools to analyze software after it crashes, and there are dynamic tools that monitor an application as it executes. Lifeguards are dynamic tools that perform instruction-grain application monitoring to catch problems as program executes. Instruction-grain monitoring collects very detailed information, such as memory references of the instruction or branch address computations. Collected information is critical to diagnose software problems such as memory access violations, data races and security exploits. oreover, dynamic monitoring prevents bugs, enables justin-time notifications and on-the-fly fixes. Dynamic Information Flow Tracking (DIFT) is a promising and widely known technique to detect software bugs at runtime [1]. The main idea of DIFT is to keep track of data status as application executes. For example, when DIFT is used for security, it marks the spurious data and tracks their propagation through the system. Lifeguards [2] are DIFT tools that monitor application s execution at instruction level. etadata are associated with every byte of memory and registers. As the execution of the application goes on, metadata are updated and checked by lifeguards. etadata checks ensure that an operation performed is safe or correct, (e.g., not a memory access violation). In this study, we focus on three lifeguards: (i) TaintCheck [3], which detects security exploitations, (ii) AddressCheck [4], which detects accesses to unallocated data, and (iii) emcheck [5], which detects accesses to both unallocated and uninitialized data. Unfortunately, instruction-grain lifeguards are slow because for every instruction executed in the application, the lifeguard should take an action according to its functionality. There are both software and hardware implementations of instruction-grain program monitoring. Software-only implementation of lifeguards, based on Dynamic Binary Instrumentation (DBI), does not require any modification to the existing system or recompilation. However, hardware resource sharing between the application and the monitoring tool, typically results in x slowdown.

2 EDIC RESEARCH PROPOSAL 2 DIFT etadata DIFT Logic Icache Decode Security Decode Reg File Reg File ALU ALU Dcache ain Core etadata Pipeline () ain Core Capture ain Core Analysis L2 L2 Compress L2 Log Buffer Decompress DRA DRA DRA (a) In-core DIFT (b) Off-core DIFT (c) Offloading DIFT Figure 1 Three design alternatives for hardware accelerated DIFT Hardware implementations remove resource sharing. Figure 1 shows three alternatives for hardware accelerated DIFT: (i) integrated in-core DIFT, (ii) an off-core coprocessor DIFT, and (iii) the multi-core based offloading DIFT. Integrated in-core DIFT performs checks in parallel with the processor pipeline. Slowdown is eliminated but it requires significant changes in the core. Off-core coprocessor DIFT uses a small (compared to the main core) specialized core without modifying the existing core, but dedicated hardware can only perform statically-defined security checks. In contrast, multi-core DIFT offers flexible, general-purpose framework. Log-Based Architectures (LBA), which are multi-core DIFT frameworks, use hardware as a substrate to perform instruction-grain monitoring. LBA is built on Chip ultiprocessors (CP), where application runs on one core and the application-monitoring lifeguard runs on another core. The application core and monitoring core communicate through the log buffer and two cores are detached from each other. Because application monitoring is performed on a general-purpose core (enhanced with log capturing), any lifeguard can be used without changing the existing hardware framework. LBA has performance overhead of 3-5x. Our aim is to further reduce the performance overheads. We propose a custom metadata cache, a lightweight hardware mechanism, to filter out the metadata lookups in instruction-grain application monitoring. The remainder of the paper is organized as follows. Section 2 discusses three design alternatives for hardware DIFT. Section 3 presents the idea of filtering metadata lookups and design of the custom metadata cache. Finally, Section 4 offers our conclusions. II. RELATED WORK Lifeguards monitor the application to check for possible misbehaviours. This work will focus on TaintCheck, AddrCheck and emcheck lifeguards. TaintCheck [3] detects overwrite-related security exploits. TaintCheck monitors all unverified input data (e.g., data coming from network) and marks memory locations as suspected or tainted. For every register and application byte, it keeps a single metadata bit that shows whether memory location is tainted or not. Data coming from unsecure channels are tainted and tainted status is propagated to the other locations, when the tainted data are used during the execution. A security exception is raised, when tainted data are used in critical ways, (e.g., as the program counter, as an instruction, as sensitive function or system call arguments). AddressCheck [4] detects accesses to unallocated data. emory allocation is tracked by intercepting malloc and free related system calls. AddressCheck maintains one accessibility bit per application byte and an exception is raised, if there is an access to unallocated data. Accessibility property is not propagated during the application s execution. [6] extends AddressCheck to detect the use of uninitialized data. emcheck maintains accessibility metadata, like AddressCheck, and extends the metadata with one initialization bit per application byte. Accessibility metadata are updated and checked as described above. A memory location is considered initialized, if a constant value is assigned to it. Initialized metadata are propagated for every instruction and the destination becomes uninitialized if at least one of the source operands of the instruction is uninitialized. Initialization metadata are cleared after free system calls. A. Integrated In-core Implementation of Lifeguards Integrated in-core implementation of lifeguards (hardwareonly lifeguards) performs metadata propagation and checks in parallel with the processor pipeline. Dedicated logic and storage are added in order to perform parallel checks. Figure 3 depicts the design of the hardware-only lifeguard implementation. Integrated approach eliminates two main sources of slowdown of the software-only approach; (i) metadata checks and updates and (ii) creating the state of the application. etadata checks and updates add minimal performance overhead because metadata are maintained by dedicated logic. Creating the state of the application is not needed because the lifeguard can access the state of the

3 EDIC RESEARCH PROPOSAL 3 Decoupling Queue Regs Core I-TLB D-TLB ain Core Decoupling Queue Queue Stall Instruction Tuple Reg File Security Decode ALU Check Logic Writeback Icache Dcache () L2 DRA Instruction Tuple Pc Instruction emory Address Valid L2 () Figure 3 Additional storage components (dark) in order to support hardware-only DIFT metadata, at any time of the execution. oreover, inter-core communication overhead is avoided because in-core DIFT approach does not need a separate core. However, in-core DIFT needs significant modifications to the existing core. For example, all pipeline stages must be modified to buffer the metadata associated with the pending instructions. Suh et al. [1] propose DIFT, which is an integrated in-core implementation of TaintCheck lifeguard. Operating system monitors input channels for spurious data and track all these data in the system by propagating and updating the metadata. As execution continues, metadata are checked transparently within the modified pipeline. If a suspected value is copied from one location to another, suspected value s metadata are also copied, otherwise the track of the spurious data will be lost. Although data are marked as unsafe, a security assertion will not necessarily occur for every event, which uses unsafe metadata. A security assertion will only be raised if the suspected data are used as an instruction or a jump target address. oreover, the DIFT technique also enables controlling the level of propagation according to different security policies and system sources. Suh et al. grouped instructions into four categories: (i) copy, (ii) computation, (iii) load, and (iv) store. According to the memory space and performance overheads, one can choose to track only a single group of instructions or can choose to track all at the same time. Suh et al. also target metadata storage overhead. Naïve implementation of metadata management uses 12.5% of the storage and bandwidth resources of the memory. However, as programs execute, a big part of the memory remains unchanged. This observation leads authors to propose an efficient metadata management system, which changes the granularity of the metadata. Every page is extended with two bits that show the granularity of the metadata in that page. For the performance evaluation, 17 benchmarks of SPEC CPU2000 suite are simulated. The security policy, which tracks copy, load and store instructions, incurs 0.2% Figure 2 Pipeline of coprocessor DIFT memory overhead, because 95% of pages are maintained at page-level granularity. There is no performance overhead for all but one benchmark, having 0.3% overhead. Another security policy which tracks all four instruction categories has the performance overhead of 0.8% on average and 6% in the worst case. DIFT has no performance overhead, but modifications to the core are significant. These modifications not only can have negative impact on design and verification time, but also can affect the clock frequency of the processor. oreover, the limited number of security policies is not feasible, when the amount of modifications needed is considered. B. Off-core Coprocessor implementation of Lifeguards Specialized processors or coprocessors are amenable to straightforward operations of DIFT. However, DIFT must synchronize with the main processor periodically to prevent the damage of the bugs or security attacks. Fine-grain (instruction-grain) synchronization between application and the monitoring tool is not practical, because monitoring latency will directly affect the overall system performance. Coarse-grain synchronization, on the other hand, decouples the monitoring functionality from the application until a synchronization point (e.g., system calls) is reached in the application. Decoupling enables detaching DIFT functionality from the main core. In general, DIFT does not need computationally complex operations. etadata are treated in small number of bytes and the necessary action includes some simple logic operations most of the time. For this reason, when decoupling is possible, specialized hardware is the best solution by means of performance and physical area. This approach will minimize not only the amount of storage and bandwidth needed, but also the number of components (e.g., ALUs) in a processor. Kannan et al. [7] propose an off-core DIFT coprocessor design, which decouples DIFT operations from the application. DIFT coprocessor is a specialized hardware to run TaintCheck lifeguard. Figure 2 shows the corresponding design. Application and lifeguard are synchronized only at system calls, so that the whole DIFT state and logic can be moved to the coprocessor. A small, FIFO queue between the main core and the coprocessor enables decoupled execution. ain processor inserts an instruction tuple to the

4 EDIC RESEARCH PROPOSAL 4 decoupling queue. An instruction tuple is decoded by the coprocessor and contains PC, instruction (opcode, operands, etc.), and memory addresses used. When the decoupling queue is full or a system call is encountered, the main processor should stall. If the decoupling queue is full because the coprocessor is having cache misses, the application performance will degrade. However, this event should be encountered at the same time with the application because we expect, at least the same locality in metadata since it represents the application data. The application slowdown will partially hide the coprocessor cache misses but misses will still cause memory contention between the coprocessor and the main processor. Dedicated coprocessor approach runs on FPGA using a fullsystem prototype and it executes SPECint2000 applications with less than 1% performance overhead. DIFT coprocessor is small and does not require any modification to the design, pipeline and the layout of the general-purpose core, or the cache hierarchy. The implementation of coprocessor showed amount of resources used is just 7% more compared to a RISC core. The main source of slowdown is the memory contention described earlier. An application, which behaves badly (e.g. has poor locality), can increase the slowdown up to 10%. On contrary, DIFT coprocessor approach is not flexible and it only supports a single lifeguard. There are various lifeguards, which are as important as TaintCheck. Even for the same lifeguard, if the metadata semantics are changed, the coprocessor should be redesigned. C. General Hardware-Accelerated Solutions for Lifeguards Previous hardware-assisted approaches lack either support for a variety of lifeguards or require significant changes in the existing processor. Leveraging multi-core for DIFT, known as Log-Based Architectures (LBA) [2], is a flexible hardware for instruction-grain application monitoring. Figure 4 shows the LBA system. While an instruction is committed in the application side, an event record corresponding to that instruction is captured and compressed in core1. Then, the record is delivered to the core2 through the log buffer, which is stored in L2 cache. Applicationmonitoring lifeguard fetches the event records from the log buffer one by one. When an event record is fetched, first, it is decompressed, and then dispatched to the core2. Because the lifeguard is also an application running on the core, it does not need to know about DIFT metadata or policies explicitly. Hence, a large variety of the lifeguards can be supported just by changing the running application on core2. In addition, the overhead of resource sharing and the need for recreating the hardware state are eliminated. The former does not exist, because the application-monitoring lifeguard does not share resources with the application. The latter is removed, because the event record has all the necessary information when the log is captured at the application side. However, instruction-grain nature of the monitoring still results in slowdowns, because nearly every instruction Application Core 1 Core 2 Capture Compress Operating System Log Buffer -TLB Decompress Lifeguard Analysis IT & IF Figure 4 Dual-core Log-Based Architecture System running on core1 (application core) needs an action at core2 (lifeguard core), which means several lifeguard instructions per application instruction. As a result, although LBA framework reduces the performance overhead significantly, there is still 3-5x slowdown. Chen et al. [8] propose three hardware mechanisms to reduce the overhead of the LBA framework. Proposed mechanisms are Unary Inheritance Tracking, Idempotent Filters, and etadata-tlb. Unary Inheritance Tracking (IT) aims at reducing the cost of metadata propagation event. Propagation tracking is one of the key sources of the overhead for the DIFT technique. For the lifeguards that need to track the flow of the data through registers extensively (e.g. TaintCheck, emcheck), metadata update and propagate events correspond to the large part of the execution time. IT tracks the inheritance of metadata for Unary Operations and delivers the update events to the lifeguard only when necessary. The destination operand for binary operations is set clean for all lifeguards, which IT is applicable to. However, the implementation of IT may change among different lifeguards depending on the semantics of metadata. For emcheck, a non-unary operation s destination operand is set clean and source operands are sent to the lifeguard to check their metadata. If the source operands are uninitialized, an error is issued at this point. Hence, cascading errors that will be caused by the same source are eliminated before propagating the metadata. For TaintCheck, checking only the unary operations is enough, for practical reasons, to detect security exploits. Inheritance Tracking can reduce the update events by 24-74%. Idempotent Filter (IF) targets lifeguards that perform metadata checks very frequently (e.g. AddrCheck, emcheck). any checks can be filtered because they are idempotent (redundant). For example, after allocating a memory location, following loads and/or stores to the same address do not need to be checked until the next free event. Idempotent Filter is designed as a cache and even with a

5 % of filtering opportunity EDIC RESEARCH PROPOSAL 5 small size (32 entries), it can filter nearly 50% of the check events of AddrCheck. etadata-tlb (-TLB) attacks the cost of metadata mapping. As we described earlier, metadata accesses are very common in the execution of the lifeguard, unless they are filtered. If an event needs to access metadata, first it should translate the application address to the metadata address. To achieve fast and efficient translation, -TLB, a TLB-like hardware structure, caches the latest translations of addresses. -TLB is accessed by a new instruction called LA (Load etadata Address). Hence, dynamic instruction count of lifeguards is reduced by 16-49%, because instead of using several instructions, the lifeguard uses a single instruction to translate the address. LBA framework has the performance overhead of 3x for AddrCheck, 3.5x for TaintCheck, and 8x for emcheck. emcheck is a heavy-weighted lifeguard, because of monitoring both allocation and initialization status of the application. However, LBA is the only platform (among the described) that supports variety of lifeguards, because LBA uses general-purpose cores without being aware of the DIFT metadata or policies explicitly. oreover, the modifications to the core are not significant. III. FILTERING ETADATA LOOKUPS LBA, as a general purpose framework, is an effective technique to perform instruction-grain application monitoring. Our aim is to further improve the framework and remove the performance overheads due to the instruction-grain nature of the lifeguards. Instruction-grain monitoring is slow. When an application is executed under LBA, almost all instructions, committed in application s pipeline, result in the execution of the corresponding event handler at the lifeguard side. An event handler is a set of instructions (e.g. around instructions on average), which take the necessary actions according to the lifeguard s functionality. Therefore, the number of instructions executed at the lifeguard side is roughly the number of instructions executed at the processor side multiplied by the average handler size. There are two ways to reduce the slowdown of LBA: (i) to reduce the average handler size and (ii) to reduce the number of events processed in the lifeguard. The average handler size is directly related to the lifeguard functionality and reducing the instruction count is possible only with a well accepted property that applies to all lifeguards. For example, -TLB reduces the number of instructions in each handler by removing metadata mapping operation from all of the event handlers for different lifeguards. Although nearly all handlers include address translation, reduced number of instructions is still not enough to overcome the performance overheads. Second option is to reduce the number of event handlers dispatched. etadata checks are frequent but AddressCheck emcheck TaintCheck Figure 5 Filtering effectiveness of metadata cache updates are not. An event handler can be filtered when the result of the handler execution will not change the metadata state or when the check operation will not assert any error. We propose a general, efficient and practical way of filtering out metadata lookups. A general technique should be compatible with different lifeguards metadata semantics. An efficient technique should have a high filtering rate (i.e., filtering should be the common case throughout the lifeguards execution). A practical technique should be implemented with a modest amount of hardware (i.e., less area compared to an L1 cache area in modern processors). A significant percentage of metadata lookups are redundant. Every lifeguard checks the state of metadata in order to monitor the application. etadata semantics imply that there is a clean value. For example, for AddrCheck lifeguard, if the memory location accessed is allocated, then the metadata value is clean. For TaintCheck, if the used value is not spurious, then it is clean and for emcheck, accessed data are clean when data are both allocated and initialized. Fortunately, metadata lookups return clean value for most of the checks. Although applications have bugs, the number of bugs is negligible, compared to the number of checks performed in an instruction-grain monitoring environment. This observation also states that the metadata lookups are redundant. If the metadata have the desired property (clean value), then there is no need to dispatch a handler to verify it. Fortunately, profiling experiments with Valgrind [6] show that majority of the accessed metadata are clean. Figure 5 depicts the filtering effectiveness of the metadata cache for three diverse lifeguards where each bar corresponds to the fraction of clean accesses in SPECint2000 benchmarks with ref input. As a result, our profiling results show that there is indeed a large percentage of filtering opportunity to be exploited. Fast hardware lookups with custom metadata cache. In order to filter out the redundant log entries, we need to know the status of the data corresponding to the log entry being dispatched at any point of execution. Therefore, status of registers and memory operands must be known when handlers are being dispatched. For this purpose, we want to be able to obtain metadata state just by performing a fast lookup in a small hardware structure.

6 EDIC RESEARCH PROPOSAL 6 ain Core Capture Compress L2 Log Buffer DRA Decompress CC Dispatch handler Fill Lookup? ain Core Analysis feasible because of their negative impact on design and verification time. Hardware coprocessor approach has the minimal modifications to the system but lacks flexibility. Finally, multi-core DIFT (LBA) framework is the desired solution by supporting diverse lifeguards with minimal overheads in the system. To eliminate performance overheads of LBA completely, we propose a custom metadata cache to filter metadata lookups in hardware. Profiling results show that vast majority of accesses to metadata are clean. Hence, custom metadata cache will eliminate the slowdown by filtering redundant checks of metadata for all lifeguards. Figure 6 Custom metadata cache (CC) for filtering Figure 6 shows the custom metadata cache we propose. is characterized as custom because application addresses are used to access the metadata cache. When an event record is fetched, a lookup is performed in the custom metadata cache to check whether event can be filtered or not. If the metadata value is clean, the corresponding event is filtered and we do not dispatch a handler. Conversely, if the metadata value is not clean, the event cannot be filtered and corresponding handler is dispatched. misses are also handled by the event handlers but the event handler performs only address calculations for the lifeguard s metadata mapping. Data operations like evicting, filling etc. are done in hardware. The lower hierarchy cache, for custom metadata cache, is L1 cache and serves fill and evict requests. Filtering opportunity of metadata lookups can only be extracted by having high hit rates. Our experiment shows that even with a memory bound benchmark (e.g. mcf), custom metadata cache hit rate is 95%. We expect to see even higher hit rates with other benchmarks. The only limitation is the miss penalty of the cache. The event handler dispatched in the event of a cache miss should not increase the average handler size of the lifeguard. Using a conventional L1 cache is also an option, however, required address translation, from application address to physical address, adds extra delay to the critical path, while our technique is built in to the dispatch logic and every lookup adds just an extra cycle. Furthermore, custom metadata cache is smaller than the conventional caches, because, for the three diverse lifeguards we studied, one byte of application s data can be represented by two-bit metadata. As a result, custom metadata cache can be smaller and faster than the conventional L1 cache, while maintaining the low miss rate like the L1 cache of application s core. IV. CONCLUSIONS In this work, we discussed three hardware design alternatives for instruction-grain application monitoring. Hardware mechanisms that are integrated in core are not V. REFERENCES [1] G. E., Lee, J. W., Zhang, D., and Devadas, S. Suh, "Secure program execution via dynamic information flow tracking," in Architectural Support For Programming Languages and Operating Systems, Boston, 2004, pp [2] S., Falsafi, B., Gibbons, P. B., Kozuch,., owry, T. C., Teodorescu, R., Ailamaki, A., Fix, L., Ganger, G. R., Lin, B., and Schlosser, S. W Chen, "Log-based architectures for general-purpose monitoring of deployed code," in Architectural and System Support For Improving Software Dependability, New York, 2006, pp [3] J Newsome and Song D, "Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software," in Network and Distributed System Security, San Diego, [4] N Nethercote, "Dynamic binary analysis and instrumentation," U. Cambridge, PhD Thesis [5] J. Seward and N. Nethercote, "Using Valgrind to detect undefined value errors with bit-precision," in USENIX Annual Technical Conference, Berkeley, 2005, pp [6] N Nethercote and J Seward, "Valgrind: a framework for heavyweight dynamic binary instrumentation," in Programming Language Design and Implementation, San Diego, 2007, pp [7] H. Kannan,. Dalton, and C Kozyrakis, "Decoupling Dynamic Information Flow Tracking with a dedicated coprocessor," in Dependable Systems & Networks, Estoril, 2009, pp [8] S., Kozuch,., Strigkos, T., Falsafi, B., Gibbons, P. B., owry, T. C., Ramachandran, V., Ruwase, O., Ryan,., and Vlachos, E. Chen, "Flexible Hardware Acceleration for Instruction-Grain Program onitoring," in International Symposium on Computer Architecture, Washington, 2008, pp

Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor

Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor Hari Kannan, Michael Dalton, Christos Kozyrakis Computer Systems Laboratory Stanford University Motivation Dynamic analysis help

More information

Parallelizing Dynamic Information Flow Tracking

Parallelizing Dynamic Information Flow Tracking Parallelizing Dynamic Information Flow Tracking Olatunji Ruwase Phillip B. Gibbons Todd C. Mowry Vijaya Ramachandran Shimin Chen Michael Kozuch Michael Ryan Carnegie Mellon University, Pittsburgh, PA,

More information

Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor

Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor 1 Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor Hari Kannan, Michael Dalton, Christos Kozyrakis Presenter: Yue Zheng Yulin Shi Outline Motivation & Background Hardware DIFT

More information

High-Performance Parallel Accelerator for Flexible and Efficient Run-Time Monitoring

High-Performance Parallel Accelerator for Flexible and Efficient Run-Time Monitoring High-Performance Parallel Accelerator for Flexible and Efficient Run-Time Monitoring Daniel Y. Deng and G. Edward Suh Computer Systems Laboratory, Cornell University Ithaca, New York 14850 {deng, suh}@csl.cornell.edu

More information

Logs and Lifeguards: Accelerating Dynamic Program Monitoring

Logs and Lifeguards: Accelerating Dynamic Program Monitoring Logs and Lifeguards: Accelerating Dynamic Program Monitoring S. Chen, B. Falsafi, P. B. Gibbons, M. Kozuch, T. C. Mowry, R. Teodorescu, A. Ailamaki, L. Fix, G. R. Ganger, S. W. Schlosser IRP-TR-06-05 INFORMATION

More information

DBT Tool. DBT Framework

DBT Tool. DBT Framework Thread-Safe Dynamic Binary Translation using Transactional Memory JaeWoong Chung,, Michael Dalton, Hari Kannan, Christos Kozyrakis Computer Systems Laboratory Stanford University http://csl.stanford.edu

More information

TRIPS: Extending the Range of Programmable Processors

TRIPS: Extending the Range of Programmable Processors TRIPS: Extending the Range of Programmable Processors Stephen W. Keckler Doug Burger and Chuck oore Computer Architecture and Technology Laboratory Department of Computer Sciences www.cs.utexas.edu/users/cart

More information

kguard++: Improving the Performance of kguard with Low-latency Code Inflation

kguard++: Improving the Performance of kguard with Low-latency Code Inflation kguard++: Improving the Performance of kguard with Low-latency Code Inflation Jordan P. Hendricks Brown University Abstract In this paper, we introduce low-latency code inflation for kguard, a GCC plugin

More information

AFRL-RI-RS-TR

AFRL-RI-RS-TR AFRL-RI-RS-TR-2015-143 FLEXIBLE TAGGED ARCHITECTURE FOR TRUSTWORTHY MULTI-CORE PLATFORMS CORNELL UNIVERSITY JUNE 2015 FINAL TECHNICAL REPORT APPROVED FOR PUBLIC RELEASE; DISTRIBUTION UNLIMITED STINFO COPY

More information

There are different characteristics for exceptions. They are as follows:

There are different characteristics for exceptions. They are as follows: e-pg PATHSHALA- Computer Science Computer Architecture Module 15 Exception handling and floating point pipelines The objectives of this module are to discuss about exceptions and look at how the MIPS architecture

More information

Run-Time Monitoring with Adjustable Overhead Using Dataflow-Guided Filtering

Run-Time Monitoring with Adjustable Overhead Using Dataflow-Guided Filtering Run-Time Monitoring with Adjustable Overhead Using Dataflow-Guided Filtering Daniel Lo, Tao Chen, Mohamed Ismail, and G. Edward Suh Cornell University Ithaca, NY 14850, USA {dl575, tc466, mii5, gs272}@cornell.edu

More information

Lecture 11 Cache. Peng Liu.

Lecture 11 Cache. Peng Liu. Lecture 11 Cache Peng Liu liupeng@zju.edu.cn 1 Associative Cache Example 2 Associative Cache Example 3 Associativity Example Compare 4-block caches Direct mapped, 2-way set associative, fully associative

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

High Performance SMIPS Processor

High Performance SMIPS Processor High Performance SMIPS Processor Jonathan Eastep 6.884 Final Project Report May 11, 2005 1 Introduction 1.1 Description This project will focus on producing a high-performance, single-issue, in-order,

More information

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554

More information

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Improving Data Cache Performance via Address Correlation: An Upper Bound Study Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras

Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras Week - 04 Lecture 17 CPU Context Switching Hello. In this video

More information

Pipelined processors and Hazards

Pipelined processors and Hazards Pipelined processors and Hazards Two options Processor HLL Compiler ALU LU Output Program Control unit 1. Either the control unit can be smart, i,e. it can delay instruction phases to avoid hazards. Processor

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

FADE: A Programmable Filtering Accelerator for Instruction-Grain Monitoring

FADE: A Programmable Filtering Accelerator for Instruction-Grain Monitoring In Proceedings of the th International Symposium On High Performance Computer Architecture (HPCA 4) FADE: A Programmable Filtering Accelerator for Instruction-Grain Monitoring Sotiria Fytraki, Evangelos

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

A Streaming Multi-Threaded Model

A Streaming Multi-Threaded Model A Streaming Multi-Threaded Model Extended Abstract Eylon Caspi, André DeHon, John Wawrzynek September 30, 2001 Summary. We present SCORE, a multi-threaded model that relies on streams to expose thread

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Xiangyao Yu 1, Christopher Hughes 2, Nadathur Satish 2, Onur Mutlu 3, Srinivas Devadas 1 1 MIT 2 Intel Labs 3 ETH Zürich 1 High-Bandwidth

More information

Chapter 9 Pipelining. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Chapter 9 Pipelining. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Chapter 9 Pipelining Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Basic Concepts Data Hazards Instruction Hazards Advanced Reliable Systems (ARES) Lab.

More information

Simultaneous Multithreading Architecture

Simultaneous Multithreading Architecture Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1 Memory technology & Hierarchy Caching and Virtual Memory Parallel System Architectures Andy D Pimentel Caches and their design cf Henessy & Patterson, Chap 5 Caching - summary Caches are small fast memories

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program

More information

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste

More information

Hardware Support for Software Debugging

Hardware Support for Software Debugging Hardware Support for Software Debugging Mohammad Amin Alipour Benjamin Depew Department of Computer Science Michigan Technological University Report Documentation Page Form Approved OMB No. 0704-0188 Public

More information

Darek Mihocka, Emulators.com Stanislav Shwartsman, Intel Corp. June

Darek Mihocka, Emulators.com Stanislav Shwartsman, Intel Corp. June Darek Mihocka, Emulators.com Stanislav Shwartsman, Intel Corp. June 21 2008 Agenda Introduction Gemulator Bochs Proposed ISA Extensions Conclusions and Future Work Q & A Jun-21-2008 AMAS-BT 2008 2 Introduction

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Limiting the Number of Dirty Cache Lines

Limiting the Number of Dirty Cache Lines Limiting the Number of Dirty Cache Lines Pepijn de Langen and Ben Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Raksha: A Flexible Information Flow Architecture for Software Security

Raksha: A Flexible Information Flow Architecture for Software Security Raksha: A Flexible Information Flow Architecture for Software Security Michael Dalton, Hari Kannan, Christos Kozyrakis Computer Systems Laboratory Stanford University {mwdalton, hkannan, kozyraki}@stanford.edu

More information

Dynamic Scheduling. CSE471 Susan Eggers 1

Dynamic Scheduling. CSE471 Susan Eggers 1 Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Cache Introduction [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user with as much

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 13 Virtual memory and memory management unit In the last class, we had discussed

More information

Spectre and Meltdown. Clifford Wolf q/talk

Spectre and Meltdown. Clifford Wolf q/talk Spectre and Meltdown Clifford Wolf q/talk 2018-01-30 Spectre and Meltdown Spectre (CVE-2017-5753 and CVE-2017-5715) Is an architectural security bug that effects most modern processors with speculative

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation CAD for VLSI 2 Pro ject - Superscalar Processor Implementation 1 Superscalar Processor Ob jective: The main objective is to implement a superscalar pipelined processor using Verilog HDL. This project may

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

The Impact of Instruction Compression on I-cache Performance

The Impact of Instruction Compression on I-cache Performance Technical Report CSE-TR--97, University of Michigan The Impact of Instruction Compression on I-cache Performance I-Cheng K. Chen Peter L. Bird Trevor Mudge EECS Department University of Michigan {icheng,pbird,tnm}@eecs.umich.edu

More information

Key Point. What are Cache lines

Key Point. What are Cache lines Caching 1 Key Point What are Cache lines Tags Index offset How do we find data in the cache? How do we tell if it s the right data? What decisions do we need to make in designing a cache? What are possible

More information

Meltdown or "Holy Crap: How did we do this to ourselves" Meltdown exploits side effects of out-of-order execution to read arbitrary kernelmemory

Meltdown or Holy Crap: How did we do this to ourselves Meltdown exploits side effects of out-of-order execution to read arbitrary kernelmemory Meltdown or "Holy Crap: How did we do this to ourselves" Abstract Meltdown exploits side effects of out-of-order execution to read arbitrary kernelmemory locations Breaks all security assumptions given

More information

EECS 570 Final Exam - SOLUTIONS Winter 2015

EECS 570 Final Exam - SOLUTIONS Winter 2015 EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Wednesday, September 13, Chapter 4

Wednesday, September 13, Chapter 4 Wednesday, September 13, 2017 Topics for today Introduction to Computer Systems Static overview Operation Cycle Introduction to Pep/9 Features of the system Operational cycle Program trace Categories of

More information

Question 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate.

Question 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate. Questions 1 Question 13 1: (Solution, p ) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate. Question 13 : (Solution, p ) In implementing HYMN s control unit, the fetch cycle

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

MCD: A Multiple Clock Domain Microarchitecture

MCD: A Multiple Clock Domain Microarchitecture MCD: A Multiple Clock Domain Microarchitecture Dave Albonesi in collaboration with Greg Semeraro Grigoris Magklis Rajeev Balasubramonian Steve Dropsho Sandhya Dwarkadas Michael Scott Caveats We started

More information

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)

More information

The Design Complexity of Program Undo Support in a General-Purpose Processor

The Design Complexity of Program Undo Support in a General-Purpose Processor The Design Complexity of Program Undo Support in a General-Purpose Processor Radu Teodorescu and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

More information

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050

More information

The ARM10 Family of Advanced Microprocessor Cores

The ARM10 Family of Advanced Microprocessor Cores The ARM10 Family of Advanced Microprocessor Cores Stephen Hill ARM Austin Design Center 1 Agenda Design overview Microarchitecture ARM10 o o Memory System Interrupt response 3. Power o o 4. VFP10 ETM10

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Design of Out-Of-Order Superscalar Processor with Speculative Thread Level Parallelism

Design of Out-Of-Order Superscalar Processor with Speculative Thread Level Parallelism ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology Volume 3, Special Issue 3, March 2014 2014 International Conference

More information

Checker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India

Checker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India Advanced Department of Computer Science Indian Institute of Technology New Delhi, India Outline Introduction Advanced 1 Introduction 2 Checker Pipeline Checking Mechanism 3 Advanced Core Checker L1 Failure

More information

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Chapter 8 & Chapter 9 Main Memory & Virtual Memory Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Jackson Marusarz Intel Corporation

Jackson Marusarz Intel Corporation Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits

More information

SUPERSCALAR AND VLIW PROCESSORS

SUPERSCALAR AND VLIW PROCESSORS Datorarkitektur I Fö 10-1 Datorarkitektur I Fö 10-2 What is a Superscalar Architecture? SUPERSCALAR AND VLIW PROCESSORS A superscalar architecture is one in which several instructions can be initiated

More information

Memory. Objectives. Introduction. 6.2 Types of Memory

Memory. Objectives. Introduction. 6.2 Types of Memory Memory Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured. Master the concepts

More information

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010 Pipelining, Instruction Level Parallelism and Memory in Processors Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010 NOTE: The material for this lecture was taken from several

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014 1061 A Cool Scheduler for Multi-Core Systems Exploiting Program Phases Zhiming Zhang and J. Morris Chang, Senior Member, IEEE Abstract Rapid growth

More information

1. Creates the illusion of an address space much larger than the physical memory

1. Creates the illusion of an address space much larger than the physical memory Virtual memory Main Memory Disk I P D L1 L2 M Goals Physical address space Virtual address space 1. Creates the illusion of an address space much larger than the physical memory 2. Make provisions for

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

Wednesday, February 4, Chapter 4

Wednesday, February 4, Chapter 4 Wednesday, February 4, 2015 Topics for today Introduction to Computer Systems Static overview Operation Cycle Introduction to Pep/8 Features of the system Operational cycle Program trace Categories of

More information

HW1 Solutions. Type Old Mix New Mix Cost CPI

HW1 Solutions. Type Old Mix New Mix Cost CPI HW1 Solutions Problem 1 TABLE 1 1. Given the parameters of Problem 6 (note that int =35% and shift=5% to fix typo in book problem), consider a strength-reducing optimization that converts multiplies by

More information

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors By: Anvesh Polepalli Raj Muchhala Introduction Integrating CPU and GPU into a single chip for performance

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

Chapter 8. Pipelining

Chapter 8. Pipelining Chapter 8. Pipelining Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization requires sophisticated compilation techniques.

More information

Leaky Cauldron on the Dark Land: Understanding Memory Side-Channel Hazards in SGX

Leaky Cauldron on the Dark Land: Understanding Memory Side-Channel Hazards in SGX Leaky Cauldron on the Dark Land: Understanding Memory Side-Channel Hazards in SGX W. Wang, G. Chen, X, Pan, Y. Zhang, XF. Wang, V. Bindschaedler, H. Tang, C. Gunter. September 19, 2017 Motivation Intel

More information

ECE 341 Final Exam Solution

ECE 341 Final Exam Solution ECE 341 Final Exam Solution Time allowed: 110 minutes Total Points: 100 Points Scored: Name: Problem No. 1 (10 points) For each of the following statements, indicate whether the statement is TRUE or FALSE.

More information

Chapter 7 The Potential of Special-Purpose Hardware

Chapter 7 The Potential of Special-Purpose Hardware Chapter 7 The Potential of Special-Purpose Hardware The preceding chapters have described various implementation methods and performance data for TIGRE. This chapter uses those data points to propose architecture

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA CS 258 Parallel Computer Architecture Data Speculation Support for a Chip Multiprocessor (Hydra CMP) Lance Hammond, Mark Willey and Kunle Olukotun Presented: May 7 th, 2008 Ankit Jain Outline The Hydra

More information

Main Points of the Computer Organization and System Software Module

Main Points of the Computer Organization and System Software Module Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a

More information

Computer Architecture V Fall Practice Exam Questions

Computer Architecture V Fall Practice Exam Questions Computer Architecture V22.0436 Fall 2002 Practice Exam Questions These are practice exam questions for the material covered since the mid-term exam. Please note that the final exam is cumulative. See the

More information

CS 136: Advanced Architecture. Review of Caches

CS 136: Advanced Architecture. Review of Caches 1 / 30 CS 136: Advanced Architecture Review of Caches 2 / 30 Why Caches? Introduction Basic goal: Size of cheapest memory... At speed of most expensive Locality makes it work Temporal locality: If you

More information

Modern Buffer Overflow Prevention Techniques: How they work and why they don t

Modern Buffer Overflow Prevention Techniques: How they work and why they don t Modern Buffer Overflow Prevention Techniques: How they work and why they don t Russ Osborn CS182 JT 4/13/2006 1 In the past 10 years, computer viruses have been a growing problem. In 1995, there were approximately

More information

CS Computer Architecture

CS Computer Architecture CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 Computer Systems Organization The CPU (Central Processing Unit) is the brain of the computer. Fetches instructions from main memory.

More information

Efficient Sequential Consistency Using Conditional Fences

Efficient Sequential Consistency Using Conditional Fences Efficient Sequential Consistency Using Conditional Fences Changhui Lin CSE Department University of California, Riverside CA 92521 linc@cs.ucr.edu Vijay Nagarajan School of Informatics University of Edinburgh

More information