Filtering Metadata Lookups in Instruction-Grain Application Monitoring

Size: px

Start display at page:

Download "Filtering Metadata Lookups in Instruction-Grain Application Monitoring"

Jade Weaver
5 years ago
Views:

1 EDIC RESEARCH PROPOSAL 1 Filtering etadata Lookups in Instruction-Grain Application onitoring Yusuf Onur Kocberber Parallel Systems Architecture Lab (PARSA) Ecole Polytechnique Fédérale de Lausanne Lausanne, Switzerland Abstract Dynamic Information Flow Tracking (DIFT) is a promising technique to perform instruction-grain application monitoring. DIFT detects software bugs by checking and analyzing every individual instruction at runtime. In softwareonly implementations of DIFT, performance degrades significantly (10-100x) because processor resources are shared between the application and the DIFT tool. Hardware-only implementations of DIFT eliminate the overhead, but focus on a specific monitoring tool or require invasive changes in the processor core. Log-Based Architectures (LBA) are flexible hardware frameworks to accelerate a wide range of instruction-grain DIFT tools. LBA leverage general-purpose multi-core chips with modest modifications to hardware but incur 3-5x slowdown. In this paper, we will introduce a custom metadata cache to eliminate the slowdown in LBA. By profiling metadata, we observed vast majority of metadata lookups are redundant. Our mechanism performs fast lookups to metadata and prevents invoking monitoring functionality by filtering events. I. INTRODUCTION S oftware debugging and verification are becoming challenging as computing systems are becoming faster and more complex. isbehaving systems negate all the design efforts that have been made to increase performance or reduce energy consumption. Bugs in complex software, often introduced by humans, are not only hard to catch, but also hard to recreate. This research plan has been approved: Proposal submitted to committee: September 30th, 2010; Candidacy exam date: October 7th, 2010; Candidacy exam committee: Giovanni De icheli, Babak Falsafi, Paolo Ienne. Date: Doctoral candidate: (name and signature) Thesis director: (name and signature) Thesis co-director: (if applicable) (name and signature) Doct. prog. director: (R. Urbanke) (signature) There have been tremendous efforts to remove or detect software bugs. These efforts include building static tools to analyze the software before execution, others include postmortem tools to analyze software after it crashes, and there are dynamic tools that monitor an application as it executes. Lifeguards are dynamic tools that perform instruction-grain application monitoring to catch problems as program executes. Instruction-grain monitoring collects very detailed information, such as memory references of the instruction or branch address computations. Collected information is critical to diagnose software problems such as memory access violations, data races and security exploits. oreover, dynamic monitoring prevents bugs, enables justin-time notifications and on-the-fly fixes. Dynamic Information Flow Tracking (DIFT) is a promising and widely known technique to detect software bugs at runtime [1]. The main idea of DIFT is to keep track of data status as application executes. For example, when DIFT is used for security, it marks the spurious data and tracks their propagation through the system. Lifeguards [2] are DIFT tools that monitor application s execution at instruction level. etadata are associated with every byte of memory and registers. As the execution of the application goes on, metadata are updated and checked by lifeguards. etadata checks ensure that an operation performed is safe or correct, (e.g., not a memory access violation). In this study, we focus on three lifeguards: (i) TaintCheck [3], which detects security exploitations, (ii) AddressCheck [4], which detects accesses to unallocated data, and (iii) emcheck [5], which detects accesses to both unallocated and uninitialized data. Unfortunately, instruction-grain lifeguards are slow because for every instruction executed in the application, the lifeguard should take an action according to its functionality. There are both software and hardware implementations of instruction-grain program monitoring. Software-only implementation of lifeguards, based on Dynamic Binary Instrumentation (DBI), does not require any modification to the existing system or recompilation. However, hardware resource sharing between the application and the monitoring tool, typically results in x slowdown.

2 EDIC RESEARCH PROPOSAL 2 DIFT etadata DIFT Logic Icache Decode Security Decode Reg File Reg File ALU ALU Dcache ain Core etadata Pipeline () ain Core Capture ain Core Analysis L2 L2 Compress L2 Log Buffer Decompress DRA DRA DRA (a) In-core DIFT (b) Off-core DIFT (c) Offloading DIFT Figure 1 Three design alternatives for hardware accelerated DIFT Hardware implementations remove resource sharing. Figure 1 shows three alternatives for hardware accelerated DIFT: (i) integrated in-core DIFT, (ii) an off-core coprocessor DIFT, and (iii) the multi-core based offloading DIFT. Integrated in-core DIFT performs checks in parallel with the processor pipeline. Slowdown is eliminated but it requires significant changes in the core. Off-core coprocessor DIFT uses a small (compared to the main core) specialized core without modifying the existing core, but dedicated hardware can only perform statically-defined security checks. In contrast, multi-core DIFT offers flexible, general-purpose framework. Log-Based Architectures (LBA), which are multi-core DIFT frameworks, use hardware as a substrate to perform instruction-grain monitoring. LBA is built on Chip ultiprocessors (CP), where application runs on one core and the application-monitoring lifeguard runs on another core. The application core and monitoring core communicate through the log buffer and two cores are detached from each other. Because application monitoring is performed on a general-purpose core (enhanced with log capturing), any lifeguard can be used without changing the existing hardware framework. LBA has performance overhead of 3-5x. Our aim is to further reduce the performance overheads. We propose a custom metadata cache, a lightweight hardware mechanism, to filter out the metadata lookups in instruction-grain application monitoring. The remainder of the paper is organized as follows. Section 2 discusses three design alternatives for hardware DIFT. Section 3 presents the idea of filtering metadata lookups and design of the custom metadata cache. Finally, Section 4 offers our conclusions. II. RELATED WORK Lifeguards monitor the application to check for possible misbehaviours. This work will focus on TaintCheck, AddrCheck and emcheck lifeguards. TaintCheck [3] detects overwrite-related security exploits. TaintCheck monitors all unverified input data (e.g., data coming from network) and marks memory locations as suspected or tainted. For every register and application byte, it keeps a single metadata bit that shows whether memory location is tainted or not. Data coming from unsecure channels are tainted and tainted status is propagated to the other locations, when the tainted data are used during the execution. A security exception is raised, when tainted data are used in critical ways, (e.g., as the program counter, as an instruction, as sensitive function or system call arguments). AddressCheck [4] detects accesses to unallocated data. emory allocation is tracked by intercepting malloc and free related system calls. AddressCheck maintains one accessibility bit per application byte and an exception is raised, if there is an access to unallocated data. Accessibility property is not propagated during the application s execution. [6] extends AddressCheck to detect the use of uninitialized data. emcheck maintains accessibility metadata, like AddressCheck, and extends the metadata with one initialization bit per application byte. Accessibility metadata are updated and checked as described above. A memory location is considered initialized, if a constant value is assigned to it. Initialized metadata are propagated for every instruction and the destination becomes uninitialized if at least one of the source operands of the instruction is uninitialized. Initialization metadata are cleared after free system calls. A. Integrated In-core Implementation of Lifeguards Integrated in-core implementation of lifeguards (hardwareonly lifeguards) performs metadata propagation and checks in parallel with the processor pipeline. Dedicated logic and storage are added in order to perform parallel checks. Figure 3 depicts the design of the hardware-only lifeguard implementation. Integrated approach eliminates two main sources of slowdown of the software-only approach; (i) metadata checks and updates and (ii) creating the state of the application. etadata checks and updates add minimal performance overhead because metadata are maintained by dedicated logic. Creating the state of the application is not needed because the lifeguard can access the state of the

3 EDIC RESEARCH PROPOSAL 3 Decoupling Queue Regs Core I-TLB D-TLB ain Core Decoupling Queue Queue Stall Instruction Tuple Reg File Security Decode ALU Check Logic Writeback Icache Dcache () L2 DRA Instruction Tuple Pc Instruction emory Address Valid L2 () Figure 3 Additional storage components (dark) in order to support hardware-only DIFT metadata, at any time of the execution. oreover, inter-core communication overhead is avoided because in-core DIFT approach does not need a separate core. However, in-core DIFT needs significant modifications to the existing core. For example, all pipeline stages must be modified to buffer the metadata associated with the pending instructions. Suh et al. [1] propose DIFT, which is an integrated in-core implementation of TaintCheck lifeguard. Operating system monitors input channels for spurious data and track all these data in the system by propagating and updating the metadata. As execution continues, metadata are checked transparently within the modified pipeline. If a suspected value is copied from one location to another, suspected value s metadata are also copied, otherwise the track of the spurious data will be lost. Although data are marked as unsafe, a security assertion will not necessarily occur for every event, which uses unsafe metadata. A security assertion will only be raised if the suspected data are used as an instruction or a jump target address. oreover, the DIFT technique also enables controlling the level of propagation according to different security policies and system sources. Suh et al. grouped instructions into four categories: (i) copy, (ii) computation, (iii) load, and (iv) store. According to the memory space and performance overheads, one can choose to track only a single group of instructions or can choose to track all at the same time. Suh et al. also target metadata storage overhead. Naïve implementation of metadata management uses 12.5% of the storage and bandwidth resources of the memory. However, as programs execute, a big part of the memory remains unchanged. This observation leads authors to propose an efficient metadata management system, which changes the granularity of the metadata. Every page is extended with two bits that show the granularity of the metadata in that page. For the performance evaluation, 17 benchmarks of SPEC CPU2000 suite are simulated. The security policy, which tracks copy, load and store instructions, incurs 0.2% Figure 2 Pipeline of coprocessor DIFT memory overhead, because 95% of pages are maintained at page-level granularity. There is no performance overhead for all but one benchmark, having 0.3% overhead. Another security policy which tracks all four instruction categories has the performance overhead of 0.8% on average and 6% in the worst case. DIFT has no performance overhead, but modifications to the core are significant. These modifications not only can have negative impact on design and verification time, but also can affect the clock frequency of the processor. oreover, the limited number of security policies is not feasible, when the amount of modifications needed is considered. B. Off-core Coprocessor implementation of Lifeguards Specialized processors or coprocessors are amenable to straightforward operations of DIFT. However, DIFT must synchronize with the main processor periodically to prevent the damage of the bugs or security attacks. Fine-grain (instruction-grain) synchronization between application and the monitoring tool is not practical, because monitoring latency will directly affect the overall system performance. Coarse-grain synchronization, on the other hand, decouples the monitoring functionality from the application until a synchronization point (e.g., system calls) is reached in the application. Decoupling enables detaching DIFT functionality from the main core. In general, DIFT does not need computationally complex operations. etadata are treated in small number of bytes and the necessary action includes some simple logic operations most of the time. For this reason, when decoupling is possible, specialized hardware is the best solution by means of performance and physical area. This approach will minimize not only the amount of storage and bandwidth needed, but also the number of components (e.g., ALUs) in a processor. Kannan et al. [7] propose an off-core DIFT coprocessor design, which decouples DIFT operations from the application. DIFT coprocessor is a specialized hardware to run TaintCheck lifeguard. Figure 2 shows the corresponding design. Application and lifeguard are synchronized only at system calls, so that the whole DIFT state and logic can be moved to the coprocessor. A small, FIFO queue between the main core and the coprocessor enables decoupled execution. ain processor inserts an instruction tuple to the

4 EDIC RESEARCH PROPOSAL 4 decoupling queue. An instruction tuple is decoded by the coprocessor and contains PC, instruction (opcode, operands, etc.), and memory addresses used. When the decoupling queue is full or a system call is encountered, the main processor should stall. If the decoupling queue is full because the coprocessor is having cache misses, the application performance will degrade. However, this event should be encountered at the same time with the application because we expect, at least the same locality in metadata since it represents the application data. The application slowdown will partially hide the coprocessor cache misses but misses will still cause memory contention between the coprocessor and the main processor. Dedicated coprocessor approach runs on FPGA using a fullsystem prototype and it executes SPECint2000 applications with less than 1% performance overhead. DIFT coprocessor is small and does not require any modification to the design, pipeline and the layout of the general-purpose core, or the cache hierarchy. The implementation of coprocessor showed amount of resources used is just 7% more compared to a RISC core. The main source of slowdown is the memory contention described earlier. An application, which behaves badly (e.g. has poor locality), can increase the slowdown up to 10%. On contrary, DIFT coprocessor approach is not flexible and it only supports a single lifeguard. There are various lifeguards, which are as important as TaintCheck. Even for the same lifeguard, if the metadata semantics are changed, the coprocessor should be redesigned. C. General Hardware-Accelerated Solutions for Lifeguards Previous hardware-assisted approaches lack either support for a variety of lifeguards or require significant changes in the existing processor. Leveraging multi-core for DIFT, known as Log-Based Architectures (LBA) [2], is a flexible hardware for instruction-grain application monitoring. Figure 4 shows the LBA system. While an instruction is committed in the application side, an event record corresponding to that instruction is captured and compressed in core1. Then, the record is delivered to the core2 through the log buffer, which is stored in L2 cache. Applicationmonitoring lifeguard fetches the event records from the log buffer one by one. When an event record is fetched, first, it is decompressed, and then dispatched to the core2. Because the lifeguard is also an application running on the core, it does not need to know about DIFT metadata or policies explicitly. Hence, a large variety of the lifeguards can be supported just by changing the running application on core2. In addition, the overhead of resource sharing and the need for recreating the hardware state are eliminated. The former does not exist, because the application-monitoring lifeguard does not share resources with the application. The latter is removed, because the event record has all the necessary information when the log is captured at the application side. However, instruction-grain nature of the monitoring still results in slowdowns, because nearly every instruction Application Core 1 Core 2 Capture Compress Operating System Log Buffer -TLB Decompress Lifeguard Analysis IT & IF Figure 4 Dual-core Log-Based Architecture System running on core1 (application core) needs an action at core2 (lifeguard core), which means several lifeguard instructions per application instruction. As a result, although LBA framework reduces the performance overhead significantly, there is still 3-5x slowdown. Chen et al. [8] propose three hardware mechanisms to reduce the overhead of the LBA framework. Proposed mechanisms are Unary Inheritance Tracking, Idempotent Filters, and etadata-tlb. Unary Inheritance Tracking (IT) aims at reducing the cost of metadata propagation event. Propagation tracking is one of the key sources of the overhead for the DIFT technique. For the lifeguards that need to track the flow of the data through registers extensively (e.g. TaintCheck, emcheck), metadata update and propagate events correspond to the large part of the execution time. IT tracks the inheritance of metadata for Unary Operations and delivers the update events to the lifeguard only when necessary. The destination operand for binary operations is set clean for all lifeguards, which IT is applicable to. However, the implementation of IT may change among different lifeguards depending on the semantics of metadata. For emcheck, a non-unary operation s destination operand is set clean and source operands are sent to the lifeguard to check their metadata. If the source operands are uninitialized, an error is issued at this point. Hence, cascading errors that will be caused by the same source are eliminated before propagating the metadata. For TaintCheck, checking only the unary operations is enough, for practical reasons, to detect security exploits. Inheritance Tracking can reduce the update events by 24-74%. Idempotent Filter (IF) targets lifeguards that perform metadata checks very frequently (e.g. AddrCheck, emcheck). any checks can be filtered because they are idempotent (redundant). For example, after allocating a memory location, following loads and/or stores to the same address do not need to be checked until the next free event. Idempotent Filter is designed as a cache and even with a

5 % of filtering opportunity EDIC RESEARCH PROPOSAL 5 small size (32 entries), it can filter nearly 50% of the check events of AddrCheck. etadata-tlb (-TLB) attacks the cost of metadata mapping. As we described earlier, metadata accesses are very common in the execution of the lifeguard, unless they are filtered. If an event needs to access metadata, first it should translate the application address to the metadata address. To achieve fast and efficient translation, -TLB, a TLB-like hardware structure, caches the latest translations of addresses. -TLB is accessed by a new instruction called LA (Load etadata Address). Hence, dynamic instruction count of lifeguards is reduced by 16-49%, because instead of using several instructions, the lifeguard uses a single instruction to translate the address. LBA framework has the performance overhead of 3x for AddrCheck, 3.5x for TaintCheck, and 8x for emcheck. emcheck is a heavy-weighted lifeguard, because of monitoring both allocation and initialization status of the application. However, LBA is the only platform (among the described) that supports variety of lifeguards, because LBA uses general-purpose cores without being aware of the DIFT metadata or policies explicitly. oreover, the modifications to the core are not significant. III. FILTERING ETADATA LOOKUPS LBA, as a general purpose framework, is an effective technique to perform instruction-grain application monitoring. Our aim is to further improve the framework and remove the performance overheads due to the instruction-grain nature of the lifeguards. Instruction-grain monitoring is slow. When an application is executed under LBA, almost all instructions, committed in application s pipeline, result in the execution of the corresponding event handler at the lifeguard side. An event handler is a set of instructions (e.g. around instructions on average), which take the necessary actions according to the lifeguard s functionality. Therefore, the number of instructions executed at the lifeguard side is roughly the number of instructions executed at the processor side multiplied by the average handler size. There are two ways to reduce the slowdown of LBA: (i) to reduce the average handler size and (ii) to reduce the number of events processed in the lifeguard. The average handler size is directly related to the lifeguard functionality and reducing the instruction count is possible only with a well accepted property that applies to all lifeguards. For example, -TLB reduces the number of instructions in each handler by removing metadata mapping operation from all of the event handlers for different lifeguards. Although nearly all handlers include address translation, reduced number of instructions is still not enough to overcome the performance overheads. Second option is to reduce the number of event handlers dispatched. etadata checks are frequent but AddressCheck emcheck TaintCheck Figure 5 Filtering effectiveness of metadata cache updates are not. An event handler can be filtered when the result of the handler execution will not change the metadata state or when the check operation will not assert any error. We propose a general, efficient and practical way of filtering out metadata lookups. A general technique should be compatible with different lifeguards metadata semantics. An efficient technique should have a high filtering rate (i.e., filtering should be the common case throughout the lifeguards execution). A practical technique should be implemented with a modest amount of hardware (i.e., less area compared to an L1 cache area in modern processors). A significant percentage of metadata lookups are redundant. Every lifeguard checks the state of metadata in order to monitor the application. etadata semantics imply that there is a clean value. For example, for AddrCheck lifeguard, if the memory location accessed is allocated, then the metadata value is clean. For TaintCheck, if the used value is not spurious, then it is clean and for emcheck, accessed data are clean when data are both allocated and initialized. Fortunately, metadata lookups return clean value for most of the checks. Although applications have bugs, the number of bugs is negligible, compared to the number of checks performed in an instruction-grain monitoring environment. This observation also states that the metadata lookups are redundant. If the metadata have the desired property (clean value), then there is no need to dispatch a handler to verify it. Fortunately, profiling experiments with Valgrind [6] show that majority of the accessed metadata are clean. Figure 5 depicts the filtering effectiveness of the metadata cache for three diverse lifeguards where each bar corresponds to the fraction of clean accesses in SPECint2000 benchmarks with ref input. As a result, our profiling results show that there is indeed a large percentage of filtering opportunity to be exploited. Fast hardware lookups with custom metadata cache. In order to filter out the redundant log entries, we need to know the status of the data corresponding to the log entry being dispatched at any point of execution. Therefore, status of registers and memory operands must be known when handlers are being dispatched. For this purpose, we want to be able to obtain metadata state just by performing a fast lookup in a small hardware structure.

6 EDIC RESEARCH PROPOSAL 6 ain Core Capture Compress L2 Log Buffer DRA Decompress CC Dispatch handler Fill Lookup? ain Core Analysis feasible because of their negative impact on design and verification time. Hardware coprocessor approach has the minimal modifications to the system but lacks flexibility. Finally, multi-core DIFT (LBA) framework is the desired solution by supporting diverse lifeguards with minimal overheads in the system. To eliminate performance overheads of LBA completely, we propose a custom metadata cache to filter metadata lookups in hardware. Profiling results show that vast majority of accesses to metadata are clean. Hence, custom metadata cache will eliminate the slowdown by filtering redundant checks of metadata for all lifeguards. Figure 6 Custom metadata cache (CC) for filtering Figure 6 shows the custom metadata cache we propose. is characterized as custom because application addresses are used to access the metadata cache. When an event record is fetched, a lookup is performed in the custom metadata cache to check whether event can be filtered or not. If the metadata value is clean, the corresponding event is filtered and we do not dispatch a handler. Conversely, if the metadata value is not clean, the event cannot be filtered and corresponding handler is dispatched. misses are also handled by the event handlers but the event handler performs only address calculations for the lifeguard s metadata mapping. Data operations like evicting, filling etc. are done in hardware. The lower hierarchy cache, for custom metadata cache, is L1 cache and serves fill and evict requests. Filtering opportunity of metadata lookups can only be extracted by having high hit rates. Our experiment shows that even with a memory bound benchmark (e.g. mcf), custom metadata cache hit rate is 95%. We expect to see even higher hit rates with other benchmarks. The only limitation is the miss penalty of the cache. The event handler dispatched in the event of a cache miss should not increase the average handler size of the lifeguard. Using a conventional L1 cache is also an option, however, required address translation, from application address to physical address, adds extra delay to the critical path, while our technique is built in to the dispatch logic and every lookup adds just an extra cycle. Furthermore, custom metadata cache is smaller than the conventional caches, because, for the three diverse lifeguards we studied, one byte of application s data can be represented by two-bit metadata. As a result, custom metadata cache can be smaller and faster than the conventional L1 cache, while maintaining the low miss rate like the L1 cache of application s core. IV. CONCLUSIONS In this work, we discussed three hardware design alternatives for instruction-grain application monitoring. Hardware mechanisms that are integrated in core are not V. REFERENCES [1] G. E., Lee, J. W., Zhang, D., and Devadas, S. Suh, "Secure program execution via dynamic information flow tracking," in Architectural Support For Programming Languages and Operating Systems, Boston, 2004, pp [2] S., Falsafi, B., Gibbons, P. B., Kozuch,., owry, T. C., Teodorescu, R., Ailamaki, A., Fix, L., Ganger, G. R., Lin, B., and Schlosser, S. W Chen, "Log-based architectures for general-purpose monitoring of deployed code," in Architectural and System Support For Improving Software Dependability, New York, 2006, pp [3] J Newsome and Song D, "Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software," in Network and Distributed System Security, San Diego, [4] N Nethercote, "Dynamic binary analysis and instrumentation," U. Cambridge, PhD Thesis [5] J. Seward and N. Nethercote, "Using Valgrind to detect undefined value errors with bit-precision," in USENIX Annual Technical Conference, Berkeley, 2005, pp [6] N Nethercote and J Seward, "Valgrind: a framework for heavyweight dynamic binary instrumentation," in Programming Language Design and Implementation, San Diego, 2007, pp [7] H. Kannan,. Dalton, and C Kozyrakis, "Decoupling Dynamic Information Flow Tracking with a dedicated coprocessor," in Dependable Systems & Networks, Estoril, 2009, pp [8] S., Kozuch,., Strigkos, T., Falsafi, B., Gibbons, P. B., owry, T. C., Ramachandran, V., Ruwase, O., Ryan,., and Vlachos, E. Chen, "Flexible Hardware Acceleration for Instruction-Grain Program onitoring," in International Symposium on Computer Architecture, Washington, 2008, pp

Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor

Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor Hari Kannan, Michael Dalton, Christos Kozyrakis Computer Systems Laboratory Stanford University Motivation Dynamic analysis help