Reusing Cache for Real-Time Memory Address Trace Compression

Size: px

Start display at page:

Download "Reusing Cache for Real-Time Memory Address Trace Compression"

Martin Harris
6 years ago
Views:

1 Reusing for Real-Time Memory Address Trace Ing-Jer Huang Dept of Computer Science and Engineering National Sun Yat-Sen University Kaohsiung 804, Taiwan Tel : ext ijhuang@csensysuedutw Chung-Fu Kao Dept of Computer Science and Engineering National Sun Yat-Sen University Kaohsiung 804, Taiwan Tel : cfkao@eslabcsensysuedutw Abstract - Instruction trace can help designer to debug the system architecture and understand the program behavior However, one of the major problems of tracing is the highest cost to store the trace result How to reduce the trace information and compress the trace volumes is an important issue while tracing a program is one of the basic components in modern microprocessor design, and usually cache will be disabled when system is under debugging In this paper, we present the technique that reusing system cache for memory trace compression within system debugging, and we can use a cache simulator which with the same behavior of hardware cache to restore the fully trace result compress the memory es trace volumes at real-time, that means we can record the memory references successively Because of the modern microprocessor design such as ARM922T [3] and MIPS processor core [4], almost has an embedded cache (instruction cache and data cache) within it, we mainly aim at how to use the system cache to decrease the hardware cost, and still can record the most important trace information and doesn't need to stop the microprocessor execution After compressed these trace data, we can transfer them to debug host quickly and saved as trace files I Introduction In the era of system-on-chip (SoC), system and microprocessor debugging are more and more important due to design complexity and time-to-market The IEEE-ISTO Nexus 5001 Forum [1] established an open industry standard that provides a general-purpose interface for the software development and debug of embedded processors To debug the microprocessor, the dynamic executed instructions are one of the useful information that help designer not only to detect the hardware bug but also analysis the program behavior Once microprocessor runs a program, the dynamic executed instructions must be collected as a trace file Usually we record the es of instruction executed in the trace file The difficulties in obtaining a complete program trace stem from the high cost of recording every executed instruction when the program runs and from the large size of the resulting trace files [2] A program runs for minutes on even 200MHz RISC microprocessor, resulting in gigabytes or terabytes huge trace files For example, a 10MHz, 32-bit microprocessor may gene the full trace file with the speed of (10 32)/8 = 40MBytes/sec; a 100 MHz, 32-bit microprocessor will gene the full trace file will the speed of (100 32)/8 = 400MBytes/sec Therefore, various techniques have been proposed to reduce the trace capacities We propose a technique that reusing the system cache for II Related Work The use of memory traces is an established technique for simulation-based researches of computer architecture and systems [6], and its techniques can briefly divide into two major parts: hardware part and software part [5][7], as shown in Fig1 Addresses Tracing Techniques - H/W Approaches H/W monitoring Embedded tracer Nexus 5001 ETM d trace - S/W Approaches Interrupt based Instrumented program Trap-bit method Fig 1 The classification of es tracing techniques

2 By modifying either the trace mechanism (hardware) or the debugging tools (software) to record the es of executed instruction can reduce the tracing overhead As Fig1 shown, the software approaches are easy to implement but difficult to store the trace result at real-time due to microprocessor operation speed is faster than trace collection speed [7] In Fig1, the H/W monitoring approach [8] [9] record processor memory requests directly This monitoring captures both user and operating system reference The embedded tracer in Fig1 means integrating the trace mechanism and microprocessor into a single chip, eg ARM Embedded Trace Macrocell (ETM) [10] and MIPS E/M family processor cores [4] III Proposed Approach In this section, we present the proposed reusing cache for memory trace compression technique that is suitable for real-time compression of memory references trace Fig2 shown the primary components in an embedded system, they are: microprocessor, caches, on-chip memory, on-chip bus, etc The gray rectangle in Fig2 is our real-time es compression mechanism, and it contains a branch target filter (B/T filter) and a cached tracer uprocessor B/T Filter (1) (1) I- $ (2) (3) d Tracer Trace file Embedded memory On-chip bus Off-chip Fig 2 Embedding trace mechanism into system-on-chip In Fig2, the signals (1), (2), and (3) denote the enable,, and cache index respectively We will introduce the branch/target filter, system cache (instruction cache) and cached tracer as follows segment of program instructions in a basic block will be executed contiguously, ie the instruction es in a basic block will be increased by a constant offset Fig3 shown the simple basic block diagram The constant offset is dependent on machine instruction set architecture (ISA) For example, the successive references of ARM7 microprocessor will increase by 4, eg a, a+4, a+8, a+c, etc The first instruction in the basic block we denote as target, and the last instruction in the basic block we denote as branch After executed the branch instruction the program execution flow will jump to another basic block, the target Since we define the feature of a basic block, we can just record the target and branch es and ignore the contiguous es to reduce the trace volumes Address Instruction 0000 Target Basic Block 000C 0010 Branch C Target Branch Basic Block Fig 3 Illustration of basic block diagram Microprocessor will sent an for memory request and this information will received by branch/target filter (B/T filter) Fig4 shown the B/T filter block diagram After the B/T filter received the current reference, the previous value will be subtracted from current value The operation of B/T filter is shown in Fig5 The B/T filter will output an enable signal to system cache and cached tracer module as true if the offset of previous and current is not equal with constant offset Then the system cache will cache the current and cached tracer module will record the trace information A Branch/Target Filter When the microprocessor executes a program, the dynamic execution flow has locality characteristics which are spatial locality and temporal locality The spatial locality is the property that the next accessed will probably be very close to the last accessed As a result, we define a basic block which a Address from microprocessor B/T Filter Previous Constant offset - = enable

3 Fig 4 Branch/Target filter architecture previous_ = 0; offset = 0; constant_offset = 4 ; // user defined Branch-Target filter ( current_ ) { offset = current_ - previous_; if (offset == constant_offset) enable = False; //contiguously instruction else // offset!= const_offset enable = True; //save branch instruction previous_ = current_; } Fig 5 Pseudo-code illustrating the B/T filter operation B System After B/T filter operation, we can only record the branch and target references, but the trace volumes maybe still large For example, a 32-bit RISC microprocessor will send a 32-bit request to memory During the system tracing stage, if the number of total branch and target counts is m, the total trace size is m 32 Memory references from programs are known to show copious temporal locality Temporal locality means if a memory is accessed, it will probably be accessed again in the very near future With advances in chip technology, most new microprocessors incorpo embedded caches, which allow some memory reference to be handled internally for increase the system performance When system is under debugging stage, the system usually disables cache, and memory reference will bypass the cache Therefore, we reuse the system cache and emulate it a small direct mapped structure, called a compression cache, to keep the recently seen memory references If the next memory reference hits in the cache, the hit index value will sent to cached tracer module Otherwise cache miss occurs, the full information will sent to cached tracer module Whether cache hit or miss, the hit/miss information will also sent to cached tracer module C d Tracer If a branch or target information already stored in cache, which is cache hit The system cache will sent a Hit and hit index signals to cached tracer module If cache miss, the system cache will sent a and miss information to cached tracer module According to this result, the trace file format will be like as follows, where H denotes hit and M denotes miss: H(index)H(index)M()H(index)M()M(), etc If a program is loop-intensive or temporal localityintensive, especially for multimedia programs, the cache hit will greater than miss, we can reduce the trace file size because we don't need to record the full information for every branch/target d tracer module will transmit the trace information to off-chip host to save this information as trace files d tracer could transmit the data to host via JTAG port or other transmission protocols such as IEEE-ISTO Nexus 5001 AUX port or Ethernet port IV Experiment Results In this paper, we show five critical programs executed on ARM7 microprocessor and tracing 1,000,000 instructions for each program These five program are DCT (Discrete Cosine Transform), FFT (Fast Fourier Transform), JPEG encoder, Fibonacci sequence, and Tower of Hanoi Obviously, these five programs are loop-intensive benchmark, especially for recursive programs of Fibonacci sequence and Tower of Hanoi We also use different cache size to emulate the cache behavior to analysis the relation between cache size and trace compression As we mention above, the trace file size can be calculated by Eq1 S = H ( 1 + I) + M (1 + A) (bits) (Eq 1) Where I: cache index size A: memory size H: number of cache hit count M: number of cache miss count S: trace file size In Eq1, the item (1+I) means that if there is a cache hit of a reference, the trace record format is H(index) For instance, the hit index is , and the trace record is The first 1 denote hit, note that hit information keeps 1 bit The item (1+A) means that if there is a cache miss of a reference, the trace record format is M(full ) For example, the cache miss is a 32-bit width: , and the trace record is The first 0 denote miss So, for each cache hit, the record size is 1 plus hit index size; for each cache miss, the record size is 1 plus full bit-width In a trace file, the cache index size and a full bit-width are fixed, and it's easy to decompression the full trace information For each table, the instruction number indicates the total tracing instruction count The size is 32-bit due to ARM7TDMI is a 32-bit RISC microprocessor, its bus width is 32-bit indicates the trace file size, and is the product of instruction number and size For full trace, we don't use any compression technique and record all dynamic executed instruction es B/T filter means we omit the contiguous from total referenced instruction es The hit/miss shows the cache hit/miss count, and the sum of hit counts and miss counts should be equal to B/T filter instruction number The index size is

4 dependent on cache, and equal to log 2 ( cache ) The file size for cached trace could be calculated according to Eq1 The compression for each rows is compared with full trace, for example, the compression of 7 th row of table 1 (cache ) is 2k file size ( 1 ) 100% Full trace file size Table 1 ratio of the DCT program Instruction number Address size B/T filter % / / / / % % % % Table 2 ratio of the FFT program Instruction Line size number B/T filter % / / / / % % % % Table 3 ratio of the JPEG encoder program Instruction Line size number B/T filter % / / / / % % % % Table 4 ratio of the Fibonacci sequence program Instruction Line size number B/T filter % / / / / % % % % Table 5 ratio of the Tower of Hanoi program Instruction Line size number B/T filter % / / / / % % % % V Conclusions In this paper, we present a novel approach that reuse the system cache to reduce the trace volumes and hardware design overhead The dynamic executed instructions trace is one of the useful information that help designer not only to detect the hardware bug but also analysis the program behavior With increasing working frequency of a microprocessor, the tracing volumes will grows up very quickly, we should compress the trace data in order to record these information at real-time First, we define the basic-block concept and implement the branch/target filter to omit the successive referenced memory es Second, under system debugging, the cache usually be disabled, and we reuse the cache for hardware resource sharing to compress the memory references We use this technique based on program temporal locality characteristic The experiment results show that the average compression is approximately 90% after cache operation We will research about other compression method, such as LZ, LZW to obtain higher compression in the future

5 References [1] IEEE-ISTO Nexus 5001 Forum, [2] J R Larus, "Efficient program tracing," Computer, Vol 26 No 5, pp 52-61, May 1993 [3] ARM Ltd, ARM922T Technical Reference Manual, ARM, [4] MIPS web site, tation/processorcores/doclibrary [5] S-M Huang, I-J Huang, and C-F Kao, "Reconfigurable real-time trace compressor for embedded microprocessors," Proc IEEE Int'l Conf Field Programmable Technology (FPT), pp , 2003 [6] R A Uhlig and T N Mudge, "Trace-driven memory simulation: a survey," ACM Computing Surveys, Vol 29, No 2, pp , 1997 [7] Stunkel, CB, Janssens, B, and Fuchs, W K, "Collecting traces from parallel computers," Proc 24 th Annual Hawaii Int'l Conf System Sciences, Vol 1, pp , 1991 [8] D W Clark, " performance in the VAX-11/780," ACM Trans Computer Systems, Vol 1, pp 24-37, 1983 [9] A Malony, "Cedar performance measurements," CSRD Report No 579, Center for Supercomputing Research and Development, Univ of Illinois, Urbana, IL, 1986 [10] ARM Ltd, ETM9 Technical Reference Manual, ARM,

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 4, APRIL

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 4, APRIL 2011 571 An On-Chip AHB Bus Tracer With Real-Time Compression and Dynamic Multiresolution Supports for SoC Fu-Ching