Design of A Memory Latency Tolerant. *Faculty of Eng.,Tokai Univ **Graduate School of Eng.,Tokai Univ. *

Similar documents
Memory Systems IRAM. Principle of IRAM

CS 426 Parallel Computing. Parallel Computing Platforms

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved.

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

Caching Basics. Memory Hierarchies

Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

LECTURE 5: MEMORY HIERARCHY DESIGN

Chapter 17 - Parallel Processing

Adapted from David Patterson s slides on graduate computer architecture

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Cache Performance (H&P 5.3; 5.5; 5.6)

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

The Memory System. Components of the Memory System. Problems with the Memory System. A Solution

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

1. Memory technology & Hierarchy

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

EITF20: Computer Architecture Part4.1.1: Cache - 2

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation

EN1640: Design of Computing Systems Topic 06: Memory System

EITF20: Computer Architecture Part4.1.1: Cache - 2

Main Memory Supporting Caches

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

Advanced Caching Techniques

I/O Buffering and Streaming

Advanced cache optimizations. ECE 154B Dmitri Strukov

1/19/2009. Data Locality. Exploiting Locality: Caches

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

1 Introduction The demand on the performance of memory subsystems is rapidly increasing with the advances in microprocessor architecture. The growing

administrivia talking about caches today including another in class worksheet

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

EE414 Embedded Systems Ch 5. Memory Part 2/2

Unit 8: Superscalar Pipelines

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Cache Refill/Access Decoupling for Vector Machines

Advanced Memory Organizations

Control Hazards. Prediction

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

LECTURE 11. Memory Hierarchy

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Portland State University ECE 588/688. Cray-1 and Cray T3E

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Dynamic Control Hazard Avoidance

CPU Pipelining Issues

The University of Adelaide, School of Computer Science 13 September 2018

Memory Hierarchy Basics

Advanced Caching Techniques

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

The check bits are in bit numbers 8, 4, 2, and 1.

A hardware operating system kernel for multi-processor systems

Memory latency: Affects cache miss penalty. Measured by:

Memory latency: Affects cache miss penalty. Measured by:

Lecture 18: DRAM Technologies

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Keywords and Review Questions

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Parallel Computing: Parallel Architectures Jin, Hai

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

The Memory Hierarchy & Cache

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

Lecture-18 (Cache Optimizations) CS422-Spring

A Scalable Multiprocessor for Real-time Signal Processing

EC 513 Computer Architecture

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals

4. Hardware Platform: Real-Time Requirements

Processors. Young W. Lim. May 12, 2016

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Computer System Components

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Transcription:

Design of A Memory Latency Tolerant Processor() Naohiko SHIMIZU* Kazuyuki MIYASAKA** Hiroaki HARAMIISHI** *Faculty of Eng.,Tokai Univ **Graduate School of Eng.,Tokai Univ. 1117 Kitakaname Hiratuka-shi Kanagawa 259-1292,Japan TEL: +81-463-58-1211(ex.484) *email: nshimizu@keyaki.cc.u-tokai.ac.jp **email: aepm49@keyaki.cc.u-tokai.ac.jp **email: hiroaki@drive.co.jp Abstract In this paper, we introduce a design of a memory latency tolerant processor called. We present that adding the latency tolerant features on the conventional load/store architecture processor will not bring complexity to the design and the critical path of the processor will not be aected with those features. For the deviation of a latency problem, we have proposed a instruction to check the data arrival existence a buer. This report describes about the, that uses buer check instruction, and its performance evaluation results, obtained analyzing the SMP system through event-driven simulator. Keywords: Latency Tolerant,Out of Order Execution,Software Prefetch,Decoupled Architecture 1 Introduction While processor technology have been improving, the clock frequency is also increasing. With the increased requirement for the memory capacity, the DRAM is still using as main memory for its capacity. Though speed of data throughput increases thanks to the new technologies such as SDRAM or RAMBUS, the speed gap between main memory and processor still exists : the memory latency. Fast cache memory is used to ll this gap. Indeed, performance goes up with cache capacity for smallscale programs, e.g. the larger the cache the higher is the cache hit-rate. But as we know, the overhead of the miss hit is signicant. And the overhead will be higher in the clocks count when the clock frequency is higher. In the history, the architects approached this problem with decoupling the access issue with the data usage. We got 1) hit-under-the-miss, 2) miss-under-the-miss, 3) out-of-order conjuncted with many registers. The idea is simple, "issue the loads as fast as possible". But if we need more and more registers to ll the more latency, we must change the processor architecture, which introduces compatibility problems. Or we must use hidden registers which introduces big complexity to the design. The cache prefetching is thought good idea for these problems. But cache is a software transparent resource, and it is not controllable from the software. The software should still assume that prefetched data may not be in the cache, or someone may rewrite the value. And if we have multiple of processors in a system, it will cause other performance problems. If we have a reliable buer which is under the control of a program, the processor design will be more simple. We don't need complex out of order core only to wait the memory latency. Or we don't need hundreds of registers for deep unrolling. But just issue decoupled buer transfer requests and well-balanced number of registers will ll the latency for the buer. This is what will do. The decoupled transfer requests is not restricted to the register's word size. And the number of the on-the-y requests can be vast. Then in the architecture we can make a super computer's range of memory bandwidth processor. The buer to the register transfer is still restricted by the processor's load instruction issue rate, but it is not a big problem than the main memory latency. 1.1 Problems related to the memory system 1.1.1 Physical limits at data throughput Related to the LSI packaging, it have been dicult to assign pins for data I/O in large quantities. Recently this restriction became less serious when BGA packaging was introduced, which allows arranging pins on the face-up side of a chip. Also the upper limit of signal frequency for I/O signals using CMOS technology was comparatively low in the past. But also here transfer rates improved by using technology with impedance matching. As the data throughput is improving, the instruction processing performance is also improving. Ways to achieve higher processing performance are increasing clock frequency or super

scalar architectures. Reaching Gigahertz range these days further increase of clock frequency get technologically dicult. Also when it comes to super scalar architectures complexity diminishes speed improvement. Therefore a need for new innovations in processor architecture arose. VLIW is one attempt. It enables parallel data processing but has no eect on data transfer speed. Thus many VLIW based processors can eciently process basic data types. But when it comes to large amount of data, memory throughput becomes a major bottleneck. 1.1.2 Memory latency Another bottleneck is memory latency. If data cannot be provided from memory as fast as the processor requests it, the processor has to wait for memory and the performance goes down. Therefore ways of working around memory latency become a matter for designing high-performance architectures. In recent years multiprocessor systems with shared memory are commonly used for high-performance applications. Having multiple processors accessing the same memory, data respond time becomes even longer. Thus need of latency tolerant architecture became even stronger. Traditional high-performance computers (e.g., vector machine, etc.) can cope with long latency by processing multiple huge data vectors simultaneously. But processing small data vectors and the time needed for proper data vectorization creates an overhead that drops general performance. Furthermore this type of high-performance computers need long development period and is quiet expensive compared to microprocessors. Beneath the vector machine there are not many architectures that secure high data throughput rates at instruction level. Recently there are other architectures providing SIMD (Single Instruction Multiple Data stream) instructions, but these are restricted to basic data types. There are two major methods of providing high performance with latency tolerance. One is processing independent data while the processor is waiting for data from memory. A simple example is thread switching: when a thread waits for data from memory another thread which uses available data could be executed in the meantime. Another method to work around memory latency is to prefetch data to the cache so that it is available when requested. This is implemented by analyzing the program ow at compile time and insert prefetch instructions at proper places. The problem at implementing prefetches is to provide consistency between cache and memory. Imagine you have a store-instruction between a prefetch and the corresponding load-instruction and all three refer to the same memory address. The memory control unit of the processor has to deal with such overlapping cases. Data exchange between memory and processor has to be buered. The more overlaps a processor can handle simultaneously the more complex it is. Today's normal end-user processors can usually handle up to two or four overlaps. Another problem is cache-thrashing caused by prefetches. If too many prefetches occur, cache capacity or memory association frames are exceeded, thus already cached data is deleted from the cache. If this data is requested again, performance drops because it has to be fetched again from memory. Also, on multiprocessor systems performance drops occur when two or more processors compute data from the same memory area: Given two processors have the same memory area in cache, one processor has to rell the cache every time the other processor makes changes to that area. Because cache has to be consistent with memory, a processor with cache prefetches has to provide special treatment for data-access to memory locations which got prefetched but not yet used. Thus prefetched data which got invalid has to be discarded. 2 The Basics of We propose an architecture called, which does data transfer between memory and processor asynchronously to the instructions requesting it. Furthermore it provides buers for resources which do not depend on consistency with memory. The buer consists of many entries of lines. Each line will be as same length and boundary as the cache line. And we dene three instructions to manipulate the buer entries. We designed the instructions carefully that they will not prevent higher clock rating. The instructions for the asynchronous transfer designate one memory location and one buer entry number. Which means we can use usual address mapping scheme to issue the transfer. The buer entry number directly designates the location within the buer. And the processor will issue the memory requests with the entry number information as a tag in a register. The memory system can process the requests as the out of order with attaching the tag. When a request returns the data, processor store the data into the associated entry by referencing the tag. There are two instructions "B.FETCH" and "B.STORE" for the transfers. The required number of the buer is dened by the processing performance and the latencies. For example, assume that we have 1GHz processor and 2nS memory, if the processor issues one load at a clock, the required outstanding buer should be 16Kbytes, that is 1 2 1 9 2 sizeof(double) 2 2 2 1 9. If we use 64byte lines, only 256 entries are required. Since these entries do not have to be consistent with memory, the processor does not have to care about possible instructions which alter memory locations

fetched or prefetched from. Also processor design becomes compact as no address checking or special buering is required. Size and number of buerentries is congurable to t needs of the target application. For small scale applications as in desktop computers the architecture can compensate low memory bus-capacities. Number and size of buer entries is software-independent, thus no recompilation is needed to run the well designed software on dierent versions of a system. PATH LOAD REQUEST(TAG+ADDRESS) CACHE V ADR BUFFER V STREAMING MEMORY RETURN (TAG+) Figure 1: An example load path conguration with of the specied buer entry, and it is an instruction which returns the value. "B.STORE" is an instruction which stores data from buer entry to the main memory. The gure2 has described only fetch to the and 1 buer entries number. However, the fetch instructions can issue more buer entries. Arrival data at the buer entry is read from the virtual mapped address and loaded into a register, using the usual load instruction, and used to the arithmetic operation. Arithmetic results are stored in the buer entry, which is mapped to a virtual address, by usual store instruction. When the usual store instruction ll a buer entry with data, the B.STORE instruction stores specied entry's data to main memory. A sparse matrix is important for many real applications. There are some implementation methods for this matrix. One of them is a row major compaction which is represented as : A[L[i]] But the stride prediction cannot predict the stride in this case. Because, L[i] doesn't increase consistently. But the treats a sparse matrix as below. The prefetch the portion of L, and then the issues prefetch instruction to the each value of A[L[i]]. V ADRS CACHE MAIN MEMORY 3 The data ow in REGISTER r BUFFER (#4) 1(#432) 1 18 2 The gure2 is an example of general RISC processor architectures and architectures assembly code. Here, constant (a) of DAXPY is in register r1. Corresponding data ow to the assembly code is showed in the gure3. "B.FETCH" is an B.FETCH 4 432 B.STORE load r2 #1 load r3 #2 mult r2 r1 r2 add r4 r2 r3 store r4 #2 y[i]=y[i]+a*x[i] RISC (B.CHECK) B.FETCH #1 B.FETCH 1 #2 load r2 #4 load r3 #432 mult r2 r1 r2 add r4 r2 r3 store r4 #432... B.STORE 1 #2 B.FETCH #1 B.FETCH 1 #2 B.CHECK B.CHECK 1 load r2 #4 load r3 #432 mult r2 r1 r2 add r4 r2 r3 store r4 #432... B.STORE 1 #2 Figure 2: An example of the scalt assembly code instruction to fetch the data from the main memory to the specied entry of buer. The data size fetched in each time has the same length as the cache line. "B.CHECK" checks the valid bit virtual mapping Figure 3: Data ow 4 Buer Check Instruction When data has not arrived to a buer entry, a processor must waits for the data arrival(i.e. a processor stalls gure4). The program investigates the data arrival using "B.CHECK" instruction, and if the data doesn't arrive, the program investigate the next buer entry (for example, entry 2, entry 3, etc). The program executes only the data has already arrived at the buer entry. When the data hasn't arrived, the processing is advanced to the next buer entry, and it checks data arrival again later. And if the data has arrived, it is executed, and so on. 464

load r1 #4 OK BUFFER V ENTRY 1 (#4) 1(#432) The data prefetch is performed about all buffer entries load r2 #432 2(#464) NOT STALL B.CHECK BUFFER CHECK The buffer entry s number adds to the non-arrived list Data hasn t alived OK Data has arrived load r1 #4 1 (#4) 1(#432) Processor processes the buffer entry s data B.CHECK 1 NOT load r2 #432 2(#464) A prefetch is performed to a processed buffer entry NEXT B.CHECK Figure 4: The example of B.CHECK instruction No Was it check all buffer entries? Yes The gure5 shows an example program ow with buer check instruction(b.check). With the deviation of the memory latency, the data in a buer entry may not arrived yet. In this case, buer entries of non-arrived data is distinguished by buer check instruction, and the arithmetic operation is processed only for the ready entries. Buer entries with non-arrived data are managed by non-arrived data list and they are checked in every loop until the data arrival. When the data arrives, it is processed and it respective buer entry is removed from the non-arrived data list. The non-arrived data list is controlled by the software. The B.CHECK take 1 clock cycle to nish operation. The wait time for the data arrive turns insensitive because the processor doesn't stall. 5 Performance evaluation In the hardware performance evaluation, we used the event-driven simulator, coded in C language. We used the Livermore Loops, and the address generation patterns was specied by a special compiler for this simulator. The simulation model was showed in the gure6. The program of Livermore Loops is in INST MEM on the model. Processor fetches one instruction per one cycle from INST MEM, and it performs the instruction. This model is the same as the general processor model. The number of processor, used as a parameter, was from 1 to 64. The connection between a main memory and a processor was crossbar connection. The number of memory banks was made to increase in 2 N (N =; 1; 2; :::; 9). When the number of memory banks changed, we investigated the variation of the average rate of bank utilization. The memory banks number is set constant, and when changing the number of processors and the number of buer entries, it was investigated how many times the loop execution occurred. The parameter used in the simulation was shown in the No A processing and a prefetch is performed to all buffer entries of the non-arrived list Did it process all data? E N D Yes Figure 5: The ow chart using the Buer Check Instruction table1. Table 1: Simulation Parameter CPU Cycle Time Memory banks number Data transmission time between LSI SC Request Pitch Length of Memory Bank Length of Buer Entry Access Cycle of Memory Memory Access Time Num of Buer Entry 5nS L 5nS 1nS 16B 32B 25nS MnS N The Livermore Loops was executed, repeatedly, during a constant time. Memory bank conicts in the SMP system was caused with frequently, and the memory latency in fetch was made to change dynamically. buer check instruction nishes a processing in one clock. Access to Non-arrived data list are cache access on an actual hardware. In simulation, accesses non-arrived data list in one clock. adds buer entry of non-arrived data to list in one clock, and it removes buer entry of non-arrived data from list in one clock. These are shown as overhead on low rate of memory utilization in following simulation result.

CPU CPU INST MEM BUFF REQ BUS LOAD BUS SCU REQ QUE LOAD QUE MCU QUE Figure 6: Simulation model 6 Simulation result MEMORY BANK The simulation was evaluated to the Livermore Loops kernel 1 and 4. The simulation results was showed from the gure7 to the gure14. From the gure7 to gure1, we compared the performance results between common RISC processor and the. The load requests number to the memory system of the RISC processor's cache miss was set to 2 or 4. The gure7 and the gure8 show that the performance of was almost proportional at the memory banks. This shows that dierent performance system can be built changing the memory banks number in the. The gure9 and the gure1 show that can get a high performance with the few number of processors. The processing saturation was based on the memory system performance limit.(gure1) performance using buer check instruction was lower than performance UN-using it, in case of few processors. In case of few processors, the rate of memory utilization is low. This show that instruction overhead of buer check instruction causes performance disadvantage in case of little non-arrived data number. From the gure7 to the gure12, we evaluated the performance running the kernel 1. When a buer check instruction is used, it can be expected about 2% in the performance improvement. This shown that the eect of buer check instruction was higher than buer check instruction overhead. The gure13(evaluation running the kernel 4) shows the almost same result than the gure11. That is, when there is stride, the eect of a buer check instruction can be expected. The gure14 shows that about 3 buer entries are enough to this kernel 4 simulation. 7 About implementation 8 Conclusion We described our proposal architecture with a software controllable buer. And we made a performance evaluation for this architecture and shows the eectiveness of our architecture. Our proposed got higher performance than the general RISC processor to deviation of the memory latency, and improved the performance almost proportional to the number of the memory banks, and it can get a higher performance with the fewer processors. However, the eect of buer check instruction depend on applications or the memory performance as shown in the simulation result, therefore we needs further evaluation. HDL version is now in the logic simulation phase. From now on, we will desire to do a simulation on various objects, develop a compiler and library, and make a FPGA version of running. References [1] Harris,Toham,The Scalability of Decoupled Multiprocessors,SHPCC'94,1994 [2] J. L. Baer etc,an eective on-chip preloading scheme to reduce data access penalty,in Proceedings of Supercomputing'91,1991 [3] Nakazawa,Nakamura,Pseudo Vector Processor based on Register-Windowed Supersalr Pipline,Supercomputing'92,1992 [4] Todd C.Mowry,Tolerating Latency Through Software-Controlled Data Prefetching [5] Shimizu,Scalable Latency Tolerant Architecture, IPSJ Sig Notes,Vol.97,No.21,1997 [6] Shimizu,A Study on Performance of a SMP (SCAlable Latency Tolerant architecture)instructions,ipsj Sig Notes,Vol.97,No.76,1997 [7] Mitake,Shimizu,A Hardware feature to overcome the unpredictable memory latency uctuations by software, Proceeding of the 1998 IEICE General conference [8] Mitake,Shimizu,Performance evaluation of the processor corresponding to the latency deviation with buer check instruction,57th IPSJ General conference 1998 The implementation of buer is almost the same as that of a cache. As the simulation shows it is eective even with a few kilo bytes. That is, the buer can be smaller than a cache. Because we do not add a heavy load to a processor critical pass, we consider that will not hurt the processor's clock frequency.

12 1 2-REQUEST 4-REQUEST 7 6 16CPU(BC) 8CPU(BC) 16CPU 8CPU 5 PERFORMANCE(1CPU) 8 6 4 PERFORMANCE 4 3 2 2 1 5 1 15 2 25 5 1 15 2 25 Figure 7: Performance to change of the memory banks number(1cpu) Figure 11: Performance to change of the memory banks number (Kernel1-8,16cpu) 1 45 4 2-REQUEST 4-REQUEST 16CPU(BC) 8CPU(BC) 16CPU 8CPU.95 35 PERFORMANCE(4CPU) 3 25 2 15 Mean of Bank Util.9.85 1.8 5 5 1 15 2 25.75 5 1 15 2 25 Figure 8: Performance to change of the memory banks number(4cpu) Figure 12: The average of the rate of bank utilization to change of the memory banks number(kernel1-8,16cpu) 7 6 4-REQEST 2-REQEST 4 35 16CPU(BC) 8CPU(BC) 16CPU 8CPU PERFORMANCE(256BANKS) 5 4 3 2 PERFORMANCE 3 25 2 15 1 1 5 1 2 3 4 5 6 Num of CPU 5 1 15 2 25 Figure 9: Performance to change of the number of CPU(256BANKS) Figure 13: Performance to change of the number of banks(kernel4-8,16cpu).9.8 4-REQEST 2-REQEST 4 35 VC-(16CPU) VC-(8CPU) (16CPU) (8CPU).7 Mean of BANK Util(256BANKS).6.5.4.3.2 PERFORMANCE(256BANKS) 3 25 2 15.1 1 1 2 3 4 5 6 Num of CPU 5 1 2 3 4 5 6 Num of BUFFER ENTRY Figure 1: The average of the rate of bank utilization to change of the number of CPU(256BANKS) Figure 14: Performance to change of the number of Buer entry(kernel4-8,16cpu)