Michael Adler 2017/05

Size: px
Start display at page:

Download "Michael Adler 2017/05"

Transcription

1 Michael Adler 2017/05

2 FPGA Design Philosophy Standard platform code should: Provide base services and semantics needed by all applications Consume as little FPGA area as possible FPGAs overcome a huge frequency disadvantage relative to CPUs through application-specific, spatial solutions Application-specific memory semantics are part of this solution Instantiate exactly the memory semantics required Avoid wasting FPGA resources on unnecessary memory management 1

3 CCI Core Cache Interface CCI provides a base platform memory interface: just reads and writes Simple request/response interface Physical addresses No order guarantees Writes may bypass other writes, even to the same address Few guarantees of consistency between FPGA-generated reads and writes These minimal requirements satisfy major classes of algorithms, e.g.: Double buffered kernels that read from and write to different buffers Streaming kernels that read from one memory-mapped FIFO and write to another 2

4 Higher Level Memory Services and Semantics Some applications need one or more of: Virtually addressed memory for large, contiguous buffers or sharing pointers with software Ordered read responses Write/read memory consistency guarantees Applications may not need all of these attributes 3

5 MPF Memory Properties Factory Provides a common collection of memory semantic extensions to CCI Applications instantiate only the semantics they require Each MPF block is implemented as a CCI to CCI shim Consume CCI requests Implement some feature (e.g. translate virtual addresses to physical) Produce transformed CCI requests Application-specific memory hierarchies are formed by composing MPF shims 4

6

7 Abstract Architecture AFU: Accelerated Function Unit (User IP) Also knows as Green Bitstream FIU: FPGA Interface Unit (Intel-provided) Blue Bitstream FPGA connects to system memory via one or more physical channels AFU CCI CCI FIU Physical Channel Physical Channel Physical Channel System Memory CCI exposes physical channels as a single, multiplexed read/write memory interface FPGA 6

8 Abstract Architecture AFU: Accelerated Function Unit (User IP) Also knows as Green Bitstream FIU: FPGA Interface Unit (Intel-provided) Blue Bitstream FPGA connects to system memory via one or more physical channels CCI MPF CCI AFU FIU Physical Channel Physical Channel Physical Channel System Memory CCI exposes physical channels as a single, multiplexed read/write memory interface AFU may instantiate MPF as a CCI to CCI bridge, maintaining the same interface but adding new semantics FPGA 7

9 Base CCI Various clocks, reset, power control One request struct: pck_af2cp_stx (type t_if_ccip_tx) One response struct: pck_cp2af_srx (type t_if_ccip_rx) Request and response structures contain: One channel for memory reads (c0) One channel for memory writes (c1) Requests may target specific system physical channels or command the FIU to choose the least busy channel 8

10 Physical Channels Muliplexed as a single bus in CCI-P Addressable independently using vc_sel field in request header Ganged together by blue bitstream as a single high-bandwidth logical channel with evc_va tag CCI-P channels have deliberate races: No guarantee that reads to one channel will return the result of a write to a different channel, even when the write has already returned an ACK! Consistent with design philosophy: base platform supports only universal requirements. Races are no problem when streaming or double buffering. Write fence to evc_va channel synchronizes all channels but is too slow for frequent use. 9

11 CCI Memory Spaces Without IOMMU AAL allocates shared memory in process virtual space Physical memory pinned to I/O space FIU performs no address translation AFU requests host physical addresses FIU emits host physical addresses AAL Process Virtual Host Physical FIU: Host Physical CPU FPGA AFU: Host Physical 10

12 CCI Memory Spaces Without IOMMU AAL allocates shared memory in process virtual space Physical memory pinned to I/O space FIU performs no address translation AFU requests host physical addresses FIU emits host physical addresses Virtual addresses in AFU require translation MPF VTP (Virtual to Physical) acts as a TLB Accepts process virtual addresses from AFU Translates to host physical addresses AAL Process Virtual Host Physical FIU: Host Physical MPF VTP AFU: Process Virtual CPU FPGA 11

13 CCI Memory Spaces With IOMMU Host kernel defines guest virtual machine Guest physical address space protects host memory Guest can write only to AAL managed memory FIU translates guest physical to host physical by querying IOMMU Ideally: Guest physical == Process virtual Would require no AFU translation! Can t have it yet. Kernels don t support it. Hypervisors are designed for guest kernels that manage disjoint guest virtual spaces in virtual machines. AAL Process Virtual Host Physical IOMMU FIU: Guest Physical AFU: Guest Physical CPU FPGA 12

14 CCI Memory Spaces With IOMMU Host kernel defines guest virtual machine Guest physical address space protects host memory Guest can write only to AAL managed memory FIU translates guest physical to host physical by querying IOMMU Ideally: Guest physical == Process virtual Would require no AFU translation! Can t have it yet. Kernels don t support it. Hypervisors are designed for guest kernels that manage disjoint guest virtual spaces in virtual machines. MPF translates process virtual to guest physical AAL Process Virtual Host Physical IOMMU FIU: Guest Physical MPF VTP AFU: Process Virtual CPU FPGA 13

15

16 MPF Composable Shims All MPF shims may be enabled or disabled independently: VTP: Virtual to physical address translation ROB: Reorder buffer to sort read responses and return them in request order WRO: Intra-line write/read ordering VC Map: Map requests to system memory channels explicitly PWRITE: Partial (masked) write emulation using read-modify-write Note: Some shims depend on other shims, e.g: WRO on VC Map PWRITE on WRO 15

17 VTP: Virtual to Physical Resembles a traditional TLB Separate translation tables for 4KB and 2MB pages Level 1: 512 entry TLB, direct mapped, one per request channel Level 2: 512 entry four-way set-associative TLB, shared across all channels Hardware, caching page table walker Size and associativity of each table is configurable No prefetcher: we have not found a need L2 Misses are rare with 2MB pages MPF s caching page table walker generates only one memory read per 16MB of memory with stride one streams and 2MB pages Planning to add support for 1GB pages for SKX 16

18 VTP Software Component VTP maintains its own page table, shared with the VTP FPGA shim Applications allocate & deallocate memory with VTP software service: mpfvtpbufferallocate() mpfvtpbufferfree() The VTP page table is updated as a side effect of allocation Allocation & deallocation may occur at any time during a run See test/test-mpf/base/sw/cci_test.h and fpga_svc_wrapper.cpp (new driver version) aal_svc_wrapper.cpp (AAL version) 17

19 ROB: Reorder Buffer CCI returns read responses unordered AFU tags read requests with a unique number FIU returns the tag with the response ROB sorts read responses Eliminates need for AFU tagging CCI reads behave more like FPGA-attached DRAM ROB is sized to enable maximum bandwidth ROB adds latency, especially when physical channels have different latencies 18

20 WRO: Write/Read Ordering CCI provides no intra- or inter-line order guarantees Even conflicting writes are unordered CCI leaves synchronization to the AFU: track write ACKs or use fences Fences are slow! No problem for kernels that: Maintain discrete read and write spaces Write each address only once Emit fences infrequently Avoid fences when they would be required frequently 19

21 WRO: Write/Read Ordering WRO guarantees that requests within a line complete in order Writes to the same line retire in order Reads always return the most recently written value Reads have priority when arriving in the same cycle as a conflicting write Still no guarantees about inter-line consistency! Write/read hazard detection is implemented as a collection of filters CAMs would be too expensive to support the number of requests in flight Filters are sized to balance FPGA area against the rate of false conflicts Multiple reads to the same location are permitted Filter sizes are configurable 20

22 VC Map: Address-Based Host Channel Mapping AFUs that enable WRO almost always required VC Map CCI channels have deliberate races (see Physical Channels slide) VC Map avoids inter-channel races: AFU passes requests to MPF using evc_va, the same mechanism for CCI mapping VC Map selects explicit physical channels before routing requests to CCI Channel mapping is a function of a request s address: a given address is always mapped to the same channel 21

23 VC Map Optimizations Channel mapping is surprisingly complicated Optimal throughput is achieved only when requests are balanced across physical channels Optimal request rate balance varies with the types and sizes of requests VC Map dynamically responds to AFU-generated traffic, picking the optimal request rates to each physical channel VC Map may choose to rebalance traffic as request patterns vary. It must: Stop all AFU traffic by asserting Almost Full Wait for all current traffic to retire Emit a write fence 22

24 PWRITE: Partial Write Emulation CCI currently provides no mechanism for masked writes PWRITE emulates masked writes: Reading the requested line Updating the masked bytes Writing the merged data MPF extends the write request header with byte mask bits PWRITE does not lock the line. It is not an atomic operation! Conflicting CPU stores in the middle of a PWRITE sequence may be lost WRO and VC Map may be used to guarantee order within the FPGA 23

25

26 Instantiating MPF in an AFU See test/test-mpf/test_mem_perf in the MPF distribution for a relatively simple streaming access example MPF uses SystemVerilog interfaces to represent CCI-P wires 1:1 mapping from CCI-P structs to cci_mpf_if buses MPF shims have multiple CCI-P buses: one toward AFU, one toward FIU Interfaces simplify arguments to MPF shims MPF module ccip_wires_to_mpf() converts CCI-P wires to a cci_mpf_if 25

27 CCI-P Wires to MPF Interface // // Expose FIU as an MPF interface // cci_mpf_if fiu(.clk(pclk)); // The CCI wires to MPF mapping connections have identical naming to // the standard AFU. The module exports an interface named "fiu". ccip_wires_to_mpf #( // All inputs and outputs in PR region (AFU) must be registered!.register_inputs(1),.register_outputs(1) ) map_ifc( // All CCI-P wire names are passed in along with fiu.* ); 26

28 Instantiate MPF hw/rtl/cci_mpf.sv has extensive comments cci_mpf_if afu(.clk(pclk)); cci_mpf #(.SORT_READ_RESPONSES(1),.PRESERVE_WRITE_MDATA(0),.ENABLE_VTP(1),.ENABLE_VC_MAP(0),.ENABLE_DYNAMIC_VC_MAPPING(1),.ENFORCE_WR_ORDER(0),.ENABLE_PARTIAL_WRITES(0),.DFH_MMIO_BASE_ADDR(MPF_DFH_MMIO_ADDR) ) mpf (.clk(pclk),.fiu,.afu ); 27

29 AFU Option #1: Expose MPF as CCI See the use of mpf2af_srxport and af2mpf_stxport in sample/afu/ccip_mpf_nlb.sv Note MPF extension header bits that must be set, e.g.: // Treat all addresses as virtual. afu.c0tx.hdr.ext.addrisvirtual = 1'b1; // Enable evc_va to physical channel mapping. This will only // be triggered when ENABLE_VC_MAP is enabled. afu.c0tx.hdr.ext.mapvatophyschannel = 1'b1; // Enforce load/store and store/store ordering within lines. // This will only be triggered when ENFORCE_WR_ORDER is enabled. afu.c0tx.hdr.ext.checkloadstoreorder = 1'b1; 28

30 AFU Option #2: Use MPF Interface Directly Example: test/test-mpf/base/hw/rtl/cci_test_afu.sv and test/testmpf/test_random/hw/rtl/test_random.sv MPF interface defined in hw/rtl/cci-mpf-if/cci_mpf_if.vh 29

31 AFU Option #2 Example t_cci_mpf_reqmemhdrparams rd_params; t_cci_mpf_c0_reqmemhdr rd_hdr; always_comb begin // Construct a request header rd_params = cci_mpf_defaultreqhdrparams(); rd_params.checkloadstoreorder = enable_wro; rd_params.vc_sel = evc_va; rd_params.mapvatophyschannel = 1'b1; end rd_hdr = cci_mpf_c0_genreqhdr((rdline_mode_s? ereq_rdline_s : ereq_rdline_i), rd_rand_addr, t_cci_mdata'(0), rd_params); rd_hdr.base.cl_len = rd_addr_num_beats; clk) begin // Write the request header (2 nd argument sets the valid bit) fiu.c0tx <= cci_mpf_genc0txreadreq(rd_hdr, (state == STATE_RUN) &&! c0txalmfull); end end if (reset) begin fiu.c0tx.valid <= 1'b0; 30

32 Building MPF Software Libraries Both new FPGA driver and AAL are currently supported CMake script builds each one only when header files from FPGA driver or AAL are found Put FPGA/AAL header paths in C/C++ include environment variables Put compiled FPGA/AAL paths in LD_LIBRARY_PATH and LIBRARY_PATH Use CMake (build directory can be anywhere just adjust../ to match): cd <MPF Path>/sw mkdir build; cd build cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_INSTALL_PREFIX=<target dir>../ make; make install 31

33 MPF Libraries in Software Add MPF s installed include directory to C_INCLUDE_PATH and CPLUS_INCLUDE_PATH Add MPF s installed lib directory to LD_LIBRARY_PATH and LIBRARY_PATH MPF install target may be the same tree is the base FPGA library or AAL Building with MPF and the new FPGA library (libmpf.so): Link your application with libmpf.so (Both hardware and ASE are supported) See test/test-mpf/base/sw/fpga_svc_wrapper.cpp for an example Building with MPF and AAL (libmpf_aal.so): Load libmpf_aal dynamically using the AAL service discovery mechanism See test/test-mpf/base/sw/aal_svc_wrapper.cpp for an example 32

34

35 MPF Internal Configuration Options Some applications, especially those with large memory footprints, may benefit from resizing MPF structures hw/rtl/cci_mpf_config.vh defines and documents many options Configure VTP TLB size and associativity Configure WRO hashing parameters These options may be set without modifying source, e.g. set_global_assignment -name VERILOG_MACRO "VTP_N_C0_TLB_2MB_SETS=1024" 34

36 MPF Primary Pipeline 400 MHz CCI-P VTP TLB CCI-P FIU MPF Edge VTP WRO ROB MPF Edge AFU Write Data Blue Green 35

37 MPF Primary Pipeline Edge modules always instantiated Write data bypasses MPF pipeline to save area. Quartus automatically deletes unconsumed data wires in internal MPF interface objects. AFU edge validates header settings Multi-beat write requests converted to a single representative request AFU edge forwards all write data beats to FIU edge module FIU edge reconstructs all beats from the representative request FIU edge (EOP module) guarantees that all write responses are packed. AFUs using MPF may ignore the c1rx format header field. MPF shims are instantiated as needed based on configuration parameters 36

38 VTP Micro-Architecture L1 hits flow around L1 misses Separate direct-mapped L1 cache for reads and writes, 2MB and 4KB pages Shared set-associative L2 cache for reads and writes, separate 2MB and 4KB Page table walker is not pipelined but does cache recent page table lines Default sizes are optimized for M20k minimum depth and bandwidth Programs with some page locality should require no VTP tuning Programs with completely random access patterns to very large footprints may benefit from larger caches. See hw/rtl/cci_mpf_config.vh. 37

39 Debugging Address Translation Failures When VTP encounters an untranslatable address it halts VTP statistics software interface indicates a failure: numfailedtranslations will be non-zero ptwalklastvaddr will hold the untranslatable virtual address 38

40 WRO Micro-Architecture Ingress buffer Computes read/write conflict epochs using hashed filters: Ingress filters are small, only tracking lifetimes inside the request channels WRO primary pipeline has separate large filters, described later Non-conflicting read and write pipelines flow independently Independent flow required to avoid starving multi-beat write requests: Multi-beat writes require multiple AFU request cycles Multi-beat reads require only one AFU request cycle Lock-step 4 line read/write requests would emit one write for every 4 reads 39

41 WRO Micro-Architecture Primary Pipeline Hashed counting filter for reads, single bit filter for writes Non-conflicting requests flow around blocked requests Conflicting requests loop back to the head of the primary pipeline New traffic is blocked only when the pipeline fills with conflicting requests Default depth (and latency) is 4 cycles. Growing the pipeline adds latency but increases the capacity for flowing around blocked requests. Configurable hashing. See hw/rtl/cci_mpf_config.vh. 40

42 MPF Shim Latencies for READ Requests (Cycles) (Writes are similar but less latency sensitive) Shim Request Cycles Response Cycles AFU Edge (mandatory) 0 when WRO disabled 1 when WRO enabled ROB (response order) 0 4 (minimum) VC Map 1 0 WRO (write/read order) 10 (minimum) 0 VTP (virtual to physical) 5 (L1 hit typical case) 0 EOP (mandatory) 0 1 FIU Edge (mandatory)

Michael Adler 2017/09

Michael Adler 2017/09 Michael Adler 2017/09 Outline System overview Core Cache Interface (CCI-P) abstraction Application-specific memory hierarchies (Memory Properties Factory MPF) Clocking Simulation (ASE) GitHub open source

More information

Intel Xeon with FPGA IP Asynchronous Core Cache Interface (CCI-P) Shim

Intel Xeon with FPGA IP Asynchronous Core Cache Interface (CCI-P) Shim Intel Xeon with FPGA IP Asynchronous Core Cache Interface (CCI-P) Shim AN-828 2017.10.02 Subscribe Send Feedback Contents Contents 1... 3 1.1 Conventions...3 1.2 Glossary...3 1.3 Introduction...3 1.4 Design...

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache

More information

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

COSC 6385 Computer Architecture - Memory Hierarchy Design (III) COSC 6385 Computer Architecture - Memory Hierarchy Design (III) Fall 2006 Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses

More information

Chapter 8. Virtual Memory

Chapter 8. Virtual Memory Operating System Chapter 8. Virtual Memory Lynn Choi School of Electrical Engineering Motivated by Memory Hierarchy Principles of Locality Speed vs. size vs. cost tradeoff Locality principle Spatial Locality:

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

ECE 1160/2160 Embedded Systems Design. Midterm Review. Wei Gao. ECE 1160/2160 Embedded Systems Design

ECE 1160/2160 Embedded Systems Design. Midterm Review. Wei Gao. ECE 1160/2160 Embedded Systems Design ECE 1160/2160 Embedded Systems Design Midterm Review Wei Gao ECE 1160/2160 Embedded Systems Design 1 Midterm Exam When: next Monday (10/16) 4:30-5:45pm Where: Benedum G26 15% of your final grade What about:

More information

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook CS356: Discussion #9 Memory Hierarchy and Caches Marco Paolieri (paolieri@usc.edu) Illustrations from CS:APP3e textbook The Memory Hierarchy So far... We modeled the memory system as an abstract array

More information

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

More information

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor David Johnson Systems Technology Division Hewlett-Packard Company Presentation Overview PA-8500 Overview uction Fetch Capabilities

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

Intel FPGA IP Core Cache Interface (CCI-P)

Intel FPGA IP Core Cache Interface (CCI-P) Intel FPGA IP Core Cache Interface (CCI-P) Interface Specification September 2017 Revision 0.5 Document Number: External Notice: This document contains information on products in the design phase of development.

More information

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15

More information

Caching Basics. Memory Hierarchies

Caching Basics. Memory Hierarchies Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA CS 258 Parallel Computer Architecture Data Speculation Support for a Chip Multiprocessor (Hydra CMP) Lance Hammond, Mark Willey and Kunle Olukotun Presented: May 7 th, 2008 Ankit Jain Outline The Hydra

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Scalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems

Scalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems Cray XMT Scalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems Next Generation Cray XMT Goals Memory System Improvements

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Advanced Memory Organizations

Advanced Memory Organizations CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU

More information

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Improving Cache Performance [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user

More information

Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual

Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Acceleration Stack for Intel Xeon

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson) CSE 4201 Memory Hierarchy Design Ch. 5 (Hennessy and Patterson) Memory Hierarchy We need huge amount of cheap and fast memory Memory is either fast or cheap; never both. Do as politicians do: fake it Give

More information

I/O Buffering and Streaming

I/O Buffering and Streaming I/O Buffering and Streaming I/O Buffering and Caching I/O accesses are reads or writes (e.g., to files) Application access is arbitary (offset, len) Convert accesses to read/write of fixed-size blocks

More information

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Memories. CPE480/CS480/EE480, Spring Hank Dietz. Memories CPE480/CS480/EE480, Spring 2018 Hank Dietz http://aggregate.org/ee480 What we want, what we have What we want: Unlimited memory space Fast, constant, access time (UMA: Uniform Memory Access) What

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University Computer Architecture Memory Hierarchy Lynn Choi Korea University Memory Hierarchy Motivated by Principles of Locality Speed vs. Size vs. Cost tradeoff Locality principle Temporal Locality: reference to

More information

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections ) Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections 5.1-5.3) 1 Reducing Miss Rate Large block size reduces compulsory misses, reduces miss penalty in case

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)] ECE7995 (6) Improving Cache Performance [Adapted from Mary Jane Irwin s slides (PSU)] Measuring Cache Performance Assuming cache hit costs are included as part of the normal CPU execution cycle, then CPU

More information

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache performance 4 Cache

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial

More information

Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual

Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Updated for Intel Acceleration Stack: 1.0 Production Subscribe Send Feedback Latest document on the web: PDF HTML

More information

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance 6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,

More information

Architecture Specification

Architecture Specification PCI-to-PCI Bridge Architecture Specification, Revision 1.2 June 9, 2003 PCI-to-PCI Bridge Architecture Specification Revision 1.1 December 18, 1998 Revision History REVISION ISSUE DATE COMMENTS 1.0 04/05/94

More information

Review: Hardware user/kernel boundary

Review: Hardware user/kernel boundary Review: Hardware user/kernel boundary applic. applic. applic. user lib lib lib kernel syscall pg fault syscall FS VM sockets disk disk NIC context switch TCP retransmits,... device interrupts Processor

More information

Structure of Computer Systems

Structure of Computer Systems 222 Structure of Computer Systems Figure 4.64 shows how a page directory can be used to map linear addresses to 4-MB pages. The entries in the page directory point to page tables, and the entries in a

More information

First-In-First-Out (FIFO) Algorithm

First-In-First-Out (FIFO) Algorithm First-In-First-Out (FIFO) Algorithm Reference string: 7,0,1,2,0,3,0,4,2,3,0,3,0,3,2,1,2,0,1,7,0,1 3 frames (3 pages can be in memory at a time per process) 15 page faults Can vary by reference string:

More information

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1 Memory Management Disclaimer: some slides are adopted from book authors slides with permission 1 CPU management Roadmap Process, thread, synchronization, scheduling Memory management Virtual memory Disk

More information

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design SRAMs to Memory Low Power VLSI System Design Lecture 0: Low Power Memory Design Prof. R. Iris Bahar October, 07 Last lecture focused on the SRAM cell and the D or D memory architecture built from these

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Virtual Memory: From Address Translation to Demand Paging

Virtual Memory: From Address Translation to Demand Paging Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 12, 2014

More information

Chapter 9 Memory Management

Chapter 9 Memory Management Contents 1. Introduction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threads 6. CPU Scheduling 7. Process Synchronization 8. Deadlocks 9. Memory Management 10. Virtual

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

Page 1. Memory Hierarchies (Part 2)

Page 1. Memory Hierarchies (Part 2) Memory Hierarchies (Part ) Outline of Lectures on Memory Systems Memory Hierarchies Cache Memory 3 Virtual Memory 4 The future Increasing distance from the processor in access time Review: The Memory Hierarchy

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

Lecture-18 (Cache Optimizations) CS422-Spring

Lecture-18 (Cache Optimizations) CS422-Spring Lecture-18 (Cache Optimizations) CS422-Spring 2018 Biswa@CSE-IITK Compiler Optimizations Loop interchange Merging Loop fusion Blocking Refer H&P: You need it for PA3 and PA4 too. CS422: Spring 2018 Biswabandan

More information

Advanced cache optimizations. ECE 154B Dmitri Strukov

Advanced cache optimizations. ECE 154B Dmitri Strukov Advanced cache optimizations ECE 154B Dmitri Strukov Advanced Cache Optimization 1) Way prediction 2) Victim cache 3) Critical word first and early restart 4) Merging write buffer 5) Nonblocking cache

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Overview. TCP & router queuing Computer Networking. TCP details. Workloads. TCP Performance. TCP Performance. Lecture 10 TCP & Routers

Overview. TCP & router queuing Computer Networking. TCP details. Workloads. TCP Performance. TCP Performance. Lecture 10 TCP & Routers Overview 15-441 Computer Networking TCP & router queuing Lecture 10 TCP & Routers TCP details Workloads Lecture 10: 09-30-2002 2 TCP Performance TCP Performance Can TCP saturate a link? Congestion control

More information

EE382 Processor Design. Processor Issues for MP

EE382 Processor Design. Processor Issues for MP EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Chapter 8: Virtual Memory. Operating System Concepts

Chapter 8: Virtual Memory. Operating System Concepts Chapter 8: Virtual Memory Silberschatz, Galvin and Gagne 2009 Chapter 8: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware. Department of Computer Science, Institute for System Architecture, Operating Systems Group Real-Time Systems '08 / '09 Hardware Marcus Völp Outlook Hardware is Source of Unpredictability Caches Pipeline

More information

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic

More information

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts Memory management Last modified: 26.04.2016 1 Contents Background Logical and physical address spaces; address binding Overlaying, swapping Contiguous Memory Allocation Segmentation Paging Structure of

More information

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas Cache Coherence (II) Instructor: Josep Torrellas CS533 Copyright Josep Torrellas 2003 1 Sparse Directories Since total # of cache blocks in machine is much less than total # of memory blocks, most directory

More information

Intel Acceleration Stack for Intel Xeon CPU with FPGAs Version 1.2 Release Notes

Intel Acceleration Stack for Intel Xeon CPU with FPGAs Version 1.2 Release Notes Intel Acceleration Stack for Intel Xeon CPU with FPGAs Version 1.2 Updated for Intel Acceleration Stack for Intel Xeon CPU with FPGAs: 1.2 Subscribe Latest document on the web: PDF HTML Contents Contents

More information

L9: Storage Manager Physical Data Organization

L9: Storage Manager Physical Data Organization L9: Storage Manager Physical Data Organization Disks and files Record and file organization Indexing Tree-based index: B+-tree Hash-based index c.f. Fig 1.3 in [RG] and Fig 2.3 in [EN] Functional Components

More information

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed 5.3 By convention, a cache is named according to the amount of data it contains (i.e., a 4 KiB cache can hold 4 KiB of data); however, caches also require SRAM to store metadata such as tags and valid

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) Lecture: DRAM Main Memory Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) 1 TLB and Cache 2 Virtually Indexed Caches 24-bit virtual address, 4KB page size 12 bits offset and 12 bits

More information

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Lecture 7: Implementing Cache Coherence. Topics: implementation details Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

COSC 6385 Computer Architecture. - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

CSE 4/521 Introduction to Operating Systems. Lecture 14 Main Memory III (Paging, Structure of Page Table) Summer 2018

CSE 4/521 Introduction to Operating Systems. Lecture 14 Main Memory III (Paging, Structure of Page Table) Summer 2018 CSE 4/521 Introduction to Operating Systems Lecture 14 Main Memory III (Paging, Structure of Page Table) Summer 2018 Overview Objective: To discuss how paging works in contemporary computer systems. Paging

More information

Chapter 12. File Management

Chapter 12. File Management Operating System Chapter 12. File Management Lynn Choi School of Electrical Engineering Files In most applications, files are key elements For most systems except some real-time systems, files are used

More information

Chapter 8 Main Memory

Chapter 8 Main Memory COP 4610: Introduction to Operating Systems (Spring 2014) Chapter 8 Main Memory Zhi Wang Florida State University Contents Background Swapping Contiguous memory allocation Paging Segmentation OS examples

More information

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested

More information

Introduction to cache memories

Introduction to cache memories Course on: Advanced Computer Architectures Introduction to cache memories Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1 Summary Summary Main goal Spatial and temporal

More information

Chapter 8: Memory-Management Strategies

Chapter 8: Memory-Management Strategies Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and

More information

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory Memory systems Memory technology Memory hierarchy Virtual memory Memory technology DRAM Dynamic Random Access Memory bits are represented by an electric charge in a small capacitor charge leaks away, need

More information

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James Computer Systems Architecture I CSE 560M Lecture 17 Guest Lecturer: Shakir James Plan for Today Announcements and Reminders Project demos in three weeks (Nov. 23 rd ) Questions Today s discussion: Improving

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Static RAM (SRAM) Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 0.5ns 2.5ns, $2000 $5000 per GB 5.1 Introduction Memory Technology 5ms

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

Accelerator Functional Unit (AFU) Developer s Guide

Accelerator Functional Unit (AFU) Developer s Guide Accelerator Functional Unit (AFU) Developer s Guide Updated for Intel Acceleration Stack for Intel Xeon CPU with FPGAs: 1.1 Production Subscribe Latest document on the web: PDF HTML Contents Contents 1.

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition Chapter 7: Main Memory Operating System Concepts Essentials 8 th Edition Silberschatz, Galvin and Gagne 2011 Chapter 7: Memory Management Background Swapping Contiguous Memory Allocation Paging Structure

More information

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review COS 318: Operating Systems NSF, Snapshot, Dedup and Review Topics! NFS! Case Study: NetApp File System! Deduplication storage system! Course review 2 Network File System! Sun introduced NFS v2 in early

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Effect of memory latency

Effect of memory latency CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable

More information

IBM PSSC Montpellier Customer Center. Blue Gene/P ASIC IBM Corporation

IBM PSSC Montpellier Customer Center. Blue Gene/P ASIC IBM Corporation Blue Gene/P ASIC Memory Overview/Considerations No virtual Paging only the physical memory (2-4 GBytes/node) In C, C++, and Fortran, the malloc routine returns a NULL pointer when users request more memory

More information

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language

More information

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality Administrative EEC 7 Computer Architecture Fall 5 Improving Cache Performance Problem #6 is posted Last set of homework You should be able to answer each of them in -5 min Quiz on Wednesday (/7) Chapter

More information