IBM PSSC Montpellier Customer Center. Blue Gene/P ASIC IBM Corporation

Size: px

Start display at page:

Download "IBM PSSC Montpellier Customer Center. Blue Gene/P ASIC IBM Corporation"

Charlotte Jennings
6 years ago
Views:

1 Blue Gene/P ASIC

2 Memory Overview/Considerations No virtual Paging only the physical memory (2-4 GBytes/node) In C, C++, and Fortran, the malloc routine returns a NULL pointer when users request more memory than the physical memory available. We recommend you always check malloc() return values for validity.

3 Memory Overview/Considerations Avoid instructions when prefetching data in the L1 cache. Using the processor, you can concurrently fill in three L1 cache lines. It is mandatory to reduce the number of prefetching streams to three or less Applications tuned for Power 4-7 with more than 4 prefetching engines will choke BG/P memory. To take advantage of the single-instruction, multiple-data (SIMD) instructions, it is essential to keep the data in the L1 cache as much as possible The optimization of the applications must be based on the 32 KB of the L1 cache The benefits of the SIMD instructions can be canceled out if data does not fit in the L1 cache.

4 L1 Cache IBM PSSC Montpellier Customer Center Architecture 32kB I-cache + 32KB D-cache 32B line, 64 way-set-associative, 16 sets Round-robin replacement Write-through 2 buses towards the L2, 128 bits width 3 lines fetch buffer Performance L1 load hit 8B/cycle, 4 cycle latency (floating point) L1 load miss, L2 hit 4.6B/cycle, 12 cycle latency Store Write through, limited by external logic to about one request every 2.9 cycles (about 5.5B/cycle peak) Important Avoid instructions prefetching in L1. Reduce the number of prefetch streams below 3 Programmer can use the XL compiler directives or assembler instruction (dcbt) Without an intensive reused of the L1 and register, memory subsystem is not allowed to feed the double FPU

5 L2 Cache IBM PSSC Montpellier Customer Center 3 independent PPC ports Instruction read Data Read L2 Prefetch Data write L3 arbiter Switch function 1 read and 1 write switched to each L3 every 425MHz cycle L2 boosts the overall performance and does not require any special software provisions. PPC450 L1 Instr. Cache L1 Data Cache L1 Instr. Cache L1 Data Cache L2 Cache L2 IC RD L2 DC RD L2 DC WR L2 IC RD L2 DC RD L2 DC WR L3 Arbiter to L3 0 to L3 1 16B@850MHz

6 L2 Data Prefetch Prefetch engine 128B line prefetch 15 entry fully associative prefetch buffer 1 or 2 line deep sequential prefetching Supports up to 7 streams Supports strides less than 128B 4.6B/cycle, 12 cycle latency Data PPC450 Address L2 DATA READ / PREFETCH b line buffers 15 stream engines stream detector 8 miss history data from other units L3 data address

7 L3 Cache per port 8 read requests 4 write requests L3 cache 0 4 edram banks per chip, each containing independent Core0/1 edram directory 15 entry 128B-wide write combining buffer edram Hit under Miss resolution Limit defined by request s and write buffer Up to 8 read misses per port Up to 15 write misses per write combining buffer Limitation: banking conflict (possibility to configure dedicated L3/core need IBM Lab support) Core2/3 DMA L3 cache 1 edram edram

8 DRAM Controller 20 8b-wide DDR2 DRAM modules per controller 4 module-internal banks (4x512 MB) Command reordering based on bank availablility Theoretical bandwidth: 128 Bytes/16 cycles the L3 cache edram edram DDR controller Miss handler 16B@425MHz

9 Performance Latencies L1: 4 cycles L2: 12 cycles L3: 50 cycles DDR: 104 cycles Bandwidth/peak Reg <-> L1 cache: 8B/cycle L1 <-> L2: 4.6B/cycle L2 <-> L3: 16B/cycle per core L3 <-> DDR: less than 16B/cycle per chip

10 Bottlenecks L2 L3 switch Not a full core to L3 bank crossbar rate and bandwidth limited if two cores of one dual processor group access the same L3 cache Banking 4 banks on 512Mb DDR modules Burst-8 transfer (128B): 16 cycles Page open, access, precharge: 64 cycles Peak bandwidth only achievable if accessing 3 other banks before accessing the same bank again

11 Main Memory Banking Optimization Example For sequential access, two arrays used in a single operation must not be aligned on the same bank parameter (n= ) real(8) x(n), y(n), w(n).. do j=1,n x(j) = x(j) + y(j)*w() Enddo BG memory tuning parameter (n= , offset=16) real(8) x(n+offset), y(n+2*offset), w(n).. do j=1,n x(j) = x(j) + y(j)*w() Enddo

CNK Memory Usage Static TLB mappings Only 64 mappings per core Kernel mapped protected App R/O text & data protected Alignment restrictions Static TLB benefits no penalty for random data access

12 CNK Memory Usage Static TLB mappings Only 64 mappings per core Kernel mapped protected App R/O text & data protected Alignment restrictions Static TLB benefits no penalty for random data access patterns greatly simplifies DMA An implementation of mmap that uses a dynamically assigned physical memory would be required to take page translations in order to map the current virtual addresses to physical addresses. At this time, we have no plans to implement such a TLB miss handler.

13 CNK Memory Design (Memory leaks) The design points of CNK: to have a statically partitioned physical memory Reasons No memory translation exceptions If a virtual mapping is not known by the processor, the hardware issues an exception. The operating system is then required to install the mapping in the hardware (if it exists), or fail with a segfault (if it does not exist). Processing this exception can take a.25-1usec penalty per translation, depending on how sophisiticated the miss handler needs to be. For an application that performs a lot of pointer chasing or pseudo-random memory accesses, this effectively would result in a.25-1usec latency for every memory access. Determinisitic performance from run-to-run locking contention between processes Determinisitic out-of-memory e.g., process 0 will be able to use the same memory footprint from run-to-run because process 0 and processes 1-3 are not competing for the same memory allocations. No OS noise e.g., if nodes enter a barrier, and node 0 takes a translation fault, delaying its participating in the barrier by 1usec. The application just lost msec of "useful work".

CNK Memory Usage an application stores data in memory, it can be classified as follows: data = Initialized static and common variables bss = Uninitialized static and common variables heap =

14 CNK Memory Usage an application stores data in memory, it can be classified as follows: data = Initialized static and common variables bss = Uninitialized static and common variables heap = Controlled allocatable arrays stack = Controlled automatic arrays and variables You can use the Linux size command to gain an idea of the memory size of the program. However, the size command does not provide any information about the runtime memory The stack section starts from the top, at address 7fffd31c in SMP Node Mode (approximately 2 GB) and at address 403fd31c in Dual Node Mode (approximately1 GB) and 209fd31c in Virtual Node Mode (approximately 512 MB)

15 CNK / Shared Memory Support Shared Memory is supported in Virtual and Dual mode Normal theme: do it the Linux way shm_open() standard interface Allocate: fd = shm_open( SHM_FILE, O_RDWR, 0600 ); ftruncate( fds[0], MAX_SHARED_SIZE ); shmptr1 = mmap( NULL, MAX_SHARED_SIZE, PROT_READ PROT_WRITE, MAP_SHARED, fd, 0); Deallocate: munmap(shmptrl, MAX_SHARED_SIZE); close(fd) shm_unlink(shm_file);

16 CNK / Persistent Memory Persistent memory is process memory that retains its contents from job to job. To allocate persistent memory, the environment variable BG_PERSISTMEMSIZE=X must be specified, where X is the number of 1024*1024 bytes to be allocated for use as persistent memory. In order for the persistent memory to be maintained across jobs, all job submissions must specify the same value for BG_PERSISTMEMSIZE. The contents of persistent memory can be re-initialized during job startup by either changing the value of BG_PERSISTMEMSIZE or by specifying the environment variable BG_PERSISTMEMRESET=1. The following new kernel function was added to support persistent memory: persist_open()

Virtual Memory Nov 9, 2009"

Virtual Memory Nov 9, 2009" Administrivia" 2! 3! Motivations for Virtual Memory" Motivation #1: DRAM a Cache for Disk" SRAM" DRAM" Disk" 4! Levels in Memory Hierarchy" cache! virtual memory! CPU" regs"