Beyond One Core. Multi-Cores. Evolution in Frequency Has Stalled. Source: ISCA 2009 Keynote of Kathy Yelick

Size: px

Start display at page:

Download "Beyond One Core. Multi-Cores. Evolution in Frequency Has Stalled. Source: ISCA 2009 Keynote of Kathy Yelick"

Maria Bishop
5 years ago
Views:

1 Beyond One Multi-s Evolution in Frequency Has Stalled Source: ISCA 2009 Keynote of Kathy Yelick 1

2 Multi-s Network or Bus Memory Private and/or shared caches Bus Network (on chip = NoC) Example: ARM Cortex A9 Source: ARM Example: Intel Nehalem Source: Intel 2

3 Network on Chip Source: Eyal Friedman, 2008 Various topologies: mesh, torus, Static or dynamic routing (wormhole) Coherence Load A Network Coherence Load A Network 3

4 Coherence Load A Network Coherence Network Coherence A=3 Network 4

Coherence Invalid A=3 Invalid Network Coherence A=3 Network A=3 Keep data coherent among caches Coherence Protocol: MESI Source: Mehmet Senvar M (Modified): no other cache has block in M,

5 Coherence Invalid A=3 Invalid Network Coherence A=3 Network A=3 Keep data coherent among caches Coherence Protocol: MESI Source: Mehmet Senvar M (Modified): no other cache has block in M, E or S state, value different from memory E (Exclusive): no other cache has block in M, E or S, value same as memory S (Shared): other caches sharing block I (Invalid): cache block invalid 5

6 Parallelization Decompose program into independent tasks But tasks usually share data and synchronize Pthreads Example Source: Charles Leiserson 6

7 System Hardware System & Operating System Hardware system = processor + I/O devices Operating system = software for: providing a hardware abstraction to user managing all hardware resources Main hardware resources: Processor: which process to execute? for how long? I/O (disk, network, keyboard, screen, ): Communications with processor? Memory: Where to place data and programs for users? How to allocate memory to users? Input/Output Processor Keyboard controler Screen controler Bus Device controler hides internal operations to systems Main commands: read and write Transfers by bytes or bursts Special I/O registers and buffers Registers and buffers are either memory-mapped or hardware elements 7

8 Drivers Operating System Upper layer of O/S Controler 1 Controler 2 Controler 3 Controler 4 Device 1 Device 2 Device 3 Device 4 Hardware specifics of a device should have no impact on O/S O/S shielded from device specifics through drivers Provide an abstract view of a category of devices Drivers are hardware-specific software components added to O/S O/S provides standard interface to drivers of each category Using an I/O Device Permanent polling: 1. user calls specific polling routine 2. I/O registers are read (or modified) 3. processor checks if data arrived in buffer (or have been used) 4. new data collected (or sent) Processor frequently probed Processor and I/O devices work on different time scales I/O Programming Polling: KBDR: code of keystroke KBSR: memory-mapped input register KBSR[15] = 1 if keystroke and KBDR not yet read KBSR[15] = 0 if KBDR read START LDI BRzp START LDI BR xf401 CRTDR: ASCII code of character to display CRTSR: memory-mapped output register CRTSR[15] = 1 if character not yet used CRTSR[15] = 0 if character used START LDI BRzp START STI BR xf401 8

9 Using an I/O Device Interrupts 1. call a specific routines 2. I/O registers modified 3. processor can execute other tasks 4. when data arrive (read), controler signals processor 5. processor interrupts task, processes data I/O Programming Interrupts Processor must allow interrupt (can refuse) Interrupt controler sends interrupt routine address Upon interrupt: context backup, PC to interrupt routine, restore context upon end DEBUT LDI xf401 Managing Several I/O Devices Device Device Device Device Interrupt controler Processor Address Interrupt lines Simultaneous interrupts possible Device signals its interrupt request through dedicated bus line Interrupt controler manages interrupt priority Interrupt controler signals authorization to devices 9

10 Example Plug And Play on a PC: Initially, each device (card) had an interrupt level (0 to 7) and a fixed I/O register address (e.g., keyboard: 0x60 to 0x64). Possible conflicts among devices Later, possibility to change interrupt level and register addresses Plug And Play: devices have programmable interrupt levels and register addresses upon start-up, system collects all devices and desired interrupts it assigns interrupts to devices Simplifying and Protecting Access to I/O Devices TRAP trapvect Similar to subroutine call but special instructions: TRAP/RTI Prevent user from directly accessing I/O device registers System call table Index Subroutine Address 0x3000 0x3100 0x4F01 0x2A0B... DMA Direct Memory Access (DMA): 1. processor sends starting target address, number of bytes, device address to DMA controler 2. DMA controler starts communication between device and memory 3. once transfer completed, processor probed 4. either a new block transferred or transfer stopped Processor is not required for I/O transfers Example application: DVD viewing on a PC 10

Putting It All Together: PC Organization Example Disk address (sector) Memory address Disk controler Number of sectors read / write Buffer Bus Processor I/O registers for disk controler Example

11 Putting It All Together: PC Organization Example Disk address (sector) Memory address Disk controler Number of sectors read / write Buffer Bus Processor I/O registers for disk controler Example Storage surface Read/Write heads Disk divided into cylinders, tracks, sectors Controler provides abstract view of disk to O/S: N contiguous sectors Read k sectors starting at sector i and send to address A triggers: i converted into cylinder, head, sector head moved to track head waits for sector to come below track information sent as a sequence of bytes 11

Example Screen/Visualization Contains a memory area (VRAM) accessed by processor via controler 1 pixel associated to 1 to 24 bits (bitmap) 1 bit: N/B 24 bits: 3x8 bits for RGB (Red Green Blue) 8/16

12 Example Screen/Visualization Contains a memory area (VRAM) accessed by processor via controler 1 pixel associated to 1 to 24 bits (bitmap) 1 bit: N/B 24 bits: 3x8 bits for RGB (Red Green Blue) 8/16 bits: index into color palette Example: displaying graphics no graphics accelerator: processor computes image bitmap and sends it to controler with graphics accelerator: program sends abstract command (draw rectangle) API (e.g., DirectX) graphics card converts command into bitmap, stores image in VRAM RAMDAC (Random Access Memory Digital Analog Converter) transfers image to memory Screen Graphics card Video RAM Controler Processor Increasing role of graphics card: Initially: all in CPU Then: 2D in graphics processor, 3D in CPU Then: 3D in graphics processor Now: graphics processor can do general computations (GPGPU: General-Purpose Graphics Processing Unit) GPU computational power > CPU CPU manufacturers add graphics extensions, develop GPU-like architectures PC architecture may revolve around GPU rather than CPU? Intel Larrabee vs. NVIDIA Fermi Example To screen RAMDAC Video BIOS (fonts, graphics primitives...) GPU VideoRAM A GPU Example 12

NVIDIA Fermi Multiprocessor 16 multiprocessors Each multiprocessor contains 32 cores

shaders Warp = bundle of 32 threads Simple cores Shared register file Shared L/S

Fermi Dual-Issue but pipelines independent (no dependences) 48 warps per

13 NVIDIA Fermi Multiprocessor 16 multiprocessors Each multiprocessor contains 32 cores 512 cores in total 40nm, 3 billion transistors Fermi Multiprocessor Evolved from GPU shaders Warp = bundle of 32 threads Simple cores Shared register file Shared L/S Shared L1 cache SFUs: complex math Source: Eyal Friedman, 2008 Multithreading in Fermi Dual-Issue but pipelines independent (no dependences) 48 warps per multiprocessor 48 x 32 = 1536 threads on chip 512 threads each cycle No delay for thread switching 13

Memory Hierarchy User-managed memory location ECC (Error Correction Code) everywhere: registers, caches, memory Configurable local memory: can be partitioned shared-memory or small L1 caches Host

14 Memory Hierarchy User-managed memory location ECC (Error Correction Code) everywhere: registers, caches, memory Configurable local memory: can be partitioned shared-memory or small L1 caches Host Grid Block (0, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Global Memory Constant Memory Block (1, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Programming Fermi C/C++ Close to CPU programming Compile into abstract representation (PTX) Parallel & Memory extensions Memory 14

Few pins multiplex address Capacitors periodic refresh ( ms) Simple DRAM Component SIMM (Single In-line Memory Module) 4Mo: 4 chips 1Mbit x 8; 32 bits per node SDRAM Page mode Keep

indicates # bits to read Mode Page, EDO SDRAM Virtual Memory Words: 32 or 64 bits. Address space: 4 GB or 16 Exabytes (2 60 ).

15 Few pins multiplex address Capacitors periodic refresh ( ms) Simple DRAM Component SIMM (Single In-line Memory Module) 4Mo: 4 chips 1Mbit x 8; 32 bits per node SDRAM Page mode Keep row address Select column address Faster access to samerow bits Output latch to store row faster column address change SDRAM: CAS & RAS synchronized with CPU clock Register indicates # bits to read Mode Page, EDO SDRAM Virtual Memory Words: 32 or 64 bits. Address space: 4 GB or 16 Exabytes (2 60 ). Physical memory size << Address space size Virtual memory: Give illusion physical memory size = address space size Physical memory cache of virtual memory Managed by operating system ns Memory 2 27 bytes ms Disk 2 35 bytes Memory Hierarchy 15

16 Memory Management Unit (MMU) Processor Processor uses virtual addresses Data at virtual address A v is in physical location A p convert A v to A p conversion done by MMU MMU Memory Virtual address 32-bits Physical address Pages Pages: 512B-64KB blocks Disk fetch: more efficient with blocks Byte address in page: same for physical and virtual addresses Virtual page address 32-bit word, 4KB-page Byte address in page Physical page address Byte address in page Address Translation Page-based translation Page tables (in memory) Page table: indexed by virtual address hashed to physical address MSB LSB Processor request (virtual address) Page Virtual address Page Physical address Page Table Physical address 16

17 Page Table Entry I/O Modified r w x Physical page address Used In memory Protection Physical/Virtual address translation But also: process protection (illegal read/writes among processes) information for page replacement page availability (in memory/not in memory) information for I/Os Page table entry size 32 bits Segmentation Different and complementary approach A «user» view Segments correspond to user-known categories: stack heap program Segments grow dynamically, can be limited Segment pointer/offset Can be combined with virtual memory (x86) Segment stack Segment heap Espace d adressage virtuel de l Alpha 17

Chapter 8 I/O. Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. But where does data in memory come from?

Chapter 8 I/O. Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. But where does data in memory come from? Chapter 8 I/O I/O: Connecting to Outside World So far, we ve learned how to: compute with values in registers load data from memory to registers store data from registers to memory But where does data