Commercially Available Chip Mul3processors for Research. Welcome to the MulE core Era

Size: px

Start display at page:

Download "Commercially Available Chip Mul3processors for Research. Welcome to the MulE core Era"

Dina Russell
5 years ago
Views:

1 4/2/11 ommercially Available hip Mul3processors for Research Bruce hilders University of Pi9sburgh h9p:// AAO h9p:// h9p:// team.org h9p:// Welcome to the MulE core Era hip muleprocessors are everywhere! ellular phone Tablets Netbooks Laptops Desktops Servers 1

Welcome to the MulE core Era hip muleprocessors are everywhere!

5 GHz Scorpion GPU & cellular modem Up to 10giga operaeons per second App + media + radio operaeon

May be single to muleple chips Modem (DMA) GPU (OpenGL) ore (ARM9) ore (ARM9) Processor ellular phone

Powerful applicaeons from consumer to science to business Single processor ( socket ) Moving toward

2 Welcome to the MulE core Era hip muleprocessors are everywhere! ellular phone Tablets Netbooks Laptops Desktops Servers Qualcomm MSM866 Dual 1.5 GHz Scorpion GPU & cellular modem Up to 10giga operaeons per second App + media + radio operaeon Increasing by 10x every 5 years 1W available (from total) for compueng Ba9ery power determines limits May be single to muleple chips Modem (DMA) GPU (OpenGL) ore (ARM9) ore (ARM9) Processor Welcome to the MulE core Era hip muleprocessors are everywhere! ellular phone Tablets Netbooks Laptops Desktops Servers Intel Sandy Bridge NB 4 cores, 8 HW threads Integrated GPU, M Powerful applicaeons from consumer to science to business Single processor ( socket ) Moving toward high integraeon Moving more toward heterogeneous ore (x86) ore (x86) ore (x86) ore (x86) GPU M (DDR3) Processor 2

3 Welcome to the MulE core Era hip muleprocessors are everywhere! ellular phone Tablets Netbooks Laptops Desktops Servers AMD Opetron cores, 6MB L3 4 sockets, HyperTransport Range of services e.g., cloud compueng VirtualizaEon for server consolidaeon Power consumpeon (effeceve uelizaeon) MulEple cores per processor MulEple processors per machine (node) MulEple machines per cabinet ore ore (x86) ore ore (x86) INT INT (HT 3.1) ore ore M ore ore M (x86) (x86) (DDR3) Processor ( Socket ) Important A9ributes ore Nearby aches (L1, L2) ore Architecture Last Level ache (L3) Memory Power Management InterconnecEon Graphics Processing Uncore Architecture The uncore is what can ma9er for mule core It may also soon be the graphics processing capabiliees 3

Intel Processors NetBurst ore Nehalem Nehalem (45nm) Westmere (32nm) Westmere E Sandy Bridge Sandy Bridge (32nm) Ivy Bridge (22nm) 2004 2005 first Intel dual core 2006 2008 Nehalem Westmere Westmere

4 Intel Processors NetBurst ore Nehalem Nehalem (45nm) Westmere (32nm) Westmere E Sandy Bridge Sandy Bridge (32nm) Ivy Bridge (22nm) first Intel dual core Nehalem Westmere Westmere E Sandy Bridge cores, HT, loosely integrated M/GPU 6 cores, HT, loosely integrated M/GPU, VM Server variant, 1cores, 4 processors (QPI), 2011? 6 cores, new uarch, closely integrated GPU & M Sandy Bridge Desktop, mobile & server variants Features Enhanced core microarchitecture More closely coupled & integrated components Hyper threading with up to 8 cores (16 threads) On chip shared L3 cache Turbo Boost power/speed management Later server versions will feature improved QuickPath Interconnect 4

5 Intel Sandy Bridge GPU 1 2 3 ache North Bridge PIe x16 Display South Bridge DMI 4

(256 bit SIMD) Micro architecture changes (Improved branch predictor, changed register

x16 Display South Bridge DMI L1 instruceon cache 32KB L1 I cache Decode 4 x86

5 5 Intel Sandy Bridge GPU ache North Bridge PIe x16 Display South Bridge DMI 4 cores with L1, L2, L3 cache Hyper threaded: 8 logical cores Advanced vector extensions (256 bit SIMD) Micro architecture changes (Improved branch predictor, changed register renaming for AVX, 2x load ports) Intel Sandy Bridge GPU ache North Bridge PIe x16 Display South Bridge DMI L1 instruceon cache 32KB L1 I cache Decode 4 x86 instr/cycle onverted to u ops 1.5K entry (L0) u op cache (just caches not trace cache) Gain is power

Intel Sandy Bridge GPU 32KB L1 data cache 256KB L2 cache

South Bridge Intel Sandy Bridge GPU 8MB L3 cache (shared)

6 Intel Sandy Bridge GPU 32KB L1 data cache 256KB L2 cache (unified, private) ache DMI North Bridge PIe x16 Display South Bridge Intel Sandy Bridge GPU 8MB L3 cache (shared) Designed for high bandwidth Shared by cores + GPU 435 GB/sec 3.4 GHz* ache DMI North Bridge PIe x16 Display South Bridge * Source: Sandy Bridge Spans Genera5ons, Linley Gwennap, MPR, Sept

7 L3 cache PIe Display System Agent Memory ontroller ore L3 cache (2 MB) ore 1 L3 cache (2 MB) ore 2 L3 cache (2 MB) ore 3 L3 cache (2 MB) Graphics Processing Unit L3 cache PIe Display System Agent Memory ontroller ore L3 cache (2 MB) ore 1 L3 cache (2 MB) ore 2 L3 cache (2 MB) ore 3 L3 cache (2 MB) Graphics Processing Unit 7

L3 cache PIe Display ore ore 1 System Agent omposed

Acknowledgement L3 cache (2 MB) Snooping Up to 26 31

ore 2 L3 cache (2 MB) ore 3 L3 cache (2 MB) Graphics

Graphics processing unit Integrated on chip More

video codec DMI North Bridge PIe x16 Display South

8 L3 cache PIe Display ore ore 1 System Agent omposed of 4 rings 32 byte data Display Request Acknowledgement L3 cache (2 MB) Snooping Up to clock traversal Distributed coherence L3 cache (2 MB) ore 2 L3 cache (2 MB) ore 3 L3 cache (2 MB) Graphics Processing Unit Intel Sandy Bridge 1 2 GPU 3 ache Graphics processing unit Integrated on chip More closely coupled with cores (via L3 cache) New FUs & video codec DMI North Bridge PIe x16 Display South Bridge Source: Sandy Bridge Spans Genera5ons, Linley Gwennap, MPR, Sept

Intel Sandy Bridge GPU Uncore logic to connect to memory, display and I/O 1 2 3 ache DMI North Bridge PIe x16

9 Intel Sandy Bridge GPU Uncore logic to connect to memory, display and I/O ache DMI North Bridge PIe x16 Display Dual channel memory South Bridge Source: Sandy Bridge Spans Genera5ons, Linley Gwennap, MPR, Sept. 201 Intel Sandy Bridge 1 2 GPU 3 ache Plarorm ontroller Hub (PH) onnects to I/O devices E.g., SATA disk, USB, PI Express, etc DMI North Bridge PIe x16 Display South Bridge Source: Sandy Bridge Spans Genera5ons, Linley Gwennap, MPR, Sept

Turbo Boost Power Management Thermal design point (TDP) Maximum power dissipated

allocaeon Introduced in Nehalem Shis available budget (under TDP) to boost speed of

frequency Boost to remain under TDP May temporarily exceed TDP ore state (AcEve,

10 Turbo Boost Power Management Thermal design point (TDP) Maximum power dissipated Baseline: onsider impact of all cores But not all cores are always aceve hange power allocaeon Introduced in Nehalem Shis available budget (under TDP) to boost speed of cores based on workload Feedback Loop Feedback Loop Monitoring Adjust voltage, frequency Boost to remain under TDP May temporarily exceed TDP ore state (AcEve, InacEve) ore ore OS state change ore ore Temperature Power EsEmated current Power Manager Speed setng 1

setng Feedback Loop InacEve cores ores moved to inaceve (3/6) Leaves headroom in TDP Spend on other cores OS

11 Feedback Loop Baseline frequency ores are aceve/inaceve Frequency with four cores OS state change trigger ore state (AcEve, InacEve) 200 OS state change Temperature Power EsEmated current Power Manager 200 Speed setng Feedback Loop InacEve cores ores moved to inaceve (3/6) Leaves headroom in TDP Spend on other cores OS state change ore state (AcEve, InacEve) Temperature Power EsEmated current Power Manager Boost cores Speed setng 11

12 Feedback Loop Adjust speed upward hange in small steps (10MHz) Up to maximum speed Stay under TDP ore state (AcEve, InacEve) 210 OS state change 210 Temperature Power EsEmated current Power Manager Boost cores Speed setng Feedback Loop Adjust speed upward hange in small steps (10MHz) Up to maximum speed Stay under TDP ore state (AcEve, InacEve) 220 OS state change 220 Temperature Power EsEmated current Power Manager Boost cores Speed setng 12

13 Feedback Loop Adjust speed downward Move back under TDP Temporarily exceeds b/c thermals change slowly ore state (AcEve, InacEve) 230 OS state change 230 Temperature Power EsEmated current Power Manager Reduce cores Speed setng Feedback Loop Adjust speed downward Move back under TDP Temporarily exceeds b/c thermals change slowly ore state (AcEve, InacEve) 220 OS state change 220 Temperature Power EsEmated current Power Manager Reduce cores Speed setng 13

14 Feedback Loop ore i7 2920XM, 4 cores, base 2.5 GHz AcEve ores Max Speed 3.2 GHz 3.3 GHz 3.4 GHz 3.5 GHz ore ore OS state change ore ore Temperature Power EsEmated current Power Manager Speed setng AMD Processors Shanghi (2008) Istanbul (2009) Magny ours (2010) Bulldozer (2011?) November 2008 Shanghi Istanbul Magny ours Bulldozer 4 cores (no HWT), 2.9 GHz, 45nm, 6MB L3, DDR2 6 cores, 2.8 GHz, 45nm, 6MB L3, DDR2, HT assist 12 cores, 2.6 GHz, 45nm, 12MB L3, DDR3, HT assist Tightly coupled cores, separate sched & FUs (HWT like), 16 cores, 32nm, 16 MB L3, 256 bit FPU January

AMD Opteron 610 12 core x86 processor, Istanbul core architecture I/O I/O AMD

(MulE chip Module) 12 Istanbul cores 2 dies (nodes), 6 core ea, 45nm I/O Per

15 AMD Opteron core x86 processor, Istanbul core architecture I/O I/O AMD Opteron core x86 processor, Istanbul core architecture Per package (MulE chip Module) 12 Istanbul cores 2 dies (nodes), 6 core ea, 45nm I/O Per node 6 MB shared L3 Memory controller I/O 2x memory channels 4x HyperTransport links 15

AMD Opteron 610 12 core x86 processor, Istanbul core architecture Remote Local Non uniform memory access Shared address space Physically distributed to nodes I/O Transparently access any address

16 AMD Opteron core x86 processor, Istanbul core architecture Remote Local Non uniform memory access Shared address space Physically distributed to nodes I/O Transparently access any address Local address: faster access I/O Remote address: slower (going across the interconnect) HyperTransport Links HyperTransport Point to point interconnect (LVDS) Arranged as muleple links (e.g., x16 links) Up to 25.6 GB/second (x32 links) 4 x16 HT ports/processor allocated for withinpackage communicaeon, cross processor communicaeon & I/O 16

InterconnecEon (2 processors) P x16 cht x8 cht P2 4 x16 HyperTransport links x16 adjacent off package nodes x8 diagonal off package nodes x16 + x8 on package nodes x16

17 InterconnecEon (2 processors) P x16 cht x8 cht P2 4 x16 HyperTransport links x16 adjacent off package nodes x8 diagonal off package nodes x16 + x8 on package nodes x16 noncoherent I/O I/O x16 ncht P1 P3 InterconnecEon (4 processors) P P P P 4 x16 HyperTransport links x8 between off package nodes x16 + x8 on package nodes x16 noncoherent I/O P P P P 17

18 oherence Traffic Explosion in coherence traffic 4 processors, 48 cores! oherence Data may reside in muleple caches Need to keep it consistent Single writer, muleple readers Broadcast Request which core has must recent data learly, doesn t scale well HT Assist X Proc. X 1 Home is locaeon where memory address resides Data can be cached anywhere, though Need to find the locaeon Reader: Deliver poteneally most recent copy Writer: Get exclusive ownership to update data 18

19 HT Assist Proc. 1 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Wait for reply from each processor 4) Data forwarded from P1 to P2 HT Assist Proc. 2 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Wait for reply from each processor 4) Data forwarded from P1 to P2 19

20 HT Assist Proc. Yes! 3 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Wait for reply from each processor 4) Data forwarded from P1 to P2 HT Assist Proc. 4 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Wait for reply from each processor 4) Data forwarded from P1 to P2 2

21 HT Assist Proc. Proc. X: 1 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Wait for reply from each processor 4) Data forwarded from P1 to P2 Maintain directory of data locaeon 1MB of L3 dedicated to directory Reduces traffic (locaeon known) HT Assist Proc. Proc. X: 1 1 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Wait for reply from each processor 4) Data forwarded from P1 to P2 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Data forwarded from P1 to P2 21

22 HT Assist 2 Proc. Proc. X: 1 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Wait for reply from each processor 4) Data forwarded from P1 to P2 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Data forwarded from P1 to P2 HT Assist Proc. Proc. X: 1 3 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Wait for reply from each processor 4) Data forwarded from P1 to P2 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Data forwarded from P1 to P2 22

23 HT Assist Proc. Proc. X: 1 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Wait for reply from each processor 4) Data forwarded from P1 to P2 1) P2 requests data from P(home) 2) Pbroadcasts for most recent copy 3) Data forwarded from P1 to P2 Only makes sense for >2 nodes Avoids most broadcasts Reduces L3 cache capacity HT Assist: Where to Keep Directory? 6 MB L3 cache with 16 ways 64 byte line with 16 directory entries DIR DIR Tag State Owner 4 byte directory entry (probe filter) DIR DIR Same processor in 1P and 4P systems Reduce costs by reusing the L3 cache for directory 16 ways, 4 ways dedicated to directory Sparse directory structure Maintain coherence state Modified (owner, dirty) Owned (owner, with sharers) Exclusive (one owner, consistent) Shared (shared, clean/dirty) Invalid (idenefied by lack of entry) Source: Hothips

24 AMD Opteron 610 Model Speed ores AP TDP Price 618SE 2.5 GHz W 14W $1514 * GHz 12 8W 115 W $1265 * GHz 12 8W 115 W $ GHz 12 8W 115 W $ GHz 8 8W 115 W $ GHz 8 8W 115 W $ HE 1.8 GHz W 85 W $ HE 1.7 GHz W 85 W $ HE 2.2 GHz 8 65 W 85 W $ HE 1.8 GHz 8 65 W 85 W $455 SE opemized for performance HE opemized for low power AP average PU power (workload derived power) All have 12 MB L3 (2x 6 MB), HT3, AMD V Introduced March 29, 201 * Introduced February 14, 2011 What s available? AVA Direct Supermicro SuperServer $5087 Quad AMD Opetron core 2.GHz (32) 64 GB memory, 50GB SATA drive Dell PowerEdge R415 $2457 Dual AMD Opetron 4170HE (6), 2.1 GHz (12) 16 GB memory, 25GB SATA drive Dell XPS 830(desktop) $1453 Intel ore i7 260(8MB, 3.4 GHz) 16 GB memory, 1TB SATA drive 24

25 Summary MulE core is certainly here! Significant research challenges Plarorm infrastructure ore architecture ache architecture InterconnecEon Power management IntegraEon and fusing of PU+GPU Today s processors offer many of these capabili3es for research! 25

AMD Opteron 4200 Series Processor

What s new in the AMD Opteron 4200 Series Processor (Codenamed Valencia ) and the new Bulldozer Microarchitecture? Platform Processor Socket Chipset Opteron 4000 Opteron 4200 C32 56x0 / 5100 (codenamed