Computer Structure. The Uncore. Computer Structure 2013 Uncore

Computer Structure The Uncore 1

2 nd Generation Intel Next Generation Intel Turbo Boost Technology High Bandwidth Last Level Cache Integrates CPU, Graphics, MC, PCI Express* on single chip PCH DMI PCI Express* 16 PCIe High BW/ low-latency core/gf interconnect Substantial performance improvement Next Generation Graphics and Media Display System Agent IMC 2ch DDR3 Intel Advanced Vector Extension (Intel AV) Embedded DisplayPort Integrated Memory Controller 2ch DDR3 Discrete Graphics Support: 1x16 or 2x8 Graphics Intel Hyper-Threading Technology 4 s / 8 Threads 2 s / 4 Threads 2 Foil taken from IDF 2011

3 rd Generation Intel TM 22nm process Quad core die, with Intel HD Graphics 4000 1.4 Billion transistors Die size: 160 mm 2 3

The Uncore Subsystem High bandwidth bi-directional ring bus Connects between the IA cores and the various un-core sub-systems The uncore subsystem includes The System Agent (SA) The Graphics unit (GT) The Last Level Cache () DMI PCI Express* In Intel eon Processor E5 Family No graphics unit (GT) Instead it contains many more components: An with larger capacity and snooping capabilities to support multiple processors Intel Quick Path Interconnect (QPI) interfaces that can support multi-socket platforms Power management control hardware A system agent capable of supporting high bandwidth traffic from memory and I/O devices Display System Agent Graphics IMC 4 From the Optimization Manual

Scalable Ring On-die Interconnect Ring-based interconnect between s, Graphics, and System Agent domain Composed of 4 rings 32 Byte Data ring, Request ring, Acknowledge ring, and Snoop ring Fully pipelined at core frequency/voltage: bandwidth, latency and power scale with cores Massive ring wire routing runs over the with no area impact Access on ring always picks the shortest path minimize latency Distributed arbitration, ring protocol handles coherency, ordering, and core interface Scalable to servers with large number of processors DMI Display System Agent Graphics PCI Express* IMC High Bandwidth, Low Latency, Modular 5 Foil taken from IDF 2011

Last Level Cache The consists of multiple cache slices The number of slices is equal to the number of IA cores Each slice contains a full cache port that can supply 32 bytes/cycle Each slice has logic portion + data array portion The logic portion handles Data coherency Memory ordering Access to the data array portion misses and write-back to memory The data array portion stores cache lines May have 4/8/12/16 ways Corresponding to 0.5M/1M/1.5M/2M block size The GT sits on the same ring interconnect Uses the for its data operations as well May in some case competes with the core on DMI Display System Agent Graphics PCI Express* IMC 6 From the Optimization Manual

Ring Interconnect and The physical addresses of data kept in the are distributed among the cache slices by a hash function Addresses are uniformly distributed From the cores and the GT view, the acts as one shared cache With multiple ports and bandwidth that scales with the number of cores The number of cache-slices increases with the number of cores The ring and are not likely to be a BW limiter to core operation DMI PCI Express* The hit latency, ranging between 26-31 cycles, depends on The core location relative to the block (how far the request needs to travel on the ring) Display System Agent IMC Traffic that cannot be satisfied by the misses, dirty line write-back, non-cacheable operations, and MMIO/IO operations Travels through the cache-slice logic portion and the ring, to the system agent Graphics 7 From the Optimization Manual

Cache Box Interface block Between /Graphics/Media and the Ring Between Cache controller and the Ring Implements the ring logic, arbitration, cache controller Communicates with System Agent for misses, external snoops, non-cacheable accesses Full cache pipeline in each cache box Physical Addresses are hashed at the source to prevent hot spots and increase bandwidth Maintains coherency and ordering for the addresses that are mapped to it is fully inclusive with Valid Bits eliminates unnecessary snoops to cores Per core CVB indicates if core needs to be snooped for a given cache line Runs at core voltage/frequency, scales with s Distributed coherency & ordering; Scalable Bandwidth, Latency & Power DMI Display System Agent Graphics PCI Express* IMC 8 Foil taken from IDF 2011

Sharing is shared among all s, Graphics and Media Graphics driver controls which streams are cached/coherent Any agent can access all data in the, independent of who allocated the line, after memory range checks Controlled way allocation mechanism prevents thrashing between /GF Multiple coherency domains IA Domain (Fully coherent via cross-snoops) Graphic domain (Graphics virtual caches, flushed to IA domain by graphics engine) Non-Coherent domain (Display data, flushed to memory by graphics engine) Much higher Graphics performance, DRAM power savings, more DRAM BW available for s DMI Display System Agent Graphics PCI Express* IMC 9 Foil taken from IDF 2011

Cache Hierarchy Level Capacity ways Line Size (bytes) Write Update Policy Inclusive Latency (cycles) Bandwidth (Byte/cyc) L1 Data 32KB 8 64 Write-back - 4 2 16 L1 Instruction 32KB 8 64 N/A - - - L2 (Unified) 256KB 8 64 Write-back No 12 1 32 Varies Varies 64 Write-back Yes 26-31 1 32 The is inclusive of all cache levels above it Data contained in the core caches must also reside in the Each cache line holds an indication of the cores that may have this line in their L2 and L1 caches Fetching data from when another core has the data Clean hit data is not modified in the other core 43 cycles Dirty hit data is modified in the other core 60 cycles 10 From the Optimization Manual

Data Prefetch to MLC and Two HW prefetchers fetch data from memory to MLC and Streamer and spatial prefetcher prefetch the data to the Typically data is brought also to the MLC Unless the MLC is heavily loaded with missing demand requests Spatial Prefetcher Strives to complete every cache line fetched to the MLC with the pair line that completes it to a 128-byte aligned chunk Streamer Prefetcher Monitors read requests from the L1 caches for ascending and descending sequences of addresses L1 D$ requests: loads, stores, and L1 D$ HW prefetcher L1 I$ code fetch requests When a forward or backward stream of requests is detected The anticipated cache lines are pre-fetched Prefetch-ed cache lines must be in the same 4K page 25 From the Optimization Manual

Data Prefetch to MLC and Streamer Prefetcher Enhancement The streamer may issue two prefetch requests on every MLC lookup Runs up to 20 lines ahead of the load request Adjusts dynamically to the number of outstanding requests per core Not many outstanding requests prefetch further ahead Many outstanding requests prefetch to only, and less far ahead When cache lines are far ahead Prefetch to only and not to the MLC Avoids replacement of useful cache lines in the MLC Detects and maintains up to 32 streams of data accesses For each 4K byte page, can maintain one forward and one backward stream 26 From the Optimization Manual

The System Agent Contains PCI Express, DMI, Memory Controller, Display Engine Contains Power Control Unit Programmable ucontroller, handles all power management and reset functions in the chip Smart integration with the ring Provides cores/graphics /Media with high BW, low latency to DRAM/IO for best performance Handles IO-to-cache coherency Separate voltage and frequency from ring/cores, Display integration for better battery life Extensive power and thermal management for PCI Express* and DDR DMI System DisplayAgent Graphics PCI Express* IMC 27 Smart I/O Integration Foil taken from IDF 2011

System Agent Components PCIe controllers that connect to external PCIe devices Support different configurations: 16+ 4, 8+ 8+ 4, 8+ 4+ 4+ 4 DMI (Direct Media Interface) controller Connects to the PCH (Platform Controller Hub) Integrated display engine Handles delivering the pixels to the screen Flexible Display Interface (FDI) Connects to the PCH, where the display connectors (HDMI, DVI) are attached DisplayPort (used for integrated display) e.g., a laptop s LCD Integrated Memory Controller (IMC) Connects to and controls the DRAM An arbiter that handles accesses from Ring and from I/O (PCIe & DMI) Routes the accesses to the right place Routes main memory traffic to the IMC Display Port FDI PCH DMI Display DMI 4 System Agent Graphics PCI Express* IMC HDMI SATA USB 16 PCIe 2ch DDR3 28

The Memory Controller The memory controller supports two channels of DDR Data rates of 1066MHz, 1333MHz and 1600MHz Each channel has its own resources Handles memory requests independently Contains a 32 cache-line write-data-buffer Supports 8 bytes per cycle A hash function distributes addresses are between channels Attempts to balance the load between the channels in order to achieve maximum bandwidth and minimum hotspot collisions For best performance DRAM Ctrlr CH A CH B DDR DIMM DDR DIMM Populate both channels with equal amounts of memory Preferably the exact same types of DIMMs Use highest supported speed DRAM, with the best DRAM timings 29 From the Optimization Manual

The Memory Controller High-performance out-of-order scheduler Attempts to maximize memory bandwidth while minimizing latency Writes to the memory controller are considered completed when they are written to the write-data-buffer The write-data-buffer is flushed out to main memory at a later time, not impacting write latency Partial writes are not handled efficiently on the memory controller May result in read-modify-write operations on the DDR channel if the partial-writes do not complete a full cache-line in time Software should avoid creating partial write transactions whenever possible E.g., buffer the partial writes into full cache line writes imc also supports high-priority isochronous requests E.g., USB isochronous, and Display isochronous requests High bandwidth of memory requests from the integrated display engine takes up some of the memory bandwidth Impacts core access latency to some degree 30 From the Optimization Manual

Integration: Optimization Opportunities Dynamically redistribute power between s & Graphics Tight power management control of all components, providing better granularity and deeper idle/sleep states Three separate power/frequency domains: System Agent (Fixed), s+ring, Graphics (Variable) High BW Last Level Cache, shared among s and Graphics Significant performance boost, saves memory bandwidth and power Integrated Memory Controller and PCI Express ports Tightly integrated with /Graphics/ domain Provides low latency & low power remove intermediate busses Bandwidth is balanced across the whole machine, from /Graphics all the way to Memory Controller Modular uarch for optimal cost/power/performance Derivative products done with minimal effort/time 31 Foil taken from IDF 2011

DRAM 32

Basic DRAM chip RAS# Addr Row Address Latch Row Address decoder Memory array CAS# Column Address Latch Column addr decoder Data DRAM access sequence Put Row on addr. bus and assert RAS# (Row Addr. Strobe) to latch Row Put Column on addr. bus and assert CAS# (Column Addr. Strobe) to latch Col Get data on address bus 33

DRAM Operation DRAM cell consists of transistor + capacitor AL Capacitor keeps the state; Transistor guards access to the state DL M C Reading cell state: raise access line AL and sense DL Capacitor charged current to flow on the data line DL Writing cell state: set DL and raise AL to charge/drain capacitor Charging and draining a capacitor is not instantaneous Leakage current drains capacitor even when transistor is closed DRAM cell periodically refreshed every 64ms 34

DRAM Access Sequence Timing trp Row Precharge RAS# CAS# A[0:7] Data trcd RAS/CAS delay Row i Col n Row j CL CAS latency Data n Put row address on address bus and assert RAS# Wait for RAS# to CAS# delay (trcd) between asserting RAS and CAS Put column address on address bus and assert CAS# Wait for CAS latency (CL) between time CAS# asserted and data ready Row precharge time: time to close current row, and open a new row 35

DRAM controller DRAM controller gets address and command Splits address to Row and Column Generates DRAM control signals at the proper timing A[20:23] address decoder Chip select Time delay gen. RAS# CAS# Select A[10:19] D[0:7] A[0:9] address mux Memory address bus DRAM R/W# DRAM data must be periodically refreshed DRAM controller performs DRAM refresh, using refresh counter 36

Improved DRAM Schemes Paged Mode DRAM Multiple accesses to different columns from same row Saves RAS and RAS to CAS delay RAS# CAS# A[0:7] Data Row Col n Col n+1 Col n+2 Data n D n+1 D n+2 Extended Data Output RAM (EDO RAM) A data output latch enables to parallel next column address with current column data RAS# CAS# A[0:7] Data Row Col n Col n+1 Col n+2 Data n Data n+1 Data n+2 37

Improved DRAM Schemes (cont) Burst DRAM Generates consecutive column address by itself RAS# CAS# A[0:7] Data Row Col n Data n Data n+1 Data n+2 38

Synchronous DRAM SDRAM All signals are referenced to an external clock (100MHz-200MHz) Makes timing more precise with other system devices 4 banks multiple pages open simultaneously (one per bank) Command driven functionality instead of signal driven ACTIVE: selects both the bank and the row to be activated ACTIVE to a new bank can be issued while accessing current bank READ/WRITE: select column Burst oriented read and write accesses Successive column locations accessed in the given row Burst length is programmable: 1, 2, 4, 8, and full-page May end full-page burst by BURST TERMINATE to get arbitrary burst length A user programmable Mode Register CAS latency, burst length, burst type Auto pre-charge: may close row at last read/write in burst Auto refresh: internal counters generate refresh address 39

SDRAM Timing clock cmd ACT NOP t RCD > 20ns RD RD+PC ACT NOP RD t RRD > 20ns ACT NOP RD NOP NOP NOP t RC >70ns BL = 1 Bank Bank 0 Bank 0Bank 0 Bank 1 Bank 1 Bank 0 Bank 0 Addr Row i Col j Col k Row m Col n Row l Col q CL=2 Data Data j Data k Data n Data q t RCD : ACTIVE to READ/WRITE gap = t RCD (MIN) / clock period t RC : successive ACTIVE to a different row in the same bank t RRD : successive ACTIVE commands to different banks 40

DDR-SDRAM 2n-prefetch architecture DRAM cells are clocked at the same speed as SDR SDRAM cells Internal data bus is twice the width of the external data bus Data capture occurs twice per clock cycle Lower half of the bus sampled at clock rise Upper half of the bus sampled at clock fall 0:n-1 SDRAM Array 0:2n-1 n:2n-1 0:n-1 400M xfer/sec 200MHz clock Uses 2.5V (vs. 3.3V in SDRAM) Reduced power consumption 41

DDR SDRAM Timing 133MHz clock cmd ACT NOP NOP RD NOP ACT NOP NOP RD NOP ACT NOP NOP t RCD >20ns t RRD >20ns t RC >70ns Bank Bank 0 Bank 0 Bank 1 Bank 1 Bank 0 Addr Row i Col j Row m Col n Row l CL=2 Data j +1 +2 +3 n +1 +2 +3 42

DIMMs DIMM: Dual In-line Memory Module A small circuit board that holds memory chips 64-bit wide data path (72 bit with parity) Single sided: 9 chips, each with 8 bit data bus Dual sided: 18 chips, each with 4 bit data bus Data BW: 64 bits on each rising and falling edge of the clock Other pins Address 14, RAS, CAS, chip select 4, VDC 17, Gnd 18, clock 4, serial address 3, 43

DDR Standards DRAM timing, measured in I/O bus cycles, specifies 3 numbers CAS Latency RAS to CAS Delay RAS Precharge Time CAS latency (latency to get data in an open page) in nsec CAS Latency I/O bus cycle time Standard name Mem. clock (MHz) I/O bus clock (MHz) Total BW for DDR400 Cycle time (ns) Data rate (MT/s) DDR-200 100 100 10 200 V DDQ (V) Module name 3200M Byte/sec = 64 bit 2 200MHz / 8 (bit/byte) 6400M Byte/sec for dual channel DDR SDRAM transfer rate (MB/s) PC-1600 1600 DDR-266 133⅓ 133⅓ 7.5 266⅔ 2.5 PC-2100 2133⅓ DDR-333 166⅔ 166⅔ 6 333⅓ PC-2700 2666⅔ DDR-400 200 200 5 400 2.6 PC-3200 3200 Timing (CL-tRCDtRP) 2.5-3-3 3-3-3 3-4-4 CAS Latency (ns) 12.5 15 15 44

DDR2 DDR2 doubles the bandwidth 4n pre-fetch: internally read/write 4 the amount of data as the external bus DDR2-533 cell works at the same freq. as Memory Cell Array I/O Buffers Data Bus a DDR266 cell or a PC133 cell Prefetching increases latency Smaller page size: 1KB vs. 2KB Reduces activation power ACTIVATE command reads all bits in the page Memory Cell Array I/O Buffers Data Bus 8 banks in 1Gb densities and above Increases random accesses 1.8V (vs 2.5V) operation voltage Significantly lower power Memory Cell Array I/O Buffers Data Bus 45

DDR2 Standards Standard name Mem clock (MHz) Cycle time I/O Bus clock (MHz) Data rate (MT/s) Module name Peak transfer rate Timings CAS Latency DDR2-400 100 10 ns 200 400 PC2-3200 3200 MB/s DDR2-533 133 7.5 ns 266 533 PC2-4200 4266 MB/s DDR2-667 166 6 ns 333 667 PC2-5300 5333 MB/s DDR2-800 200 5 ns 400 800 PC2-6400 6400 MB/s DDR2-1066 266 3.75 ns 533 1066 PC2-8500 8533 MB/s 3-3-3 4-4-4 3-3-3 4-4-4 4-4-4 5-5-5 4-4-4 5-5-5 6-6-6 6-6-6 7-7-7 15 20 11.25 15 12 15 10 12.5 15 11.25 13.125 46

DDR3 30% power consumption reduction compared to DDR2 1.5V supply voltage, compared to DDR2's 1.8V 90 nanometer fabrication technology Higher bandwidth 8 bit deep prefetch buffer (vs. 4 bit in DDR2 and 2 bit in DDR) Transfer data rate Effective clock rate of 800 1600 MHz using both rising and falling edges of a 400 800 MHz I/O clock DDR2: 400 800 MHz using a 200 400 MHz I/O clock DDR: 200 400 MHz based on a 100 200 MHz I/O clock DDR3 DIMMs 240 pins, the same number as DDR2, and are the same size Electrically incompatible, and have a different key notch location 47

DDR3 Standards Standard Name Mem clock (MHz) I/O bus clock (MHz) I/O bus Cycle time (ns) Data rate (MT/s) Module name Peak transfer rate (MB/s) DDR3-800 100 400 2.5 800 PC3-6400 6400 DDR3-1066 133⅓ 533⅓ 1.875 1066⅔ PC3-8500 8533⅓ DDR3-1333 166⅔ 666⅔ 1.5 1333⅓ PC3-10600 10666⅔ DDR3-1600 200 800 1.25 1600 PC3-12800 12800 DDR3-1866 233⅓ 933⅓ 1.07 1866⅔ PC3-14900 14933⅓ DDR3-2133 266⅔ 1066⅔ 0.9375 2133⅓ PC3-17000 17066⅔ Timings (CL-tRCDtRP) 5-5-5 6-6-6 6-6-6 7-7-7 8-8-8 8-8-8 9-9-9 9-9-9 10-10-10 11-11-11 11-11-11 12-12-12 12-12-12 13-13-13 CAS Latency (ns) 12 1 2 15 11 1 4 13 1 8 15 12 13 1 2 11 1 4 12 1 2 13 3 4 11 11 14 12 6 7 11 1 4 12 3 16 48

DDR2 vs. DDR3 Performance The high latency of DDR3 SDRAM has negative effect on streaming operations 49 Source: xbitlabs

How to get the most of Memory? Each DIMM supports 4 open pages simultaneously The more open pages, the more random access It is better to have more DIMMs: n DIMMs 4n open pages Dual sided DIMMs may have separate CS of each side Support 8 open pages Dual sided DIMMs may also have a common CS 50

SRAM Static RAM True random access High speed, low density, high power No refresh Address not multiplexed DDR SRAM 2 READs or 2 WRITEs per clock Common or Separate I/O DDRII: 200MHz to 333MHz Operation; Density: 18/36/72Mb+ QDR SRAM Two separate DDR ports: one read and one write One DDR address bus: alternating between the read address and the write address QDRII: 250MHz to 333MHz Operation; Density: 18/36/72Mb+ 51

SRAM vs. DRAM Random Access: access time is the same for all locations DRAM Dynamic RAM SRAM Static RAM Refresh Refresh needed No refresh needed Address Address muxed: row+ column Address not multiplexed Access Not true Random Access True Random Access density High (1 Transistor/bit) Low (6 Transistor/bit) Power low high Speed slow fast Price/bit low high Typical usage Main memory cache 52

Read Only Memory (ROM) Random Access Non volatile ROM Types PROM Programmable ROM Burnt once using special equipment EPROM Erasable PROM Can be erased by exposure to UV, and then reprogrammed E 2 PROM Electrically Erasable PROM Can be erased and reprogrammed on board Write time (programming) much longer than RAM Limited number of writes (thousands) 53