Barriers. CS252 Graduate Computer Architecture Lecture 22. Synchronization (con t) Memory Technology Error Correction Codes April 18 th, 2010

Size: px

Start display at page:

Download "Barriers. CS252 Graduate Computer Architecture Lecture 22. Synchronization (con t) Memory Technology Error Correction Codes April 18 th, 2010"

Herbert Hardy
5 years ago
Views:

1 CS252 Graduate Computer Archtecture Lecture 22 Synchronzaton (con t) Technology Error Correcton Codes Aprl 18 th, 2010 John Kubatowcz Electrcal Engneerng and Computer Scences Unversty of Calforna, Berkeley Revew: Zoo of hardware prmtves test&set (&address) { /* most archtectures */ result = M[address]; M[address] = 1; return result; swap (&address, regster) { /* x86 */ temp = M[address]; M[address] = regster; regster = temp; compare&swap (&address, reg1, reg2) { /* */ f (reg1 == M[address]) { M[address] = reg2; return success; else { return falure; load-lnked&store condtonal(&address) { /* R4000, alpha */ loop: ll r1, M[address]; mov r2, 1; /* Can do arbtrary comp */ sc r2, M[address]; beqz r2, loop; 4/13/2011 cs252-s11, Lecture 21 2 Barrers Software algorthms mplemented usng locks, flags, counters Hardware barrers Wred-AND lne separate from address/data bus» Set nput hgh when arrve, wat for output to be hgh to leave In practce, multple wres to allow reuse Useful when barrers are global and very frequent Dffcult to support arbtrary subset of processors» even harder wth multple processes per processor Dffcult to dynamcally change number and dentty of partcpants» e.g. latter due to process mgraton Not common today on bus-based machnes 4/13/2011 cs252-s11, Lecture 21 3 A Smple Centralzed Barrer Shared counter # of processes that have arrved ncrement when arrve (lock), check untl reaches numprocs Problem? struct bar_type {nt counter; struct lock_type lock; nt flag = 0; bar_name; BARRIER (bar_name, p) { LOCK(bar_name.lock); f (bar_name.counter == 0) bar_name.flag = 0; /* reset flag f frst to reach*/ mycount = bar_name.counter++; /* mycount s prvate */ UNLOCK(bar_name.lock); f (mycount == p) { /* last to arrve */ bar_name.counter = 0; /* reset for next barrer */ bar_name.flag = 1; /* release waters */ else whle (bar_name.flag == 0); /* busy wat for release */ 4/13/2011 cs252-s11, Lecture 21 4

2 A Workng Centralzed Barrer Consecutvely enterng the same barrer doesn t work Must prevent process from enterng untl all have left prevous nstance Could use another counter, but ncreases latency and contenton Sense reversal: wat for flag to take dfferent value consecutve tmes Toggle ths value only when all processes reach Improved Barrer Algorthms for a Bus Software combnng tree Only k processors access the same locaton, where k s degree of tree Contenton Lttle contenton BARRIER (bar_name, p) { local_sense =!(local_sense); /* toggle prvate sense varable */ LOCK(bar_name.lock); mycount = bar_name.counter++; /* mycount s prvate */ f (bar_name.counter == p) UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waters*/ else { UNLOCK(bar_name.lock); whle (bar_name.flag!= local_sense) {; Flat Tree structured Separate arrval and ext trees, and use sense reversal Valuable n dstrbuted network: communcate along dfferent paths On bus, all traffc goes on same bus, and no less total traffc Hgher latency (log p steps of work, and O(p) seralzed bus xactons) Advantage on bus s use of ordnary reads/wrtes nstead of locks 4/13/2011 cs252-s11, Lecture /13/2011 cs252-s11, Lecture 21 6 Lock-Free Synchronzaton What happens f process grabs lock, then goes to sleep??? Page fault Processor schedulng Etc Lock-free synchronzaton: Operatons do not requre mutual excluson of multple nsts Nonblockng: Some process wll complete n a fnte amount of tme even f other processors halt Wat-Free (Herlhy): Every (nonfaultng) process wll complete n a fnte amount of tme Systems based on LL&SC can mplement these 4/13/2011 cs252-s11, Lecture 21 7 Transactonal Transacton-based model of memory Interface: start transacton(); read/wrte data commt transacton(): If conflcts detected, commt wll abort and must be retred What s a conflct?» If values you read are wrtten by others n mddle of transacton» If values you wrte are wrtten by others n mddle of transacton Hardware support for transactons Typcally uses cache coherence protocol to help process How to detect conflct?» Set R/W flags on cache lne when access» Conflcts detected when cache lne nvaldates (and/or nterventons) notce bts set Eager Conflct detecton:» Newer transacton s assumed to conflct wth older one 4/13/2011 cs252-s11, Lecture 21 8

Bref dscusson of Transactonal LogTM: Log-based Transactonal Kevn Moore, Jayaram Bobba, Mchelle Moravan, Mark Hll & Davd Wood Use of Cache Coherence protocol to detect transacton conflcts Transactonal

Dscards any transacton state saved for potental abort abort_transacton(): Transfers control to a prevously regster conflct handler whch should undo and dscard work snce last begn_transacton()

arrves» Cycle Tme: tme between requests Bandwdth: I/O & Large Block Mss Penalty (L2) Man s DRAM: Dynamc Random Access Dynamc snce needs to be refreshed perodcally (8 ms, 1% tme) Addresses dvded nto 2

3 Bref dscusson of Transactonal LogTM: Log-based Transactonal Kevn Moore, Jayaram Bobba, Mchelle Moravan, Mark Hll & Davd Wood Use of Cache Coherence protocol to detect transacton conflcts Transactonal Interface: begn_transacton(): Request that subsequent statements for a transacton commt_transacton(): Ends successful transacton begun by matchng begn_transacton(). Dscards any transacton state saved for potental abort abort_transacton(): Transfers control to a prevously regster conflct handler whch should undo and dscard work snce last begn_transacton() 4/18/2011 cs252-s11, Lecture 22 9 Specfc Loggng Mechansm 4/18/2011 cs252-s11, Lecture Man Background Performance of Man : Latency: Cache Mss Penalty» Access Tme: tme between request and word arrves» Cycle Tme: tme between requests Bandwdth: I/O & Large Block Mss Penalty (L2) Man s DRAM: Dynamc Random Access Dynamc snce needs to be refreshed perodcally (8 ms, 1% tme) Addresses dvded nto 2 halves ( as a 2D matrx):» RAS or Row Address Strobe» CAS or Column Address Strobe Cache uses SRAM: Statc Random Access No refresh (6 transstors/bt vs. 1 transstor Sze: DRAM/SRAM 4-8, Cost/Cycle tme: SRAM/DRAM 8-4/18/2011 cs252-s11, Lecture DRAM Archtecture N+M N M Row Address Decoder Col. 1 bt lnes Col. word lnes 2 M Row 1 Column Decoder & Sense Amplfers Data D Row 2 N cell (one bt) Bts stored n 2-dmensonal arrays on chp Modern chps have around 4 logcal banks on each chp each logcal bank physcally mplemented as many smaller arrays 4/18/2011 cs252-s11, Lecture 22 12

1-T Cell (DRAM) Wrte: 1. Drve bt lne 2.. Select row Read: 1. Precharge bt lne to Vdd/2 2.. Select row bt 3. Cell and bt lne share charges» Very small voltage changes on the bt lne 4.

row select DRAM Capactors: more capactance n a small area Trench capactors: Stacked capactors Logc ABOVE capactor Logc BELOW capactor Gan n surface area of capactor Gan n surface area of capactor

known value, requred before next row access Row access (RAS) decode row address, enable addressed row (often multple Kb n row) btlnes share charge wth storage cell small change n voltage detected by

8,, or 32 bts dependng on DRAM package) on read, send latched bts out to chp pns on wrte, change sense amplfer latches.

4 1-T Cell (DRAM) Wrte: 1. Drve bt lne 2.. Select row Read: 1. Precharge bt lne to Vdd/2 2.. Select row bt 3. Cell and bt lne share charges» Very small voltage changes on the bt lne 4. Sense (fancy sense amp)» Can detect changes of ~1 mllon electrons 5. Wrte: restore the value Refresh 1. Just do a dummy read to every cell. row select DRAM Capactors: more capactance n a small area Trench capactors: Stacked capactors Logc ABOVE capactor Logc BELOW capactor Gan n surface area of capactor Gan n surface area of capactor Better Scalng propertes 2-dm cross-secton qute small Better Planarzaton 4/18/2011 cs252-s11, Lecture /18/2011 cs252-s11, Lecture DRAM Operaton: Three Steps Precharge charges bt lnes to known value, requred before next row access Row access (RAS) decode row address, enable addressed row (often multple Kb n row) btlnes share charge wth storage cell small change n voltage detected by sense amplfers whch latch whole row of bts sense amplfers drve btlnes full ral to recharge storage cells Column access (CAS) decode column address to select small number of sense amplfer latches (4, 8,, or 32 bts dependng on DRAM package) on read, send latched bts out to chp pns on wrte, change sense amplfer latches. whch then charge storage cells to requred value can perform multple column accesses on same row wthout another row access (burst mode) 4/18/2011 cs252-s11, Lecture RAS_L CAS_L A WE_L OE_L DRAM Read Tmng (Example) Every DRAM access begns at: The asserton of the RAS_L 2 ways to read: early or late v. CAS DRAM Read Cycle Tme Row Address Read Access Tme RAS_L CAS_L WE_L OE_L A 256K x 8 9 DRAM 8 Col Address Junk Row Address Col Address Junk D Hgh Z Junk Data Out Hgh Z Data Out Output Enable Delay Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L 4/18/2011 cs252-s11, Lecture 22 D

5 Man Performance Access Tme Cycle Tme Tme DRAM (Read/Wrte) Cycle Tme >> DRAM (Read/Wrte) Access Tme 2:1; why? DRAM (Read/Wrte) Cycle Tme : How frequent can you ntate an access? Analogy: A lttle kd can only ask hs father for money on Saturday DRAM (Read/Wrte) Access Tme: How quckly wll you get what you want once you ntate an access? Analogy: As soon as he asks, hs father wll gve hm the money DRAM Bandwdth Lmtaton analogy: What happens f he runs out of money on Wednesday? 4/18/2011 cs252-s11, Lecture Access Pattern wthout Interleavng: D1 avalable Start Access for D1 Access Pattern wth 4-way Interleavng: Access 0 Increasng Bandwdth - Interleavng Access 1 Access 2 Access 3 Start Access for D2 We can Access 0 agan CPU CPU /18/2011 cs252-s11, Lecture Man Performance Wde: CPU/Mux 1 word; Mux/Cache, Bus, N words (Alpha: 64 bts & 256 bts) Smple: CPU, Cache, Bus, same wdth (32 bts) Interleaved: CPU, Cache, Bus 1 word: N Modules (4 Modules); example s word nterleaved 4/18/2011 cs252-s11, Lecture Quest for DRAM Performance 1. Fast Page mode Add tmng sgnals that allow repeated accesses to row buffer wthout another row access tme Such a buffer comes naturally, as each array wll buffer 1024 to 2048 bts for each access 2. Synchronous DRAM (SDRAM) Add a clock sgnal to DRAM nterface, so that the repeated transfers would not bear overhead to synchronze wth DRAM controller 3. Double Data Rate (DDR SDRAM) Transfer data on both the rsng edge and fallng edge of the DRAM clock sgnal doublng the peak data rate DDR2 lowers power by droppng the voltage from 2.5 to 1.8 volts + offers hgher clock rates: up to 400 MHz DDR3 drops to 1.5 volts + hgher clock rates: up to 800 MHz Improved Bandwdth, not Latency 4/18/2011 cs252-s11, Lecture 22 20

Fast Systems: DRAM specfc Multple CAS accesses: several names (page mode) Extended Data Out (EDO): 30% faster n page mode Newer DRAMs to

slce of memory» Short bus between CPU and chps» Does own refresh» Varable amount of data returned» 1 byte / 2 ns (500 MB/s per chp)

clock (on rsng and fallng edge) Intel clams s the next bg thng» Stands for Fully-Buffered Dual-Inlne RAM» Same basc technology as DDR, but

4/18/2011 cs252-s11, Lecture 22 21 Fast Page Mode Operaton Regular DRAM Organzaton: N rows x N column x M-bt Read & Wrte M-bt at a tme Each

access other M-bt blocks on that row RAS_L remans asserted whle CAS_L s toggled RAS_L CAS_L 1st M-bt Access Column Address N rows M-bt

4/18/2011 cs252-s11, Lecture 22 22 SDRAM tmng (Sngle Data Rate) 200MHz Clock Double-Data Rate (DDR2) DRAM Row Column Precharge Row CAS RAS

6 Fast Systems: DRAM specfc Multple CAS accesses: several names (page mode) Extended Data Out (EDO): 30% faster n page mode Newer DRAMs to address gap; what wll they cost, wll they survve? RAMBUS: startup company; renvented DRAM nterface» Each a module vs. slce of memory» Short bus between CPU and chps» Does own refresh» Varable amount of data returned» 1 byte / 2 ns (500 MB/s per chp) Synchronous DRAM: 2 banks on chp, a clock sgnal to DRAM, transfer synchronous to system clock ( MHz)» DDR DRAM: Two transfers per clock (on rsng and fallng edge) Intel clams s the next bg thng» Stands for Fully-Buffered Dual-Inlne RAM» Same basc technology as DDR, but utlzes a seral dasy-chan channel between dfferent memory components. 4/18/2011 cs252-s11, Lecture Fast Page Mode Operaton Regular DRAM Organzaton: N rows x N column x M-bt Read & Wrte M-bt at a tme Each M-bt access requres a RAS / CAS cycle Fast Page Mode DRAM N x M SRAM to save a row After a row s read nto the regster Only CAS s needed to access other M-bt blocks on that row RAS_L remans asserted whle CAS_L s toggled RAS_L CAS_L 1st M-bt Access Column Address N rows M-bt Output N cols DRAM N x M SRAM M bts 2nd M-bt 3rd M-bt 4th M-bt Row Address A Row Address Col Address Col Address Col Address Col Address 4/18/2011 cs252-s11, Lecture SDRAM tmng (Sngle Data Rate) 200MHz Clock Double-Data Rate (DDR2) DRAM Row Column Precharge Row CAS RAS (New ) CAS Latency Mcron 128M-bt dram (usng 2Megbt4bank ver) Row (12 bts), bank (2 bts), column (9 bts) x Precharge Burst READ 4/18/2011 cs252-s11, Lecture Data [ Mcron, 256Mb DDR2 SDRAM datasheet ] 400Mb/s Data Rate 4/18/2011 cs252-s11, Lecture 22 24

DDR vs DDR2 vs DDR3 vs DDR4 All about ncreasng the rate at the pns Not an mprovement n latency In fact,

cs252-s11, Lecture 22 26 DRAM Packagng ~7 Clock and control sgnals Address lnes multplexed row/column

Each rank has clock/control/address sgnals connected n parallel (sometmes need buffers to drve sgnals to all

7 DDR vs DDR2 vs DDR3 vs DDR4 All about ncreasng the rate at the pns Not an mprovement n latency In fact, latency can sometmes be worse Internal banks often consumed for ncreased bandwdth DDR4 (January 2011) Samsung, Currently 2.13Gb/sec Target: 4 Gb/sec DRAM Power: Not always up, but 4/18/2011 cs252-s11, Lecture /18/2011 cs252-s11, Lecture DRAM Packagng ~7 Clock and control sgnals Address lnes multplexed row/column address ~12 Data bus (4b,8b,b,32b) DRAM chp DIMM (Dual Inlne Module) contans multple chps arranged n ranks Each rank has clock/control/address sgnals connected n parallel (sometmes need buffers to drve sgnals to all chps), and data pns work together to return wde word e.g., a rank could mplement a 64-bt data bus usng x4-bt chps, or a 64-bt data bus usng 8x8-bt chps. A modern DIMM usually has one or two ranks (occasonally 4 f hgh capacty) A rank wll contan the same number of banks as each consttuent chp (e.g., 4-8) 4/18/2011 cs252-s11, Lecture DRAM Channel Controller 64-bt Data Bus Command/Address Bus Rank Rank 4/18/2011 cs252-s11, Lecture 22 28

Memores FLASH Regular DIMM Uses Commodty DRAMs wth specal controller on actual DIMM board Connecton

2007: Has a floatng gate that can hold chargegb, NAND Flash To wrte: rase or lower wordlne hgh

charge changes threshold and thus measured current Two varetes: NAND: denser, must be read and

Tunnelng Magnetc Juncton (MRAM) Phase Change memory (IBM, Samsung, Intel) Tunnelng Magnetc Juncton

spn and electroncs Same technology used n hgh-densty dsk-drves 4/18/2011 cs252-s11, Lecture 22 31

wth applcaton of heat Two states have very dfferent resstve propertes Smlar to materal used n CD-RW

8 Memores FLASH Regular DIMM Uses Commodty DRAMs wth specal controller on actual DIMM board Connecton s n a seral form: Controller 4/18/2011 cs252-s11, Lecture Lke a normal transstor but: Samsung 2007: Has a floatng gate that can hold chargegb, NAND Flash To wrte: rase or lower wordlne hgh enough to cause charges to tunnel To read: turn on wordlne as f normal transstor» presence of charge changes threshold and thus measured current Two varetes: NAND: denser, must be read and wrtten n blocks NOR: much less dense, fast to read and wrte 4/18/2011 cs252-s11, Lecture Tunnelng Magnetc Juncton (MRAM) Phase Change memory (IBM, Samsung, Intel) Tunnelng Magnetc Juncton RAM (TMJ-RAM) Speed of SRAM, densty of DRAM, non-volatle (no refresh) Spntroncs : combnaton quantum spn and electroncs Same technology used n hgh-densty dsk-drves 4/18/2011 cs252-s11, Lecture Phase Change (called PRAM or PCM) Chalcogende materal can change from amorphous to crystallne state wth applcaton of heat Two states have very dfferent resstve propertes Smlar to materal used n CD-RW process Exctng alternatve to FLASH Hgher speed May be easy to ntegrate wth CMOS processes 4/18/2011 cs252-s11, Lecture 22 32

9 Error Correcton Codes (ECC) systems generate errors (accdentally flppedbts) DRAMs store very lttle charge per bt Soft errors occur occasonally when cells are struck by alpha partcles or other envronmental upsets. Less frequently, hard errors can occur when chps permanently fal. Problem gets worse as memores get denser and larger Where s perfect memory requred? servers, spacecraft/mltary computers, ebay, Memores are protected aganst falures wth ECCs Extra bts are added to each data-word used to detect and/or correct faults n the memory system n general, each possble data word value s mapped to a unque code word. A fault changes a vald code word to an nvald one - whch can be detected. ECC Approach: Redundancy Approach: Redundancy Add extra nformaton so that we can recover from errors Can we do better than just create complete copes? Block Codes: Data Coded n blocks k data bts coded nto n encoded bts Measure of overhead: Rate of Code: K/N Often called an (n,k) code Consder data as vectors n GF(2) [.e. vectors of bts ] Code Space s set of all 2 n vectors, Data space set of 2 k vectors Encodng functon: C=f(d) Decodng functon: d=f(c ) Not all possble code vectors, C, are vald! 4/18/2011 cs252-s11, Lecture /18/2011 cs252-s11, Lecture General Idea: Code Vector Space Code Space Code Dstance (Hammng Dstance) C 0 =f(v 0 ) v 0 Not every vector n the code space s vald Hammng Dstance (d): Mnmum number of bt flps to turn one code word nto another Number of errors that we can detect: (d-1) Number of errors that we can fx: ½(d-1) 4/18/2011 cs252-s11, Lecture Some Code Types Lnear Codes: C G d S H C Code s generated by G and n null-space of H (n,k) code: Data space 2 k, Code space 2 n (n,k,d) code: specfy dstance d as well Random code: Need to both dentfy errors and correct them Dstance d correct ½(d-1) errors Erasure code: Can correct errors f we know whch bts/symbols are bad Example: RAID codes, where symbols are blocks of dsk Dstance d correct (d-1) errors Error detecton code: Dstance d detect (d-1) errors Hammng Codes d = 3 Columns nonzero, Dstnct d = 4 Columns nonzero, Dstnct, Odd-weght Bnary Golay code: based on quadratc resdues mod 23 Bnary code: [24, 12, 8] and [23, 12, 7]. Often used n space-based schemes, can correct 3 errors 4/18/2011 cs252-s11, Lecture 22 36

10 Hammng Bound, symbols n GF(2) Consder an (n,k) code wth dstance d How do n, k, and d relate to one another? Frst queston: How bg are spheres? For dstance d, spheres are of radus ½ (d-1),».e. all error wth weght ½ (d-1) or less must ft wthn sphere Thus, sze of sphere s at least: 1 + Num(1-bt err) + Num(2-bt err) + + Num( ½(d-1) bt err) Sze 1 ( d 1) 2 e0 n e Hammng bound reflects bn-packng of spheres: need 2 k of these spheres wthn code space 2 k 1 ( d 1) 2 e0 n 2 e n k n 2 (1 n) 2, d 3 4/18/2011 cs252-s11, Lecture How to Generate code words? Consder a lnear code. Need a Generator Matrx. Let v be the data value (k bts), C be resultng code (n bts): C Are there 2 k unque code values? Only f the k columns of G are lnearly ndependent! Of course, need some way of decodng as well. v G v f d C ' G must be an nk matrx Is ths lnear??? Why or why not? A code s systematc f the data s drectly encoded wthn the code words. Means Generator has form: I Can always turn non-systematc G code nto a systematc one (row ops) P But What s dstance of code? Not Obvous! 4/18/2011 cs252-s11, Lecture Implctly Defnng Codes by Check Matrx Consder a party-check matrx H (n[n-k]) Defne vald code words C as those that gve S =0 (null space of H) S H C 0 Sze of null space? (null-rank H)=k f (n-k) lnearly ndependent columns n H Suppose we transmt code word C wth error: Model ths as vector E whch flps selected bts of C to get R (receved): R C E Consder what happens when we multply by H: S H R H ( C E) H E What s dstance of code? Code has dstance d f no sum of d-1 or less columns yelds 0 I.e. No error vectors, E, of weght < d have zero syndromes So Code desgn s desgnng H matrx 4/18/2011 cs252-s11, Lecture How to relate G and H (Bnary Codes) Defnng H makes t easy to understand dstance of code, but hard to generate code (H defnes code mplctly!) However, let H be of followng form: P s (n-k)k, I s (n-k)(n-k) H P I Result: H s (n-k)n Then, G can be of followng form (maxmal code sze): I G P P s (n-k)k, I s kk Result: G s nk Notce: G generates values n null-space of H and has k ndependent columns so generates 2 k unque values: S H I G v P I v 0 P 4/18/2011 cs252-s11, Lecture 22 40

11 Concluson Man memory s Dense, Slow Cycle tme > Access tme! Technques to optmze memory Wder Interleaved : for sequental or ndependent accesses Avodng bank conflcts: SW & HW DRAM specfc optmzatons: page mode & Specalty DRAM ECC: add redundancy to correct for errors (n,k,d) n code bts, k data bts, dstance d Lnear codes: code vectors computed by lnear transformaton Erasure code: after dentfyng erasures, can correct 4/18/2011 cs252-s11, Lecture 22 41

CpE 442. Memory System

CpE 442. Memory System CpE 442 Memory System CPE 442 memory.1 Outline of Today s Lecture Recap and Introduction (5 minutes) Memory System: the BIG Picture? (15 minutes) Memory Technology: SRAM and Register File (25 minutes)