Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded

Size: px

Start display at page:

Download "Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded"

Constance Palmer
5 years ago
Views:

Reference Case Study of a Multi-core, Multithreaded Processor The Sun T ( Niagara ) Computer Architecture, A Quantitative Approach, Fourth Edition, by John Hennessy and

:/C:8 T ( Niagara ) Target: Commercial server applications High thread level parallelism (TLP) Large numbers of parallel client requests Low instruction level parallelism

1 Reference Case Study of a Multi-core, Multithreaded Processor The Sun T ( Niagara ) Computer Architecture, A Quantitative Approach, Fourth Edition, by John Hennessy and David Patterson, chapter. :/C:8 T ( Niagara ) Target: Commercial server applications High thread level parallelism (TLP) Large numbers of parallel client requests Low instruction level parallelism (ILP) High cache miss rates Many unpredictable branches Frequent load-load dependencies Power, cooling, and space are major concerns for data centers Metric: Performance/Watt/Sq. Ft. Approach: Multicore, Fine-grain multithreading, Simple pipeline, Small L caches, Shared L T Architecture Also ships with 6 or processors

2 T pipeline Single issue, in-order, 6-deep pipeline: F, S, D, E, M, W clock delays for loads & branches. Shared units: L $, L $ TLB X units pipe registers Hazards: Data Structural Integer Register File One register file / thread SPARC window: in, out, local registers Highly integrated cell structure to support threads: 8 windows of locations / thread read ports + write ports Read/write: single cycle latency Active Window Cell (copy of the architectural set window) Thread Scheduling Thread selection based on: Previous long latency instruction in pipe Instruction type LRU status Select & Fetch coupled T Fine-Grained Multithreading Each core supports four threads and has its own level one caches (6KB for instructions and 8 KB for data) Switching to a new thread on each clock cycle Idle threads are bypassed in the scheduling Waiting due to a pipeline delay or cache miss Processor is idle only when all threads are idle or stalled Both loads and branches incur a cycle delay that can only be hidden by other threads A single set of floating point functional units is shared by all 8 cores floating point performance was not a focus for T

3 Memory, Clock, Power 6 KB way set assoc. I$/ core 8 KB way set assoc. D$/ core MB way set assoc. L $ shared x 7KB independent banks crossbar switch to connect cycle throughput, 8 cycle latency Direct link to DRAM & Jbus Manages cache coherence for the 8 cores Write through allocate LD no-allocate ST CAM based directory Coherency is enforced among the L caches by a directory associated with each L cache block Used to track which L caches have copies of an L block By associating each L with a particular memory bank and enforcing the subset property, T can place the directory at L rather than at the memory, which reduces the directory overhead L data cache is write-through, only invalidation messages are required; the data can always be retrieved from the L cache. GHz at 7W typical, 79W peak power consumption rate L Miss Miss Rates: L Cache Size, Block Size.%.%.%.%.%.%. MB; B. MB; MB; B T MB; TPC-C SPECJBB 6 MB; B 6 MB; Miss Latency: L Cache Size, Block Size CPI Breakdown of Performance 8 6 T TPC-C SPECJBB Benchmark Per Thread CPI Per core CPI Effective CPI for 8 cores Effective IPC for 8 cores L Miss latenc cy 8 6 TPC-C SPECJBB SPECWeb MB; B. MB; MB; B MB; 6 MB; B 6 MB;

4 Not Ready Breakdown Performance: Benchmarks + Sun Marketing on of cycles not ready Fractio % 8% 6% % % % TPC-C SPECJBB SPECWeb99 Other Pipeline delay L miss L D miss L I miss Sun IBM p- with dualcore Benchmark\Architecture Fire Power chips T SPECjbb (Java server software) business operations/ sec 6,78 6,789 SPECweb (Web server performance), 7,88 NotesBench (Lotus Notes 6,6,7 performance) Dell PowerEdge,8 (SC with dual single-core Xeon),8 (8 with two dual-core Xeon processors) TPC-C - store buffer full is largest contributor SPEC-JBB - atomic instructions are largest contributor SPECWeb99 - both factors contribute Space, Watts, and Performance Performance: Benchmarks + Sun Marketing Sun IBM p- with dualcore Benchmark\Architecture Fire Power chips T SPECjbb (Java server software) business operations/ sec 6,78 6,789 SPECweb (Web server performance), 7,88 NotesBench (Lotus Notes performance) 6,6,7 Dell PowerEdge,8 (SC with dual single-core Xeon),8 (8 with two dual-core Xeon processors) Note the paradigm shift Space, Watts, and Performance Microprocessor Comparison Processor SUN T Opteron Pentium D IBM Power Cores 8 Instruction issues / clock / core Peak instr. issues / chip Multithreading Finegrained No SMT SMT L I/D in KB per core 6/8 6/6 L per core/shared MB shared K uops/6 6/ MB / core MB/ core.9 MB shared Clock rate (GHz)....9 Transistor count (M) 76 Die size (mm ) Power (W) 79

Performance Relative to Pentium D Performance/mm, Performance/Watt. 6. Performance relative to Pentium D 6..... +Power Opteron Sun T

SPECIntRate SPECFPRate SPECJBB SPECWeb TPC-like SPECIntRate/mm^ SPECIntRate/Watt SPECFPRate/mm^ SPECFPRate/Watt SPECJBB/mm^ SPECJBB/Watt TPC-C/mm^ TPC-C/Watt Niagara Improve performance by increasing

5 Performance Relative to Pentium D Performance/mm, Performance/Watt. 6. Performance relative to Pentium D Power Opteron Sun T alized to Pentium D Efficiency norma Power Opteron Sun T. SPECIntRate SPECFPRate SPECJBB SPECWeb TPC-like SPECIntRate/mm^ SPECIntRate/Watt SPECFPRate/mm^ SPECFPRate/Watt SPECJBB/mm^ SPECJBB/Watt TPC-C/mm^ TPC-C/Watt Niagara Improve performance by increasing threads supported per chip from to 6 8 cores * 8 threads per core Floating-point unit for each core, not for each chip Hardware support for encryption standards EAS, DES, and elliptical-curve cryptography Niagara will add a number of 8x PCI Express interfaces directly into the chip in addition to integrated Gigabit Ethernet XAU interfaces and Gigabit Ethernet ports. Integrated memory controllers will shift support from DDR to FB-DIMMs and double the maximum amount of system memory. Kevin Krewell Sun's Niagara Begins CMT Flood - The Sun UltraSPARC T Processor Released Microprocessor Report, January, 6 Sun Niagara at a Glance 8 cores x 8 threads = 6 threads Dual single issue pipelines FPU per core MB L, 8-banks, 6-way S.A x dual-channel FBDIMM ports (6+ GB/s) > x Niagara throughput and throughput/watt. x Niagara int > x Niagara FP Available H 7

5008: Computer Architecture

5008: Computer Architecture Chapter 4 Multiprocessors and Thread-Level Parallelism --II CA Lecture08 - multiprocessors and TLP (cwliu@twins.ee.nctu.edu.tw) 09-1 Review Caches contain all information on