Godson Processor and its Application in High Performance Computers

Size: px

Start display at page:

Download "Godson Processor and its Application in High Performance Computers"

Randolf Boyd
5 years ago
Views:

1 Godson Processor and its Application in High Performance Computers Weiwu Hu Institute of Computing Technology, Chinese Academy of Sciences Loongson Technologies Corporation Limited 1

2 Contents Godson Processor Briefs Godson-3B Processor for Servers and HPCs TeraFLOPS Godson-3D for HPCs Godson is the academic name of Loongson TM

3 Godson Processor Briefs

4 From Academic To Commercial Ten-year research from 2001 Institute of Computing Technology, Chinese Academy of Sciences Supported by: National Major S&T Project (MIIT), National 863 and 973 Project (MOST), National Science Foundation of China (NSFC), National Knowledgw Innovation Project (CAS) Technology achieved: Superscalar OOO core, Multiple core, 1.5GHz, 32nm, 1Billion Xtors. Go to commercial from 2010 Loongson Technology Cooperation Limited A starting up company: from sample to product Three series CPUs: Big CPU, Middle CPU, Small CPU

5 Loongson CPU Roadmap

6 Godson-3 Scalable Architecture Scalable interconnection network Mesh + Crossbar Directory-based CCprotocol Directory entry in LLC cache block

7 Godson-3B Processor for Servers and HPCs

8-core LS3B for Server and HPC LS3B1000 1.

25GHz@32nm LP, private L2, 8MB LLC 8 four-issue 64-bit core

8 8-core LS3B for Server and HPC LS3B1000 GP, 4MB LLC 8 four-issue 64-bit core 2*256-bit Vector Ext. per core 128GFLOPS@60W 2*DDR3 800, 2*HT1.0 Controllers 583M transistors, 300mm2 LS3B GHz@32nm LP, private L2, 8MB LLC 8 four-issue 64-bit core 2*256-bit Vector Ext. per core 160GFLOPS@40W 2*DDR3 1600, 2*HT2.0 Controllers 1.1B transistors, 180mm2 8

9 Vector Extension of CPU Core 4-issue Out-of-order Two 256-bit SIMD vector unit 8 64-bit MACs per cire Keep MIPS64 compatible 128-entry 256-bit register file 300+ SIMD instructions Linpack, FFT, filter, media 9

10 Vector Unit Features Long vector Two 256-bit vector units per core 64 MACs per chip Streamed data link for vector unit Traditional load/store cannot feed the starving vector unit Data format transform in the way Matrix transpose, FFT butterfly, etc. Shuffle and computing in one inst. L1 L2 VR MEMORY 10

11 Vector Instructions (Shuffle and Calculation in one) FFT FFT Matrix Multiplication Media Decode 11

12 Personal HPC HPCs based on Godson-3B For research Tflops~10TFLOPS, <1000W, <30dB noise Built by USTC, running well PetaFLOPS HPC For industry computing 100TFLOPS~1PFLOPS Built by Dawning, will be stalled in Shenzhen HPC center 10 PetaFLOPS HPC For climate simulation, 1P X P LS3B Based on 32nm LS3B1200 Built by Lenovo+Dawning, proposal approved

13 PFLOPS HPC with Loongson-3B Dawning Blades 5U28P, 224P/Rack PetaFLOPS HPC based on Dawning Blades 35Rack/PetaFLOPS Connected through 10G Ethernet or Infiniband

Flaws in Current Design Virtual to physical address translation should be improved in streamed data link Current design uses direct address translation, needs special segment in OS memory management

14 Flaws in Current Design Virtual to physical address translation should be improved in streamed data link Current design uses direct address translation, needs special segment in OS memory management Will implement TLB mechanism in next design Register consistency and synchronization btw. load/ store and stream data movement should be improved Current design maintain L1/L2 data consistency automatically But VR data consistency between process switch and stream data movement should be improved in next design Bandwidth Problem Two DDR3 800 controller is not enough for 128GFLOPS peak performance, ~70% Linpack performance at chip level LS3B1200 can double bandwidth

15 TeraFLOPS Godson-3D processor plan

16 100 PetaFLOPS HPC in 2015/2016 New goal: To build 100 PetaFLOPS HPC in 2015/2016 With limited financial and power consumption budget Dedicated CPU for HPC LS3B is designed for both servers and HPCs, it works at <10P stage 100PFLOPS and EFLOPS HPC need dedicated designed CPU LS3C for servers, LS3D for HPCs Godson-3D need TFLOPS design

17 16 General Purpose Core 256 GFLOPS 4 DDR3 1600, 2 HT Controllers 16-core Godson-3C 17

18 TFLOPS Godson-3D 1.25GHz, 512MAC, 1.25Tflops, W Chanllenge: Performance vs. Memory bandwidth Peak performance 1.25TFlops Peak bandwidth 64GB/s (4 DDR3 1GHz Controller) 1 : 25 Solution Large register file for locality Take highest utilization of memory bandwidth Longest burst for multi-channel DDR3 Balanced workload for four DDR3 at any time

19 Many-core vs. Long Vector Many-core example: Intel MIC, Nvidia GPU Long vector is more area efficient than many core Only 1 fetch/decode/map/ units for many MAC Long vector is more power efficient than many-core Reduces inst. fetch/decode/regmap/ power consumption with SIMD Which is better for programming? Hard to say for scientific apps. Many core will be better for high through applications (however, bandwidth will be bottleneck) The data re-organization in the streamed data link helps to meet requirements of more apps.

Microarchitecture of Godson-3D Core Sixteen-core The same

powerful general purpose core (four-issue, OOO, private L2)

extension In-order vector queue, elastic connection between

20 Microarchitecture of Godson-3D Core Sixteen-core The same infrastructure as general purpose Godson-3C Replace the powerful general purpose core (four-issue, OOO, private L2) with long vector core Dual issue 64-bit core with long vector extension In-order vector queue, elastic connection between MACs Large register file, shared LLC Expectation: 64K efficiency

21 Reconfigurable Memory Hierarchy The memory hierarchy is reconfigurable to fit different applications Data transforming (e.g., matrix transpose) in the way Cache/uncache can be reconfigured while keep cache coherence E.g. for Linpack Areg 4-entry auto Aram 128KB Breg 4-entry inst Bram 128KB Creg 16-entry auto Cram 512KB uncache stream data LLC 4MB uncache DDR auto DDR DDR

22 Conclusion The vector unit and streamed data link architecture of LS3B works for HPC applications Long vector, streamed data link, data format transform in the way, shuffle and computing in one instruction TLB for streamed data link, vector register consistency, flexibility of data transform in the way should be improved Different CPUs for servers and HPCs next time LS3B is designed both for server and HPC Should design different CPUs for server (LS3C) and HPC (LS3D) next time LS3D will take long vector architecture More area and power consumption efficient More powerful streamed data link

23 Thanks 23

A Multicore Processor Designed For PetaFLOPS Computation

A Multicore Processor Designed For PetaFLOPS Computation Weiwu Hu Institute of Computing Technology, Chinese Academy of Sciences Loongson Technologies Corporation Limited hww@ict.ac.cn 1 Contents Background