System X Virgina Tech's Supercomputer The fastest academic supercomputer

Size: px

Start display at page:

Download "System X Virgina Tech's Supercomputer The fastest academic supercomputer"

Grace Hutchinson
5 years ago
Views:

1 System X Virgina Tech's Supercomputer The fastest academic supercomputer Project #2 CS466, Fall 2004 By Raj Bh arath Swam in ath an Hareesh Nagarajan {rswam in a, h n cs.u ic.ed u Un iversity of Illin ois at Ch icago

2 How was it built? VTech facu lty (Th e terascale com p u tin g facility TCF) worked closely with ven d or p artn ers 1100 Power MAC G5 were p u t in to racks an d th e con stru ction began In p arallel d evice d rivers, h an d op tim iz ation of n u m erical libraries, cod e p ortin g was goin g on Th e sup er com p u ter was on p ap er in Feb 2003 an d was bu ilt by Sep tem ber Un fortu n ately th e system cou ld n 't p erform scien tific com p u tation as ECC RAM was requ ired an d th e G5 d id n 't su p p ort it. En ter Xserve G5.

3 Th e TCF lab wen t from lookin g like th is (Left) th is (Bottom )

4 Specification Nodes 1100 Apple XServe G5 2.3 GHz dual processor cluster nodes (4 GB RAM, 80 GB S- ATA HD) 4.4 TB (4400 GB) of RAM 88 TB (88000 GB) of HDD 2200 Processors Primary Communication 24 Mellanox 96 port InfiniBand swit ches (4X InfiniBand, 10 Gbps) Secondary Communication 6 Cisco 4506 Gigabit Ethernet switches Cooling Liebert X- trem e Density System cooling Software Mac OS X, MVAPICH, XLC & XLF Current Linpack Rpeak = Teraflop Rmax = Teraflops Nm ax =

5 Some facts System X comes in at #7 on the top500.org's list Each of the 1100 Xserve servers was custom built by Apple. $5.8 million price tag ($5.2 million for the initial m achines, and $600,000 for the Xserve upgrade) New (custom built!) Xserve servers are about 1 5% faster than the desktop machines > The new System X operates about 20 percent faster, almost adding 2 t eraflops The extra 5- percent performance boost came from optimized software Typically, System X runs several projects simultaneously, each tying up 400 to 500 processors for research into weather and m olecular m odeling.

6 Power PC G5 Processor key features Based on IBM s PowerPC 970FX series. 64 bit PowerPC Architecture Native support for 32- bit Applications Front side bus speed upto 1.25GHz Superscalar execution core with 12 functional units supporting upto 215 in- flight instruct ions Uses a dedicated optimized 128 bit velocity Engine for accelerated SIMD processing Can address upto 4 TB of RAM

7 Specifications 90nm Silicon on Insulator (SOI) process with copper interconnects Consumes 42W of power at 1.3V. Around 58 million transistors. Uses a 2 Level Cache Registers: bit general purpose registers bit floating- point registers bit vector registers Eight deep issue queues for each funct ional unit Uses a 16 stage pipeline

9 Front- side bus It runs at 1/ 2 the core clock speed DDR. So for the 2.3GHz processor, the Front Side Bus runs at 1.15GHz DDR Bus is composed of two unidirectional channels, each 32 bits wide, the total theoretical peak bandwidt h for the 1.15GHz bus is close to 10GB/ sec. Dual processors m ean twice the bandwidth i.e around 20GB/ sec

10 Cache L1 data cache: 32 KB write through 2- way Associative m apped L1 instruction cache: 64 KB direct mapped L2 cache: 512K fully associative L1 cache is parity protected L2 cache is protected using ECC (Error Correction code) logic

11 Fetch, Decode & Issue Eight instructions per cycle are fetched from the 64KB instruct ion cache into an instruction queue. 9 pipeline stages devoted to instruction fetch and decode Decode, crack, and group formation" phase breaks down instructions to simpler IOPS( Internal Operations), which resem ble RISC instructions 5 IOPS are dispatched per clock (4 instructions + 1 branch) in program order to a set of issue queues Out- of- order execution logic pulls instructions from these issue queues to feed the chip's eight functional units.

12 Branch prediction On each instruction fetch the front end's branch unit scans the eight instructions and picks out up to two branches. Prediction is done using one of two branch prediction schemes. 1. Standard BHT Scheme 16K entries, 1- bit branch predictor. 2. Global predictor table scheme 16K entries. Each entry has an associated 11 bit vector that records the actual execution path taken by the previous 11 fetch groups and a 1- bit branch predictor. A third 16K- entry keeps track of which of the two schemes works best for each branch. When each branch is finally evaluated, the processor compares the success of both schemes and records in this selector table which scheme has done the best job so far of predicting the

13 Integer unit 2 Integer Units attached to 80 GPR s (32 architectural + 48 rename) Simple, non- dependent integer IOPs can issue and finish at a rate of one per cycle. Dependent integer IOPS need 2 cycles Condition register logical unit (CRU): Dedicated unit for handling logical operations related to the PowerPC's condition register

14 Load Store Unit Two identical load- store units that executes all of the LOADs and STOREs. Dedicated address generation hardware which is part of the load- store units. Hence address generation takes place as part of the execution phase of the Load- Store Units pipeline.

15 Integer Issue Queue

16 Floating point unit Two identical FPUs, each of which can execute the fastest floating- point instructions in 6 cycles. Single- and double- precision operations take the same amount of time to execute. FPUs are fully pipelined for all operations except floating- point divides. 80 total microarchitectural registers, where 32 are PowerPC architectural registers and the remaining 48 are rename registers. The floating- point units can complete both a multiply operation and an add operation as part of the same machine instruction (fused multiply- add), thereby accelerating matrix multiplication, vector dot products, and other scientific com putations.

17 Floating point Issue queue

18 Vector Unit Contains 4 fully pipelined vector processing units 1. Vector Permute Unit (VPU) Vector Arithmetic Logic Unit (VALU) 2. Vector Simple Integer Unit (VSIU) 3. Vector Complex Integer Unit (VCIU) 4. Vector Floating- point Unit (VFPU) Upto four vector IOPs per cycle total can be issued to the two vector issue queues - two IOPs per cycle maximum to the 16- entry VPU queue and two IOPs per cycle maximum to the 20- entry VALU queue

19 Vector Issue Queue

20 Conclusion (On the processor. The presentation isn't over!) Dual processors provide the high- density power and scalability required by the research and computational clustering environm ents of System X. The PowerPC G5 is designed for symmetric m ultiprocessing. Dual independent frontside buses allow each processor to handle its own tasks at maximum speed with minimal interruption. With sophisticated multiprocessing capabilities built in, Mac OS X and Mac OS X Server dynamically manage multiple processing tasks across the two processors. This allows dual PowerPC G5 systems to accomplish up to twice as much as a single- processor system in the same amount of time, without requiring any special

21 A brief intro to Interconnection Networks Shared m edia has disadvantages (collisions) Switches allow communication directly from source to destination, without intermediate nodes to interfere with these signals A crossbar switch allows any node to communicate with any other node in one pass through interconnection An Omega interconnection uses less hardware but contention is more likely. Contention is called blocking A fat tree switch has more bandwidth added higher in the tree to match the requirements of common com m unication patterns

22 More... A Storage Area Network (SAN) that tries to optimize based on shorter distances is Infiniband. High performance clusters such as the System X utilize Fat Tree or Constant Bidirectional Bandwidth (CBB) networks to construct large node count non- blocking switch configurations Here integrated crossbars with relatively low number of ports are used to build a nonblocking switch topology supporting a much larger number of endpoints.

23 Switch e s Crossbar switch (left) CBB Network (below) u sed in th e System X P = 96 (Ports) 24 Mellan ox switch es 96/ 2 * 24 = 1152 ~ 1100 Nod

24 Used in th e System X How does it apply to SystemX? Infiniband is a switch based serial I/ O interconnect architecture operating at a base speed of 10Gb/ s in each direction per port.

25 A cluster making use of Infiniband system fabric Note: We were u n able to obtain th e exact sch em a of th e System

26 The Mellanox Switch

27 Apple's new liquid cooling 1. G5 processor at point of contact to the heatsink. 2. G5 processor card from IBM 3. Heatsink 4. Cooling fluid output from the radiator to the pump 5. Liquid cooling system pump 6. Pump power cable 7. Cooling fluid radiator input from the G5 processor 8. Radiant grille 9. Airflow direction system

More on the cooling system... 1. Liquid cooling system pump 2.

28 More on the cooling system Liquid cooling system pump 2. G5 processors 3. Radiator output 4. Radiator 5. Pump power cable 6. Radiator input

The cooling system used for SystemX Liebert s XDR system utilizes a cooling module that is attached to the back door of the computer rack enclosure.

29 The cooling system used for SystemX Liebert s XDR system utilizes a cooling module that is attached to the back door of the computer rack enclosure. Fans in the module move room temperature air from the front of the enclosure, past the equipment in the rack, past a cooling coil and expel it from the back of the unit, chilled to the point where the impact on the room is close to neutral. The XDR system can be configured to take care of uneven heat loads within the room.

30 Software used Operating system: Mac OS X MVAPICH (pronounced as 'em- vah- pich'): is a high performance implementation of MPI- 1 over InfiniBand based on MPICH1. Compilers: XL C/ C+ + Advanced Edition V6.0 for Mac OS X and XL Fortran Advanced Edition for Mac OS X (Both are made by IBM)

31 Performance of MVAPICH2 on G5 Testbed: Each node of our testbed has dual 2.0 GHz PowerPC G5 processors with 512 KB L2 cache. Each node also has 512 Megabyte memory and one PCI- X 64- bit 133 MHz bus. They are equipped with MT23108 HCAs with PCI- X interfaces. An InfiniScale MTS2400 switch is used to connect all the nodes. Experiments were conducted using the Small Tree 3.2 VAPI driver. The operating system used was OSX. GCC compilers are used for all the test programs.

32 The point is : By u sin g In fin iban d an d h igh ly op tim iz ed software for m essage p assin g, th e System X keep s overh ead s low an d m axim iz es

33 Any questions? Th an k you!

System X - A review CS466 Project 2 Fall 04 Instructor: Prof. Mitchell Theys

System X - A review CS466 Project 2 Fall 04 Instructor: Prof. Mitchell Theys Hareesh Nagarajan Dept. of Computer Science University of Illinois at Chicago hnagaraj@cs.uic.edu Raj Bharath Swaminathan Dept.