Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines win positions in the list General trend towards commodity machines (COTS Commodity-Off-The-Shelf) Connecting a large number of machines with relatively lower performance is more rewarding than connecting a small number of machines each with high performance reference paper: A note on the Zipf distribution of Top500 supercomputers Performance doubles every year Ranked in terms of Flops, obtained by Linpack benchmark 1
Challenging the Top 500 list Prof. Jack Dongarra, the developer of Top 500 list and Linpack, rebutes the Graph 500 isn t the first new benchmark to challenge the High-Performance Linpack,, Graph 500 score shouldn t be seen as some definitive number any more than the Linpack score used today,, If Graph 500 was the only benchmark we had, we d criticize that too 2
Understand the list 3
Understand the list Rmax: the max benchmarked performance Rpeak: the theoretical max performance Ratio of Rmax to Rpeak indicates how efficient the implementation is Nmax: the problem size when obtaining Rmax Nhalf: the problem size when obtaining half of Rmax 4
Supercomputer distributions 5
Performance distribution 6
Supercomputers in UK 25 supercomputers in the list come from UK University of Edinburgh: No.25 AWE: No.53 ECMWF: No. 56 University of Warwick: Francesca, used to be No.196 (11/2007), No. 476 (06/2008) http://www.top500.org/site/2860 7
TianHe-1A 112 computer cabinets, 12 storage cabinets, 6 communications cabinets, and 8 I/O cabinets. Each compute cabinet is composed of four frames, with each frame containing eight blades, plus a 16-port switching board. Each blade is composed of two computer nodes, with each compute node containing two Xeon X5670 6-core processors (3GHz) and one Nvidia M2050 GPU processor. In total, 14,336 Xeon X5670 processors and 7,168 Nvidia Tesla M2050 GPUs. The total disk storage of the systems is 2 Petabytes, and the total memory size of the system is 262 Terabytes. 8
TianHe-1A Interconnect: The Chinese-designed NUDT* custom designed proprietary high-speed interconnect called Arch that runs at 160 Gbps, twice the bandwidth of InfiniBand Application carry out computations for petroleum exploration and aircraft simulation. It is an "open access" computer meaning it provides services for other countries Cost $88 million to build around $20 million annually for costs involving energy intake and operating costs. Employ 200 workers in order to have it operate and function properly *National University of Defense Technology 9
BlueGene/L No. 12 now, used to be No.1 from 2004-2007 First supercomputer in the Blue Gene project Philosophy - High density of processors in a small area comparatively slow processors - just lots of them! Fast interconnects and lowlatency. 10
Architecture of BlueGene/L Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside on a compute card with 512MB memory. 16 of these compute cards are placed on a node board. 32 node boards fit into one cabinet, and there are 64 cabinets. 212992 CPUs in total with Rmax 478.2 TeraFlops and Rpeak of 596.4 TFLOPS Multiple network topologies available, which can be selected depending on the application. 11
Jaguar (No.1 from 2009-2010, 2010, No2. now) 12
Jaguar How it is built? http://www.youtube.com/watch?v=zx677ccebac 13
Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures into four classes, based on how many instruction and data streams can be observed in the architecture. They are: SISD - Single Instruction, Single Data Instructions are operated sequentially on a single stream of data in a single memory. Classic Von Neumann architecture. Machines may still consist of multiple processors, operating on independent data - these can be considered as multiple SISD systems. SIMD - Single Instruction, Multiple Data A single instruction stream (broadcast to all processors), acting on multiple data. The most common form of this architecture class are Vector processors. These can deliver results several times faster than scalar processors. 14
Architecture Classifications MISD - Multiple instruction, Single data No practical implementations of this architecture. MIMD - Multiple instruction, Multiple data multiple instruction streams, acting on different (but related) data Note the difference between multiple SISD and MIMD 15
Architecture Classifications MIMD: MPP, Cluster SISD: Machine with a single scalar processor SIMD: Machine with vector processors 16
Parallelism in single processor systems Pipelines Performing more operations per clock cycle (Reduces the idle time of hardware components). Difficult to keep pipelines full Dependency among instructions Branches Vector architectures One master processor and multiple math co-processors Large memory bandwidth and low latency access. No cache because of above. Perform operations involving large matrices, commonly encountered in engineering areas 17
Multiprocessor Parallelism Use multiple processors on the same program: Divide workload up between processors. Often achieved by dividing up a data structure. Each processor works on it s own data. Typically processors need to communicate. Shared memory Sending messages explicitly distributed shared memory (virtual global address space) 18
Granularity of Parallelism Defined as the size of the computations that are being performed in parallel l Four types of parallelism (in order of granularity size) Instruction-level parallelism (e.g. pipeline) Thread-level parallelism (e.g. run a multi-thread thread java program) Process-level parallelism (e.g. run an MPI job in a cluster) Job-level parallelism (e.g. run a batch of independent jobs in a cluster) 19
Dependency and Parallelism Dependency: If event A must occur before event B, then B is dependent d on A Two types of Dependency Control dependency: waiting for the instruction which controls the execution flow to be completed IF (X!=0) Then Y=1.0/X: Y=1.0/X has the control dependency on X!=0 Data dependency: dependency because of calculations or memory access Flow dependency: A=X+Y; B=A+C; Anti-dependency: B=A+C; A=X+Y; Output dependency: A=2; X=A+1; A=5; 20