CPS 303 High Performance Computing Wensheng Shen Department of Computational Science SUNY Brockport
Chapter 1: Introduction to High Performance Computing van Neumann Architecture CPU and Memory Speed Motivation of Parallel Computing Applications of Parallel Computing
1.1. van Neumann Architecture A fixed-program computer A stored-program computer A computer model for more than 40 years CPU executes a stored program The operation is sequential A sequence of read and write operations on the memory van Neumann proposed the use of ROM: read only stored program
John van Neumann Born on December 28, 1903, died on February 8, 1957. Mastered calculus at the age of 8. Graduate level math at the age of 12 Obtained his Ph.D. at the age of 23 Stored program concept
A Typical Example of van Neumann Architecture Memory Input Devices Output Devices Control Unit CPU Arithmetic Logic Unit External Storage
Modern Personal Computers Graphics cards Sound cards Network cards 1. Monitor Modems 2. Motherboard 3. CPU (Microprocessor) 4. Primary storage (RAM) 5. Expansion cards 6. Power supply 7. Optical disc drive 8. Secondary storage (Hard disk) 9. Keyboard 10. Mouse http://en.wikipedia.org/wiki/personal_computer Peripheral Component Interconnect
CISC and RISC machines CISC: stands for complex instruction set computer. A single bus system. CISC: Each individual instruction can execute several low-level operations, such as a memory load, an arithmetic operation, and a memory store. RISC: stands for reduced instruction set computer. Two bus system, a data bus and an address bus. They are all SISD machines: Single Instruction Stream on Single Data Stream.
1.2 CPU and memory speed Cray 1: 12ns 1975 Cray 2: 6ns 1986 Cray T-90 2ns 1997 Intel PC 1ns 2000 Today s PC 0.3ns 2006(P4)
Moore s Law Moore s law (1965): the number of transistors per square inch on integrated circuits had double every two years since the integrated circuit was invented How about the future? (price of computers that have the same computing power falls by half every two years?) In a 2008 article in InfoWorld, Randall C. Kennedy, formerly of Intel introduces this term using successive versions of Microsoft Office between the year 2000 and 2007 as his premise. Despite the gains in computational performance during this time period according to Moore's law, Office 2007 performed the same task at half the speed on a prototypical year 2007 computer as compared to Office 2000 on a year 2000 computer.
CPU and memory speed comparison In 20 years, CPU speed (clock rate) has increased by a factor of 1000 DRAM speed has increased only by a factor of smaller than 4 CPU speed: 1-2 ns Cache speed: 10 ns DRAM speed: 50-60 ns How to feed data fast enough to keep CPU busy?
Possible Solutions A hierarchy of successive fast memory devices (multilevel caches) Location of data reference Efficient programming can be an issue Parallel systems may provide (1) large aggregate cache (2) high aggregate bandwidth to the memory system
1.3 Price and Performance Comparison Price for highend CPU rises sharply Intel processor price/performance
1.4 Computation for special purpose Weather forecasting Information retrieval Car and aircraft design NASA space discovery
Problem: Insufficient memory Slow in speed
Example: predicting weather of US and Canada for next two days 20 kilometer Δx=Δy=Δz=0.1 kilometer 20 million square kilometers = 2.0 10 7 kilometer 5,000 kilometers 4,000 kilometers
Mesh size Number of cells: n = 5000 0.1 4000 0.1 20 0.1 = 4 10 11 0.1 kilometer Assuming it takes 100 calculations to determine the weather at a typical grid point, we wants to predict the weather condition at each hour for the next 48 hours, the total number of calculations are: 11 4 10 100 48 2 10 15
Assuming that our computer can execute one billion (10 9 ) calculations per second, it will take 2 5 9 6 10 /10 = 2 10 Seconds, or 23 days Increase the CPU speed to one trillion calculations per second? We still need more than half an hour. What happens if we wants to predict the weather for the whole earth, or if we want to use a smaller grid size, Δx=Δy=Δz=0.05 kilometer for better accuracy?
The memory requirement If we need 7 variables (u, v, w, p, T, ρ, ω) at each location, the memory cost is, 7 4 10 11 words = 112 10 11 bytes = 11,200 Gbytes Data transfer latency among CPU, registers, and memory
Possible solution: to build a processor executing 1 trillion operations per second For (i=0; i<one_trillion; i++) z[i] = x[i] + y[i]; Fetch x[i], y[i] Add z[i], x[i], y[i] Store z[i] At least 3 10 12 copies of data must be transfer between registers and memory per second Data are transferred with the speed of light 3 10 8 m/s.
We assume that r is the average distance of a word memory from the CPU, then r must satisfy r 3 10 12 r meters = 3 10 8 meters/second 1 second r = 10-4 meters We need at least three trillion words of memory to store x, y, and z. Memory words are typically arranged in rectangular grid in hardware. If we use square grid with side length s and connect the CPU to the center of the square, then the average distance from a memory location to the CPU is about s/2=r, so s=2 10-4 meters. For a square grid, a typical row of memory words will contain s 3 12 6 10 = 3 10 words
Therefore, we need to fit a single word of memory into a square with a side length of 4 2 10 10 3 10 6 10 meters, the size of an atom That is to say, we need to figure out how to represent a 32-bit word with a single atom. The solution of building a computer performing one trillion operations is extremely difficulty. Other solutions?
To invite one hundred people for a diner, should we build one big table to seat everyone? More tables How to perform the task of 10 15 calculation in minutes? More computers. We divide one big problem into many small sized sub problems.
1.5 Challenges: Communications. In a dinner, people sitting in different table can work around to talk to each other. How about data in different processors? How do processors communicate?
Tasks: Decide on and implement an interconnection network for the processors and memory modules Design and implement system software for the hardware Devise algorithms and data structures for solving our problem Divide the algorithm and data structures up into subproblems Identify the communications that will be needed among the subproblems Assign subproblems to processors and memory modules.
1.6 Topics covered in the course The architecture, interconnection network, and system software for parallel computing Message passing interface (MPI) libraries for parallel computing Basic communication of MPI Applications of using MPI in numerical computation Collective communication of MPI Designing and coding issues in parallel computing The performance of parallel computing