Midterm 3 Revision and Parallel Computers. Prof. Sin-Min Lee Department of Computer Science

Size: px

Start display at page:

Download "Midterm 3 Revision and Parallel Computers. Prof. Sin-Min Lee Department of Computer Science"

Melissa Hudson
5 years ago
Views:

1 Midterm 3 Revision and Parallel Computers Prof. Sin-Min Lee Department of Computer Science

2 Solution to Quiz 6 problems 1 and 4. Draw By: Alice Cotti Thanks!

3 Solution to problem 1 Step 1: Create a table with a column for all the input values, and for the output values Q1+ and Q0+ IMAGE Step 2: Start by entering all the possible values for the inputs: X, Q0 and Q1 X T1 T0 Q1 Q0 Q Q0 +

4 Solution to problem 1 IMAGE Step 3: By following the circuit diagram we can tell that T1 is equal to Q1 + X + Q0. Fill in the T1 column according to the boolean expression. X T1 T0 Q1 Q0 Q Q0 +

5 Solution to problem 1 IMAGE Step 4: By looking at the image, realize that T0 is always 1 Set it to 1 everywhere. X T1 T0 Q1 Q0 Q Q0 +

6 Solution to problem 1 IMAGE Step 5: Q1+ is next state of T1 (toggle) flip-flop. Value in Q1 column is current state of Flip-flop. T1 column is the input. Using this information, fill in the values for the Q1+ column. Draw a T Flip-Flop truth table if needed. X T1 T0 Q1 Q0 Q Q0 +

7 Solution to problem 1 IMAGE Step 6: Q0+ is next state of T0 (toggle) flip-flop. Value in Q0 column is current state of Flip-flop. T0 column is the input. Using this information, fill in the values for the Q0+ column. Draw a T Flip-Flop truth table if needed. X T1 T0 Q1 Q0 Q1 + Q

8 Solution to problem 4 clk J K Clear Q0

9 Solution to problem 4 clk J K Clear Q0 Clear JK Q0 Because Clear is zero, Q0 is also zero [J and K doesn't matter] Since Clear is 1, look for J and K. In this case they both are zero so Q0 also stays zero Because Clear is zero, Q0 is also zero [J and K doesn't matter] 0 XX Q Q

13 Uniprocessor Systems Improve performance: Allowing multiple, simultaneous memory access - requires multiple address, data, and control buses (one set for each simultaneous memory access) - The memory chip has to be able to handle multiple transfers simultaneously

14 Uniprocessor Systems Multiport Memory: Has two sets of address, data, and control pins to allow simultaneous data transfers to occur CPU and DMA controller can transfer data concurrently A system with more than one CPU could handle simultaneous requests from two different processors

15 Uniprocessor Systems Multiport Memory (cont.): Can - Multiport memory can handle two requests to read data from the same location at the same time Cannot - Process two simultaneous requests to write data to the same memory location - Requests to read from and write to the same memory location simultaneously

17 Multiprocessors Device Device Controller I/O Port CPU CPU CPU Bus Memory

21 Multiprocessors Systems designed to have 2 to 8 CPUs The CPUs all share the other parts of the computer Memory Disk System Bus etc CPUs communicate via Memory and the System Bus

22 MultiProcessors Each CPU shares memory, disks, etc Cheaper than clusters Not as good performance as clusters Often used for Small Servers High-end Workstations

23 MultiProcessors OS automatically shares work among available CPUs On a workstation One CPU can be running an engineering design program Another CPU can be doing complex graphics formatting

27 Applications of Parallel Computers Traditionally: government labs, numerically intensive applications Research Institutions Recent Growth in Industrial Applications 236 of the top 500 Financial analysis, drug design and analysis, oil exploration, aerospace and automotive

28 1966 Flynn s Classification Michael Flynn, Professor of Stanford University

30 Multiprocessor Systems Flynn s Classification Single instruction multiple data (SIMD): Processor Memory Main Memory Control Unit Processor Memory Communications Network Processor Memory Executes a single instruction on multiple data values simultaneously using many processors Since only one instruction is processed at any given time, it is not necessary for each processor to fetch and decode the instruction This task is handled by a single control unit that sends the control signals to each processor. Example: Array processor

31 Why Multiprocessors? 1. Microprocessors as the fastest CPUs Collecting several much easier than redesigning 1 2. Complexity of current microprocessors Do we have enough ideas to sustain 1.5X/yr? Can we deliver such complexity on schedule? 3. Slow (but steady) improvement in parallel software (scientific apps, databases, OS) 4. Emergence of embedded and server markets driving microprocessors in addition to desktops Embedded functional parallelism, producer/consumer model Server figure of merit is tasks per hour vs. latency

32 Parallel Processing Intro Long term goal of the field: scale number processors to size of budget, desired performance Machines today: Sun Enterprise (8/00) MHz UltraSPARC II CPUs,64 GB SDRAM memory, GB disk,tape $4,720,800 total 64 CPUs 15%,64 GB DRAM 11%, disks 55%, cabinet 16% ($10,800 per processor or ~0.2% per processor) Minimal E10K - 1 CPU, 1 GB DRAM, 0 disks, tape ~$286,700 $10,800 (4%) per CPU, plus $39,600 board/4 CPUs (~8%/CPU) Machines today: Dell Workstation 220 (2/01) 866 MHz Intel Pentium III (in Minitower) GB RDRAM memory, 1 10GB disk, 12X CD, 17 monitor, nvidia GeForce 2 GTS,32MB DDR Graphics card, 1yr service $1,600; for extra processor, add $350 (~20%)

33 Major MIMD Styles 1. Centralized shared memory ("Uniform Memory Access" time or "Shared Memory Processor") 2. Decentralized memory (memory module with CPU) get more memory bandwidth, lower memory latency Drawback: Longer communication latency Drawback: Software model more complex

35 Multiprocessor Systems Flynn s Classification

36 Multiprocessor Systems Flynn s Classification Four Categories of Flynn s Classification: SISD Single instruction single data SIMD Single instruction multiple data MISD Multiple instruction single data ** MIMD Multiple instruction multiple data ** The MISD classification is not practical to implement. In fact, no significant MISD computers have ever been build. It is included only for completeness.

37 MIMD computers usually have a different program running on every processor. This makes for a very complex programming environment. What s doing what when? What processor? Doing which task? At what time?

39 Memory latency The time between issuing a memory fetch and receiving the response. Simply put, if execution proceeds before the memory request responds, unexpected results will occur. What values are being used? Not the ones requested!

40 A similar problem can occur with instruction executions themselves. Synchronization The need to enforce the ordering of instruction executions according to their data dependencies. Instruction b must occur before instruction a.

41 Despite potential problems, MIMD can prove larger than life. MIMD Successes IBM Deep Blue Computer beats professional chess player. Some may not consider this to be a fair example, because Deep Blue was built to beat Kasparov alone. It knew his play style so it could counter is projected moves. Still, Deep Blue s win marked a major victory for computing.

42 IBM s latest, a supercomputer that models nuclear explosions. IBM Poughkeepsie built the world s fastest supercomputer for the U. S. Department of Energy. It s job was to model nuclear explosions.

43 MIMD it s the most complex, fastest, flexible parallel paradigm. It s beat a world class chess player at his own game. It models things that few people understand. It is parallel processing at its finest.

44 Multiprocessor Systems System Topologies: The topology of a multiprocessor system refers to the pattern of connections between its processors Quantified by standard metrics: Diameter The maximum distance between two processors in the computer system Bandwidth The capacity of a communications link multiplied by the number of such links in the system (best case) Bisectional Bandwidth The total bandwidth of the links connecting the two halves of the processor split so that the number of links between the two halves is minimized (worst case)

46 Multiprocessor Systems System Topologies Six Categories of System Topologies: Shared bus Ring Tree Mesh Hypercube Completely Connected

49 Multiprocessor Systems System Topologies Shared bus: The simplest topology Processors communicate with each other exclusively via this bus M P M P M P Can handle only one data transmission at a time Can be easily expanded by connecting additional processors to the shared bus, along with the necessary bus arbitration circuitry Shared Bus Global Memory

53 Multiprocessor Systems System Topologies Ring: Uses direct dedicated connections between processors Allows all communication links to be active simultaneously A piece of data may have to travel through several processors to reach its final destination All processors must have two communication links P P P P P P

54 Multiprocessor Systems System Topologies Tree topology: Uses direct connections between processors Each processor has three connections Its primary advantage is its relatively low diameter Example: DADO Computer P P P P P P P

58 Multiprocessor Systems System Topologies Mesh topology: Every processor connects to the processors above, below, left, and right Left to right and top to bottom wraparound connections may or may not be present P P P P P P P P P

61 Multiprocessor Systems System Topologies Hypercube: Multidimensional mesh Has n processors, each with log n connections

64 Multiprocessor Systems System Topologies Completely Connected: Every processor has n-1 connections, one to each of the other processors The complexity of the processors increases as the system grows Offers maximum communication capabilities

65 Architecture Details Computers MPPs P M World s simplest computer (processor/memo C P M D Standard computer (add cache,disk) C P M D C P M D Network C P M D

66 A Supercomputer at $5.2 million Virginia Tech 1,100 node Macs. G5 supercomputer

67 The Virginia Polytechnic Institute and State University has built a supercomputer comprised of a cluster of 1,100 dualprocessor Macintosh G5 computers. Based on preliminary benchmarks, Big Mac is capable of 8.1 teraflops per second. The Mac supercomputer still is being fine tuned, and the full extent of its computing power will not be known until November. But the 8.1 teraflops figure would make the Big Mac the world's fourth fastest supercomputer

68 Big Mac's cost relative to similar machines is as noteworthy as its performance. The Apple supercomputer was constructed for just over US$5 million, and the cluster was assembled in about four weeks. In contrast, the world's leading supercomputers cost well over $100 million to build and require several years to construct. The Earth Simulator, which clocked in at 38.5 teraflops in 2002, reportedly cost up to $250 million.

69 October Time: 7:30pm - 9:00pm Location: Santa Clara Ballroom Srinidhi Varadarajan, Ph.D. Dr. Srinidhi Varadarajan is an Assistant Professor of Computer Science at Virginia Tech. He was honored with the NSF Career Award in 2002 for "Weaving a Code Tapestry: A Compiler Directed Framework for Scalable Network Emulation." He has focused his research on building a distributed network emulation system that can scale to emulate hundreds of thousands of virtual nodes.

70 Parallel Computers Two common types Cluster Multi-Processor

71 Cluster Computers

72 Clusters on the Rise Using clusters of small machines to build a supercomputer is not a new concept. Another of the world's top machines, housed at the Lawrence Livermore National Laboratory, was constructed from 2,304 Xeon processors. The machine was build by Utah-based Linux Networx. Clustering technology has meant that traditional big-iron leaders like Cray (Nasdaq: CRAY) and IBM have new competition from makers of smaller machines. Dell (Nasdaq: DELL), among other companies, has sold high-powered computing clusters to research institutions.

73 Cluster Computers Each computer in a cluster is a complete computer by itself CPU Memory Disk etc Computers communicate with each other via some interconnection bus

74 Cluster Computers Typically used where one computer does not have enough capacity to do the expected work Large Servers Cheaper than building one GIANT computer

75 However, a cluster would be ideal for the processing of seismic data for oil exploration, because that computing job can be divided into many smaller tasks. Although not new, supercomputing clustering technology still is impressive. It works by farming out chunks of data to individual machines, adding that clustering works better for some types of computing problems than others. For example, a cluster would not be ideal to compete against IBM's Deep Blue supercomputer in a chess match; in this case, all the data must be available to one processor at the same moment -- the machine operates much in the same way as the human brain handles tasks.

76 Cluster Computers Need to break up work among the computers in the cluster Example: Microsoft.com Search Engine 6 computers running SQL Server Each has a copy of the MS Knowledge Base Search requests come to one computer Sends request to one of the 6 Attempts to keep all 6 busy

The Virginia Tech Mac supercomputer should be fully functional and in use by January 2004.

77 The Virginia Tech Mac supercomputer should be fully functional and in use by January It will be used for research into nanoscale electronics, quantum chemistry, computational chemistry, aerodynamics, molecular statics, computational acoustics and the molecular modeling of proteins.

78 Supercomputers in China

88 The previous speed leader is a computer called Cray XT5 Jaguar located at the Oak Ridge National Laboratory of the United States. China has invested billions in computing in recent years, and supercomputers are being pressed into service for everything from designing aeroplanes to probing the origins of the universe. They're also being used all over the world to model climate change scenarios.

89 Tianhe-1A : China s New Supercomputer, Beats Cray XT5 Jaguar of US

90 It contains a massive 7,000 graphics processors and 14,000 Intel chips. but it was Chinese researchers who worked out how to wire them up to create the lightning-fast data transfer and computational power. The system is designed at the National University of Defense Technology in China. This supercompter, based in China s National Center for Supercomputing, has already started working for the local weather service and the National Offshore Oil Corporation

91 The super computer set a performance record by crunching petaflops of data at once (two-and-a-half thousand trillion operations per second), which is 40 percent more than the Cray XT5 Jaguar s speed of 1.75 petaflops. The Tianhe-1A is twenty-nine million times more powerful than the earliest supercomputers of the 1970s

92 Linux operating system Tianhe-1A runs on Linux operating system. While the thousands of individual processors used in the supercomputer are made in America, the switches that connect those computer chips are built by Chinese scientists. The connection and the switches are critical success factor of a super computer, as the faster you make the interconnect, the better your overall computation will flow. Milky Way is a reported 47% faster than the XT5 and does this by uniting its thousands of Intel chips with graphics processors made by rival firm Nvidia.

93 The super computer consists of 20,000 smaller computers linked together, and covers more than a third of an acre (17,000 square feet). It is sized more than 100 fridge-sized computer racks and together these weigh more than 155 tonnes. The Tianhe-1A is powered by 7168 Nvidia Tesla M2050 GPUs and Intel Xeon CPUs. The system consumes 4.04 megawatts of electricity. Five new supercomputers are being built that are supposed to be four times more powerful than China s new machine. Three are in the U.S.; two are in Japan.

94 Luckily the Tianhe-1A can. The fact that China currently owns the world's fastest supercomputer is not really relevant to an understanding of the international league tables of computing power. It almost goes without saying that in 18 months time even petaflops will look like pocket calculator stuff. The real story here is that China's unprecedented level of investment in supercomputing is resulting in huge numbers of software engineers coming out of the country. It is not the Tianhe-1A that spells the future of computing dominance, but the legions of computing experts of the future.

Parallel Architectures

Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36