Organizational issues (I)

Size: px

Start display at page:

Download "Organizational issues (I)"

Angela Porter
5 years ago
Views:

1 COSC 6385 Computer Architecture Introduction and Organizational Issues Fall 2009 Organizational issues (I) Classes: Monday, 1.00pm 2.30pm, SEC 202 Wednesday, 1.00pm 2.30pm, SEC 202 Evaluation 25% homework 75% three quizzes ( 25% each) In case of questions: gabriel@cs.uh.edu Tel: (713) Office hours: PGH 524, Tue, 11am-12pm or by appointment All slides available on the website: Videos of some lectures will be posted on the course web page 1

2 Organizational Issues (II) TA s for the course: Sarat Poluri, PGH 526, scpoluri@cs.uh.edu Anup Prakash, PGH 526, anup@cs.uh.edu Tentative dates for the quizzes: Monday, September 21 st Wednesday, October 21 st Wednesday, December 2 nd Homework Announced on Monday, September 28 Due on Wednesday, October 14 Contents Textbook: John L. Hennessy, David A. Patterson Computer Architecture A Quantitative Approach 4 th Edition Morgan Kaufmann Publishers 2

3 Contents (II) Most of chapters 1 to 5 Appendix A, B, C Selected sections regarding Storage systems Vector Processors Selected literature to multi-core processors Selected literature to virtualization Contents(III) Aug. 24 Overview, Motivation, Organization Aug. 26 Performance Measurement Aug 31 Instruction Set Architectures Sep. 2 Memory Hierarchy (I) Sep. 7 Labor Day, no lectures Sep. 9 Memory Hierarchy (II) (online) Sep. 14 Pipelining (I), Sep. 16 Recap for 1st quiz, Sep. 21 1st quiz Sep. 23 Pipelining (II) (online) Sep. 28 homework announcement Sep. 30 Tomasulo's algorithm (I) Oct. 5 Tomasulo's algorithm (II) Oct. 7 ILP with software approaches Oct. 12 discussion of 1st quiz; Oct. 14 recap for 2nd quiz; homework due Oct. 19 Vector processors Oct. 21 2nd quiz Oct. 26 Multi-processor systems (I) Oct. 28 Multi-processor systems (II) Nov. 2 Multi-processor systems (III) Nov. 4 discussion of 2nd quiz Nov. 9 Multi-processor systems (IV) Nov. 11 Virtualization Nov. 16 File I/O Nov. 18 cancelled? Nov. 23 recap for 3rd quiz Nov. 25 Thanksgiving holiday, no class Nov. 30 History of Computers Dec. 2 3rd Quiz 3

4 Why learning about Computer Architecture? for (i=0; i<n; i++ ) { c[i] = a[i] + b[i]; Every loop iteration requires 3 memory operations 2 loads 1 store For a micro-processor having a frequency of 2 GHz this loop requires 9 1 3* 4Bytes * 2*10 s = 24GBytes / s to satisfy one Floating Point Unit (FPU) Most modern processors have 2 FPUs and two or more Integer Units which could work in parallel 4

5 Memory technology ( Bandwidth of a memory module SB SB * f * Op Cycle with SB max SB BUS f BUS max = Bus BUS / : max. memory bandwidth : Bandwidth of the memory bus (64 Bit = 8 Bytes) : Frequency of the memory bus Name Memory bandwidth Frequency of memory bus (MHz) max. bandwidth PC100 SDRAM MB/s PC133 SDRAM GB/s PC1600 DDR GB/s PC2100 DDR GB/s PC2700 DDR GB/s PC3200 DDR GB/s PC3700 DDR GB/s PC4200 DDR GB/s 5

6 Memory modules (cont.) Dual Channel Memory: 2 I/O Channels between memory controller und memory module DDR2 and DDR3: further evolution of the DDR technology Name Frequency of memory bus Bandwidth of a module Dual Channel DDR2 bandwidth PC MHz 3.2 GB/s 6.4 GB/s PC MHz 4.2 GB/s 8.4 GB/s PC MHz 5.3 GB/s 10.6 GB/s PC MHz 6.4 GB/s 12.8 GB/s PC MHz 8.5GB/s 17.0 GB/s PC MHz 10.6 GB/s 21.2 GB/s PC MHz 12.8 GB/s 25.6 GB/s Memory hierarchies Backup (tape) Size TB, PT Access time [cycles] Primary data storage (disk) ~ 100 GB > 10 6 main memory ~ 1-4 GB Caches ~ 1-4 MB 2 50 Register < 256 Words 1-2 6

7 Memory hierarchies Do I have to care about memory hierarchies? Example: Matrix-multiply of two dense matrices Trivial code for ( i=0; i<dim; i++ ) { for ( j=0; j<dim; j++ ) { for ( k=0; k<dim; k++) { c[i][j] += a[i][k] * b[k][j]; Matrix-multiply Performance of the trivial implementation on an 2.2 GHz AMD Opteron with 2 GB main memory 1 MB 2 nd level cache Matrix dimension Execution time [sec] Performance [MFLOPS] 256x x

8 Matrix-multiply (II) Peak floating point performance of the processor 2 * (2.2 * 10 9 ) Floating point operations/sec = 4.4 * 10 9 = 4.4 GFLOPS Number of floating point units Frequency of the processor assuming that each FPU can finish an operation per cycle Theoretical floating point peak performance of the processor Where are the missing FLOPS between theoretical peek and achieved performance? Memory wait time Blocked code for ( i=0; i<dim; i+=block ) { for ( j=0; j<dim; j+=block ) { for ( k=0; k<dim; k+=block) { for (ii=i; ii<(i+block); ii++) { for (jj=j; jj<(j+block); jj++) { for (kk=k; kk<(k+block);kk++) { c[ii][jj] += a[ii][kk] * b[kk][jj]; 8

046 726 16 0.51 657 32 0.043 777 64 0.049 677 128 0.113 296 512x512 4 0.

9 Matrix dimension Performance of the blocked code block Execution time [sec] Performance [MFLOPS] trivial [MFLOPS] 256x x

10 10

11 Top 500 List ( Top 500 List 11

Laboratory Hybrid Architecture 13,824 AMD Opteron cores

12 IBM Roadrunner First computer to surpass the 1 Petaflop (2 50 FLOPS ) barrier Installed at Los Alamos National Laboratory Hybrid Architecture 13,824 AMD Opteron cores 116,640 IBM PowerXCell 8i cores Costs: $120 million 12

Organizational issues (I)

COSC 6385 Computer Architecture Introduction and Organizational Issues Fall 2008 Organizational issues (I) Classes: Monday, 1.00pm 2.30pm, PGH 232 Wednesday, 1.00pm 2.30pm, PGH 232 Evaluation 25% homework