Last time. Last Time. Last time. Dynamic Array Multiplication. Dynamic Nested Arrays

Similar documents
More in-class activities / questions

Referencing Examples

Introduction to Computer Systems , fall th Lecture, Sep. 14 th

Machine-Level Programming IV: Structured Data

Machine-Level Prog. IV - Structured Data

Data Structures in Memory!

Basic Data Types The course that gives CMU its Zip! Machine-Level Programming IV: Data Feb. 5, 2008

CISC 360. Machine-Level Programming IV: Structured Data Sept. 24, 2008

Integral. ! Stored & operated on in general registers. ! Signed vs. unsigned depends on instructions used

Basic Data Types The course that gives CMU its Zip! Machine-Level Programming IV: Structured Data Sept. 18, 2003.

X86 Assembly Buffer Overflow III:1

Page 1. Basic Data Types CISC 360. Machine-Level Programming IV: Structured Data Sept. 24, Array Access. Array Allocation.

Basic Data Types. Lecture 6A Machine-Level Programming IV: Structured Data. Array Allocation. Array Access Basic Principle

Lecture 7: More procedures; array access Computer Architecture and Systems Programming ( )

How to Write Fast Numerical Code Spring 2013 Lecture: Architecture/Microarchitecture and Intel Core

Giving credit where credit is due

Giving credit where credit is due

The course that gives CMU its Zip! Floating Point Arithmetic Feb 17, 2000

Assembly IV: Complex Data Types. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

h"p://news.illinois.edu/news/12/0927transient_electronics_johnrogers.html

2 Systems Group Department of Computer Science ETH Zürich. Argument Build 3. %r8d %rax. %r9d %rbx. Argument #6 %rcx. %r10d %rcx. Static chain ptr %rdx

Computer Systems CEN591(502) Fall 2011

Machine Representa/on of Programs: Arrays, Structs and Unions. This lecture. Arrays Structures Alignment Unions CS Instructor: Sanjeev Se(a

Machine-Level Programming IV: Structured Data

MACHINE-LEVEL PROGRAMMING IV: Computer Organization and Architecture

Systems I. Machine-Level Programming VII: Structured Data

Computer Organization: A Programmer's Perspective

Data Representa/ons: IA32 + x86-64

Carnegie Mellon. Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

u Arrays One-dimensional Multi-dimensional (nested) Multi-level u Structures Allocation Access Alignment u Floating Point

Machine-level Programs Data

Machine-Level Programming V: Miscellaneous Topics Sept. 24, 2002

Giving credit where credit is due

Assembly I: Basic Operations. Jo, Heeseung

ASSEMBLY I: BASIC OPERATIONS. Jo, Heeseung

Linux Memory Layout The course that gives CMU its Zip! Machine-Level Programming IV: Miscellaneous Topics Sept. 24, Text & Stack Example

CS241 Computer Organization Spring Loops & Arrays

Assembly I: Basic Operations. Computer Systems Laboratory Sungkyunkwan University

Machine- level Programming IV: Data Structures. Topics Arrays Structs Unions

CS , Fall 2001 Exam 1

Meet & Greet! Come hang out with your TAs and Fellow Students (& eat free insomnia cookies) When : TODAY!! 5-6 pm Where : 3rd Floor Atrium, CIT

Binghamton University. CS-220 Spring x86 Assembler. Computer Systems: Sections

1.'(*2#3%* 2+6.)* !!"##$%$&'()&*$! +',-.$+'/-$! #0!"#$1"&2"/#',$! &'!)&,21'$3)*!40*,$ ! 5*'6728'*,20*"#$! 9)#46728'*,20*"#$:*',('7;$!

Turning C into Object Code Code in files p1.c p2.c Compile with command: gcc -O p1.c p2.c -o p Use optimizations (-O) Put resulting binary in file p

Memory Organization and Addressing

CMSC 313 Fall2009 Midterm Exam 2 Section 01 Nov 11, 2009

Machine-Level Programming I: Introduction Jan. 30, 2001

CS241 Computer Organization Spring 2015 IA

CS 261 Fall Mike Lam, Professor. x86-64 Data Structures and Misc. Topics

The Hardware/Software Interface CSE351 Spring 2013

Machine Programming 4: Structured Data

Assembly Language: Function Calls

Machine Programming 3: Procedures

Assembly Language: Function Calls" Goals of this Lecture"

EECS 213 Fall 2007 Midterm Exam

Assembly IV: Complex Data Types

Assembly Programmer s View Lecture 4A Machine-Level Programming I: Introduction

administrivia today start assembly probably won t finish all these slides Assignment 4 due tomorrow any questions?

Assembly Language: Function Calls" Goals of this Lecture"

Assembly Language: Function Calls. Goals of this Lecture. Function Call Problems

1 /* file cpuid2.s */ 4.asciz "The processor Vendor ID is %s \n" 5.section.bss. 6.lcomm buffer, section.text. 8.globl _start.

Machine Level Programming I: Basics

CISC 360. Machine-Level Programming I: Introduction Sept. 18, 2008

Introduction to Computer Systems. Exam 1. February 22, Model Solution fp

Procedure Calls. Young W. Lim Sat. Young W. Lim Procedure Calls Sat 1 / 27

Carnegie Mellon. 5 th Lecture, Jan. 31, Instructors: Todd C. Mowry & Anthony Rowe

CS241 Computer Organization Spring Data Alignment

Assembly III: Procedures. Jo, Heeseung

Assembly IV: Complex Data Types. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

ASSEMBLY III: PROCEDURES. Jo, Heeseung

The von Neumann Machine

Introduction to Computer Systems. Exam 1. February 22, This is an open-book exam. Notes are permitted, but not computers.

CSC 252: Computer Organization Spring 2018: Lecture 9

Machine- Level Programming IV: x86-64 Procedures, Data

Instruction Set Architectures

Assembly I: Basic Operations. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS429: Computer Organization and Architecture

The course that gives CMU its Zip! Machine-Level Programming I: Introduction Sept. 10, 2002

Machine-Level Programming Introduction

Machine-level Programming (3)

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Instruction Set Architecture

Instruction Set Architecture

Assembly III: Procedures. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Homework. In-line Assembly Code Machine Language Program Efficiency Tricks Reading PAL, pp 3-6, Practice Exam 1

Lecture 4 CIS 341: COMPILERS

Machine Program: Procedure. Zhaoguo Wang

Today: Machine Programming I: Basics

Systems I. Machine-Level Programming I: Introduction

Procedure Calls. Young W. Lim Mon. Young W. Lim Procedure Calls Mon 1 / 29

CS429: Computer Organization and Architecture

Arrays. Young W. Lim Mon. Young W. Lim Arrays Mon 1 / 17

CAS CS Computer Systems Spring 2015 Solutions to Problem Set #2 (Intel Instructions) Due: Friday, March 20, 1:00 pm

University of Washington

CIT Week13 Lecture

The Hardware/So=ware Interface CSE351 Winter 2013

Sungkyunkwan University

Question 4.2 2: (Solution, p 5) Suppose that the HYMN CPU begins with the following in memory. addr data (translation) LOAD 11110

Machine-Level Programming I Introduction

Arrays. Young W. Lim Wed. Young W. Lim Arrays Wed 1 / 19

Transcription:

Last time Lecture 8: Structures, alignment, floats Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 %rax %rbx %rcx %rdx %rsi %rdi %rsp Return alue Callee saed Argument #4 Argument #3 Argument #2 Argument #1 Stack pointer %r8 %r9 %r10 %r11 %r12 %r13 %r14 Argument #5 Argument #6 Callee saed Used for linking C: Callee saed Callee saed Callee saed %rbp Callee saed %r15 Callee saed Systems Group Department of Computer Science ETH Zürich 1 2 Last time Procedures (x86-64): Optimizations No base/frame pointer Passing arguments to functions through registers (if possible) Sometimes: Writing into the red zone (below stack pointer) rtn Ptr %rsp 8 unused 16 loc[1] 24 loc[0] Sometimes: Function call using jmp (instead of call) Reason: Performance use stack as little as possible while obeying rules (e.g., caller/callee sae registers) Arrays Nested Multi-leel Last Time int al[5]; 1 5 2 1 3 x x + 4 x + 8 x + 12 x + 16 x + 20 int pgh[4][5]; int *uni[3] 3 4 Dynamic Nested Arrays Strength Can create matrix of any size Programming Must do index computation explicitly Performance Accessing single element costly Must do multiplication int * new_ar_matrix(int n) return (int *) calloc(sizeof(int), n*n); int ar_ele (int *a, int i, int j, int n) return a[i*n+j]; mol 12(%ebp),%eax # i mol 8(%ebp),%edx # a imull 20(%ebp),%eax # n*i addl 16(%ebp),%eax # n*i+j mol (%edx,%eax,4),%eax # Mem[a+4*(i*n+j)] 5 Dynamic Array Multiplication Per iteration: Multiplies: 3 2 for subscripts 1 for data Adds: 4 2 for array indexing 1 for loop index 1 for data i-th row /* Compute element i,k of ariable matrix product */ int ar_prod_ele (int *a, int *b, int i, int k, int n) int j; int result = 0; for (j = 0; j < n; j++) result += a[i*n+j] * b[j*n+k]; a x b j-th column 6

Optimizing Dynamic Array Multiplication Optimizations Performed when set optimization leel to -O2 Code Motion Expression i*n can be computed outside loop Strength Reduction Incrementing j has effect of incrementing j*n+k by n Operations count 4 adds, 1 mult 4 adds, 3 mults int j; int result = 0; for (j = 0; j < n; j++) result += a[i*n+j] * b[j*n+k]; 4 adds, 1 mult int j; int result = 0; int itn = i*n; int jtnpk = k; for (j = 0; j < n; j++) result += a[itn+j] * b[jtnpk]; jtnpk += n; 7 8 struct rec int a[3]; int *p; ; oid set_i(struct rec *r, int al) r->i = al; Structures Memory Layout i a p 0 4 16 20 Concept Contiguously-allocated region of memory Refer to members within structure by names Members may be of different types Accessing Structure Member IA32 Assembly # %eax = al # %edx = r mol %eax,(%edx) # Mem[r] = al 9 Generating Pointer to Structure Member struct rec int a[3]; int *p; ; Generating Pointer to Array Element Offset of each structure member determined at compile time r r+4+4*idx i a p 0 4 16 20 int *find_a (struct rec *r, int idx) return &r->a[idx]; What does it do? # %ecx = idx # %edx = r leal 0(,%ecx,4),%eax # 4*idx leal 4(%eax,%edx),%eax # r+4*idx+4 10 Structure Referencing (Cont.) C Code struct rec int a[3]; int *p; ; oid set_p(struct rec *r) r->p = &r->a[r->i]; What does it do? i a p 0 4 16 20 i a 0 4 16 20 Element i # %edx = r mol (%edx),%ecx # r->i leal 0(,%ecx,4),%eax # 4*(r->i) leal 4(%edx,%eax),%eax # r+4+4*(r->i) mol %eax,16(%edx) # Update r->p 11 12

Alignment Aligned Data Primitie data type requires K bytes Address must be multiple of K Required on some machines; adised on IA32 treated differently by IA32 Linux, x86-64 Linux, and Windows! Motiation for Aligning Data Memory accessed by (aligned) chunks of 4 or 8 bytes (system dependent) Inefficient to load or store datum that spans quad word boundaries Virtual memory ery tricky when datum spans 2 pages Compiler Inserts gaps in structure to ensure correct alignment of fields Specific Cases of Alignment (IA32) 1 byte: char, no restrictions on address 2 bytes: short, lowest 1 bit of address must be 0 2 4 bytes: int, float, char *, lowest 2 bits of address must be 00 2 8 bytes: double, Windows (and most other OS s & instruction sets): lowest 3 bits of address must be 000 2 Linux: lowest 2 bits of address must be 00 2 i.e., treated the same as a 4-byte primitie data type 12 bytes: long double Windows, Linux: lowest 2 bits of address must be 00 2 i.e., treated the same as a 4-byte primitie data type 13 14 Specific Cases of Alignment (x86-64) 1 byte: char, no restrictions on address 2 bytes: short, lowest 1 bit of address must be 0 2 4 bytes: int, float, lowest 2 bits of address must be 00 2 8 bytes: double, char *, Windows & Linux: lowest 3 bits of address must be 000 2 16 bytes: long double Linux: lowest 3 bits of address must be 000 2 i.e., treated the same as a 8-byte primitie data type 15 Satisfying Alignment with Structures Within structure: Must satisfy element s alignment requirement Oerall structure placement Each structure has alignment requirement K K = Largest alignment of any element Initial address & structure length must be multiples of K Example (under Windows or x86-64): K = 8, due to double element struct S1 double ; *p; c 3 bytes i[0] i[1] 4 bytes p+0 p+4 p+8 p+16 p+24 Multiple of 4 Multiple of 8 Multiple of 8 Multiple of 8 16 Different Alignment Conentions x86-64 or IA32 Windows: K = 8, due to double element c 3 bytes i[0] i[1] 4 bytes p+0 p+4 p+8 p+16 p+24 IA32 Linux K = 4; double treated like a 4-byte data type struct S1 double ; *p; Saing Space Put large data types first struct S1 double ; *p; struct S2 double ; *p; Effect (example x86-64, both hae K=8) c 3 bytes i[0] i[1] 4 bytes p+0 p+4 p+8 p+16 p+24 c 3 bytes i[0] i[1] p+0 p+4 p+8 p+12 p+20 17 i[0] i[1] c p+0 p+8 p+16 18

Arrays of Structures Accessing Array Elements Satisfy alignment requirement for eery element struct S2 double ; a[10]; struct S3 short i; float ; short j; a[10]; a[0] a+0 a[1] a[2] a+24 a+48 a+36 a[0] a+0 a[i] a+12i i 2 bytes j 2 bytes a+12i a+12i+8 i[0] i[1] c a+24 a+32 a+40 7 bytes a+48 19 short get_j(int idx) return a[idx].j; # %eax = idx leal (%eax,%eax,2),%eax # 3*idx moswl a+8(,%eax,4),%eax 20 Accessing Array Elements Compute array offset 12i Compute offset 8 with structure Assembler gies offset a+8 Resoled during linking a[0] a+0 struct S3 short i; float ; short j; a[10]; a[i] a+12i i 2 bytes j 2 bytes a+12i a+12i+8 short get_j(int idx) return a[idx].j; # %eax = idx leal (%eax,%eax,2),%eax # 3*idx moswl a+8(,%eax,4),%eax 21 22 Union Allocation Allocate according to largest element Can only use one field at a time union U1 c i[0] i[1] double ; *up; up+0 up+4 up+8 struct S1 double ; *sp; Using Union to Access Bit Patterns typedef union float f; unsigned u; bit_float_t; float bit2float(unsigned u) bit_float_t arg; arg.u = u; return arg.f; u f 0 4 unsigned float2bit(float f) bit_float_t arg; arg.f = f; return arg.u; Same as (float) u? Same as (unsigned) f? c 3 bits i[0] i[1] 4 bits sp+0 sp+4 sp+8 sp+16 sp+24 23 24

Summary Arrays in C Contiguous allocation of memory Aligned to satisfy eery element s alignment requirement Pointer to first element No bounds checking Allocate bytes in order declared Pad in middle and at end to satisfy alignment Oerlay declarations Way to circument type system x87 (aailable with IA32, becoming obsolete) SSE3 (aailable with x86-64) 31 32 IA32 Floating Point (x87) FPU Data Register Stack (x87) History 8086: first computer to implement IEEE FP separate 8087 FPU (floating point unit) 486: merged FPU and Integer Unit onto one chip Becoming obsolete with x86-64 Summary Hardware to add, multiply, and diide Floating point data registers Various control & status registers Floating Point Formats single precision (C float): 32 bits double precision (C double): 64 bits extended precision (C long double): 80 bits Integer Unit Instruction decoder and sequencer Memory FPU 33 FPU register format (80 bit extended precision) 79 78 s exp 64 63 frac FPU registers 8 registers - %st(7) Logically form stack Top: Bottom disappears (drops out) after too many pushes Top %st(3) %st(2) %st(1) 0 34 FPU instructions (x87) FP Code Example (x87) Large number of floating point instructions and formats ~50 basic instruction types load, store, add, multiply sin, cos, tan, arctan, and log Often slower than math lib Sample instructions: Instruction Effect Description fldz push 0.0 Load zero flds Addr push Mem[Addr] Load single precision real fmuls Addr *M[Addr] Multiply faddp %st(1) +%st(1);pop Add and pop 35 Compute inner product of two ectors Single precision arithmetic Common computation float ipf (float x[], float y[], int n) float result = 0.0; for (i = 0; i < n; i++) result += x[i]*y[i]; pushl %ebp mol %esp,%ebp pushl %ebx # setup mol 8(%ebp),%ebx # %ebx=&x mol 12(%ebp),%ecx # %ecx=&y mol 16(%ebp),%edx # %edx=n fldz # push +0.0 xorl %eax,%eax # i=0 cmpl %edx,%eax # if i>=n done jge.l3.l5: flds (%ebx,%eax,4) # push x[i] fmuls (%ecx,%eax,4) # st(0)*=y[i] faddp # st(1)+=st(0); pop incl %eax # i++ cmpl %edx,%eax # if i<n repeat jl.l5.l3: mol -4(%ebp),%ebx # finish mol %ebp, %esp popl %ebp ret # st(0) = result36

1. fldz Inner Product Stack Trace Initialization 0.0 Iteration 0 Iteration 1 2. flds (%ebx,%eax,4) 0.0 %st(1) x[0] 3. fmuls (%ecx,%eax,4) 0.0 %st(1) x[0]*y[0] 5. flds (%ebx,%eax,4) x[0]*y[0] x[1] 6. fmuls (%ecx,%eax,4) x[0]*y[0] ] x[1]*y[1] eax = i ebx = *x ecx = *y %st(1) %st(1) x87 (aailable with IA32, becoming obsolete) SSE3 (aailable with x86-64) 4. faddp 0.0+x[0]*y[0] 7. faddp x[0]*y[0]+x[1]*y[1] 37 38 Vector Instructions: SSE Family SIMD (single-instruction, multiple data) ector instructions New data types, registers, operations Parallel operation on small (length 2-8) ectors of integers or floats Example: + x 4-way Intel Architectures (Floating Point) Processors 8086 286 386 486 Pentium Pentium MMX Architectures x86-16 x86-32 MMX Features ector instructions Aailable with Intel s SSE (streaming SIMD extensions) family SSE starting with Pentium III: 4-way single precision SSE2 starting with Pentium 4: 2-way double precision All x86-64 hae SSE3 (superset of SSE2, SSE) 39 time Pentium III Pentium 4 Pentium 4E Pentium 4F Core 2 Duo SSE SSE2 SSE3 x86-64 / em64t SSE4 4-way single precision fp 2-way double precision fp Our focus: SSE3 used for scalar (non-ector) floating point 40 SSE3 Registers All caller saed %xmm0 for floating point return alue %xmm0 %xmm1 %xmm2 %xmm3 128 bit = 2 doubles = 4 singles Argument #1 Argument #2 Argument #3 Argument #4 %xmm8 %xmm9 %xmm10 %xmm11 SSE3 Registers Different data types and associated instructions Integer ectors: 16-way byte 8-way 2 bytes 4-way 4 bytes ectors: 4-way single 2-way double 128 bit LSB %xmm4 %xmm5 %xmm6 %xmm7 Argument #5 Argument #6 Argument #7 Argument #8 %xmm12 %xmm13 %xmm14 %xmm15 41 scalars: single double 42

SSE3 Instructions: Examples SSE3 Instruction Names Single precision 4-way ector add: addps %xmm0 %xmm1 %xmm0 + %xmm1 Single precision scalar add: addss %xmm0 %xmm1 packed (ector) addps single precision single slot (scalar) addss %xmm0 addpd addsd + this course double precision %xmm1 43 44 SSE3 Basic Instructions x86-64 FP Code Example Moes: Usual operand form: reg reg, reg mem, mem reg Arithmetic: Single Double Effect moss mosd Single Double Effect addss addsd subss subsd S mulss mulsd diss disd maxss maxsd minss minsd S) sqrtss sqrtsd sqrt(s) 45 Compute inner product of two ectors Single precision arithmetic Uses SSE3 instructions ipf: xorps %xmm1, %xmm1 # result = 0.0 xorl %ecx, %ecx # i = 0 jmp.l8 # goto middle.l10: # loop: moslq %ecx,%rax # icpy = i incl %ecx # i++ moss (%rsi,%rax,4), %xmm0 # t = y[icpy] mulss (%rdi,%rax,4), %xmm0 # t *= x[icpy] addss %xmm0, %xmm1 # result += t.l8: # middle: cmpl %edx, %ecx # i:n jl.l10 # if < goto loop moaps %xmm1, %xmm0 # return result ret float ipf (float x[], float y[], int n) float result = 0.0; for (i = 0; i < n; i++) result += x[i]*y[i]; 46 SSE3 Conersion Instructions x86-64 FP Code Example Conersions Same operand forms as moes double funct(double a, float x, double b, int i) return a*x - b/i; Instruction Description ctss2sd a %xmm0 double x %xmm1 float ctsd2ss single b %xmm2 double ctsi2ss int single i %edi int ctsi2sd int funct: ctsi2ssq quad int ctss2sd %xmm1, %xmm1 # %xmm1 = (double) x ctsi2sdq quad int mulsd %xmm0, %xmm1 # %xmm1 = a*x cttss2si int (truncation) ctsi2sd %edi, %xmm0 # %xmm0 = (double) i disd %xmm0, %xmm2 # %xmm2 = b/i cttsd2si int (truncation) mosd %xmm1, %xmm0 # %xmm0 = a*x cttss2siq int (truncation) subsd %xmm2, %xmm0 # return a*x - b/i cttss2siq ret int (truncation) 47 48

Constants Checking Constant double cel2fahr(double temp) return 1.8 * temp + 32.0; Here: Constants in decimal format compiler decision hex more readable # Constant declarations.lc2:.long 3435973837 # Low order four bytes of 1.8.long 1073532108 # High order four bytes of 1.8.LC4:.long 0 # Low order four bytes of 32.0.long 1077936128 # High order four bytes of 32.0 Preious slide: Claim.LC4:.long 0 # Low order four bytes of 32.0.long 1077936128 # High order four bytes of 32.0 Conert to hex format:.lc4:.long 0x0 # Low order four bytes of 32.0.long 0x40400000 # High order four bytes of 32.0 Conert to double: Remember: e = 11 exponent bits, bias = 2 e-1-1 = 1023 # Code cel2fahr: mulsd.lc2(%rip), %xmm0 # Multiply by 1.8 addsd.lc4(%rip), %xmm0 # Add 32.0 ret 49 50 Comments SSE3 floating point Uses lower ½ (double) or ¼ (single) of ector Finally departure from awkward x87 Assembly ery similar to integer code x87 still supported Een mixing with SSE3 possible Not recommended For highest floating point performance Vectorization a must (but not in this course ) See next slide Vector Instructions Recently, gcc can autoectorize to some extent -O3 or ftree-ectorize No speed-up guaranteed Very limited icc is currently much better For highest performance ectorize yourself using intrinsics Intrinsics = C interface to ector instructions The latest: Intel AVX: 4-way double, 8-way single 51 52 Next time Memory layout Buffer oerflow, worms, and iruses Program optimization Remoing unnecessary procedure calls Code motion/precomputation Strength reduction Sharing of common subexpressions 53