Outline. L9: Project Discussion and Floating Point Issues. Project Parts (Total = 50%) Project Proposal (due 3/8) 2/13/12.

Similar documents
GPU Floating Point Features

L9: Next Assignment, Project and Floating Point Issues CS6963

Module 12 Floating-Point Considerations

Module 12 Floating-Point Considerations

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Written Homework 3. Floating-Point Example (1/2)

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Computer Architecture and IC Design Lab. Chapter 3 Part 2 Arithmetic for Computers Floating Point

Many-core Processors Lecture 4. Instructor: Philippos Mordohai Webpage:

CSCI 402: Computer Architectures. Arithmetic for Computers (4) Fengguang Song Department of Computer & Information Science IUPUI.

Systems I. Floating Point. Topics IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties

Floating Point January 24, 2008

Floating Point Puzzles The course that gives CMU its Zip! Floating Point Jan 22, IEEE Floating Point. Fractional Binary Numbers.

Foundations of Computer Systems

Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

Giving credit where credit is due

ECE232: Hardware Organization and Design

Floating Point Numbers

Giving credit where credit is due

Floating Point Numbers

Floating Point Numbers

Chapter 2 Float Point Arithmetic. Real Numbers in Decimal Notation. Real Numbers in Decimal Notation

Data Representation Floating Point

Floating Point. CSE 351 Autumn Instructor: Justin Hsia

CUDA Performance Optimization. Patrick Legresley

Floating Point : Introduction to Computer Systems 4 th Lecture, May 25, Instructor: Brian Railing. Carnegie Mellon

Tuning CUDA Applications for Fermi. Version 1.2

Floating Point. CSE 351 Autumn Instructor: Justin Hsia

CUDA Architecture & Programming Model

Optimizing CUDA NVIDIA Corporation 2009 U. Melbourne GPU Computing Workshop 27/5/2009 1

Numerical computing. How computers store real numbers and the problems that result

CS61C : Machine Structures

Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition. Carnegie Mellon

Floating Point (with contributions from Dr. Bin Ren, William & Mary Computer Science)

System Programming CISC 360. Floating Point September 16, 2008

Data Representation Floating Point

Arithmetic for Computers. Hwansoo Han

Floating point. Today! IEEE Floating Point Standard! Rounding! Floating Point Operations! Mathematical properties. Next time. !

Systems Programming and Computer Architecture ( )

Computer Organization: A Programmer's Perspective

Lecture 13: (Integer Multiplication and Division) FLOATING POINT NUMBERS

Floating-Point Data Representation and Manipulation 198:231 Introduction to Computer Organization Lecture 3

Fundamental CUDA Optimization. NVIDIA Corporation

Representing and Manipulating Floating Points

CMPSCI 145 MIDTERM #1 Solution Key. SPRING 2017 March 3, 2017 Professor William T. Verts

Fundamental CUDA Optimization. NVIDIA Corporation

How do you know your GPU or manycore program is correct?

Chapter 3. Arithmetic Text: P&H rev

Finite arithmetic and error analysis

xx.yyyy Lecture #11 Floating Point II Summary (single precision): Precision and Accuracy Fractional Powers of 2 Representation of Fractions

Floating Point Arithmetic

Floating Point Numbers

M1 Computers and Data

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Practical Numerical Methods in Physics and Astronomy. Lecture 1 Intro & IEEE Variable Types and Arithmetic

Floating Point Arithmetic. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

The NVIDIA GeForce 8800 GPU

On a 64-bit CPU. Size/Range vary by CPU model and Word size.

Representing and Manipulating Floating Points

Today: Floating Point. Floating Point. Fractional Binary Numbers. Fractional binary numbers. bi bi 1 b2 b1 b0 b 1 b 2 b 3 b j

EE 109 Unit 19. IEEE 754 Floating Point Representation Floating Point Arithmetic

Representing and Manipulating Floating Points. Computer Systems Laboratory Sungkyunkwan University

COSC 243. Data Representation 3. Lecture 3 - Data Representation 3 1. COSC 243 (Computer Architecture)

Lecture 10. Floating point arithmetic GPUs in perspective

C NUMERIC FORMATS. Overview. IEEE Single-Precision Floating-point Data Format. Figure C-0. Table C-0. Listing C-0.

Divide: Paper & Pencil

Floating Point. CSE 351 Autumn Instructor: Justin Hsia

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS429: Computer Organization and Architecture

NVIDIA Fermi Architecture

Computer Systems C S Cynthia Lee

Introduction to Computer Programming in Python Dr. William C. Bulko. Data Types

Floating-point Arithmetic. where you sum up the integer to the left of the decimal point and the fraction to the right.

Numerical Methods in Physics. Lecture 1 Intro & IEEE Variable Types and Arithmetic

FLOATING POINT NUMBERS

IEEE Standard 754 for Binary Floating-Point Arithmetic.

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Chapter 03: Computer Arithmetic. Lesson 09: Arithmetic using floating point numbers

Interval arithmetic on graphics processing units

Floating Point Arithmetic

Representing and Manipulating Floating Points

Fixed-Point Math and Other Optimizations

CS 261 Fall Floating-Point Numbers. Mike Lam, Professor.

Floating Point. CSE 238/2038/2138: Systems Programming. Instructor: Fatma CORUT ERGİN. Slides adapted from Bryant & O Hallaron s slides

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

Floating point. Today. IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties Next time.

The course that gives CMU its Zip! Floating Point Arithmetic Feb 17, 2000

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

EE 109 Unit 20. IEEE 754 Floating Point Representation Floating Point Arithmetic

Data Representation Floating Point

CSCI 402: Computer Architectures. Arithmetic for Computers (3) Fengguang Song Department of Computer & Information Science IUPUI.

Outline. What is Performance? Restating Performance Equation Time = Seconds. CPU Performance Factors

CS 61C: Great Ideas in Computer Architecture Performance and Floating-Point Arithmetic

3.5 Floating Point: Overview

CS 261 Fall Floating-Point Numbers. Mike Lam, Professor.

PRECISION AND PERFORMANCE: FLOATING POINT AND IEEE 754 COMPLIANCE FOR NVIDIA GPUS

UC Berkeley CS61C : Machine Structures

Introduction to Programming

GPU & Computer Arithmetics

Transcription:

Outline L9: Project Discussion and Floating Point Issues Discussion of semester projects Floating point Mostly single precision until recent architectures Accuracy What s fast and what s not Reading: Ch 6/7 in Kirk and Hwu, http://courses.ece.illinois.edu/ece498/al/textbook/chapter6- FloatingPoint.pdf NVIDA CUDA Programmer s Guide, Appendix C 2" Project Proposal (due 3/8) Team of 2-3 people Please let me know if you need a partner Proposal Logistics: Significant implementation, worth 50% of grade Each person turns in the proposal (should be same as other team members) Proposal: 3-4 page document (11pt, single-spaced) Submit with handin program: handin prop <pdf-file> Project Parts (Total = 50%) Proposal (5%) Short written document, next few slides Design Review (10%) Oral, in-class presentation 2 weeks before end Presentation and Poster (15%) Poster session last week of class, dry run week before Final Report (20%) Due during finals no final for this class 1

Project Schedule Thursday, March 8, Proposals due Monday, April 2, Design Reviews Wednesday, April 18, Poster Dry Run Monday, April 23, In-Class Poster Presentation Wednesday, April 25, Guest Speaker Content of Proposal I. Team members: Name and a sentence on expertise for each member II. Problem description - What is the computation and why is it important? - Abstraction of computation: equations, graphic or pseudo-code, no more than 1 page III. Suitability for GPU acceleration - Amdahl s Law: describe the inherent parallelism. Argue that it is close to 100% of computation. Use measurements from CPU execution of computation if possible. - Synchronization and Communication: Discuss what data structures may need to be protected by synchronization, or communication through host. - Copy Overhead: Discuss the data footprint and anticipated cost of copying to/from host memory. IV. Intellectual Challenges - Generally, what makes this computation worthy of a project? - Point to any difficulties you anticipate at present in achieving high speedup Projects How to Approach Some questions: 1. Amdahl s Law: target bulk of computation and can profile to obtain key computations 2. Strategy for gradually adding GPU execution to CPU code while maintaining correctness 3. How to partition data & computation to avoid synchronization? 4. What types of floating point operations and accuracy requirements? 5. How to manage copy overhead? Can you overlap computation and copying? Floating Point Incompatibility Most scientific apps are double precision codes! Graphics applications do not need double precision (criteria are speed and whether the picture looks ok, not whether it accurately models some scientific phenomena). -> Prior to GTX and Tesla platforms, double precision floating point not supported at all. Some inaccuracies in singleprecision operations. In general Double precision needed for convergence on fine meshes, or large set of values Single precision ok for coarse meshes 8" 2

Some key features Hardware intrinsics implemented in special functional units faster but less precise than software implementations Double precision slower than single precision, but new architectural enhancements have increased its performance Measures of accuracy IEEE compliant In terms of unit in the last place (ulps): the gap between two floating-point numbers nearest to x, even if x is one of them What is IEEE floating-point format? A floating point binary number consists of three parts: sign (S), exponent (E), and mantissa (M). Each (S, E, M) pattern uniquely identifies a floating point number. For each bit pattern, its IEEE floating-point value is derived as: value = (-1) S * M * {2 E }, where 1.0 M < 10.0 B The interpretation of S is simple: S=0 results in a positive number and S=1 a negative number. Single Precision vs. Double Precision Platforms of compute capability 1.2 and below only support single precision floating point Some systems (GTX, 200 series, Tesla) include double precision, but much slower than single precision A single dp arithmetic unit shared by all SPs in an SM Similarly, a single fused multiply-add unit Greatly improved in Fermi Up to 16 double precision operations performed per warp (subsequent slides) 11" 512 cores 32 cores per SM 16 SMs 6 64-bit memory partitions Fermi Architecture 3

Closer look at Fermi core and SM 48 KB L1 cache in lieu of 16 KB shared memory 32-bit integer multiplies in single operation Fused multiplyadd IEEE-Compliant for latest standard Double Precision Arithmetic Up to 16 DP fused multiply-adds can be performed per clock per SM Hardware Comparison GPU Floating Point Features G80 SSE IBM Altivec Cell SPE Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754 Rounding modes for FADD and FMUL Denormal handling Round to nearest and round to zero Flush to zero All 4 IEEE, round to nearest, zero, inf, -inf Supported, 1000 s of cycles Round to nearest only Supported, 1000 s of cycles NaN support Yes Yes Yes No Round to zero/ truncate only Flush to zero Overflow and Infinity support Yes, only clamps to max norm Yes Yes No, infinity Flags No Yes Yes Some Square root Software only Hardware Software only Software only Division Software only Hardware Software only Software only Reciprocal estimate accuracy Reciprocal sqrt estimate accuracy log2(x) and 2^x estimates accuracy 24 bit 12 bit 12 bit 12 bit 23 bit 12 bit 12 bit 12 bit 23 bit No 12 bit No 4

Summary: Accuracy vs. Performance A few operators are IEEE 754-compliant Addition and Multiplication but some give up precision, presumably in favor of speed or hardware simplicity Particularly, division Many built in intrinsics perform common complex operations very fast Some intrinsics have multiple implementations, to trade off speed and accuracy e.g., intrinsic sin() (fast but imprecise) versus sin() (much slower) 17" Deviations from IEEE-754 Addition and Multiplication are IEEE 754 compliant Maximum 0.5 ulp (units in the least place) error However, often combined into multiply-add (FMAD) Intermediate result is truncated Division is non-compliant (2 ulp) Not all rounding modes are supported in G80, but supported now Denormalized numbers are not supported in G80, but supported later No mechanism to detect floating-point exceptions (seems to be still true) 18" Arithmetic Instruction Throughput (G80) int and float add, shift, min, max and float mul, mad: 4 cycles per warp int multiply (*) is by default 32-bit requires multiple cycles / warp Use mul24() / umul24() intrinsics for 4-cycle 24-bit int multiply Integer divide and modulo are expensive Compiler will convert literal power-of-2 divides to shifts Be explicit in cases where compiler can t tell that divisor is a power of 2! Useful trick: foo % n == foo & (n-1) if n is a power of 2 19" Arithmetic Instruction Throughput (G80) Reciprocal, reciprocal square root, sin/cos, log, exp: 16 cycles per warp These are the versions prefixed with Examples: rcp(), sin(), exp() Other functions are combinations of the above y / x == rcp(x) * y == 20 cycles per warp sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp 20" 5

Runtime Math Library There are two types of runtime math operations func(): direct mapping to hardware ISA Fast but low accuracy (see prog. guide for details) Examples: sin(x), exp(x), pow(x,y) func() : compile to multiple instructions Slower but higher accuracy (5 ulp, units in the least place, or less) Examples: sin(x), exp(x), pow(x,y) Next Class Next class Discuss CUBLAS 2/3 implementation of matrix multiply and sample projects Remainder of the semester: Focus on applications and parallel programming patterns The -use_fast_math compiler option forces every func() to compile to func() 21" 22" 6