Parallel Numerical Algorithms 2016 Report 2

Size: px
Start display at page:

Download "Parallel Numerical Algorithms 2016 Report 2"

Transcription

1 Parallel Numerical Algorithms 2016 Report 2 Assignments (i) Perk Performance of Core i7 4500U When using FMA instruction of AVX 2 Single precision floating point number: 32 FLOPS/Clock * 3.0 GHz * 2Core = 192GFLOPS Double precision floating point number: 16 FLOPS/Clock * 3.0GHz * 2Core = 96GFLOPS (ii) Measure the performance in flops Single precision floating point number: 38.1GFLOPS (19.8% per perk performance) Double precision floating point number: 18.2GFLOPS (19.0% per perk performance) However, while test program was running, the clock of the CPU was 2.66GHz. Considering it, the percentages compared with perk performance are 22.3% and 21.4%. Measured by the program shown below. It is a multithread program which uses AVX2 instructions. #define _USE_MATH_DEFINES

2 #include <iostream> #include <vector> #include <windows.h> #include <cmath> #include <immintrin.h> #include <thread> #define NUM #define SIZE 64 #define THREAD 4 #define TYPED #ifdef TYPED #define TYPE double #define VTYPE m256d #define FUNC _mm256_fmadd_pd #else #define TYPE float #define VTYPE m256 #define FUNC _mm256_fmadd_ps #endif void vec_add(type **a) { VTYPE *va = (VTYPE *)a[0]; VTYPE *vb = (VTYPE *)a[1]; VTYPE *vc = (VTYPE *)a[2]; VTYPE *vd = (VTYPE *)a[3]; for (size_t j = 0; j < NUM; j++) { for (size_t i = 0; i < SIZE / (32 / sizeof(type)); i++) { vd[i] = FUNC(va[i], vb[i], vc[i]);

3 void Initialize(TYPE **a) { a[0] = (TYPE*)_mm_malloc(sizeof(TYPE) * SIZE, 32); a[1] = (TYPE*)_mm_malloc(sizeof(TYPE) * SIZE, 32); a[2] = (TYPE*)_mm_malloc(sizeof(TYPE) * SIZE, 32); a[3] = (TYPE*)_mm_malloc(sizeof(TYPE) * SIZE, 32); for (int j = 0; j < SIZE; j++) { a[0][j] = (TYPE)M_PI; a[1][j] = (TYPE)M_E; a[2][j] = (TYPE)1; void Finalize(TYPE **a) { _mm_free(a[0]); _mm_free(a[1]); _mm_free(a[2]); _mm_free(a[3]); int main() { LARGE_INTEGER freq; if (!QueryPerformanceFrequency(&freq)) LARGE_INTEGER start, end; TYPE **data[thread]; data[i] = (TYPE**)malloc(sizeof(TYPE*) * 4); Initialize(data[i]); QueryPerformanceCounter(&start);

4 std::vector<std::thread> threads; std::vector<int> da; threads.push_back(std::thread(vec_add, data[i])); threads[i].join(); QueryPerformanceCounter(&end); auto dur = end.quadpart - start.quadpart; std::cout << (double)(dur) / freq.quadpart * 1E9 << std::endl; std::cout << (double)size * (double)num * 2 * 4 / ((double)(dur) / freq.quadpart) << std::endl; // SIZE * ITERATION * CALC * THREAD Finalize(data[i]); (iii) Execution time of Copy, Inner product and Sum The figure of the execution times of these three programs shown below shows that the execution times of Copy and Sum is almost same and one of Inner Product is bigger than the others. This system has DDR memories. The speed of it is 12.8 GB/s. Then the bandwidths of Copy and Sum are 8~11GB/s. It indicated that execution time of Copy

5 and Sum are bottlenecked by the speed of memories Execution Time(ns) Vector Size (n) copy inn sum This program code is for measuring the execution time of sum of two vectors. Copy and Inner Product are measured by the variation program of it. #define _USE_MATH_DEFINES #include <iostream> #include <vector> #include <windows.h> #include <cmath> #include <random> #define Num #define N 2048 void vec_copy(float* v1, float* v2) { for (int i = 0; i < N; i++) { v1[i] = v2[i];

6 void vec_sum(float* v1, float* v2) { for (int i = 0; i < N; i++) { v1[i] += v2[i]; float vec_inn(float* v1, float* v2) { float sum = 0; for (int i = 0; i < N; i++) { sum += v1[i] * v2[i]; return sum; std::random_device rnd; std::mt19937 mt(rnd()); float* init_vector() { float* p = new float[n]; for (int j = 0; j < N; j++) { p[j] = (float)mt(); return p; void free_vector(float *p){ delete p; int main() { LARGE_INTEGER freq; if (!QueryPerformanceFrequency(&freq))

7 LARGE_INTEGER start, end; std::vector<float*> a; std::vector<float*> b; std::vector<float> c(num); for (int i = 0; i < Num; i++) { a.push_back(init_vector()); b.push_back(init_vector()); QueryPerformanceCounter(&start); for (int i = 0; i < Num; i++) { vec_sum(a[i], b[i]); QueryPerformanceCounter(&end); for (int i = 0; i < Num; i++) { free_vector(a[i]); free_vector(b[i]); auto dur = end.quadpart - start.quadpart; std::endl; std::cout << N << "\t" << (double)(dur) / freq.quadpart * 1E9 / Num <<

Submission instructions (read carefully): SS17 / Assignment 4 Instructor: Markus Püschel. ETH Zurich

Submission instructions (read carefully): SS17 / Assignment 4 Instructor: Markus Püschel. ETH Zurich 263-2300-00: How To Write Fast Numerical Code Assignment 4: 120 points Due Date: Th, April 13th, 17:00 http://www.inf.ethz.ch/personal/markusp/teaching/263-2300-eth-spring17/course.html Questions: fastcode@lists.inf.ethz.ch

More information

Suppose that you want to use two libraries with a bunch of useful classes and functions, but some names collide:

Suppose that you want to use two libraries with a bunch of useful classes and functions, but some names collide: COMP151 Namespaces Motivation [comp151] 1 Suppose that you want to use two libraries with a bunch of useful classes and functions, but some names collide: // File: gnutils.h class Stack {... ; class Some

More information

Intel Array Building Blocks (Intel ArBB) Technical Presentation

Intel Array Building Blocks (Intel ArBB) Technical Presentation Intel Array Building Blocks (Intel ArBB) Technical Presentation Copyright 2010, Intel Corporation. All rights reserved. 1 Noah Clemons Software And Services Group Developer Products Division Performance

More information

COMP6771 Advanced C++ Programming

COMP6771 Advanced C++ Programming 1.... COMP6771 Advanced C++ Programming Week 9 Multithreading 2016 www.cse.unsw.edu.au/ cs6771 .... Single Threaded Programs All programs so far this semester have been single threaded They have a single

More information

Functions and Recursion

Functions and Recursion Functions and Recursion 1 some useful problems 2 Function: power Power iteration Power recursive #include #include 3 using std::cout; using std::cin; using std::endl; // function prototype

More information

C++11 and Beyond. C++11 feels like a new language Bjarne Stroustrup

C++11 and Beyond. C++11 feels like a new language Bjarne Stroustrup C++11 and Beyond C++11 feels like a new language Bjarne Stroustrup Agenda The Free Lunch is Over C++11 vs Global Warming Chasing "Coffee-based" Languages What is Standard C++'s biggest weakness? The Future

More information

tcsc 2016 Luca Brianza 1 Luca Brianza 19/07/16 INFN & University of Milano-Bicocca

tcsc 2016 Luca Brianza 1 Luca Brianza 19/07/16 INFN & University of Milano-Bicocca tcsc 2016 1 1 INFN & University of Milano-Bicocca Outlook Amdahl s law Different ways of parallelism: - Asynchronous task execution - Threads Resource protection/thread safety - The problem - Low-level

More information

Type Aliases. Examples: using newtype = existingtype; // C++11 typedef existingtype newtype; // equivalent, still works

Type Aliases. Examples: using newtype = existingtype; // C++11 typedef existingtype newtype; // equivalent, still works Type Aliases A name may be defined as a synonym for an existing type name. Traditionally, typedef is used for this purpose. In the new standard, an alias declaration can also be used C++11.Thetwoformsareequivalent.

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

By: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,

More information

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,

More information

Advanced Parallel Programming II

Advanced Parallel Programming II Advanced Parallel Programming II Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Introduction to Vectorization RISC Software GmbH Johannes Kepler

More information

1- Write a single C++ statement that: A. Calculates the sum of the two integrates 11 and 12 and outputs the sum to the consol.

1- Write a single C++ statement that: A. Calculates the sum of the two integrates 11 and 12 and outputs the sum to the consol. 1- Write a single C++ statement that: A. Calculates the sum of the two integrates 11 and 12 and outputs the sum to the consol. B. Outputs to the console a floating point number f1 in scientific format

More information

COMP6771 Advanced C++ Programming

COMP6771 Advanced C++ Programming 1. COMP6771 Advanced C++ Programming Week 7 Part One: Member Templates and 2016 www.cse.unsw.edu.au/ cs6771 2. Member Templates Consider this STL code: 1 #include 2 #include 3 #include

More information

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs) John Wawrzynek. EECS, University of California at Berkeley

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs) John Wawrzynek. EECS, University of California at Berkeley CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) John Wawrzynek EECS, University of California at Berkeley http://inst.eecs.berkeley.edu/~cs152 Administrivia Lab

More information

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2 Particle simulation

More information

Exam Issued: December 18, 2012, 14:00 Hand in: December 18, 2012, 17:00

Exam Issued: December 18, 2012, 14:00 Hand in: December 18, 2012, 17:00 P. Koumoutsakos, M. Troyer ETH Zentrum, CAB H 69.2 CH-8092 Zürich High Performance Computing for Science and Engineering (HPCSE) for Engineers Fall semester 2012 Exam Issued: December 18, 2012, 14:00 Hand

More information

Unrolling parallel loops

Unrolling parallel loops Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:

More information

September 19,

September 19, September 19, 2013 1 Problems with previous examples Changes to the implementation will require recompilation & relinking of clients Extensions will require access to the source code Solutions Combine

More information

GCC : From 2.95 to 3.2

GCC : From 2.95 to 3.2 GCC : From 2.95 to 3.2 Topics Simple changes name of standard include files, std::endl, iostream, throw statements, vector iterators More complicated changes string streams, parameterized macros, hash_map

More information

Programmazione. Prof. Marco Bertini

Programmazione. Prof. Marco Bertini Programmazione Prof. Marco Bertini marco.bertini@unifi.it http://www.micc.unifi.it/bertini/ Hello world : a review Some differences between C and C++ Let s review some differences between C and C++ looking

More information

C++ for numerical computing - part 2

C++ for numerical computing - part 2 C++ for numerical computing - part 2 Rupert Nash r.nash@epcc.ed.ac.uk 1 / 36 Recap 2 / 36 Iterators std::vector data = GetData(n); // C style iteration - fully explicit for (auto i=0; i!= n; ++i)

More information

CHAPTER 4 FUNCTIONS. Dr. Shady Yehia Elmashad

CHAPTER 4 FUNCTIONS. Dr. Shady Yehia Elmashad CHAPTER 4 FUNCTIONS Dr. Shady Yehia Elmashad Outline 1. Introduction 2. Program Components in C++ 3. Math Library Functions 4. Functions 5. Function Definitions 6. Function Prototypes 7. Header Files 8.

More information

CS242 COMPUTER PROGRAMMING

CS242 COMPUTER PROGRAMMING CS242 COMPUTER PROGRAMMING I.Safa a Alawneh Variables Outline 2 Data Type C++ Built-in Data Types o o o o bool Data Type char Data Type int Data Type Floating-Point Data Types Variable Declaration Initializing

More information

AQD-SD4U4GN21-HG Test Report

AQD-SD4U4GN21-HG Test Report AQD-SD4U4GN21-HG Test Report Initiated by Brandon Lin Approved by Adonis Shih Page 1 of 13 Revision History: Revision Date Revision Description Creator 2016-3-1 1.0 First version released Brandon Lin Page

More information

Parallel and Distributed Programming Introduction. Kenjiro Taura

Parallel and Distributed Programming Introduction. Kenjiro Taura Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel

More information

Debugging Serial and Parallel Programs with Visual Studio

Debugging Serial and Parallel Programs with Visual Studio and Parallel Programs with Visual Studio Christian Terboven terboven@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University Windows HPC 2008 (II) September 17, RWTH Aachen Agenda

More information

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs) CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

Database Systems on Modern CPU Architectures

Database Systems on Modern CPU Architectures Database Systems on Modern CPU Architectures Introduction to Modern C++ Moritz Sichert Technische Universität München Department of Informatics Chair of Data Science and Engineering Overview Prerequisites:

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program

More information

EE 152 Advanced Programming LAB 7

EE 152 Advanced Programming LAB 7 EE 152 Advanced Programming LAB 7 1) Create a class called Rational for performing arithmetic with fractions. Write a program to test your class. Use integer variables to represent the private data of

More information

More Functions. Pass by Value. Example: Exchange two numbers. Storage Classes. Passing Parameters by Reference. Pass by value and by reference

More Functions. Pass by Value. Example: Exchange two numbers. Storage Classes. Passing Parameters by Reference. Pass by value and by reference Pass by Value More Functions Different location in memory Changes to the parameters inside the function body have no effect outside of the function. 2 Passing Parameters by Reference Example: Exchange

More information

C++ Basics. Data Processing Course, I. Hrivnacova, IPN Orsay

C++ Basics. Data Processing Course, I. Hrivnacova, IPN Orsay C++ Basics Data Processing Course, I. Hrivnacova, IPN Orsay The First Program Comments Function main() Input and Output Namespaces Variables Fundamental Types Operators Control constructs 1 C++ Programming

More information

C++ Tutorial AM 225. Dan Fortunato

C++ Tutorial AM 225. Dan Fortunato C++ Tutorial AM 225 Dan Fortunato Anatomy of a C++ program A program begins execution in the main() function, which is called automatically when the program is run. Code from external libraries can be

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

AQD-D4U8GN21-HG Test Report

AQD-D4U8GN21-HG Test Report AQD-D4U8GN21-HG Test Report Initiated by Mila Chen Approved by Adonis Shih Page 1 of 13 Revision History: Revision Date Revision Description Creator 2015-12-28 1.0 First version released Mila Chen Page

More information

Chapter 3 - Functions

Chapter 3 - Functions Chapter 3 - Functions 1 Outline 3.1 Introduction 3.2 Progra m Components in C++ 3.3 Ma th Libra ry Func tions 3.4 Func tions 3.5 Func tion De finitions 3.6 Func tion Prototypes 3.7 He a de r File s 3.8

More information

LAFF-On High-Performance Programming

LAFF-On High-Performance Programming LAFF-On High-Performance Programming Margaret E Myers Robert A van de Geijn Release Date Wednesday 20 th December, 2017 This is a work in progress Copyright 2017, 2018 by Margaret E Myers and Robert A

More information

Large-scale Deep Unsupervised Learning using Graphics Processors

Large-scale Deep Unsupervised Learning using Graphics Processors Large-scale Deep Unsupervised Learning using Graphics Processors Rajat Raina Anand Madhavan Andrew Y. Ng Stanford University Learning from unlabeled data Classify vs. car motorcycle Input space Unlabeled

More information

std::async() in C++11 Basic Multithreading

std::async() in C++11 Basic Multithreading MÜNSTER std::async() in C++11 Basic Multithreading 2. December 2015 std::thread MÜNSTER std::async() in C++11 2 /14 #include void hello(){ std::cout

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

CHAPTER 4 FUNCTIONS. Dr. Shady Yehia Elmashad

CHAPTER 4 FUNCTIONS. Dr. Shady Yehia Elmashad CHAPTER 4 FUNCTIONS Dr. Shady Yehia Elmashad Outline 1. Introduction 2. Program Components in C++ 3. Math Library Functions 4. Functions 5. Function Definitions 6. Function Prototypes 7. Header Files 8.

More information

4. C++ functions. 1. Library Function 2. User-defined Function

4. C++ functions. 1. Library Function 2. User-defined Function 4. C++ functions In programming, function refers to a segment that group s code to perform a specific task. Depending on whether a function is predefined or created by programmer; there are two types of

More information

CS1600 Lab Assignment 1 Spring 2016 Due: Feb. 2, 2016 POINTS: 10

CS1600 Lab Assignment 1 Spring 2016 Due: Feb. 2, 2016 POINTS: 10 CS1600 Lab Assignment 1 Spring 2016 Due: Feb. 2, 2016 POINTS: 10 PURPOSE: The purpose of this lab is to acquaint you with the C++ programming environment on storm. PROCEDURES: You will use Unix/Linux environment

More information

Multithreading. A thread is a unit of control (stream of instructions) within a process.

Multithreading. A thread is a unit of control (stream of instructions) within a process. Multithreading A thread is a unit of control (stream of instructions) within a process. When a thread runs, it executes a function in the program. The process associated with a running program starts with

More information

1 #include <iostream > 3 using std::cout; 4 using std::cin; 5 using std::endl; 7 int main(){ 8 int x=21; 9 int y=22; 10 int z=5; 12 cout << (x/y%z+4);

1 #include <iostream > 3 using std::cout; 4 using std::cin; 5 using std::endl; 7 int main(){ 8 int x=21; 9 int y=22; 10 int z=5; 12 cout << (x/y%z+4); 2 3 using std::cout; 4 using std::cin; using std::endl; 6 7 int main(){ 8 int x=21; 9 int y=22; int z=; 11 12 cout

More information

Kingdom of Saudi Arabia Princes Nora bint Abdul Rahman University College of Computer Since and Information System CS242 ARRAYS

Kingdom of Saudi Arabia Princes Nora bint Abdul Rahman University College of Computer Since and Information System CS242 ARRAYS Kingdom of Saudi Arabia Princes Nora bint Abdul Rahman University College of Computer Since and Information System CS242 1 ARRAYS Arrays 2 Arrays Structures of related data items Static entity (same size

More information

Case Study: Matrix Multiplication. 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017

Case Study: Matrix Multiplication. 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017 Case Study: Matrix Multiplication 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017 1 4k-by-4k Matrix Multiplication Version Implementation Running time (s) GFLOPS Absolute

More information

Ch 4. Parameters and Function Overloading

Ch 4. Parameters and Function Overloading 2014-1 Ch 4. Parameters and Function Overloading March 19, 2014 Advanced Networking Technology Lab. (YU-ANTL) Dept. of Information & Comm. Eng, Graduate School, Yeungnam University, KOREA (Tel : +82-53-810-2497;

More information

Chapter 3 - Functions

Chapter 3 - Functions Chapter 3 - Functions 1 Outline 3.1 Introduction 3.2 Program Components in C++ 3.3 Math Library Functions 3.4 Functions 3.5 Function Definitions 3.6 Function Prototypes 3.7 Header Files 3.8 Random Number

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Heterogeneous Computing and OpenCL

Heterogeneous Computing and OpenCL Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi

More information

Lecture 12: Instruction Execution and Pipelining. William Gropp

Lecture 12: Instruction Execution and Pipelining. William Gropp Lecture 12: Instruction Execution and Pipelining William Gropp www.cs.illinois.edu/~wgropp Yet More To Consider in Understanding Performance We have implicitly assumed that an operation takes one clock

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

COMP6771 Advanced C++ Programming

COMP6771 Advanced C++ Programming 1. COMP6771 Advanced C++ Programming Week 9 Multithreading (continued) 2016 www.cse.unsw.edu.au/ cs6771 2. So Far Program Workflows: Sequential, Parallel, Embarrassingly Parallel Memory: Shared Memory,

More information

CSc Introduc/on to Compu/ng. Lecture 8 Edgardo Molina Fall 2011 City College of New York

CSc Introduc/on to Compu/ng. Lecture 8 Edgardo Molina Fall 2011 City College of New York CSc 10200 Introduc/on to Compu/ng Lecture 8 Edgardo Molina Fall 2011 City College of New York 18 The Null Statement Null statement Semicolon with nothing preceding it ; Do-nothing statement required for

More information

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017 Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference London, 2017 Agenda Vectorization is becoming more and more important What is

More information

Introduction to Programming

Introduction to Programming Introduction to Programming session 9 Instructor: Reza Entezari-Maleki Email: entezari@ce.sharif.edu 1 Fall 2010 These slides are created using Deitel s slides Sahrif University of Technology Outlines

More information

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data Vector Processors A vector processor is a pipelined processor with special instructions designed to keep the (floating point) execution unit pipeline(s) full. These special instructions are vector instructions.

More information

int main(){ int main(){ We want to understand this in depth! std::vector<int> v(10,0); // Vector of length 10

int main(){ int main(){ We want to understand this in depth! std::vector<int> v(10,0); // Vector of length 10 We look back #include #include . C++ advanced (I) Repetition: vectors, pointers and iterators, range for, keyword auto, a class for vectors, subscript-operator, move-construction, iterators

More information

Name :. Roll No. :... Invigilator s Signature : INTRODUCTION TO PROGRAMMING. Time Allotted : 3 Hours Full Marks : 70

Name :. Roll No. :... Invigilator s Signature : INTRODUCTION TO PROGRAMMING. Time Allotted : 3 Hours Full Marks : 70 Name :. Roll No. :..... Invigilator s Signature :.. 2011 INTRODUCTION TO PROGRAMMING Time Allotted : 3 Hours Full Marks : 70 The figures in the margin indicate full marks. Candidates are required to give

More information

Optimizing Parallel Reduction in CUDA

Optimizing Parallel Reduction in CUDA Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 9 SIMD vectorization using #pragma omp simd force

More information

FORM 1 (Please put your name and form # on the scantron!!!!) CS 161 Exam I: True (A)/False(B) (2 pts each):

FORM 1 (Please put your name and form # on the scantron!!!!) CS 161 Exam I: True (A)/False(B) (2 pts each): FORM 1 (Please put your name and form # on the scantron!!!!) CS 161 Exam I: True (A)/False(B) (2 pts each): 1. The basic commands that a computer performs are input (get data), output (display result),

More information

5. Assuming gooddata is a Boolean variable, the following two tests are logically equivalent. if (gooddata == false) if (!

5. Assuming gooddata is a Boolean variable, the following two tests are logically equivalent. if (gooddata == false) if (! FORM 2 (Please put your name and form # on the scantron!!!!) CS 161 Exam I: True (A)/False(B) (2 pts each): 1. Assume that all variables are properly declared. The following for loop executes 20 times.

More information

General Computer Science II Course: B International University Bremen Date: Dr. Jürgen Schönwälder Deadline:

General Computer Science II Course: B International University Bremen Date: Dr. Jürgen Schönwälder Deadline: General Computer Science II Course: 320102-B International University Bremen Date: 2004-04-28 Dr. Jürgen Schönwälder Deadline: 2004-05-14 Problem Sheet #7 This problem sheet focusses on C++ casting operators

More information

C Functions. Object created and destroyed within its block auto: default for local variables

C Functions. Object created and destroyed within its block auto: default for local variables 1 5 C Functions 5.12 Storage Classes 2 Automatic storage Object created and destroyed within its block auto: default for local variables auto double x, y; Static storage Variables exist for entire program

More information

Exploring the Effects of Hyperthreading on Scientific Applications

Exploring the Effects of Hyperthreading on Scientific Applications Exploring the Effects of Hyperthreading on Scientific Applications by Kent Milfeld milfeld@tacc.utexas.edu edu Kent Milfeld, Chona Guiang, Avijit Purkayastha, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

28x 29x 30x [ 24x] 3.20GHz ( 133x24) CPU Clock Ratio CPU Frequency. CPU Host Clock Control [ Enable] CPU Host Frequency ( MHz ) 133

28x 29x 30x [ 24x] 3.20GHz ( 133x24) CPU Clock Ratio CPU Frequency. CPU Host Clock Control [ Enable] CPU Host Frequency ( MHz ) 133 Intel Core i7 is a brand new architecture featuring the QPI bus which replaces the FSB bus. So, how does this affect overclocking? The Core i7 processor s frequency is Bclk * CPU multiplier. For ex. Intel

More information

Lambda functions. Zoltán Porkoláb: C++11/14 1

Lambda functions. Zoltán Porkoláb: C++11/14 1 Lambda functions Terminology How it is compiled Capture by value and reference Mutable lambdas Use of this Init capture and generalized lambdas in C++14 Constexpr lambda and capture *this and C++17 Zoltán

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

GE U111 Engineering Problem Solving & Computation Lecture 6 February 2, 2004

GE U111 Engineering Problem Solving & Computation Lecture 6 February 2, 2004 GE U111 Engineering Problem Solving & Computation Lecture 6 February 2, 2004 Functions and Program Structure Today we will be learning about functions. You should already have an idea of their uses. Cout

More information

COMP6771 Advanced C++ Programming

COMP6771 Advanced C++ Programming 1.. COMP6771 Advanced C++ Programming Week 6 Part One: Function Templates 2016 www.cse.unsw.edu.au/ cs6771 2.. Constants Two notions of immutability: const: A promise not to change this value. Used primarily

More information

Modern C++ for Computer Vision and Image Processing. Igor Bogoslavskyi

Modern C++ for Computer Vision and Image Processing. Igor Bogoslavskyi Modern C++ for Computer Vision and Image Processing Igor Bogoslavskyi Outline Using pointers Pointers are polymorphic Pointer this Using const with pointers Stack and Heap Memory leaks and dangling pointers

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell

More information

Profiling & Optimization

Profiling & Optimization Lecture 18 Sources of Game Performance Issues? 2 Avoid Premature Optimization Novice developers rely on ad hoc optimization Make private data public Force function inlining Decrease code modularity removes

More information

C++ Programming Lecture 11 Functions Part I

C++ Programming Lecture 11 Functions Part I C++ Programming Lecture 11 Functions Part I By Ghada Al-Mashaqbeh The Hashemite University Computer Engineering Department Introduction Till now we have learned the basic concepts of C++. All the programs

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Tasks and Threads. What? When? Tasks and Threads. Use OpenMP Threading Building Blocks (TBB) Intel Math Kernel Library (MKL)

Tasks and Threads. What? When? Tasks and Threads. Use OpenMP Threading Building Blocks (TBB) Intel Math Kernel Library (MKL) CGT 581I - Parallel Graphics and Simulation Knights Landing Tasks and Threads Bedrich Benes, Ph.D. Professor Department of Computer Graphics Purdue University Tasks and Threads Use OpenMP Threading Building

More information

Experiences Porting Real Time Signal Processing Pipeline CUDA Kernels from Kepler to Maxwell

Experiences Porting Real Time Signal Processing Pipeline CUDA Kernels from Kepler to Maxwell NVIDIA GPU Technology Conference March 20, 2015 San José, California Experiences Porting Real Time Signal Processing Pipeline CUDA Kernels from Kepler to Maxwell Ismayil Güracar Senior Key Expert Siemens

More information

Intermediate Programming, Spring 2017*

Intermediate Programming, Spring 2017* 600.120 Intermediate Programming, Spring 2017* Misha Kazhdan *Much of the code in these examples is not commented because it would otherwise not fit on the slides. This is bad coding practice in general

More information

We look back... #include <iostream> #include <vector>

We look back... #include <iostream> #include <vector> 165 6. C++ advanced (I) Repetition: vectors, pointers and iterators, range for, keyword auto, a class for vectors, subscript-operator, move-construction, iterators We look back... #include #include

More information

CS 152, Spring 2011 Section 10

CS 152, Spring 2011 Section 10 CS 152, Spring 2011 Section 10 Christopher Celio University of California, Berkeley Agenda Stuff (Quiz 4 Prep) http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/ Intel

More information

Beckhoff Basic Datalogger using C/C++

Beckhoff Basic Datalogger using C/C++ PLC generally forwards values to SCADA from coming field. Most of time, PLC does not keep this data in any format. Neverthless, if you want to keep this data, it brings also cause of having external computer

More information

Tiled Matrix Multiplication

Tiled Matrix Multiplication Tiled Matrix Multiplication Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y;

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

The following program computes a Calculus value, the "trapezoidal approximation of

The following program computes a Calculus value, the trapezoidal approximation of Multicore machines and shared memory Multicore CPUs have more than one core processor that can execute instructions at the same time. The cores share main memory. In the next few activities, we will learn

More information

Problem 1. (10 points):

Problem 1. (10 points): Parallel Computer Architecture and Programming Written Assignment 1 30 points total + 2 pts extra credit. Due Monday, July 3 at the start of class. Warm Up Problems Problem 1. (10 points): A. (5 pts) Complete

More information

6.S096 Lecture 4 Style and Structure

6.S096 Lecture 4 Style and Structure 6.S096 Lecture 4 Style and Structure Transition from C to C++ Andre Kessler Andre Kessler 6.S096 Lecture 4 Style and Structure 1 / 24 Outline 1 Assignment Recap 2 Headers and multiple files 3 Coding style

More information

ECE 122. Engineering Problem Solving with Java

ECE 122. Engineering Problem Solving with Java ECE 122 Engineering Problem Solving with Java Lecture 5 Anatomy of a Class Outline Problem: How do I build and use a class? Need to understand constructors A few more tools to add to our toolbox Formatting

More information

GPU Microarchitecture Note Set 2 Cores

GPU Microarchitecture Note Set 2 Cores 2 co 1 2 co 1 GPU Microarchitecture Note Set 2 Cores Quick Assembly Language Review Pipelined Floating-Point Functional Unit (FP FU) Typical CPU Statically Scheduled Scalar Core Typical CPU Statically

More information