Parallel Numerical Algorithms 2016 Report 2
|
|
- Natalie Gallagher
- 5 years ago
- Views:
Transcription
1 Parallel Numerical Algorithms 2016 Report 2 Assignments (i) Perk Performance of Core i7 4500U When using FMA instruction of AVX 2 Single precision floating point number: 32 FLOPS/Clock * 3.0 GHz * 2Core = 192GFLOPS Double precision floating point number: 16 FLOPS/Clock * 3.0GHz * 2Core = 96GFLOPS (ii) Measure the performance in flops Single precision floating point number: 38.1GFLOPS (19.8% per perk performance) Double precision floating point number: 18.2GFLOPS (19.0% per perk performance) However, while test program was running, the clock of the CPU was 2.66GHz. Considering it, the percentages compared with perk performance are 22.3% and 21.4%. Measured by the program shown below. It is a multithread program which uses AVX2 instructions. #define _USE_MATH_DEFINES
2 #include <iostream> #include <vector> #include <windows.h> #include <cmath> #include <immintrin.h> #include <thread> #define NUM #define SIZE 64 #define THREAD 4 #define TYPED #ifdef TYPED #define TYPE double #define VTYPE m256d #define FUNC _mm256_fmadd_pd #else #define TYPE float #define VTYPE m256 #define FUNC _mm256_fmadd_ps #endif void vec_add(type **a) { VTYPE *va = (VTYPE *)a[0]; VTYPE *vb = (VTYPE *)a[1]; VTYPE *vc = (VTYPE *)a[2]; VTYPE *vd = (VTYPE *)a[3]; for (size_t j = 0; j < NUM; j++) { for (size_t i = 0; i < SIZE / (32 / sizeof(type)); i++) { vd[i] = FUNC(va[i], vb[i], vc[i]);
3 void Initialize(TYPE **a) { a[0] = (TYPE*)_mm_malloc(sizeof(TYPE) * SIZE, 32); a[1] = (TYPE*)_mm_malloc(sizeof(TYPE) * SIZE, 32); a[2] = (TYPE*)_mm_malloc(sizeof(TYPE) * SIZE, 32); a[3] = (TYPE*)_mm_malloc(sizeof(TYPE) * SIZE, 32); for (int j = 0; j < SIZE; j++) { a[0][j] = (TYPE)M_PI; a[1][j] = (TYPE)M_E; a[2][j] = (TYPE)1; void Finalize(TYPE **a) { _mm_free(a[0]); _mm_free(a[1]); _mm_free(a[2]); _mm_free(a[3]); int main() { LARGE_INTEGER freq; if (!QueryPerformanceFrequency(&freq)) LARGE_INTEGER start, end; TYPE **data[thread]; data[i] = (TYPE**)malloc(sizeof(TYPE*) * 4); Initialize(data[i]); QueryPerformanceCounter(&start);
4 std::vector<std::thread> threads; std::vector<int> da; threads.push_back(std::thread(vec_add, data[i])); threads[i].join(); QueryPerformanceCounter(&end); auto dur = end.quadpart - start.quadpart; std::cout << (double)(dur) / freq.quadpart * 1E9 << std::endl; std::cout << (double)size * (double)num * 2 * 4 / ((double)(dur) / freq.quadpart) << std::endl; // SIZE * ITERATION * CALC * THREAD Finalize(data[i]); (iii) Execution time of Copy, Inner product and Sum The figure of the execution times of these three programs shown below shows that the execution times of Copy and Sum is almost same and one of Inner Product is bigger than the others. This system has DDR memories. The speed of it is 12.8 GB/s. Then the bandwidths of Copy and Sum are 8~11GB/s. It indicated that execution time of Copy
5 and Sum are bottlenecked by the speed of memories Execution Time(ns) Vector Size (n) copy inn sum This program code is for measuring the execution time of sum of two vectors. Copy and Inner Product are measured by the variation program of it. #define _USE_MATH_DEFINES #include <iostream> #include <vector> #include <windows.h> #include <cmath> #include <random> #define Num #define N 2048 void vec_copy(float* v1, float* v2) { for (int i = 0; i < N; i++) { v1[i] = v2[i];
6 void vec_sum(float* v1, float* v2) { for (int i = 0; i < N; i++) { v1[i] += v2[i]; float vec_inn(float* v1, float* v2) { float sum = 0; for (int i = 0; i < N; i++) { sum += v1[i] * v2[i]; return sum; std::random_device rnd; std::mt19937 mt(rnd()); float* init_vector() { float* p = new float[n]; for (int j = 0; j < N; j++) { p[j] = (float)mt(); return p; void free_vector(float *p){ delete p; int main() { LARGE_INTEGER freq; if (!QueryPerformanceFrequency(&freq))
7 LARGE_INTEGER start, end; std::vector<float*> a; std::vector<float*> b; std::vector<float> c(num); for (int i = 0; i < Num; i++) { a.push_back(init_vector()); b.push_back(init_vector()); QueryPerformanceCounter(&start); for (int i = 0; i < Num; i++) { vec_sum(a[i], b[i]); QueryPerformanceCounter(&end); for (int i = 0; i < Num; i++) { free_vector(a[i]); free_vector(b[i]); auto dur = end.quadpart - start.quadpart; std::endl; std::cout << N << "\t" << (double)(dur) / freq.quadpart * 1E9 / Num <<
Submission instructions (read carefully): SS17 / Assignment 4 Instructor: Markus Püschel. ETH Zurich
263-2300-00: How To Write Fast Numerical Code Assignment 4: 120 points Due Date: Th, April 13th, 17:00 http://www.inf.ethz.ch/personal/markusp/teaching/263-2300-eth-spring17/course.html Questions: fastcode@lists.inf.ethz.ch
More informationSuppose that you want to use two libraries with a bunch of useful classes and functions, but some names collide:
COMP151 Namespaces Motivation [comp151] 1 Suppose that you want to use two libraries with a bunch of useful classes and functions, but some names collide: // File: gnutils.h class Stack {... ; class Some
More informationIntel Array Building Blocks (Intel ArBB) Technical Presentation
Intel Array Building Blocks (Intel ArBB) Technical Presentation Copyright 2010, Intel Corporation. All rights reserved. 1 Noah Clemons Software And Services Group Developer Products Division Performance
More informationCOMP6771 Advanced C++ Programming
1.... COMP6771 Advanced C++ Programming Week 9 Multithreading 2016 www.cse.unsw.edu.au/ cs6771 .... Single Threaded Programs All programs so far this semester have been single threaded They have a single
More informationFunctions and Recursion
Functions and Recursion 1 some useful problems 2 Function: power Power iteration Power recursive #include #include 3 using std::cout; using std::cin; using std::endl; // function prototype
More informationC++11 and Beyond. C++11 feels like a new language Bjarne Stroustrup
C++11 and Beyond C++11 feels like a new language Bjarne Stroustrup Agenda The Free Lunch is Over C++11 vs Global Warming Chasing "Coffee-based" Languages What is Standard C++'s biggest weakness? The Future
More informationtcsc 2016 Luca Brianza 1 Luca Brianza 19/07/16 INFN & University of Milano-Bicocca
tcsc 2016 1 1 INFN & University of Milano-Bicocca Outlook Amdahl s law Different ways of parallelism: - Asynchronous task execution - Threads Resource protection/thread safety - The problem - Low-level
More informationType Aliases. Examples: using newtype = existingtype; // C++11 typedef existingtype newtype; // equivalent, still works
Type Aliases A name may be defined as a synonym for an existing type name. Traditionally, typedef is used for this purpose. In the new standard, an alias declaration can also be used C++11.Thetwoformsareequivalent.
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationLecture 5. Performance programming for stencil methods Vectorization Computing with GPUs
Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,
More informationAdvanced Parallel Programming II
Advanced Parallel Programming II Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Introduction to Vectorization RISC Software GmbH Johannes Kepler
More information1- Write a single C++ statement that: A. Calculates the sum of the two integrates 11 and 12 and outputs the sum to the consol.
1- Write a single C++ statement that: A. Calculates the sum of the two integrates 11 and 12 and outputs the sum to the consol. B. Outputs to the console a floating point number f1 in scientific format
More informationCOMP6771 Advanced C++ Programming
1. COMP6771 Advanced C++ Programming Week 7 Part One: Member Templates and 2016 www.cse.unsw.edu.au/ cs6771 2. Member Templates Consider this STL code: 1 #include 2 #include 3 #include
More informationOutline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends
Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationCS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs) John Wawrzynek. EECS, University of California at Berkeley
CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) John Wawrzynek EECS, University of California at Berkeley http://inst.eecs.berkeley.edu/~cs152 Administrivia Lab
More informationCSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization
CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2 Particle simulation
More informationExam Issued: December 18, 2012, 14:00 Hand in: December 18, 2012, 17:00
P. Koumoutsakos, M. Troyer ETH Zentrum, CAB H 69.2 CH-8092 Zürich High Performance Computing for Science and Engineering (HPCSE) for Engineers Fall semester 2012 Exam Issued: December 18, 2012, 14:00 Hand
More informationUnrolling parallel loops
Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:
More informationSeptember 19,
September 19, 2013 1 Problems with previous examples Changes to the implementation will require recompilation & relinking of clients Extensions will require access to the source code Solutions Combine
More informationGCC : From 2.95 to 3.2
GCC : From 2.95 to 3.2 Topics Simple changes name of standard include files, std::endl, iostream, throw statements, vector iterators More complicated changes string streams, parameterized macros, hash_map
More informationProgrammazione. Prof. Marco Bertini
Programmazione Prof. Marco Bertini marco.bertini@unifi.it http://www.micc.unifi.it/bertini/ Hello world : a review Some differences between C and C++ Let s review some differences between C and C++ looking
More informationC++ for numerical computing - part 2
C++ for numerical computing - part 2 Rupert Nash r.nash@epcc.ed.ac.uk 1 / 36 Recap 2 / 36 Iterators std::vector data = GetData(n); // C style iteration - fully explicit for (auto i=0; i!= n; ++i)
More informationCHAPTER 4 FUNCTIONS. Dr. Shady Yehia Elmashad
CHAPTER 4 FUNCTIONS Dr. Shady Yehia Elmashad Outline 1. Introduction 2. Program Components in C++ 3. Math Library Functions 4. Functions 5. Function Definitions 6. Function Prototypes 7. Header Files 8.
More informationCS242 COMPUTER PROGRAMMING
CS242 COMPUTER PROGRAMMING I.Safa a Alawneh Variables Outline 2 Data Type C++ Built-in Data Types o o o o bool Data Type char Data Type int Data Type Floating-Point Data Types Variable Declaration Initializing
More informationAQD-SD4U4GN21-HG Test Report
AQD-SD4U4GN21-HG Test Report Initiated by Brandon Lin Approved by Adonis Shih Page 1 of 13 Revision History: Revision Date Revision Description Creator 2016-3-1 1.0 First version released Brandon Lin Page
More informationParallel and Distributed Programming Introduction. Kenjiro Taura
Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel
More informationDebugging Serial and Parallel Programs with Visual Studio
and Parallel Programs with Visual Studio Christian Terboven terboven@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University Windows HPC 2008 (II) September 17, RWTH Aachen Agenda
More informationCS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)
CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
More informationDatabase Systems on Modern CPU Architectures
Database Systems on Modern CPU Architectures Introduction to Modern C++ Moritz Sichert Technische Universität München Department of Informatics Chair of Data Science and Engineering Overview Prerequisites:
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationChapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture
An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program
More informationEE 152 Advanced Programming LAB 7
EE 152 Advanced Programming LAB 7 1) Create a class called Rational for performing arithmetic with fractions. Write a program to test your class. Use integer variables to represent the private data of
More informationMore Functions. Pass by Value. Example: Exchange two numbers. Storage Classes. Passing Parameters by Reference. Pass by value and by reference
Pass by Value More Functions Different location in memory Changes to the parameters inside the function body have no effect outside of the function. 2 Passing Parameters by Reference Example: Exchange
More informationC++ Basics. Data Processing Course, I. Hrivnacova, IPN Orsay
C++ Basics Data Processing Course, I. Hrivnacova, IPN Orsay The First Program Comments Function main() Input and Output Namespaces Variables Fundamental Types Operators Control constructs 1 C++ Programming
More informationC++ Tutorial AM 225. Dan Fortunato
C++ Tutorial AM 225 Dan Fortunato Anatomy of a C++ program A program begins execution in the main() function, which is called automatically when the program is run. Code from external libraries can be
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationAQD-D4U8GN21-HG Test Report
AQD-D4U8GN21-HG Test Report Initiated by Mila Chen Approved by Adonis Shih Page 1 of 13 Revision History: Revision Date Revision Description Creator 2015-12-28 1.0 First version released Mila Chen Page
More informationChapter 3 - Functions
Chapter 3 - Functions 1 Outline 3.1 Introduction 3.2 Progra m Components in C++ 3.3 Ma th Libra ry Func tions 3.4 Func tions 3.5 Func tion De finitions 3.6 Func tion Prototypes 3.7 He a de r File s 3.8
More informationLAFF-On High-Performance Programming
LAFF-On High-Performance Programming Margaret E Myers Robert A van de Geijn Release Date Wednesday 20 th December, 2017 This is a work in progress Copyright 2017, 2018 by Margaret E Myers and Robert A
More informationLarge-scale Deep Unsupervised Learning using Graphics Processors
Large-scale Deep Unsupervised Learning using Graphics Processors Rajat Raina Anand Madhavan Andrew Y. Ng Stanford University Learning from unlabeled data Classify vs. car motorcycle Input space Unlabeled
More informationstd::async() in C++11 Basic Multithreading
MÜNSTER std::async() in C++11 Basic Multithreading 2. December 2015 std::thread MÜNSTER std::async() in C++11 2 /14 #include void hello(){ std::cout
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationCHAPTER 4 FUNCTIONS. Dr. Shady Yehia Elmashad
CHAPTER 4 FUNCTIONS Dr. Shady Yehia Elmashad Outline 1. Introduction 2. Program Components in C++ 3. Math Library Functions 4. Functions 5. Function Definitions 6. Function Prototypes 7. Header Files 8.
More information4. C++ functions. 1. Library Function 2. User-defined Function
4. C++ functions In programming, function refers to a segment that group s code to perform a specific task. Depending on whether a function is predefined or created by programmer; there are two types of
More informationCS1600 Lab Assignment 1 Spring 2016 Due: Feb. 2, 2016 POINTS: 10
CS1600 Lab Assignment 1 Spring 2016 Due: Feb. 2, 2016 POINTS: 10 PURPOSE: The purpose of this lab is to acquaint you with the C++ programming environment on storm. PROCEDURES: You will use Unix/Linux environment
More informationMultithreading. A thread is a unit of control (stream of instructions) within a process.
Multithreading A thread is a unit of control (stream of instructions) within a process. When a thread runs, it executes a function in the program. The process associated with a running program starts with
More information1 #include <iostream > 3 using std::cout; 4 using std::cin; 5 using std::endl; 7 int main(){ 8 int x=21; 9 int y=22; 10 int z=5; 12 cout << (x/y%z+4);
2 3 using std::cout; 4 using std::cin; using std::endl; 6 7 int main(){ 8 int x=21; 9 int y=22; int z=; 11 12 cout
More informationKingdom of Saudi Arabia Princes Nora bint Abdul Rahman University College of Computer Since and Information System CS242 ARRAYS
Kingdom of Saudi Arabia Princes Nora bint Abdul Rahman University College of Computer Since and Information System CS242 1 ARRAYS Arrays 2 Arrays Structures of related data items Static entity (same size
More informationCase Study: Matrix Multiplication. 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017
Case Study: Matrix Multiplication 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017 1 4k-by-4k Matrix Multiplication Version Implementation Running time (s) GFLOPS Absolute
More informationCh 4. Parameters and Function Overloading
2014-1 Ch 4. Parameters and Function Overloading March 19, 2014 Advanced Networking Technology Lab. (YU-ANTL) Dept. of Information & Comm. Eng, Graduate School, Yeungnam University, KOREA (Tel : +82-53-810-2497;
More informationChapter 3 - Functions
Chapter 3 - Functions 1 Outline 3.1 Introduction 3.2 Program Components in C++ 3.3 Math Library Functions 3.4 Functions 3.5 Function Definitions 3.6 Function Prototypes 3.7 Header Files 3.8 Random Number
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationGPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationHeterogeneous Computing and OpenCL
Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi
More informationLecture 12: Instruction Execution and Pipelining. William Gropp
Lecture 12: Instruction Execution and Pipelining William Gropp www.cs.illinois.edu/~wgropp Yet More To Consider in Understanding Performance We have implicitly assumed that an operation takes one clock
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationCOMP6771 Advanced C++ Programming
1. COMP6771 Advanced C++ Programming Week 9 Multithreading (continued) 2016 www.cse.unsw.edu.au/ cs6771 2. So Far Program Workflows: Sequential, Parallel, Embarrassingly Parallel Memory: Shared Memory,
More informationCSc Introduc/on to Compu/ng. Lecture 8 Edgardo Molina Fall 2011 City College of New York
CSc 10200 Introduc/on to Compu/ng Lecture 8 Edgardo Molina Fall 2011 City College of New York 18 The Null Statement Null statement Semicolon with nothing preceding it ; Do-nothing statement required for
More informationVisualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017
Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference London, 2017 Agenda Vectorization is becoming more and more important What is
More informationIntroduction to Programming
Introduction to Programming session 9 Instructor: Reza Entezari-Maleki Email: entezari@ce.sharif.edu 1 Fall 2010 These slides are created using Deitel s slides Sahrif University of Technology Outlines
More informationVector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data
Vector Processors A vector processor is a pipelined processor with special instructions designed to keep the (floating point) execution unit pipeline(s) full. These special instructions are vector instructions.
More informationint main(){ int main(){ We want to understand this in depth! std::vector<int> v(10,0); // Vector of length 10
We look back #include #include . C++ advanced (I) Repetition: vectors, pointers and iterators, range for, keyword auto, a class for vectors, subscript-operator, move-construction, iterators
More informationName :. Roll No. :... Invigilator s Signature : INTRODUCTION TO PROGRAMMING. Time Allotted : 3 Hours Full Marks : 70
Name :. Roll No. :..... Invigilator s Signature :.. 2011 INTRODUCTION TO PROGRAMMING Time Allotted : 3 Hours Full Marks : 70 The figures in the margin indicate full marks. Candidates are required to give
More informationOptimizing Parallel Reduction in CUDA
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 9 SIMD vectorization using #pragma omp simd force
More informationFORM 1 (Please put your name and form # on the scantron!!!!) CS 161 Exam I: True (A)/False(B) (2 pts each):
FORM 1 (Please put your name and form # on the scantron!!!!) CS 161 Exam I: True (A)/False(B) (2 pts each): 1. The basic commands that a computer performs are input (get data), output (display result),
More information5. Assuming gooddata is a Boolean variable, the following two tests are logically equivalent. if (gooddata == false) if (!
FORM 2 (Please put your name and form # on the scantron!!!!) CS 161 Exam I: True (A)/False(B) (2 pts each): 1. Assume that all variables are properly declared. The following for loop executes 20 times.
More informationGeneral Computer Science II Course: B International University Bremen Date: Dr. Jürgen Schönwälder Deadline:
General Computer Science II Course: 320102-B International University Bremen Date: 2004-04-28 Dr. Jürgen Schönwälder Deadline: 2004-05-14 Problem Sheet #7 This problem sheet focusses on C++ casting operators
More informationC Functions. Object created and destroyed within its block auto: default for local variables
1 5 C Functions 5.12 Storage Classes 2 Automatic storage Object created and destroyed within its block auto: default for local variables auto double x, y; Static storage Variables exist for entire program
More informationExploring the Effects of Hyperthreading on Scientific Applications
Exploring the Effects of Hyperthreading on Scientific Applications by Kent Milfeld milfeld@tacc.utexas.edu edu Kent Milfeld, Chona Guiang, Avijit Purkayastha, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More information28x 29x 30x [ 24x] 3.20GHz ( 133x24) CPU Clock Ratio CPU Frequency. CPU Host Clock Control [ Enable] CPU Host Frequency ( MHz ) 133
Intel Core i7 is a brand new architecture featuring the QPI bus which replaces the FSB bus. So, how does this affect overclocking? The Core i7 processor s frequency is Bclk * CPU multiplier. For ex. Intel
More informationLambda functions. Zoltán Porkoláb: C++11/14 1
Lambda functions Terminology How it is compiled Capture by value and reference Mutable lambdas Use of this Init capture and generalized lambdas in C++14 Constexpr lambda and capture *this and C++17 Zoltán
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationGE U111 Engineering Problem Solving & Computation Lecture 6 February 2, 2004
GE U111 Engineering Problem Solving & Computation Lecture 6 February 2, 2004 Functions and Program Structure Today we will be learning about functions. You should already have an idea of their uses. Cout
More informationCOMP6771 Advanced C++ Programming
1.. COMP6771 Advanced C++ Programming Week 6 Part One: Function Templates 2016 www.cse.unsw.edu.au/ cs6771 2.. Constants Two notions of immutability: const: A promise not to change this value. Used primarily
More informationModern C++ for Computer Vision and Image Processing. Igor Bogoslavskyi
Modern C++ for Computer Vision and Image Processing Igor Bogoslavskyi Outline Using pointers Pointers are polymorphic Pointer this Using const with pointers Stack and Heap Memory leaks and dangling pointers
More informationPORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune
PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell
More informationProfiling & Optimization
Lecture 18 Sources of Game Performance Issues? 2 Avoid Premature Optimization Novice developers rely on ad hoc optimization Make private data public Force function inlining Decrease code modularity removes
More informationC++ Programming Lecture 11 Functions Part I
C++ Programming Lecture 11 Functions Part I By Ghada Al-Mashaqbeh The Hashemite University Computer Engineering Department Introduction Till now we have learned the basic concepts of C++. All the programs
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationTasks and Threads. What? When? Tasks and Threads. Use OpenMP Threading Building Blocks (TBB) Intel Math Kernel Library (MKL)
CGT 581I - Parallel Graphics and Simulation Knights Landing Tasks and Threads Bedrich Benes, Ph.D. Professor Department of Computer Graphics Purdue University Tasks and Threads Use OpenMP Threading Building
More informationExperiences Porting Real Time Signal Processing Pipeline CUDA Kernels from Kepler to Maxwell
NVIDIA GPU Technology Conference March 20, 2015 San José, California Experiences Porting Real Time Signal Processing Pipeline CUDA Kernels from Kepler to Maxwell Ismayil Güracar Senior Key Expert Siemens
More informationIntermediate Programming, Spring 2017*
600.120 Intermediate Programming, Spring 2017* Misha Kazhdan *Much of the code in these examples is not commented because it would otherwise not fit on the slides. This is bad coding practice in general
More informationWe look back... #include <iostream> #include <vector>
165 6. C++ advanced (I) Repetition: vectors, pointers and iterators, range for, keyword auto, a class for vectors, subscript-operator, move-construction, iterators We look back... #include #include
More informationCS 152, Spring 2011 Section 10
CS 152, Spring 2011 Section 10 Christopher Celio University of California, Berkeley Agenda Stuff (Quiz 4 Prep) http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/ Intel
More informationBeckhoff Basic Datalogger using C/C++
PLC generally forwards values to SCADA from coming field. Most of time, PLC does not keep this data in any format. Neverthless, if you want to keep this data, it brings also cause of having external computer
More informationTiled Matrix Multiplication
Tiled Matrix Multiplication Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y;
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationThe following program computes a Calculus value, the "trapezoidal approximation of
Multicore machines and shared memory Multicore CPUs have more than one core processor that can execute instructions at the same time. The cores share main memory. In the next few activities, we will learn
More informationProblem 1. (10 points):
Parallel Computer Architecture and Programming Written Assignment 1 30 points total + 2 pts extra credit. Due Monday, July 3 at the start of class. Warm Up Problems Problem 1. (10 points): A. (5 pts) Complete
More information6.S096 Lecture 4 Style and Structure
6.S096 Lecture 4 Style and Structure Transition from C to C++ Andre Kessler Andre Kessler 6.S096 Lecture 4 Style and Structure 1 / 24 Outline 1 Assignment Recap 2 Headers and multiple files 3 Coding style
More informationECE 122. Engineering Problem Solving with Java
ECE 122 Engineering Problem Solving with Java Lecture 5 Anatomy of a Class Outline Problem: How do I build and use a class? Need to understand constructors A few more tools to add to our toolbox Formatting
More informationGPU Microarchitecture Note Set 2 Cores
2 co 1 2 co 1 GPU Microarchitecture Note Set 2 Cores Quick Assembly Language Review Pipelined Floating-Point Functional Unit (FP FU) Typical CPU Statically Scheduled Scalar Core Typical CPU Statically
More information