Parallel Numerical Algorithms 2016 Report 2

Size: px

Start display at page:

Download "Parallel Numerical Algorithms 2016 Report 2"

Natalie Gallagher
5 years ago
Views:

1 Parallel Numerical Algorithms 2016 Report 2 Assignments (i) Perk Performance of Core i7 4500U When using FMA instruction of AVX 2 Single precision floating point number: 32 FLOPS/Clock * 3.0 GHz * 2Core = 192GFLOPS Double precision floating point number: 16 FLOPS/Clock * 3.0GHz * 2Core = 96GFLOPS (ii) Measure the performance in flops Single precision floating point number: 38.1GFLOPS (19.8% per perk performance) Double precision floating point number: 18.2GFLOPS (19.0% per perk performance) However, while test program was running, the clock of the CPU was 2.66GHz. Considering it, the percentages compared with perk performance are 22.3% and 21.4%. Measured by the program shown below. It is a multithread program which uses AVX2 instructions. #define _USE_MATH_DEFINES

2 #include <iostream> #include <vector> #include <windows.h> #include <cmath> #include <immintrin.h> #include <thread> #define NUM #define SIZE 64 #define THREAD 4 #define TYPED #ifdef TYPED #define TYPE double #define VTYPE m256d #define FUNC _mm256_fmadd_pd #else #define TYPE float #define VTYPE m256 #define FUNC _mm256_fmadd_ps #endif void vec_add(type **a) { VTYPE *va = (VTYPE *)a[0]; VTYPE *vb = (VTYPE *)a[1]; VTYPE *vc = (VTYPE *)a[2]; VTYPE *vd = (VTYPE *)a[3]; for (size_t j = 0; j < NUM; j++) { for (size_t i = 0; i < SIZE / (32 / sizeof(type)); i++) { vd[i] = FUNC(va[i], vb[i], vc[i]);

3 void Initialize(TYPE **a) { a[0] = (TYPE*)_mm_malloc(sizeof(TYPE) * SIZE, 32); a[1] = (TYPE*)_mm_malloc(sizeof(TYPE) * SIZE, 32); a[2] = (TYPE*)_mm_malloc(sizeof(TYPE) * SIZE, 32); a[3] = (TYPE*)_mm_malloc(sizeof(TYPE) * SIZE, 32); for (int j = 0; j < SIZE; j++) { a[0][j] = (TYPE)M_PI; a[1][j] = (TYPE)M_E; a[2][j] = (TYPE)1; void Finalize(TYPE **a) { _mm_free(a[0]); _mm_free(a[1]); _mm_free(a[2]); _mm_free(a[3]); int main() { LARGE_INTEGER freq; if (!QueryPerformanceFrequency(&freq)) LARGE_INTEGER start, end; TYPE **data[thread]; data[i] = (TYPE**)malloc(sizeof(TYPE*) * 4); Initialize(data[i]); QueryPerformanceCounter(&start);

4 std::vector<std::thread> threads; std::vector<int> da; threads.push_back(std::thread(vec_add, data[i])); threads[i].join(); QueryPerformanceCounter(&end); auto dur = end.quadpart - start.quadpart; std::cout << (double)(dur) / freq.quadpart * 1E9 << std::endl; std::cout << (double)size * (double)num * 2 * 4 / ((double)(dur) / freq.quadpart) << std::endl; // SIZE * ITERATION * CALC * THREAD Finalize(data[i]); (iii) Execution time of Copy, Inner product and Sum The figure of the execution times of these three programs shown below shows that the execution times of Copy and Sum is almost same and one of Inner Product is bigger than the others. This system has DDR memories. The speed of it is 12.8 GB/s. Then the bandwidths of Copy and Sum are 8~11GB/s. It indicated that execution time of Copy

5 and Sum are bottlenecked by the speed of memories Execution Time(ns) Vector Size (n) copy inn sum This program code is for measuring the execution time of sum of two vectors. Copy and Inner Product are measured by the variation program of it. #define _USE_MATH_DEFINES #include <iostream> #include <vector> #include <windows.h> #include <cmath> #include <random> #define Num #define N 2048 void vec_copy(float* v1, float* v2) { for (int i = 0; i < N; i++) { v1[i] = v2[i];

6 void vec_sum(float* v1, float* v2) { for (int i = 0; i < N; i++) { v1[i] += v2[i]; float vec_inn(float* v1, float* v2) { float sum = 0; for (int i = 0; i < N; i++) { sum += v1[i] * v2[i]; return sum; std::random_device rnd; std::mt19937 mt(rnd()); float* init_vector() { float* p = new float[n]; for (int j = 0; j < N; j++) { p[j] = (float)mt(); return p; void free_vector(float *p){ delete p; int main() { LARGE_INTEGER freq; if (!QueryPerformanceFrequency(&freq))

7 LARGE_INTEGER start, end; std::vector<float*> a; std::vector<float*> b; std::vector<float> c(num); for (int i = 0; i < Num; i++) { a.push_back(init_vector()); b.push_back(init_vector()); QueryPerformanceCounter(&start); for (int i = 0; i < Num; i++) { vec_sum(a[i], b[i]); QueryPerformanceCounter(&end); for (int i = 0; i < Num; i++) { free_vector(a[i]); free_vector(b[i]); auto dur = end.quadpart - start.quadpart; std::endl; std::cout << N << "\t" << (double)(dur) / freq.quadpart * 1E9 / Num <<

Submission instructions (read carefully): SS17 / Assignment 4 Instructor: Markus Püschel. ETH Zurich

Submission instructions (read carefully): SS17 / Assignment 4 Instructor: Markus Püschel. ETH Zurich 263-2300-00: How To Write Fast Numerical Code Assignment 4: 120 points Due Date: Th, April 13th, 17:00 http://www.inf.ethz.ch/personal/markusp/teaching/263-2300-eth-spring17/course.html Questions: fastcode@lists.inf.ethz.ch