Intel Tools zur parallelen Programmierung Windows HPC RWTH Aachen 2007

Size: px
Start display at page:

Download "Intel Tools zur parallelen Programmierung Windows HPC RWTH Aachen 2007"

Transcription

1 Intel Tools zur parallelen Programmierung Windows HPC RWTH Aachen 2007 Dr. Mario Deilmann Intel Compiler Group

2 Processor Evolution X86 New Quad-Core Intel Xeon 5300 for 2006 Dual-Core Intel Xeon processor 5100 series Intel Xeon processor Quad Core Dual Core Single Core 2

3 Parallel computing is omnipresent Over the next few years, all computers will be somehow parallel computers. Servers Laptops Cell phones What about software? Herb Sutter of Microsoft said in Dr. Dobbs Journal: The free lunch is over: Fundamental Turn towards Concurrency in software Software performance will no longer increase from one generation to the next as hardware improves unless it is parallel software 3

4 Intel Processor and Platform Evolution for the Next Decade 4

5 Independent Programs Application Parallel Programs Application Application Application Application Application Threads Thread Switch Process Switch Threads allows one application to utilize the power of multiple processors 5

6 How do we make use of the additional core s and CPU s on the software side?

7 What is Parallelism? Two or more processes or threads execute at the same time! Multiple processes communicate through an inter-process protocol Single process with multiple threads which communication through shared memory 7

8 Process model Process OS creates process for each program loaded Process Master thread Additional threads can be created within the process Threads share code and data Each thread has its separate Registers and Stack Stack Thread Stack Thread Stack Code segment Data segment 8

9 Parallel Programming Models Message Passing Create/fork multiple processes Node Typically one per node/core Explicit communication Send messages send(tid, tag, message) receive(tid, tag, message) P P P M M M Synchronization Block on messages Barriers 9 Interconnect

10 Parallel Programming Models Shared Memory Create one process with multiple threads Typically one per node/core Implicit communication P T T Using shared address space Loads and stores Synchronization Locks Atomic memory operators Barriers 10 T Bus Memory

11 Parallel programming approaches API / Library Threads - P-threads (POSIX), Win32* threading API Intel Threading Building Blocks (C++) MPI, PVM Programming language mechanisms Java*, C#, Erlang Programming language extension OpenMP (C, C++, Fortran, ) UPC (unified parallel C) Cilk (extension to GCC) 11

12 Multithreading introduces new class of problems Developing threaded or MPI applications is hard but new and advanced Intel architectures and software tool help to support these approaches. New class of problems is introduced due to the interaction between threads which are complicated and hard to find! Correctness problems (data races) Performance problems (contention) Runtime problems (crashes) Opened the Pandora s box 12

13 Performance versus effort MPI Code Performance Theoretical speedup limited by number of CPU s per cluster Cluster OpenMP Theoretical speedup limited by number of Core s per CPU Threads Theoretical speedup limited by Core OpenMP Serial optimization Development effort Code Restructuring 13

14 Intel Software Development Tools for Parallel Programming

15 How can we support parallel Development Analysis (seriell / parallel) VTune Performance Analyzer Intel Trace Analyzer Design (Introduce Parallelism or extend) Intel Performance libraries: IPP and MKL OpenMP* (Intel Compiler) Intel MPI Debug for Correctness (data races, locks) Intel Thread Checker Intel MPI Correctness Checker Tune for Performance (bottlenecks) Intel Thread Profiler VTune Performance Analyzer Intel Trace Collector 15

16 Intel Threading Tools

17 Intel VTune Analyzer 9.0 Identifies hard to find performance bottlenecks Features Tune process or thread parallel code "The Intel VTune Performance Analyzer took a multi-day task and turned it into a subday task." Randy Camp VP, Software R&D MUSICMATCH Inc. Low overhead sampling Graphical call graph View results on source or assembly What s New New tuning methodology Stall cycle accounting for Core 2 Duo and Core 2 Quad processors Windows: Microsoft Vista* support Linux: Connection to Intel compiler analysis & intuitive hotspot navigator Windows* Linux* Mac* IA32 Intel64 IA64 Multicore 17

18 VTune Performance Analyzer Helps you identify and characterize performance issues by: Collecting performance data from the system running your application Sampling: Event-based or Time-based Call Graph Counter Monitor Organizing and displaying the data in a variety of views From system-wide down to source code or processor instruction perspective GUI and CLI VTune Analyzer Driver Kit Rebuild VTune Analyzer Linux driver for non-standard kernels Red Hat, SuSE, Red Flag distributions supported 18

19 What Can You Profile with Vtune? Windows/Linux applications Stand-alone Win* DLLs Stand-alone COM+ DLLs Java applications.net* applications ASP.NET applications 19

20 Performance Analysis Technologies Identify Performance Bottlenecks (Sampling) Interrupt based sampling using CPU registers (PMU) Lower Overhead Examine flow of control through the app (Call Graph) Which functions took the longest Which functions were blocked the longest Calling sequence critical path Higher Overhead, more data 20

21 Performance Hotspots - Sampling Sample the CPU s execution context Periodically interrupts the processor Time-based: Triggered at a certain time intervall Event-based: Triggered by the occurrence of a certain events Collects the execution context Execution address in memory (CS:IP) Operating system process and thread ID Executable module loaded at that address If you have symbols for the module, post-processing can identify the function or method at the memory address. Line numbers from the symbol file can direct you to the relevant line of source code. Can measure performance sensitive CPU events Cache misses, branch mispredictions, 21

22 Sampling Module of Interest 22

23 Sampling over time 23

24 Sampling: Source View 24

25 Programmatic Flow of control Call Graph Instrumented technology Some performance degradation Binary is instrumented No special build needed Identifies function to function calling sequences Reports statistics for each called function Execution time Blocked time Calling sequences & frequency of occurrence 25

26 Call graph: Application workflow Filter view by self time The red lines show the critical path. The critical path is the most timeconsuming call path. It is based on self time. Bright orange nodes indicate functions with the highest self time. 26 Intel, VTune, and the Intel logo are trademarks or registered trademarks of Intel *Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners or its subsidiaries in the United States or other countries. Corporation

27 Graph Navigation Window Use the graph navigation window for an overview of the entire call graph. 27

28 Vtune Tuning Assistant For more detail, click hyperlink. 28 Intel and the Intel logo are trademarks or registered trademarks of Intel *Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners or its subsidiaries in the United States or other countries. Corporation

29 ICC optimization report VTune displays optimization reports generated with ICC 9.0 and later Allows simplify performance optimization work when ICC and VTune are used together and handles all optimization phases supported by ICC 29

30 Intel Thread Checker v3.1 Confidently pinpoint threading errors Features Detects challenging data races and deadlocks Pinpoints errors to the source code line Command line interface for Windows and Linux Works on standard debug builds without recompiling Batch scripts integration for regression test runs We couldn t have gotten the networking up and running as quickly and as efficiently without Thread Checker. Thread Checker is simply an awesome tool and we are not going to develop multi-threaded code without it. Doug Service, Dir. of Tech. Dev. Chris Stark, Software Engineer Ritual Entertainment What s New Faster analysis through performance optimizations Microsoft Vista* support Windows* Linux* Mac* IA32 Intel64 IA64 Multicore 30

31 Example: Not Quite Right #include #include <stdio.h> <stdio.h> const const long long NN == ; ; long long primes[n], primes[n], number_of_primes number_of_primes == 0; 0; main() main() {{ printf( printf( "Determining "Determining primes primes from from 1-%d 1-%d \n", \n", NN ); ); primes[ primes[ number_of_primes++ number_of_primes++ ]] == 2; 2; // // special special case case #pragma omp parallel for for for (( long long number number == 3; 3; number number <= <= N; N; number number += += 22 )) {{ long long factor factor == 3; 3; while while (( number number %% factor factor )) factor factor += += 2; 2; if if (( factor factor == == number number )) primes[ primes[ number_of_primes++ number_of_primes++ ]] == number; number; }} printf( printf( "Found "Found %d %d primes\n", primes\n", number_of_primes number_of_primes ); ); }} 31

32 Intel Thread Checker Key Benefits Detects challenging data races and deadlocks Pinpoints errors to the source code line Works on standard debug builds without recompiling Recommends modules to instrument by usage (minimize instrumentation overhead) Scriptable interface for test environment integration (enabling batch file runs) Supports 32 and 64-bit applications 32

33 Intel Thread Checker Intel Thread Checker Primes.exe Binary Instrumentation Primes.exe (Instrumented) Runtime Data Collector +DLLs (Instrumented) threadchecker.thr (results) Win32* threads, POSIX* threads, OpenMP* 33 *Intel and the Intel logo are registered trademarks of Intel Corporation. names are are the property of their owners *Third Other partybrands marksand and brands the property ofrespective their respective owners

34 Intel Thread Checker S T N OI E P PIN URC SO DE CO 34

35 Intel Thread Profiler v3.1 Pinpoints threading inefficiencies Features View application concurrency level to ensure full core utilization Identify where thread related overhead impacts performance Find out which created threads are active and which are inactive Included with VTune Analyzer for Windows* Intel Thread Profiler was very useful for analyzing bottlenecks in our threaded code. Thread Profiler quickly pinpointed problem areas and showed us the reasons for the slowdown, so we were able to restructure the code for better threaded performance. Martin Watt Software Architect Alias Support for Threading Building Blocks API What s New Easier to Use - Recall custom configuration settings Faster to Use - User selectable stack walking Microsoft Vista* support Windows* Linux* Mac* IA32 Intel64 IA64 Multicore 35

36 Speedup Performance Profile: Recap Threads Possible causes for this scalability profile: 1. Insufficient parallel work 2. Memory bandwidth limitations 3. Synchronization overhead 4. Load imbalance 36

37 Intel Thread Profiler Key Benefits Shows how much of your application is not optimally parallel and where Identifies where thread specific overhead impacts performance Highlights thread workload imbalances and thread activity Shows the number of cores utilized Pinpoints issues to the source code line 37

38 Intel Thread Profiler S INT IES O P PIN ICIENC F F E IN S INT IES O P C PIN ICIEN FF E N I 38

39 Intel Threading Building Blocks Some kind of STL for Parallel C++ Programming You specify task patterns instead of threads Library maps user-defined logical tasks onto physical threads, efficiently using cache and balancing load Full support for nested parallelism Targets threading for robust performance Designed to provide scalable, portable performance for computationally intense portions of shrink-wrapped applications. Compatible with other threading packages Designed to work well for CPU bound computation, not I/O bound or real-time. Library can be used in concert with other threading packages such as native threads and OpenMP. Emphasizes scalable, data parallel programming Solutions based on functional decomposition usually do not scale. 39

40 An Example using ParallelFor Independent iterations and fixed/known bounds const int N = ; void change_array(float array, int M) { for (int i = 0; i < M; i++){ array[i] *= 2; } } int main (){ float A[N]; initialize_array(a); change_array(a, N); return 0; } 40

41 An Example using ParallelFor Include and initialize the library Include Library Headers #include <tbb/taskschedulerinit.h> #include <tbb/blockedrange.h> #include <tbb/parallelfor.h> using namespace ThreadingBuildingBlocks; int main (){ TaskSchedulerInit init; float A[N]; initialize_array(a); parallel_change_array(a, N); Initialize scheduler return 0; } Use namespace blue = original code red = provided by TBB black = boilerplate for library 41

42 An Example using ParallelFor Use the ParallelFor pattern Define Task blue = original code red = provided by TBB black = boilerplate for library class ChangeArrayBody { float *array; public: ChangeArrayBody (float *a): array(a) {} void operator()( const BlockedRange <int>& r ) const{ for (int i=r.begin(); i!=r.end(); i++ ){ array[i] *= 2; } } }; void parallel_change_array(float *array, int M) { ParallelFor (BlockedRange <int>(0, M, IdealGrainSize), ChangeArrayBody(array)); } Use Pattern Establish grain size 42

43 Intel Threading Building Blocks overview Generic Parallel Algorithms parallel_for parallel_while parallel_reduce pipeline parallel_sort parallel_scan Concurrent Containers concurrent_hash_map concurrent_queue concurrent_vector Task scheduler Low-Level Synchronization Primitives atomic spin_mutex queuing_mutex spin_rw_mutex mutex Memory Allocation cache_aligned_allocator scalable_allocator Timing tick_count 43

44 Intel Threading Building Blocks Programming vs. OS threads Programming Intel TBB Parallel Work POSIX* threads void parallel_thread (void *arg) { int y1, y2; while (schedule_thread_work (y1, y2)) { for (int y = y1; y <= y2; y++) { for (int x=startx; x<=stopx; x++) { render_one_pixel (x, y); } } if (scene.displaymode == RT_DISPLAY_ENABLED) { pthread_mutex_lock (&MyMutex3); for (int y = y1; y <= y2; y++) { GraphicsDrawRow(startx-1, y-1, totalx, (unsigned char *) &global_buffer[(y-starty)*totalx*3]); } pthread_mutex_unlock (&MyMutex3); } } } #include "tbb/parallelfor.h" #include "tbb/blockedrange2d.h" class parallel_task { public: void operator() (const TBB::BlockedRange2D<int> &r) const { for (int y = r.rows().begin(); y!= r.rows().end(); ++y) { for (int x = r.cols().begin(); x!= r.cols().end(); x++) { render_one_pixel (x, y); } } if (scene.displaymode == RT_DISPLAY_ENABLED) { TBB::SpinMutex::scoped_lock lock (MyMutex2); for (int y = r.rows().begin(); y!= r.rows().end(); ++y) { GraphicsDrawRow(startx-1, y-1, totalx, (unsigned char *) &global_buffer[(y-starty)*totalx*3]); } } } parallel_task () {} }; ParallelFor (TBB::BlockedRange2D<int> (starty, stopy + 1, grain_size, startx, stopx + 1, grain_size), parallel_task ()); #include "tbb/parallelfor.h" #include "tbb/bl ockedrange2d.h" ParallelFor (TBB::BlockedRange2D<int> (starty, stopy + 1, grain_size, startx, stopx + 1, grain_size), parallel_task ()); const int MINPATCH = 150; const int DIVFAC TOR = 2; typedef struct work_queue_entry_s { patch pch; struct work_queue_en try_s *next; } work_queue_ entry_t; work_q ueue_en try_t *work_queue_ head = NULL; work_q ueue_en try_t *work_queue_ tail = NULL; void generate_work (patch* pchin) { int startx, stopx, starty, stopy; int xs,ys; startx=pchin- >startx; stopx= pchin->stopx; starty=pchin- >starty; Data Decomposition stopy= pchin->stopy; if(((stopx-startx) >= MINPATCH) ((stopy-starty) >= MINPATCH)) { int xpatchsize = (stopx-startx)/divfactor + 1; int ypatchsize = (stopy-starty)/divfactor + 1; for (ys=starty; ys<=stopy; ys+=ypatchsize) for (xs=startx; xs<=stopx; xs+=xpatchsize) { patch pch; pch.startx = xs; pch.starty = ys; pch.stopx = MIN(xs+xpatchsize-1,stopx); pch.stopy = MIN(ys+ypatchsize-1,stopy); generate_work (&pch); } } else { /* just trace this patch */ work_queue_en try_t *q = (work_queue_ entry_t *) malloc (sizeof (work_q ueue_ent ry_t)); q->pch.starty = starty; q->pch.stopy = stopy; q->pch.startx = startx; q->pch.stopx = stopx; q->next = NULL; if (work_queue_he ad == NULL) { work_q ueue_h ead = q; } else { work_q ueue_t ail->next = q; } work_queue_t ail = q; } } void generate_worklist (void) Intel TBB offers cleaner design and competitive performance { patch pch; pch.startx = startx; pch.stopx = stopx; pch.starty = starty; pch.stopy = stopy; generate_w ork (&pch); } bool schedule_thread _work (patch &pch) { pthread_mutex_lock (&MyMutex3); work_q ueue_ent ry_t *q = work_queue_head; if (q!= NULL) { pch = q->pch; work_queue_head = work_queue_ head->next; } pthread_mutex_unloc k (&MyMutex3 ); return (q!= NULL); } generate_worklist (); 44

45 Performance Libraries

46 Intel Math Kernel Library 8.1 (MKL) Multi-core ready Thread Safe Excellent scaling on multiprocessor systems Automatic runtime processor detection Support for C and Fortran interfaces Support for all Intel processors in one package Royalty-free distribution rights BLAS LAPACK ScaLAPACK Supports Intel MKL Sparse Solvers Fast Fourier Transforms Vector Math Windows* Linux* Mac OS* 64-Bit Multicore AMD* 46 "By adopting the Intel Intel MKL DGEMM libraries, our standard benchmarks timing improved between 43% and 71%, which is very impressive." Matt Dunbar Software Developer ABAQUS, Inc.

47 Intel Integrated Performance Primitives (IPP) Application Source Code Intel IPP Usage Code Samples Rapid Application Development Sample video/audio/speech codecs Image processing and JPEG Signal processing Data compression.net and Java* integration API calls "The Intel IPP [Intel Integrated Performance Primitives] is the fastest image processing library we've found, resulting in much greater interactivity and creative freedom for our users." Intel IPP Library C/C++ API Video coding Audio coding Speech coding Speech recognition Data compression Cryptography Matrix maths Cross-platform Compatibility & Code Re-Use Signal processing Image processing JPEG coding Computer vision Image colour conversion String processing Vector maths Static/Dynamic Link Intel IPP Processor-Optimized Binaries Intel Intel Intel Intel Intel Intel Intel Core Duo and Core Solo Processors Pentium D dual-core Processors Pentium M Processors Pentium 4 Processors Xeon Processors Itanium 2 Processors XScale Technology-based Processors Outstanding Performance Supports Windows* Linux* Mac* Intel IPP 64-Bit Multicore AMD* 47 Bruce Rady President RadTIME, Inc.

48 Intel Cluster tools Optimize MPI based applications

49 What Are the Biggest Bottlenecks Today in Creating Parallel Applications? Source: Developing Custom Parallel Computing Applications, Simon Management Group, September

50 Intel Cluster Toolkit Boost development and performance of cluster applications Universal MPI Library runs cluster applications on all networks Leading cluster development environment to efficiently create, analyze, optimize and deploy parallel applications Ready to support dual-core and multi-core cluster Intel Cluster Toolkit 3.0 Full-featured MPI tools environment Intel MPI Library 3.0 Intel Trace Analyzer and Collector 7.0 Intel Math Kernel Library Cluster Edition 9.0 Intel MPI Benchmarks

51 Summary Intel Software Development Products Lead the Way with support for the latest Operating Systems and Multi-core Processors Intel VTune Analyzer v9.0 Intel Core 2 processor event support and Hotspot navigator Intel Thread Profiler and Checker v3.1 Speed and usability improvements Intel Threading Building Blocks (Intel TBB) v1.1 Automatic grainsizes Intel Performance Libraries Speed improvements Intel Cluster Tools v3.0 Build and Optimize MPI based applications 51

52 Any Questions? 52

Intel Developer Products for Parallelized Software Development

Intel Developer Products for Parallelized Software Development Intel Developer Products for Parallelized Software Development Vipin Kumar E K Technical Consulting Engineer Software Solutions Group, Intel 1 Software Solutions Group - Developer Products Division Agenda

More information

Parallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai

Parallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai Parallel Programming Principle and Practice Lecture 7 Threads programming with TBB Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Outline Intel Threading

More information

Efficiently Introduce Threading using Intel TBB

Efficiently Introduce Threading using Intel TBB Introduction This guide will illustrate how to efficiently introduce threading using Intel Threading Building Blocks (Intel TBB), part of Intel Parallel Studio XE. It is a widely used, award-winning C++

More information

Intel Threading Tools

Intel Threading Tools Intel Threading Tools Paul Petersen, Intel -1- INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS,

More information

Optimize an Existing Program by Introducing Parallelism

Optimize an Existing Program by Introducing Parallelism Optimize an Existing Program by Introducing Parallelism 1 Introduction This guide will help you add parallelism to your application using Intel Parallel Studio. You will get hands-on experience with our

More information

Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth

Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth Intel C++ Compiler Professional Edition 11.1 for Mac OS* X In-Depth Contents Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. 3 Intel C++ Compiler Professional Edition 11.1 Components:...3 Features...3

More information

Intel C++ Compiler Professional Edition 11.1 for Linux* In-Depth

Intel C++ Compiler Professional Edition 11.1 for Linux* In-Depth Intel C++ Compiler Professional Edition 11.1 for Linux* In-Depth Contents Intel C++ Compiler Professional Edition 11.1 for Linux*.... 3 Intel C++ Compiler Professional Edition Components:......... 3 s...3

More information

Intel C++ Compiler Professional Edition 11.0 for Linux* In-Depth

Intel C++ Compiler Professional Edition 11.0 for Linux* In-Depth Intel C++ Compiler Professional Edition 11.0 for Linux* In-Depth Contents Intel C++ Compiler Professional Edition for Linux*...3 Intel C++ Compiler Professional Edition Components:...3 Features...3 New

More information

Intel Parallel Studio 2011

Intel Parallel Studio 2011 THE ULTIMATE ALL-IN-ONE PERFORMANCE TOOLKIT Studio 2011 Product Brief Studio 2011 Accelerate Development of Reliable, High-Performance Serial and Threaded Applications for Multicore Studio 2011 is a comprehensive

More information

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Using Intel VTune Amplifier XE and Inspector XE in.net environment Using Intel VTune Amplifier XE and Inspector XE in.net environment Levent Akyil Technical Computing, Analyzers and Runtime Software and Services group 1 Refresher - Intel VTune Amplifier XE Intel Inspector

More information

Intel C++ Compiler Professional Edition 11.0 for Windows* In-Depth

Intel C++ Compiler Professional Edition 11.0 for Windows* In-Depth Intel C++ Compiler Professional Edition 11.0 for Windows* In-Depth Contents Intel C++ Compiler Professional Edition for Windows*..... 3 Intel C++ Compiler Professional Edition At A Glance...3 Intel C++

More information

Rama Malladi. Application Engineer. Software & Services Group. PDF created with pdffactory Pro trial version

Rama Malladi. Application Engineer. Software & Services Group. PDF created with pdffactory Pro trial version Threaded Programming Methodology Rama Malladi Application Engineer Software & Services Group Objectives After completion of this module you will Learn how to use Intel Software Development Products for

More information

Intel Parallel Studio

Intel Parallel Studio Intel Parallel Studio Product Brief Intel Parallel Studio Parallelism for your Development Lifecycle Intel Parallel Studio brings comprehensive parallelism to C/C++ Microsoft Visual Studio* application

More information

Intel VTune Amplifier XE

Intel VTune Amplifier XE Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance

More information

Intel Thread Building Blocks

Intel Thread Building Blocks Intel Thread Building Blocks SPD course 2015-16 Massimo Coppola 08/04/2015 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa

More information

From Serial to Parallel Intel Software Products for HPC

From Serial to Parallel Intel Software Products for HPC From Serial to Parallel Intel Software Products for HPC Hubert Haberstock Technical Consulting Engineer *Other brands and names are the property of their respective owners. 1 Agenda 09:15 Saluto di benvenuto

More information

Intel Thread Building Blocks

Intel Thread Building Blocks Intel Thread Building Blocks SPD course 2017-18 Massimo Coppola 23/03/2018 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa

More information

Intel Math Kernel Library 10.3

Intel Math Kernel Library 10.3 Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)

More information

Intel Parallel Studio XE 2015

Intel Parallel Studio XE 2015 2015 Create faster code faster with this comprehensive parallel software development suite. Faster code: Boost applications performance that scales on today s and next-gen processors Create code faster:

More information

Intel VTune Performance Analyzer 9.1 for Windows* In-Depth

Intel VTune Performance Analyzer 9.1 for Windows* In-Depth Intel VTune Performance Analyzer 9.1 for Windows* In-Depth Contents Deliver Faster Code...................................... 3 Optimize Multicore Performance...3 Highlights...............................................

More information

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms. Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

More information

Revealing the performance aspects in your code

Revealing the performance aspects in your code Revealing the performance aspects in your code 1 Three corner stones of HPC The parallelism can be exploited at three levels: message passing, fork/join, SIMD Hyperthreading is not quite threading A popular

More information

Eliminate Threading Errors to Improve Program Stability

Eliminate Threading Errors to Improve Program Stability Introduction This guide will illustrate how the thread checking capabilities in Intel Parallel Studio XE can be used to find crucial threading defects early in the development cycle. It provides detailed

More information

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, Performance Profiler Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, 08-09-2016 Faster, Scalable Code, Faster Intel VTune Amplifier Performance Profiler Get Faster Code Faster With Accurate

More information

Threading Methodology: Principles and Practices. Version 2.0

Threading Methodology: Principles and Practices. Version 2.0 Threading Methodology: Principles and Practices Version 2.0 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

Oracle Developer Studio 12.6

Oracle Developer Studio 12.6 Oracle Developer Studio 12.6 Oracle Developer Studio is the #1 development environment for building C, C++, Fortran and Java applications for Oracle Solaris and Linux operating systems running on premises

More information

Jackson Marusarz Software Technical Consulting Engineer

Jackson Marusarz Software Technical Consulting Engineer Jackson Marusarz Software Technical Consulting Engineer What Will Be Covered Overview Memory/Thread analysis New Features Deep dive into debugger integrations Demo Call to action 2 Analysis Tools for Diagnosis

More information

Intel Software Development Products for High Performance Computing and Parallel Programming

Intel Software Development Products for High Performance Computing and Parallel Programming Intel Software Development Products for High Performance Computing and Parallel Programming Multicore development tools with extensions to many-core Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN

More information

Eliminate Threading Errors to Improve Program Stability

Eliminate Threading Errors to Improve Program Stability Eliminate Threading Errors to Improve Program Stability This guide will illustrate how the thread checking capabilities in Parallel Studio can be used to find crucial threading defects early in the development

More information

CS420: Operating Systems

CS420: Operating Systems Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing

More information

Multi-Core Programming

Multi-Core Programming Multi-Core Programming Increasing Performance through Software Multi-threading Shameem Akhter Jason Roberts Intel PRESS Copyright 2006 Intel Corporation. All rights reserved. ISBN 0-9764832-4-6 No part

More information

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division Intel VTune Amplifier XE Dr. Michael Klemm Software and Services Group Developer Relations Division Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS

More information

Intel Parallel Studio

Intel Parallel Studio Intel Parallel Studio Product Brief Intel Parallel Studio Parallelism for Your Development Lifecycle Intel Parallel Studio brings comprehensive parallelism to C/C++ Microsoft Visual Studio* application

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

Task-based Data Parallel Programming

Task-based Data Parallel Programming Task-based Data Parallel Programming Asaf Yaffe Developer Products Division May, 2009 Agenda Overview Data Parallel Algorithms Tasks and Scheduling Synchronization and Concurrent Containers Summary 2 1

More information

Exploiting the Power of the Intel Compiler Suite. Dr. Mario Deilmann Intel Compiler and Languages Lab Software Solutions Group

Exploiting the Power of the Intel Compiler Suite. Dr. Mario Deilmann Intel Compiler and Languages Lab Software Solutions Group Exploiting the Power of the Intel Compiler Suite Dr. Mario Deilmann Intel Compiler and Languages Lab Software Solutions Group Agenda Compiler Overview Intel C++ Compiler High level optimization IPO, PGO

More information

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Agenda VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Concepts, metrics and technology inside VTune Amplifier XE OpenMP

More information

Using Intel VTune Amplifier XE for High Performance Computing

Using Intel VTune Amplifier XE for High Performance Computing Using Intel VTune Amplifier XE for High Performance Computing Vladimir Tsymbal Performance, Analysis and Threading Lab 1 The Majority of all HPC-Systems are Clusters Interconnect I/O I/O... I/O I/O Message

More information

Intel(R) Threading Building Blocks

Intel(R) Threading Building Blocks Getting Started Guide Intel Threading Building Blocks is a runtime-based parallel programming model for C++ code that uses threads. It consists of a template-based runtime library to help you harness the

More information

Microarchitectural Analysis with Intel VTune Amplifier XE

Microarchitectural Analysis with Intel VTune Amplifier XE Microarchitectural Analysis with Intel VTune Amplifier XE Michael Klemm Software & Services Group Developer Relations Division 1 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Parallel Programming Models

Parallel Programming Models Parallel Programming Models Intel Cilk Plus Tasking Intel Threading Building Blocks, Copyright 2009, Intel Corporation. All rights reserved. Copyright 2015, 2011, Intel Corporation. All rights reserved.

More information

Memory & Thread Debugger

Memory & Thread Debugger Memory & Thread Debugger Here is What Will Be Covered Overview Memory/Thread analysis New Features Deep dive into debugger integrations Demo Call to action Intel Confidential 2 Analysis Tools for Diagnosis

More information

Table of Contents. Cilk

Table of Contents. Cilk Table of Contents 212 Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB HPF Chapel Fortress Stapl PGAS Languages

More information

Overview of Intel Parallel Studio XE

Overview of Intel Parallel Studio XE Overview of Intel Parallel Studio XE Stephen Blair-Chappell 1 30-second pitch Intel Parallel Studio XE 2011 Advanced Application Performance What Is It? Suite of tools to develop high performing, robust

More information

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Parallel Programming Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Challenges Difficult to write parallel programs Most programmers think sequentially

More information

Intel Threading Building Blocks (Intel TBB) 2.1. In-Depth

Intel Threading Building Blocks (Intel TBB) 2.1. In-Depth Intel Threading Building Blocks (Intel TBB) 2.1 In-Depth Contents Intel Threading Building Blocks (Intel TBB) 2.1........... 3 Features................................................ 3 New in this Release.....................................

More information

Concurrency, Thread. Dongkun Shin, SKKU

Concurrency, Thread. Dongkun Shin, SKKU Concurrency, Thread 1 Thread Classic view a single point of execution within a program a single PC where instructions are being fetched from and executed), Multi-threaded program Has more than one point

More information

Eliminate Memory Errors to Improve Program Stability

Eliminate Memory Errors to Improve Program Stability Introduction INTEL PARALLEL STUDIO XE EVALUATION GUIDE This guide will illustrate how Intel Parallel Studio XE memory checking capabilities can find crucial memory defects early in the development cycle.

More information

Oracle Developer Studio Performance Analyzer

Oracle Developer Studio Performance Analyzer Oracle Developer Studio Performance Analyzer The Oracle Developer Studio Performance Analyzer provides unparalleled insight into the behavior of your application, allowing you to identify bottlenecks and

More information

Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth

Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth Contents Intel Visual Fortran Compiler Professional Edition for Windows*........................ 3 Features...3 New in This

More information

Chapter 4: Threads. Chapter 4: Threads

Chapter 4: Threads. Chapter 4: Threads Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples

More information

Graphics Performance Analyzer for Android

Graphics Performance Analyzer for Android Graphics Performance Analyzer for Android 1 What you will learn from this slide deck Detailed optimization workflow of Graphics Performance Analyzer Android* System Analysis Only Please see subsequent

More information

CSE 4/521 Introduction to Operating Systems

CSE 4/521 Introduction to Operating Systems CSE 4/521 Introduction to Operating Systems Lecture 5 Threads (Overview, Multicore Programming, Multithreading Models, Thread Libraries, Implicit Threading, Operating- System Examples) Summer 2018 Overview

More information

HPC Tools on Windows. Christian Terboven Center for Computing and Communication RWTH Aachen University.

HPC Tools on Windows. Christian Terboven Center for Computing and Communication RWTH Aachen University. - Excerpt - Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University PPCES March 25th, RWTH Aachen University Agenda o Intel Trace Analyzer and Collector

More information

Intel PerfMon Performance Monitoring Hardware

Intel PerfMon Performance Monitoring Hardware Intel PerfMon Performance Monitoring Hardware Overview PerfMon Basics PerfMon is hardware throughout the silicon available through registers to tools to facilitate several system/application usages: compiler

More information

OPERATING SYSTEM. Chapter 4: Threads

OPERATING SYSTEM. Chapter 4: Threads OPERATING SYSTEM Chapter 4: Threads Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples Objectives To

More information

Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel

Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Agenda Which performance analysis tool should I use first? Intel Application

More information

Intel(R) Threading Building Blocks

Intel(R) Threading Building Blocks Getting Started Guide Intel Threading Building Blocks is a runtime-based parallel programming model for C++ code that uses threads. It consists of a template-based runtime library to help you harness the

More information

Threaded Programming. Lecture 9: Alternatives to OpenMP

Threaded Programming. Lecture 9: Alternatives to OpenMP Threaded Programming Lecture 9: Alternatives to OpenMP What s wrong with OpenMP? OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming

More information

Maximizing performance and scalability using Intel performance libraries

Maximizing performance and scalability using Intel performance libraries Maximizing performance and scalability using Intel performance libraries Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 17 th 2016, Barcelona

More information

Thread Profiler 2.0 Release Notes

Thread Profiler 2.0 Release Notes Thread Profiler 2.0 Release Notes Contents 1. Overview 2. Package 3. New Features 4. Requirements 5. Installation 6. Usage 7. Supported C Run-Time and Windows* APIs 8. Technical Support and Feedback 1.

More information

Introduction to Parallel Performance Engineering

Introduction to Parallel Performance Engineering Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:

More information

Intel Parallel Amplifier Sample Code Guide

Intel Parallel Amplifier Sample Code Guide The analyzes the performance of your application and provides information on the performance bottlenecks in your code. It enables you to focus your tuning efforts on the most critical sections of your

More information

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino Performance analysis tools: Intel VTuneTM Amplifier and Advisor Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimisation After having considered the MPI layer,

More information

Eliminate Memory Errors to Improve Program Stability

Eliminate Memory Errors to Improve Program Stability Eliminate Memory Errors to Improve Program Stability This guide will illustrate how Parallel Studio memory checking capabilities can find crucial memory defects early in the development cycle. It provides

More information

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin

More information

Using Intel Inspector XE 2011 with Fortran Applications

Using Intel Inspector XE 2011 with Fortran Applications Using Intel Inspector XE 2011 with Fortran Applications Jackson Marusarz Intel Corporation Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS

More information

Intel Thread Building Blocks, Part II

Intel Thread Building Blocks, Part II Intel Thread Building Blocks, Part II SPD course 2013-14 Massimo Coppola 25/03, 16/05/2014 1 TBB Recap Portable environment Based on C++11 standard compilers Extensive use of templates No vectorization

More information

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:

More information

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:

More information

Intel Thread Checker 3.1 for Windows* Release Notes

Intel Thread Checker 3.1 for Windows* Release Notes Page 1 of 6 Intel Thread Checker 3.1 for Windows* Release Notes Contents Overview Product Contents What's New System Requirements Known Issues and Limitations Technical Support Related Products Overview

More information

Intel Parallel Amplifier 2011

Intel Parallel Amplifier 2011 THREADING AND PERFORMANCE PROFILER Intel Parallel Amplifier 2011 Product Brief Intel Parallel Amplifier 2011 Optimize Performance and Scalability Intel Parallel Amplifier 2011 makes it simple to quickly

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Getting Started with Intel SDK for OpenCL Applications

Getting Started with Intel SDK for OpenCL Applications Getting Started with Intel SDK for OpenCL Applications Webinar #1 in the Three-part OpenCL Webinar Series July 11, 2012 Register Now for All Webinars in the Series Welcome to Getting Started with Intel

More information

Computer Systems A Programmer s Perspective 1 (Beta Draft)

Computer Systems A Programmer s Perspective 1 (Beta Draft) Computer Systems A Programmer s Perspective 1 (Beta Draft) Randal E. Bryant David R. O Hallaron August 1, 2001 1 Copyright c 2001, R. E. Bryant, D. R. O Hallaron. All rights reserved. 2 Contents Preface

More information

Profiling: Understand Your Application

Profiling: Understand Your Application Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

Scientific Programming in C XIV. Parallel programming

Scientific Programming in C XIV. Parallel programming Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Performance Analysis of Parallel Scientific Applications In Eclipse

Performance Analysis of Parallel Scientific Applications In Eclipse Performance Analysis of Parallel Scientific Applications In Eclipse EclipseCon 2015 Wyatt Spear, University of Oregon wspear@cs.uoregon.edu Supercomputing Big systems solving big problems Performance gains

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Intel profiling tools and roofline model. Dr. Luigi Iapichino

Intel profiling tools and roofline model. Dr. Luigi Iapichino Intel profiling tools and roofline model Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimization (and to the next hour) We will focus on tools developed

More information

KNL tools. Dr. Fabio Baruffa

KNL tools. Dr. Fabio Baruffa KNL tools Dr. Fabio Baruffa fabio.baruffa@lrz.de 2 Which tool do I use? A roadmap to optimization We will focus on tools developed by Intel, available to users of the LRZ systems. Again, we will skip the

More information

Pablo Halpern Parallel Programming Languages Architect Intel Corporation

Pablo Halpern Parallel Programming Languages Architect Intel Corporation Pablo Halpern Parallel Programming Languages Architect Intel Corporation CppCon, 8 September 2014 This work by Pablo Halpern is licensed under a Creative Commons Attribution

More information

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.

More information

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2 Lecture 3: Processes Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2 Process in General 3.3 Process Concept Process is an active program in execution; process

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University 1. Introduction 2. System Structures 3. Process Concept 4. Multithreaded Programming

More information

Intel Threading Building Blocks (TBB)

Intel Threading Building Blocks (TBB) Intel Threading Building Blocks (TBB) SDSC Summer Institute 2012 Pietro Cicotti Computational Scientist Gordon Applications Team Performance Modeling and Characterization Lab Parallelism and Decomposition

More information

Chapter 4: Multithreaded Programming

Chapter 4: Multithreaded Programming Chapter 4: Multithreaded Programming Silberschatz, Galvin and Gagne 2013 Chapter 4: Multithreaded Programming Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading

More information

AUTOMATIC SMT THREADING

AUTOMATIC SMT THREADING AUTOMATIC SMT THREADING FOR OPENMP APPLICATIONS ON THE INTEL XEON PHI CO-PROCESSOR WIM HEIRMAN 1,2 TREVOR E. CARLSON 1 KENZO VAN CRAEYNEST 1 IBRAHIM HUR 2 AAMER JALEEL 2 LIEVEN EECKHOUT 1 1 GHENT UNIVERSITY

More information

Jackson Marusarz Intel Corporation

Jackson Marusarz Intel Corporation Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits

More information

Chapter 4: Threads. Operating System Concepts 9 th Edition

Chapter 4: Threads. Operating System Concepts 9 th Edition Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples

More information

Moore s Law. Multicore Programming. Vendor Solution. Power Density. Parallelism and Performance MIT Lecture 11 1.

Moore s Law. Multicore Programming. Vendor Solution. Power Density. Parallelism and Performance MIT Lecture 11 1. Moore s Law 1000000 Intel CPU Introductions 6.172 Performance Engineering of Software Systems Lecture 11 Multicore Programming Charles E. Leiserson 100000 10000 1000 100 10 Clock Speed (MHz) Transistors

More information

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ, Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon

More information

Shared memory parallel computing. Intel Threading Building Blocks

Shared memory parallel computing. Intel Threading Building Blocks Shared memory parallel computing Intel Threading Building Blocks Introduction & history Threading Building Blocks (TBB) cross platform C++ template lib for task-based shared memory parallel programming

More information