Intel Tools zur parallelen Programmierung Windows HPC RWTH Aachen 2007
|
|
- Lynn Willis
- 5 years ago
- Views:
Transcription
1 Intel Tools zur parallelen Programmierung Windows HPC RWTH Aachen 2007 Dr. Mario Deilmann Intel Compiler Group
2 Processor Evolution X86 New Quad-Core Intel Xeon 5300 for 2006 Dual-Core Intel Xeon processor 5100 series Intel Xeon processor Quad Core Dual Core Single Core 2
3 Parallel computing is omnipresent Over the next few years, all computers will be somehow parallel computers. Servers Laptops Cell phones What about software? Herb Sutter of Microsoft said in Dr. Dobbs Journal: The free lunch is over: Fundamental Turn towards Concurrency in software Software performance will no longer increase from one generation to the next as hardware improves unless it is parallel software 3
4 Intel Processor and Platform Evolution for the Next Decade 4
5 Independent Programs Application Parallel Programs Application Application Application Application Application Threads Thread Switch Process Switch Threads allows one application to utilize the power of multiple processors 5
6 How do we make use of the additional core s and CPU s on the software side?
7 What is Parallelism? Two or more processes or threads execute at the same time! Multiple processes communicate through an inter-process protocol Single process with multiple threads which communication through shared memory 7
8 Process model Process OS creates process for each program loaded Process Master thread Additional threads can be created within the process Threads share code and data Each thread has its separate Registers and Stack Stack Thread Stack Thread Stack Code segment Data segment 8
9 Parallel Programming Models Message Passing Create/fork multiple processes Node Typically one per node/core Explicit communication Send messages send(tid, tag, message) receive(tid, tag, message) P P P M M M Synchronization Block on messages Barriers 9 Interconnect
10 Parallel Programming Models Shared Memory Create one process with multiple threads Typically one per node/core Implicit communication P T T Using shared address space Loads and stores Synchronization Locks Atomic memory operators Barriers 10 T Bus Memory
11 Parallel programming approaches API / Library Threads - P-threads (POSIX), Win32* threading API Intel Threading Building Blocks (C++) MPI, PVM Programming language mechanisms Java*, C#, Erlang Programming language extension OpenMP (C, C++, Fortran, ) UPC (unified parallel C) Cilk (extension to GCC) 11
12 Multithreading introduces new class of problems Developing threaded or MPI applications is hard but new and advanced Intel architectures and software tool help to support these approaches. New class of problems is introduced due to the interaction between threads which are complicated and hard to find! Correctness problems (data races) Performance problems (contention) Runtime problems (crashes) Opened the Pandora s box 12
13 Performance versus effort MPI Code Performance Theoretical speedup limited by number of CPU s per cluster Cluster OpenMP Theoretical speedup limited by number of Core s per CPU Threads Theoretical speedup limited by Core OpenMP Serial optimization Development effort Code Restructuring 13
14 Intel Software Development Tools for Parallel Programming
15 How can we support parallel Development Analysis (seriell / parallel) VTune Performance Analyzer Intel Trace Analyzer Design (Introduce Parallelism or extend) Intel Performance libraries: IPP and MKL OpenMP* (Intel Compiler) Intel MPI Debug for Correctness (data races, locks) Intel Thread Checker Intel MPI Correctness Checker Tune for Performance (bottlenecks) Intel Thread Profiler VTune Performance Analyzer Intel Trace Collector 15
16 Intel Threading Tools
17 Intel VTune Analyzer 9.0 Identifies hard to find performance bottlenecks Features Tune process or thread parallel code "The Intel VTune Performance Analyzer took a multi-day task and turned it into a subday task." Randy Camp VP, Software R&D MUSICMATCH Inc. Low overhead sampling Graphical call graph View results on source or assembly What s New New tuning methodology Stall cycle accounting for Core 2 Duo and Core 2 Quad processors Windows: Microsoft Vista* support Linux: Connection to Intel compiler analysis & intuitive hotspot navigator Windows* Linux* Mac* IA32 Intel64 IA64 Multicore 17
18 VTune Performance Analyzer Helps you identify and characterize performance issues by: Collecting performance data from the system running your application Sampling: Event-based or Time-based Call Graph Counter Monitor Organizing and displaying the data in a variety of views From system-wide down to source code or processor instruction perspective GUI and CLI VTune Analyzer Driver Kit Rebuild VTune Analyzer Linux driver for non-standard kernels Red Hat, SuSE, Red Flag distributions supported 18
19 What Can You Profile with Vtune? Windows/Linux applications Stand-alone Win* DLLs Stand-alone COM+ DLLs Java applications.net* applications ASP.NET applications 19
20 Performance Analysis Technologies Identify Performance Bottlenecks (Sampling) Interrupt based sampling using CPU registers (PMU) Lower Overhead Examine flow of control through the app (Call Graph) Which functions took the longest Which functions were blocked the longest Calling sequence critical path Higher Overhead, more data 20
21 Performance Hotspots - Sampling Sample the CPU s execution context Periodically interrupts the processor Time-based: Triggered at a certain time intervall Event-based: Triggered by the occurrence of a certain events Collects the execution context Execution address in memory (CS:IP) Operating system process and thread ID Executable module loaded at that address If you have symbols for the module, post-processing can identify the function or method at the memory address. Line numbers from the symbol file can direct you to the relevant line of source code. Can measure performance sensitive CPU events Cache misses, branch mispredictions, 21
22 Sampling Module of Interest 22
23 Sampling over time 23
24 Sampling: Source View 24
25 Programmatic Flow of control Call Graph Instrumented technology Some performance degradation Binary is instrumented No special build needed Identifies function to function calling sequences Reports statistics for each called function Execution time Blocked time Calling sequences & frequency of occurrence 25
26 Call graph: Application workflow Filter view by self time The red lines show the critical path. The critical path is the most timeconsuming call path. It is based on self time. Bright orange nodes indicate functions with the highest self time. 26 Intel, VTune, and the Intel logo are trademarks or registered trademarks of Intel *Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners or its subsidiaries in the United States or other countries. Corporation
27 Graph Navigation Window Use the graph navigation window for an overview of the entire call graph. 27
28 Vtune Tuning Assistant For more detail, click hyperlink. 28 Intel and the Intel logo are trademarks or registered trademarks of Intel *Intel and the Intel logo are registered trademarks of Intel Corporation. Other brands and names are the property of their respective owners or its subsidiaries in the United States or other countries. Corporation
29 ICC optimization report VTune displays optimization reports generated with ICC 9.0 and later Allows simplify performance optimization work when ICC and VTune are used together and handles all optimization phases supported by ICC 29
30 Intel Thread Checker v3.1 Confidently pinpoint threading errors Features Detects challenging data races and deadlocks Pinpoints errors to the source code line Command line interface for Windows and Linux Works on standard debug builds without recompiling Batch scripts integration for regression test runs We couldn t have gotten the networking up and running as quickly and as efficiently without Thread Checker. Thread Checker is simply an awesome tool and we are not going to develop multi-threaded code without it. Doug Service, Dir. of Tech. Dev. Chris Stark, Software Engineer Ritual Entertainment What s New Faster analysis through performance optimizations Microsoft Vista* support Windows* Linux* Mac* IA32 Intel64 IA64 Multicore 30
31 Example: Not Quite Right #include #include <stdio.h> <stdio.h> const const long long NN == ; ; long long primes[n], primes[n], number_of_primes number_of_primes == 0; 0; main() main() {{ printf( printf( "Determining "Determining primes primes from from 1-%d 1-%d \n", \n", NN ); ); primes[ primes[ number_of_primes++ number_of_primes++ ]] == 2; 2; // // special special case case #pragma omp parallel for for for (( long long number number == 3; 3; number number <= <= N; N; number number += += 22 )) {{ long long factor factor == 3; 3; while while (( number number %% factor factor )) factor factor += += 2; 2; if if (( factor factor == == number number )) primes[ primes[ number_of_primes++ number_of_primes++ ]] == number; number; }} printf( printf( "Found "Found %d %d primes\n", primes\n", number_of_primes number_of_primes ); ); }} 31
32 Intel Thread Checker Key Benefits Detects challenging data races and deadlocks Pinpoints errors to the source code line Works on standard debug builds without recompiling Recommends modules to instrument by usage (minimize instrumentation overhead) Scriptable interface for test environment integration (enabling batch file runs) Supports 32 and 64-bit applications 32
33 Intel Thread Checker Intel Thread Checker Primes.exe Binary Instrumentation Primes.exe (Instrumented) Runtime Data Collector +DLLs (Instrumented) threadchecker.thr (results) Win32* threads, POSIX* threads, OpenMP* 33 *Intel and the Intel logo are registered trademarks of Intel Corporation. names are are the property of their owners *Third Other partybrands marksand and brands the property ofrespective their respective owners
34 Intel Thread Checker S T N OI E P PIN URC SO DE CO 34
35 Intel Thread Profiler v3.1 Pinpoints threading inefficiencies Features View application concurrency level to ensure full core utilization Identify where thread related overhead impacts performance Find out which created threads are active and which are inactive Included with VTune Analyzer for Windows* Intel Thread Profiler was very useful for analyzing bottlenecks in our threaded code. Thread Profiler quickly pinpointed problem areas and showed us the reasons for the slowdown, so we were able to restructure the code for better threaded performance. Martin Watt Software Architect Alias Support for Threading Building Blocks API What s New Easier to Use - Recall custom configuration settings Faster to Use - User selectable stack walking Microsoft Vista* support Windows* Linux* Mac* IA32 Intel64 IA64 Multicore 35
36 Speedup Performance Profile: Recap Threads Possible causes for this scalability profile: 1. Insufficient parallel work 2. Memory bandwidth limitations 3. Synchronization overhead 4. Load imbalance 36
37 Intel Thread Profiler Key Benefits Shows how much of your application is not optimally parallel and where Identifies where thread specific overhead impacts performance Highlights thread workload imbalances and thread activity Shows the number of cores utilized Pinpoints issues to the source code line 37
38 Intel Thread Profiler S INT IES O P PIN ICIENC F F E IN S INT IES O P C PIN ICIEN FF E N I 38
39 Intel Threading Building Blocks Some kind of STL for Parallel C++ Programming You specify task patterns instead of threads Library maps user-defined logical tasks onto physical threads, efficiently using cache and balancing load Full support for nested parallelism Targets threading for robust performance Designed to provide scalable, portable performance for computationally intense portions of shrink-wrapped applications. Compatible with other threading packages Designed to work well for CPU bound computation, not I/O bound or real-time. Library can be used in concert with other threading packages such as native threads and OpenMP. Emphasizes scalable, data parallel programming Solutions based on functional decomposition usually do not scale. 39
40 An Example using ParallelFor Independent iterations and fixed/known bounds const int N = ; void change_array(float array, int M) { for (int i = 0; i < M; i++){ array[i] *= 2; } } int main (){ float A[N]; initialize_array(a); change_array(a, N); return 0; } 40
41 An Example using ParallelFor Include and initialize the library Include Library Headers #include <tbb/taskschedulerinit.h> #include <tbb/blockedrange.h> #include <tbb/parallelfor.h> using namespace ThreadingBuildingBlocks; int main (){ TaskSchedulerInit init; float A[N]; initialize_array(a); parallel_change_array(a, N); Initialize scheduler return 0; } Use namespace blue = original code red = provided by TBB black = boilerplate for library 41
42 An Example using ParallelFor Use the ParallelFor pattern Define Task blue = original code red = provided by TBB black = boilerplate for library class ChangeArrayBody { float *array; public: ChangeArrayBody (float *a): array(a) {} void operator()( const BlockedRange <int>& r ) const{ for (int i=r.begin(); i!=r.end(); i++ ){ array[i] *= 2; } } }; void parallel_change_array(float *array, int M) { ParallelFor (BlockedRange <int>(0, M, IdealGrainSize), ChangeArrayBody(array)); } Use Pattern Establish grain size 42
43 Intel Threading Building Blocks overview Generic Parallel Algorithms parallel_for parallel_while parallel_reduce pipeline parallel_sort parallel_scan Concurrent Containers concurrent_hash_map concurrent_queue concurrent_vector Task scheduler Low-Level Synchronization Primitives atomic spin_mutex queuing_mutex spin_rw_mutex mutex Memory Allocation cache_aligned_allocator scalable_allocator Timing tick_count 43
44 Intel Threading Building Blocks Programming vs. OS threads Programming Intel TBB Parallel Work POSIX* threads void parallel_thread (void *arg) { int y1, y2; while (schedule_thread_work (y1, y2)) { for (int y = y1; y <= y2; y++) { for (int x=startx; x<=stopx; x++) { render_one_pixel (x, y); } } if (scene.displaymode == RT_DISPLAY_ENABLED) { pthread_mutex_lock (&MyMutex3); for (int y = y1; y <= y2; y++) { GraphicsDrawRow(startx-1, y-1, totalx, (unsigned char *) &global_buffer[(y-starty)*totalx*3]); } pthread_mutex_unlock (&MyMutex3); } } } #include "tbb/parallelfor.h" #include "tbb/blockedrange2d.h" class parallel_task { public: void operator() (const TBB::BlockedRange2D<int> &r) const { for (int y = r.rows().begin(); y!= r.rows().end(); ++y) { for (int x = r.cols().begin(); x!= r.cols().end(); x++) { render_one_pixel (x, y); } } if (scene.displaymode == RT_DISPLAY_ENABLED) { TBB::SpinMutex::scoped_lock lock (MyMutex2); for (int y = r.rows().begin(); y!= r.rows().end(); ++y) { GraphicsDrawRow(startx-1, y-1, totalx, (unsigned char *) &global_buffer[(y-starty)*totalx*3]); } } } parallel_task () {} }; ParallelFor (TBB::BlockedRange2D<int> (starty, stopy + 1, grain_size, startx, stopx + 1, grain_size), parallel_task ()); #include "tbb/parallelfor.h" #include "tbb/bl ockedrange2d.h" ParallelFor (TBB::BlockedRange2D<int> (starty, stopy + 1, grain_size, startx, stopx + 1, grain_size), parallel_task ()); const int MINPATCH = 150; const int DIVFAC TOR = 2; typedef struct work_queue_entry_s { patch pch; struct work_queue_en try_s *next; } work_queue_ entry_t; work_q ueue_en try_t *work_queue_ head = NULL; work_q ueue_en try_t *work_queue_ tail = NULL; void generate_work (patch* pchin) { int startx, stopx, starty, stopy; int xs,ys; startx=pchin- >startx; stopx= pchin->stopx; starty=pchin- >starty; Data Decomposition stopy= pchin->stopy; if(((stopx-startx) >= MINPATCH) ((stopy-starty) >= MINPATCH)) { int xpatchsize = (stopx-startx)/divfactor + 1; int ypatchsize = (stopy-starty)/divfactor + 1; for (ys=starty; ys<=stopy; ys+=ypatchsize) for (xs=startx; xs<=stopx; xs+=xpatchsize) { patch pch; pch.startx = xs; pch.starty = ys; pch.stopx = MIN(xs+xpatchsize-1,stopx); pch.stopy = MIN(ys+ypatchsize-1,stopy); generate_work (&pch); } } else { /* just trace this patch */ work_queue_en try_t *q = (work_queue_ entry_t *) malloc (sizeof (work_q ueue_ent ry_t)); q->pch.starty = starty; q->pch.stopy = stopy; q->pch.startx = startx; q->pch.stopx = stopx; q->next = NULL; if (work_queue_he ad == NULL) { work_q ueue_h ead = q; } else { work_q ueue_t ail->next = q; } work_queue_t ail = q; } } void generate_worklist (void) Intel TBB offers cleaner design and competitive performance { patch pch; pch.startx = startx; pch.stopx = stopx; pch.starty = starty; pch.stopy = stopy; generate_w ork (&pch); } bool schedule_thread _work (patch &pch) { pthread_mutex_lock (&MyMutex3); work_q ueue_ent ry_t *q = work_queue_head; if (q!= NULL) { pch = q->pch; work_queue_head = work_queue_ head->next; } pthread_mutex_unloc k (&MyMutex3 ); return (q!= NULL); } generate_worklist (); 44
45 Performance Libraries
46 Intel Math Kernel Library 8.1 (MKL) Multi-core ready Thread Safe Excellent scaling on multiprocessor systems Automatic runtime processor detection Support for C and Fortran interfaces Support for all Intel processors in one package Royalty-free distribution rights BLAS LAPACK ScaLAPACK Supports Intel MKL Sparse Solvers Fast Fourier Transforms Vector Math Windows* Linux* Mac OS* 64-Bit Multicore AMD* 46 "By adopting the Intel Intel MKL DGEMM libraries, our standard benchmarks timing improved between 43% and 71%, which is very impressive." Matt Dunbar Software Developer ABAQUS, Inc.
47 Intel Integrated Performance Primitives (IPP) Application Source Code Intel IPP Usage Code Samples Rapid Application Development Sample video/audio/speech codecs Image processing and JPEG Signal processing Data compression.net and Java* integration API calls "The Intel IPP [Intel Integrated Performance Primitives] is the fastest image processing library we've found, resulting in much greater interactivity and creative freedom for our users." Intel IPP Library C/C++ API Video coding Audio coding Speech coding Speech recognition Data compression Cryptography Matrix maths Cross-platform Compatibility & Code Re-Use Signal processing Image processing JPEG coding Computer vision Image colour conversion String processing Vector maths Static/Dynamic Link Intel IPP Processor-Optimized Binaries Intel Intel Intel Intel Intel Intel Intel Core Duo and Core Solo Processors Pentium D dual-core Processors Pentium M Processors Pentium 4 Processors Xeon Processors Itanium 2 Processors XScale Technology-based Processors Outstanding Performance Supports Windows* Linux* Mac* Intel IPP 64-Bit Multicore AMD* 47 Bruce Rady President RadTIME, Inc.
48 Intel Cluster tools Optimize MPI based applications
49 What Are the Biggest Bottlenecks Today in Creating Parallel Applications? Source: Developing Custom Parallel Computing Applications, Simon Management Group, September
50 Intel Cluster Toolkit Boost development and performance of cluster applications Universal MPI Library runs cluster applications on all networks Leading cluster development environment to efficiently create, analyze, optimize and deploy parallel applications Ready to support dual-core and multi-core cluster Intel Cluster Toolkit 3.0 Full-featured MPI tools environment Intel MPI Library 3.0 Intel Trace Analyzer and Collector 7.0 Intel Math Kernel Library Cluster Edition 9.0 Intel MPI Benchmarks
51 Summary Intel Software Development Products Lead the Way with support for the latest Operating Systems and Multi-core Processors Intel VTune Analyzer v9.0 Intel Core 2 processor event support and Hotspot navigator Intel Thread Profiler and Checker v3.1 Speed and usability improvements Intel Threading Building Blocks (Intel TBB) v1.1 Automatic grainsizes Intel Performance Libraries Speed improvements Intel Cluster Tools v3.0 Build and Optimize MPI based applications 51
52 Any Questions? 52
Intel Developer Products for Parallelized Software Development
Intel Developer Products for Parallelized Software Development Vipin Kumar E K Technical Consulting Engineer Software Solutions Group, Intel 1 Software Solutions Group - Developer Products Division Agenda
More informationParallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai
Parallel Programming Principle and Practice Lecture 7 Threads programming with TBB Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Outline Intel Threading
More informationEfficiently Introduce Threading using Intel TBB
Introduction This guide will illustrate how to efficiently introduce threading using Intel Threading Building Blocks (Intel TBB), part of Intel Parallel Studio XE. It is a widely used, award-winning C++
More informationIntel Threading Tools
Intel Threading Tools Paul Petersen, Intel -1- INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS,
More informationOptimize an Existing Program by Introducing Parallelism
Optimize an Existing Program by Introducing Parallelism 1 Introduction This guide will help you add parallelism to your application using Intel Parallel Studio. You will get hands-on experience with our
More informationIntel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth
Intel C++ Compiler Professional Edition 11.1 for Mac OS* X In-Depth Contents Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. 3 Intel C++ Compiler Professional Edition 11.1 Components:...3 Features...3
More informationIntel C++ Compiler Professional Edition 11.1 for Linux* In-Depth
Intel C++ Compiler Professional Edition 11.1 for Linux* In-Depth Contents Intel C++ Compiler Professional Edition 11.1 for Linux*.... 3 Intel C++ Compiler Professional Edition Components:......... 3 s...3
More informationIntel C++ Compiler Professional Edition 11.0 for Linux* In-Depth
Intel C++ Compiler Professional Edition 11.0 for Linux* In-Depth Contents Intel C++ Compiler Professional Edition for Linux*...3 Intel C++ Compiler Professional Edition Components:...3 Features...3 New
More informationIntel Parallel Studio 2011
THE ULTIMATE ALL-IN-ONE PERFORMANCE TOOLKIT Studio 2011 Product Brief Studio 2011 Accelerate Development of Reliable, High-Performance Serial and Threaded Applications for Multicore Studio 2011 is a comprehensive
More informationUsing Intel VTune Amplifier XE and Inspector XE in.net environment
Using Intel VTune Amplifier XE and Inspector XE in.net environment Levent Akyil Technical Computing, Analyzers and Runtime Software and Services group 1 Refresher - Intel VTune Amplifier XE Intel Inspector
More informationIntel C++ Compiler Professional Edition 11.0 for Windows* In-Depth
Intel C++ Compiler Professional Edition 11.0 for Windows* In-Depth Contents Intel C++ Compiler Professional Edition for Windows*..... 3 Intel C++ Compiler Professional Edition At A Glance...3 Intel C++
More informationRama Malladi. Application Engineer. Software & Services Group. PDF created with pdffactory Pro trial version
Threaded Programming Methodology Rama Malladi Application Engineer Software & Services Group Objectives After completion of this module you will Learn how to use Intel Software Development Products for
More informationIntel Parallel Studio
Intel Parallel Studio Product Brief Intel Parallel Studio Parallelism for your Development Lifecycle Intel Parallel Studio brings comprehensive parallelism to C/C++ Microsoft Visual Studio* application
More informationIntel VTune Amplifier XE
Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance
More informationIntel Thread Building Blocks
Intel Thread Building Blocks SPD course 2015-16 Massimo Coppola 08/04/2015 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa
More informationFrom Serial to Parallel Intel Software Products for HPC
From Serial to Parallel Intel Software Products for HPC Hubert Haberstock Technical Consulting Engineer *Other brands and names are the property of their respective owners. 1 Agenda 09:15 Saluto di benvenuto
More informationIntel Thread Building Blocks
Intel Thread Building Blocks SPD course 2017-18 Massimo Coppola 23/03/2018 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa
More informationIntel Math Kernel Library 10.3
Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)
More informationIntel Parallel Studio XE 2015
2015 Create faster code faster with this comprehensive parallel software development suite. Faster code: Boost applications performance that scales on today s and next-gen processors Create code faster:
More informationIntel VTune Performance Analyzer 9.1 for Windows* In-Depth
Intel VTune Performance Analyzer 9.1 for Windows* In-Depth Contents Deliver Faster Code...................................... 3 Optimize Multicore Performance...3 Highlights...............................................
More informationDavid R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
More informationRevealing the performance aspects in your code
Revealing the performance aspects in your code 1 Three corner stones of HPC The parallelism can be exploited at three levels: message passing, fork/join, SIMD Hyperthreading is not quite threading A popular
More informationEliminate Threading Errors to Improve Program Stability
Introduction This guide will illustrate how the thread checking capabilities in Intel Parallel Studio XE can be used to find crucial threading defects early in the development cycle. It provides detailed
More informationMunara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.
Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationPerformance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,
Performance Profiler Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, 08-09-2016 Faster, Scalable Code, Faster Intel VTune Amplifier Performance Profiler Get Faster Code Faster With Accurate
More informationThreading Methodology: Principles and Practices. Version 2.0
Threading Methodology: Principles and Practices Version 2.0 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationOracle Developer Studio 12.6
Oracle Developer Studio 12.6 Oracle Developer Studio is the #1 development environment for building C, C++, Fortran and Java applications for Oracle Solaris and Linux operating systems running on premises
More informationJackson Marusarz Software Technical Consulting Engineer
Jackson Marusarz Software Technical Consulting Engineer What Will Be Covered Overview Memory/Thread analysis New Features Deep dive into debugger integrations Demo Call to action 2 Analysis Tools for Diagnosis
More informationIntel Software Development Products for High Performance Computing and Parallel Programming
Intel Software Development Products for High Performance Computing and Parallel Programming Multicore development tools with extensions to many-core Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN
More informationEliminate Threading Errors to Improve Program Stability
Eliminate Threading Errors to Improve Program Stability This guide will illustrate how the thread checking capabilities in Parallel Studio can be used to find crucial threading defects early in the development
More informationCS420: Operating Systems
Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing
More informationMulti-Core Programming
Multi-Core Programming Increasing Performance through Software Multi-threading Shameem Akhter Jason Roberts Intel PRESS Copyright 2006 Intel Corporation. All rights reserved. ISBN 0-9764832-4-6 No part
More informationIntel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division
Intel VTune Amplifier XE Dr. Michael Klemm Software and Services Group Developer Relations Division Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS
More informationIntel Parallel Studio
Intel Parallel Studio Product Brief Intel Parallel Studio Parallelism for Your Development Lifecycle Intel Parallel Studio brings comprehensive parallelism to C/C++ Microsoft Visual Studio* application
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More informationTask-based Data Parallel Programming
Task-based Data Parallel Programming Asaf Yaffe Developer Products Division May, 2009 Agenda Overview Data Parallel Algorithms Tasks and Scheduling Synchronization and Concurrent Containers Summary 2 1
More informationExploiting the Power of the Intel Compiler Suite. Dr. Mario Deilmann Intel Compiler and Languages Lab Software Solutions Group
Exploiting the Power of the Intel Compiler Suite Dr. Mario Deilmann Intel Compiler and Languages Lab Software Solutions Group Agenda Compiler Overview Intel C++ Compiler High level optimization IPO, PGO
More informationAgenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Agenda VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Concepts, metrics and technology inside VTune Amplifier XE OpenMP
More informationUsing Intel VTune Amplifier XE for High Performance Computing
Using Intel VTune Amplifier XE for High Performance Computing Vladimir Tsymbal Performance, Analysis and Threading Lab 1 The Majority of all HPC-Systems are Clusters Interconnect I/O I/O... I/O I/O Message
More informationIntel(R) Threading Building Blocks
Getting Started Guide Intel Threading Building Blocks is a runtime-based parallel programming model for C++ code that uses threads. It consists of a template-based runtime library to help you harness the
More informationMicroarchitectural Analysis with Intel VTune Amplifier XE
Microarchitectural Analysis with Intel VTune Amplifier XE Michael Klemm Software & Services Group Developer Relations Division 1 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationParallel Programming Models
Parallel Programming Models Intel Cilk Plus Tasking Intel Threading Building Blocks, Copyright 2009, Intel Corporation. All rights reserved. Copyright 2015, 2011, Intel Corporation. All rights reserved.
More informationMemory & Thread Debugger
Memory & Thread Debugger Here is What Will Be Covered Overview Memory/Thread analysis New Features Deep dive into debugger integrations Demo Call to action Intel Confidential 2 Analysis Tools for Diagnosis
More informationTable of Contents. Cilk
Table of Contents 212 Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB HPF Chapel Fortress Stapl PGAS Languages
More informationOverview of Intel Parallel Studio XE
Overview of Intel Parallel Studio XE Stephen Blair-Chappell 1 30-second pitch Intel Parallel Studio XE 2011 Advanced Application Performance What Is It? Suite of tools to develop high performing, robust
More informationParallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Parallel Programming Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Challenges Difficult to write parallel programs Most programmers think sequentially
More informationIntel Threading Building Blocks (Intel TBB) 2.1. In-Depth
Intel Threading Building Blocks (Intel TBB) 2.1 In-Depth Contents Intel Threading Building Blocks (Intel TBB) 2.1........... 3 Features................................................ 3 New in this Release.....................................
More informationConcurrency, Thread. Dongkun Shin, SKKU
Concurrency, Thread 1 Thread Classic view a single point of execution within a program a single PC where instructions are being fetched from and executed), Multi-threaded program Has more than one point
More informationEliminate Memory Errors to Improve Program Stability
Introduction INTEL PARALLEL STUDIO XE EVALUATION GUIDE This guide will illustrate how Intel Parallel Studio XE memory checking capabilities can find crucial memory defects early in the development cycle.
More informationOracle Developer Studio Performance Analyzer
Oracle Developer Studio Performance Analyzer The Oracle Developer Studio Performance Analyzer provides unparalleled insight into the behavior of your application, allowing you to identify bottlenecks and
More informationIntel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth
Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth Contents Intel Visual Fortran Compiler Professional Edition for Windows*........................ 3 Features...3 New in This
More informationChapter 4: Threads. Chapter 4: Threads
Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples
More informationGraphics Performance Analyzer for Android
Graphics Performance Analyzer for Android 1 What you will learn from this slide deck Detailed optimization workflow of Graphics Performance Analyzer Android* System Analysis Only Please see subsequent
More informationCSE 4/521 Introduction to Operating Systems
CSE 4/521 Introduction to Operating Systems Lecture 5 Threads (Overview, Multicore Programming, Multithreading Models, Thread Libraries, Implicit Threading, Operating- System Examples) Summer 2018 Overview
More informationHPC Tools on Windows. Christian Terboven Center for Computing and Communication RWTH Aachen University.
- Excerpt - Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University PPCES March 25th, RWTH Aachen University Agenda o Intel Trace Analyzer and Collector
More informationIntel PerfMon Performance Monitoring Hardware
Intel PerfMon Performance Monitoring Hardware Overview PerfMon Basics PerfMon is hardware throughout the silicon available through registers to tools to facilitate several system/application usages: compiler
More informationOPERATING SYSTEM. Chapter 4: Threads
OPERATING SYSTEM Chapter 4: Threads Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples Objectives To
More informationIntel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel
Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Agenda Which performance analysis tool should I use first? Intel Application
More informationIntel(R) Threading Building Blocks
Getting Started Guide Intel Threading Building Blocks is a runtime-based parallel programming model for C++ code that uses threads. It consists of a template-based runtime library to help you harness the
More informationThreaded Programming. Lecture 9: Alternatives to OpenMP
Threaded Programming Lecture 9: Alternatives to OpenMP What s wrong with OpenMP? OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming
More informationMaximizing performance and scalability using Intel performance libraries
Maximizing performance and scalability using Intel performance libraries Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 17 th 2016, Barcelona
More informationThread Profiler 2.0 Release Notes
Thread Profiler 2.0 Release Notes Contents 1. Overview 2. Package 3. New Features 4. Requirements 5. Installation 6. Usage 7. Supported C Run-Time and Windows* APIs 8. Technical Support and Feedback 1.
More informationIntroduction to Parallel Performance Engineering
Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:
More informationIntel Parallel Amplifier Sample Code Guide
The analyzes the performance of your application and provides information on the performance bottlenecks in your code. It enables you to focus your tuning efforts on the most critical sections of your
More informationPerformance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino
Performance analysis tools: Intel VTuneTM Amplifier and Advisor Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimisation After having considered the MPI layer,
More informationEliminate Memory Errors to Improve Program Stability
Eliminate Memory Errors to Improve Program Stability This guide will illustrate how Parallel Studio memory checking capabilities can find crucial memory defects early in the development cycle. It provides
More informationChapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues
Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin
More informationUsing Intel Inspector XE 2011 with Fortran Applications
Using Intel Inspector XE 2011 with Fortran Applications Jackson Marusarz Intel Corporation Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS
More informationIntel Thread Building Blocks, Part II
Intel Thread Building Blocks, Part II SPD course 2013-14 Massimo Coppola 25/03, 16/05/2014 1 TBB Recap Portable environment Based on C++11 standard compilers Extensive use of templates No vectorization
More informationIntroduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero
Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:
More informationEI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)
EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:
More informationIntel Thread Checker 3.1 for Windows* Release Notes
Page 1 of 6 Intel Thread Checker 3.1 for Windows* Release Notes Contents Overview Product Contents What's New System Requirements Known Issues and Limitations Technical Support Related Products Overview
More informationIntel Parallel Amplifier 2011
THREADING AND PERFORMANCE PROFILER Intel Parallel Amplifier 2011 Product Brief Intel Parallel Amplifier 2011 Optimize Performance and Scalability Intel Parallel Amplifier 2011 makes it simple to quickly
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationGetting Started with Intel SDK for OpenCL Applications
Getting Started with Intel SDK for OpenCL Applications Webinar #1 in the Three-part OpenCL Webinar Series July 11, 2012 Register Now for All Webinars in the Series Welcome to Getting Started with Intel
More informationComputer Systems A Programmer s Perspective 1 (Beta Draft)
Computer Systems A Programmer s Perspective 1 (Beta Draft) Randal E. Bryant David R. O Hallaron August 1, 2001 1 Copyright c 2001, R. E. Bryant, D. R. O Hallaron. All rights reserved. 2 Contents Preface
More informationProfiling: Understand Your Application
Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel
More informationShared memory programming model OpenMP TMA4280 Introduction to Supercomputing
Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started
More informationScientific Programming in C XIV. Parallel programming
Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationPerformance Analysis of Parallel Scientific Applications In Eclipse
Performance Analysis of Parallel Scientific Applications In Eclipse EclipseCon 2015 Wyatt Spear, University of Oregon wspear@cs.uoregon.edu Supercomputing Big systems solving big problems Performance gains
More informationIntel Performance Libraries
Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation
More informationIntel profiling tools and roofline model. Dr. Luigi Iapichino
Intel profiling tools and roofline model Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimization (and to the next hour) We will focus on tools developed
More informationKNL tools. Dr. Fabio Baruffa
KNL tools Dr. Fabio Baruffa fabio.baruffa@lrz.de 2 Which tool do I use? A roadmap to optimization We will focus on tools developed by Intel, available to users of the LRZ systems. Again, we will skip the
More informationPablo Halpern Parallel Programming Languages Architect Intel Corporation
Pablo Halpern Parallel Programming Languages Architect Intel Corporation CppCon, 8 September 2014 This work by Pablo Halpern is licensed under a Creative Commons Attribution
More informationTutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE
Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.
More informationAgenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2
Lecture 3: Processes Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2 Process in General 3.3 Process Concept Process is an active program in execution; process
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationChe-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University
Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University 1. Introduction 2. System Structures 3. Process Concept 4. Multithreaded Programming
More informationIntel Threading Building Blocks (TBB)
Intel Threading Building Blocks (TBB) SDSC Summer Institute 2012 Pietro Cicotti Computational Scientist Gordon Applications Team Performance Modeling and Characterization Lab Parallelism and Decomposition
More informationChapter 4: Multithreaded Programming
Chapter 4: Multithreaded Programming Silberschatz, Galvin and Gagne 2013 Chapter 4: Multithreaded Programming Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading
More informationAUTOMATIC SMT THREADING
AUTOMATIC SMT THREADING FOR OPENMP APPLICATIONS ON THE INTEL XEON PHI CO-PROCESSOR WIM HEIRMAN 1,2 TREVOR E. CARLSON 1 KENZO VAN CRAEYNEST 1 IBRAHIM HUR 2 AAMER JALEEL 2 LIEVEN EECKHOUT 1 1 GHENT UNIVERSITY
More informationJackson Marusarz Intel Corporation
Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits
More informationChapter 4: Threads. Operating System Concepts 9 th Edition
Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples
More informationMoore s Law. Multicore Programming. Vendor Solution. Power Density. Parallelism and Performance MIT Lecture 11 1.
Moore s Law 1000000 Intel CPU Introductions 6.172 Performance Engineering of Software Systems Lecture 11 Multicore Programming Charles E. Leiserson 100000 10000 1000 100 10 Clock Speed (MHz) Transistors
More informationTools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,
Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon
More informationShared memory parallel computing. Intel Threading Building Blocks
Shared memory parallel computing Intel Threading Building Blocks Introduction & history Threading Building Blocks (TBB) cross platform C++ template lib for task-based shared memory parallel programming
More information