Application Programming

Similar documents
Programming with POSIX Threads

Integrated Approach. Operating Systems COMPUTER SYSTEMS. LEAHY, Jr. Georgia Institute of Technology. Umakishore RAMACHANDRAN. William D.

Quality Code. Software Testing Principles, Practices, and Patterns. Stephen Vance. AAddison-Wesley

An Introduction to Parallel Programming

Digital System Design with SystemVerilog

Modern C++ Design. Generic Programming and Design Patterns Applied. Andrei Alexandrescu. AAddison-Wesley

Cloud Computing and SOA Convergence in Your Enterprise

The Unified Modeling Language User Guide

Modern C++ Design. Generic Programming and Design Patterns Applied. Andrei Alexandrescu. .~Addison-Wesley

PYTHON. p ykos vtawynivis. Second eciitiovl. CO Ve, WESLEY J. CHUN

Programming in Python 3

Real-Time Systems and Programming Languages

Multicore Application Programming

Structured Parallel Programming Patterns for Efficient Computation

Modern C++ Design. Generic Programming and Design Patterns Applied. Andrei Alexandrescu

Sistemas Operacionais I. Valeria Menezes Bastos

Programming Guide. Aaftab Munshi Dan Ginsburg Dave Shreiner. TT r^addison-wesley

Computers as Components Principles of Embedded Computing System Design

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor

The Power of Events. An Introduction to Complex Event Processing in Distributed Enterprise Systems. David Luckham

Structured Parallel Programming

Rails AntiPatterns. Chad Pytel. Best Practice Ruby on Rails Refactoring. Tammer Saleh. AAddison-Wesley

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Multi-Core Programming

Secure Coding in C and C++

Essentials. Oracle Solaris Cluster. Tim Read. Upper Saddle River, NJ Boston Indianapolis San Francisco. Capetown Sydney Tokyo Singapore Mexico City

Virtualization from the Trenches

Framework Design Guidelines

MariaDB Crash Course. A Addison-Wesley. Ben Forta. Upper Saddle River, NJ Boston. Indianapolis. Singapore Mexico City. Cape Town Sydney.

The Java Tutorial. A Short Course on the Basics. Raymond Gallardo. Sowmya Kannan. AAddison-Wesley. Sharon Biocca Zakhour.

\ Smart Client 0" Deploymentwith v^ ClickOnce

SQL Queries. for. Mere Mortals. Third Edition. A Hands-On Guide to Data Manipulation in SQL. John L. Viescas Michael J. Hernandez

Barbara Chapman, Gabriele Jost, Ruud van der Pas

JAVASCRIPT FOR PROGRAMMERS

Refactoring HTML. Improving the Design of Existing Web Applications. Elliotte Rusty Harold. TT rvaddison-wesley

CSE 4/521 Introduction to Operating Systems

ECLIPSE RICH CLIENT PLATFORM

DATABASE SYSTEM CONCEPTS

LATEX. Leslie Lamport. Digital Equipment Corporation. Illustrations by Duane Bibby. v ADDISON-WESLEY

Real World Multicore Embedded Systems

OPERATING SYSTEM. Chapter 4: Threads

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

PROCESSES & THREADS. Charles Abzug, Ph.D. Department of Computer Science James Madison University Harrisonburg, VA Charles Abzug

Chapter 6: Process Synchronization. Operating System Concepts 9 th Edit9on

Chapter 4: Threads. Chapter 4: Threads

Fit for Developing Software

"Charting the Course to Your Success!" MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008

ECE 574 Cluster Computing Lecture 8

Programming. In Ada JOHN BARNES TT ADDISON-WESLEY

CS533 Concepts of Operating Systems. Jonathan Walpole

OpenGL SUPERBIBLE. Fifth Edition. Comprehensive Tutorial and Reference. Richard S. Wright, Jr. Nicholas Haemel Graham Sellers Benjamin Lipchak

Introduction to Multicore Programming

Module 6: Process Synchronization. Operating System Concepts with Java 8 th Edition

CSC 4320 Test 1 Spring 2017

PRACE Autumn School Basic Programming Models

MULTITHREADING AND SYNCHRONIZATION. CS124 Operating Systems Fall , Lecture 10

Shared Memory Parallel Programming with Pthreads An overview

C++ Concurrency in Action

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads

Operating Systems: William Stallings. Starvation. Patricia Roy Manatee Community College, Venice, FL 2008, Prentice Hall

A Document Preparation System. User's Guide and Reference Manual. Leslie Lamport

DB2 SQL Tuning Tips for z/os Developers

Semaphore. Originally called P() and V() wait (S) { while S <= 0 ; // no-op S--; } signal (S) { S++; }

Multi-core Architecture and Programming

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Using POSIX Threading to Build Scalable Multi-Core Applications

Parallelising serial applications. Darryl Gove Compiler Performance Engineering

Module 6: Process Synchronization

Message-Passing Shared Address Space

Parallelism Marco Serafini

Threads. Threads The Thread Model (1) CSCE 351: Operating System Kernels Witawas Srisa-an Chapter 4-5

Chapter 5 Concurrency: Mutual Exclusion. and. Synchronization. Operating Systems: Internals. and. Design Principles

Deadlock. Concurrency: Deadlock and Starvation. Reusable Resources

CS377P Programming for Performance Multicore Performance Multithreading

Parallel and Distributed Computing (PD)

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

MARUTHI SCHOOL OF BANKING (MSB)

Curriculum 2013 Knowledge Units Pertaining to PDC

Core Java Volume Ii Advanced Features 10th Edition

Chapter 2 Processes and Threads

CSE 153 Design of Operating Systems Fall 2018

CLASSIC DATA STRUCTURES IN JAVA

Introduction to Multicore Programming

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Chapter 4: Multithreaded Programming

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

Chapter 4: Threads. Operating System Concepts 9 th Edit9on

Operating Systems. Lecture 4 - Concurrency and Synchronization. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

PROBLEM SOLVING USING JAVA WITH DATA STRUCTURES. A Multimedia Approach. Mark Guzdial and Barbara Ericson PEARSON. College of Computing

For use by students enrolled in #71251 CSE430 Fall 2012 at Arizona State University. Do not use if not enrolled.

FUNDAMENTALS OF. Database S wctpmc. Shamkant B. Navathe College of Computing Georgia Institute of Technology. Addison-Wesley

Computer Systems A Programmer s Perspective 1 (Beta Draft)

Chapter 5: Process Synchronization. Operating System Concepts 9 th Edition

CSI3131 Final Exam Review

Operating Systems: Internals and Design Principles. Chapter 4 Threads Seventh Edition By William Stallings

Main Points of the Computer Organization and System Software Module

Chapter 4: Threads. Operating System Concepts 9 th Edition

Intel Thread Building Blocks, Part IV

Chapter 6: Process Synchronization. Operating System Concepts 8 th Edition,

Transcription:

Multicore Application Programming For Windows, Linux, and Oracle Solaris Darryl Gove AAddison-Wesley Upper Saddle River, NJ Boston Indianapolis San Francisco New York Toronto Montreal London Munich Paris Madrid Capetown Sydney Tokyo Singapore Mexico City

; \ Preface xv Acknowledgments xlx About the Author xxi 1 Hardware, Processes, and Threads 1 Examining the Insides of a Computer 1 The Motivation for Multicore Processors 3 ;} Supporting Multiple Threads on a Single Chip \.'4".>;. Increasing Instruction Issue Rate with Pipelined Processor Cores 9.y,.. Using Caches to Hold Recently Used Data 12 y V Using Virtual Memory to Store Data 15 Translating from Virtual Addresses to Physical ; ".Addresses 16 The Characteristics of Multiprocessor Systems 18 How Latency and Bandwidth Impact Performance 20 The Translation of Source Code to Assembly Language 21 The Performance of 32-Bit versus 64-Bit Code 23 Ensuring the Correct Order of Memory Operations 24 The Differences Between Processes and Threads 26 Summary 29V 33 2 Coding for Performance 31 Defining Performance 31 Understanding Algorithmic Complexity 33 Examples of Algorithmic Complexity Why Algorithmic Complexity Is Important 37 Using Algorithmic Complexity with Care 38 How Structure Impacts Performance 39 ;; Performance and Convenience Trade-Offs in Code and Build Structures 39 Source Using Libraries to Structure Applications 42 The Impact of Data Structures on Performance 53

viii The Role of the Compiler 60 The Two Types of Compiler Optimization 62 Selecting Appropriate Compiler Options 64 How Cross-File Optimization Can Be Used to Improve Performance 65 Using Profile Feedback 68 How Potential Pointer Aliasing Can Inhibit Compiler Optimizations 70 Identifying Where Time Is Spent Using Profiling 74 ; Commonly Available Profiling Tools 75 How Not to Optimize 80 Performance by Design 82 Summary 83 3 Identifying Opportunities for Parallelism 85 Using Multiple Processes to Improve System Productivity 85 Multiple Users Utilizing a Single System -S7 Improving Machine Efficiency Through Consolidation 88 Using Containers to Isolate Applications Sharing a Single System 89. Hosting Multiple Operating Systems Using Hypervisors 89 Using Parallelism to Improve the Performance of a Single Task 92 One Approach to Visualizing Parallel Applications 92 How Parallelism Can Change the Choice of Algorithms 93 Amdahl's Law 94 Determining the Maximum Practical Threads 97 How Synchronization Costs Reduce Scaling 98 Parallelization Patterns 100 Data Parallelism Using SIMD Instructions 101 Parallelization Using Processes or Threads 102 Multiple Independent Tasks 102 Multiple Loosely Coupled Tasks 103 Multiple Copies of the Same Task 105 Single Task Split Over Multiple Threads 106

' ' ' ix Using a Pipeline of Tasks to Work on a Single Vj Item 106; Division of Work into a Client and a Server 108 Splitting Responsibility into a Producer and a; Consumer 109 Combining Parallelization Strategies; 109 How Dependencies Influence the Ability Run Code in Parallel 110 Antidependencies and Output Dependencies 111 Using Speculation to Break Dependencies 113 \ Critical Paths 117 Identifying Parallelization Opportunities 118 Summary 119 4 Synchronization and Data Sharing 121 Data Races 121 Using Tools to Detect Data Races 123 Avoiding Data Races 126 Synchronization Primitives 126 Mutexes and Critical Regions 126 ' Spin Locks 128 Semaphores 128 Readers-Writer Locks 129.. Barriers 130 Atomic Operations and Lock-Free Code 130 Deadlocks and Livelocks 132 Communication Between Threads and Processes 133 Memory, Shared Memory, and Memory-Mapped Files 134 Condition Variables 135 '.['.. Signals and Events 137 Message Queues 138 Named Pipes 139 Communication Through the Network Stack 139 ;' / Other Approaches to Sharing Data Between Threads ^±40^ Storing Thread-Private Data 141 "; Summary 142

; Threads, Protecting 5 Using POSIX Threads 143 Creating Threads 143 Thread Termination 144 Passing Data to and from Child Threads 145 Detached Threads 147 Setting the Attributes for Pthreads 148 Compiling Multithreaded Code 151 Process Termination 153 Sharing Data Between Threads 154 ' ', Protecting Access Using Mutex Locks 154 Mutex Attributes 156 Using Spin Locks 157 Read-Write Locks 159 Barriers 162 Semaphores 163 Condition Variables 170 Variables and Memory 175 '..".Multiprocess Programming 179 ; Sharing Memory Between Processes 180 Sharing Semaphores Between Processes 183 Message Queues 184 Pipes and Named Pipes 186 Using Signals to Communicate with a Process 188 Sockets 193 Reentrant Code and Compiler Flags 197 Summary 198 6 Windows Threading 199 Creating Native Windows Threads 199 Terminating Threads 204 Creating and Resuming Suspended Threads 207 Using Handles to Kernel Resources 207 Methods of Synchronization and Resource Sharing 208 An Example of Requiring Synchronization Between 209 Protecting Access to Code with Critical Sections 210 Regions of Code with Mutexes 213

xi Slim Reader/Writer Locks 214 Semaphores 216 Condition Variables 218 Signaling Event Completion to Other Threads or Processes 219 Wide String Handling in Windows 221 Creating Processes 222 Sharing Memory Between Processes 225 Inheriting Handles in Child Processes 228 Naming Mutexes and Sharing Them Between Processes 229 Communicating with Pipes 231 Communicating Using Sockets 234 Atomic Updates of Variables 238 Allocating Thread-Local Storage 240 Setting Thread Priority 242 z Summary 244 7 Using Automatic Parallelization and OpenMP 245 Using Automatic Parallelization to Produce a Parallel Application 245 Identifying and Parallelizing Reductions 250 Automatic Parallelization of Codes Containing '}; Calls 251 Assisting Compiler in Automatically Parallelizing Code 254 /;'/;> Using OpenMP to Produce a Parallel Application 256 / Using OpenMP to Parallelize Loops 258 Runtime Behavior of an OpenMP Application 258 Variable Scoping Inside OpenMP Parallel Regions 259 Parallelizing Reductions Using OpenMP 260 / Accessing Private Data Outside the Parallel Region 261 Improving Work Distribution Using Scheduling 263 Using Parallel Sections to Perform Independent ^'^ Nested Parallelism 268

, Producer-Consumer Scaling - Tasks xii Using OpenMP for Dynamically Defined Parallel 269 Keeping Data Private to Threads 274. Controlling the OpenMP Runtime Environment 276 Waiting for Work to Complete 278 Restricting the Threads That Execute a Region of Code 281 Ensuring That Code in a Parallel Region Is Executed in Order 285 Collapsing Loops to Improve Workload Balance 286 Enforcing Memory Consistency 287 ; An Example of Parallelization 288 Summary 293 8 Hand-Coded Synchronization and Sharing 295 Atomic Operations 295 Using Compare and Swap Instructions to Form More Complex Atomic Operations 297 Enforcing Memory Ordering to Operation 301 Ensure Correct Compiler Support of Memory-Ordering Directives 303 Reordering of Operations by the Compiler 304 Volatile Variables 308 Operating System-Provided Atomics 309 Lockless Algorithms 312 Dekker's Algorithm 312 with a Circular Buffer 315 to Multiple Consumers or Producers 318 Scaling the Producer-Consumer to Multiple Threads 319 Modifying the Producer-Consumer Code to Use Atomics 326 >' The ABA Problem 329 Summary 332 9 Scaling with Multicore Processors 333 Constraints to Application Scaling 333- Performance Limited by Serial Code 334

' Hardware Constraints to Scaling 352. Superlinear Scaling 336 Workload Imbalance 338 Hot LOCkS 'M6^'V';^;..;,'''.'.'' Scaling of Library Code 345 Insufficient Work 347 Algorithmic Limit 350 Bandwidth Sharing Between Cores 353 False Sharing 355 Cache Conflict and Capacity 359 Pipeline Resource Starvation 363 Operating System Constraints to Scaling 369 Oversubscription 369 Using Processor Binding to Improve Memory Locality 371 Priority Inversion 379 Multicore Processors and Scaling 380 Summary 381 10 Other Parallelization Technologies 383 GPU-Based Computing 383 Language Extensions 386 Threading Building Blocks 386 Cilk++ 389 V Grand Central Dispatch 392 Features Proposed for the Next C and C++ Standards 394 Microsoft's C++/CLI 397 Alternative Languages 399 Clustering Technologies 402 MPI 402 MapReduce as a Strategy for Scaling 406 Grids 407 Transactional Memory 407 ; Vectorization 408 Summary 409

11 Concluding Remarks 411 Writing Parallel Applications 411 Identifying Tasks 411 Estimating Performance Gains 412 Determining Dependencies 413 Data Races and the Scaling Limitations of Mutex Locks 413; Locking Granularity 413 Parallel Code on Multicore Processors 414 Optimizing Programs for Multicore Processors 415 The Future 416 Bibliography 417 Books 417 POSIX Threads 417 Windows 417 Algorithmic Complexity 417 Computer Architecture 417 Parallel Programming 417 OpenMP 418 Online Resources 418 Hardware 418 Developer Tools 418 Parallelization Approaches 418 Index 419