Multicore Application Programming For Windows, Linux, and Oracle Solaris Darryl Gove AAddison-Wesley Upper Saddle River, NJ Boston Indianapolis San Francisco New York Toronto Montreal London Munich Paris Madrid Capetown Sydney Tokyo Singapore Mexico City
; \ Preface xv Acknowledgments xlx About the Author xxi 1 Hardware, Processes, and Threads 1 Examining the Insides of a Computer 1 The Motivation for Multicore Processors 3 ;} Supporting Multiple Threads on a Single Chip \.'4".>;. Increasing Instruction Issue Rate with Pipelined Processor Cores 9.y,.. Using Caches to Hold Recently Used Data 12 y V Using Virtual Memory to Store Data 15 Translating from Virtual Addresses to Physical ; ".Addresses 16 The Characteristics of Multiprocessor Systems 18 How Latency and Bandwidth Impact Performance 20 The Translation of Source Code to Assembly Language 21 The Performance of 32-Bit versus 64-Bit Code 23 Ensuring the Correct Order of Memory Operations 24 The Differences Between Processes and Threads 26 Summary 29V 33 2 Coding for Performance 31 Defining Performance 31 Understanding Algorithmic Complexity 33 Examples of Algorithmic Complexity Why Algorithmic Complexity Is Important 37 Using Algorithmic Complexity with Care 38 How Structure Impacts Performance 39 ;; Performance and Convenience Trade-Offs in Code and Build Structures 39 Source Using Libraries to Structure Applications 42 The Impact of Data Structures on Performance 53
viii The Role of the Compiler 60 The Two Types of Compiler Optimization 62 Selecting Appropriate Compiler Options 64 How Cross-File Optimization Can Be Used to Improve Performance 65 Using Profile Feedback 68 How Potential Pointer Aliasing Can Inhibit Compiler Optimizations 70 Identifying Where Time Is Spent Using Profiling 74 ; Commonly Available Profiling Tools 75 How Not to Optimize 80 Performance by Design 82 Summary 83 3 Identifying Opportunities for Parallelism 85 Using Multiple Processes to Improve System Productivity 85 Multiple Users Utilizing a Single System -S7 Improving Machine Efficiency Through Consolidation 88 Using Containers to Isolate Applications Sharing a Single System 89. Hosting Multiple Operating Systems Using Hypervisors 89 Using Parallelism to Improve the Performance of a Single Task 92 One Approach to Visualizing Parallel Applications 92 How Parallelism Can Change the Choice of Algorithms 93 Amdahl's Law 94 Determining the Maximum Practical Threads 97 How Synchronization Costs Reduce Scaling 98 Parallelization Patterns 100 Data Parallelism Using SIMD Instructions 101 Parallelization Using Processes or Threads 102 Multiple Independent Tasks 102 Multiple Loosely Coupled Tasks 103 Multiple Copies of the Same Task 105 Single Task Split Over Multiple Threads 106
' ' ' ix Using a Pipeline of Tasks to Work on a Single Vj Item 106; Division of Work into a Client and a Server 108 Splitting Responsibility into a Producer and a; Consumer 109 Combining Parallelization Strategies; 109 How Dependencies Influence the Ability Run Code in Parallel 110 Antidependencies and Output Dependencies 111 Using Speculation to Break Dependencies 113 \ Critical Paths 117 Identifying Parallelization Opportunities 118 Summary 119 4 Synchronization and Data Sharing 121 Data Races 121 Using Tools to Detect Data Races 123 Avoiding Data Races 126 Synchronization Primitives 126 Mutexes and Critical Regions 126 ' Spin Locks 128 Semaphores 128 Readers-Writer Locks 129.. Barriers 130 Atomic Operations and Lock-Free Code 130 Deadlocks and Livelocks 132 Communication Between Threads and Processes 133 Memory, Shared Memory, and Memory-Mapped Files 134 Condition Variables 135 '.['.. Signals and Events 137 Message Queues 138 Named Pipes 139 Communication Through the Network Stack 139 ;' / Other Approaches to Sharing Data Between Threads ^±40^ Storing Thread-Private Data 141 "; Summary 142
; Threads, Protecting 5 Using POSIX Threads 143 Creating Threads 143 Thread Termination 144 Passing Data to and from Child Threads 145 Detached Threads 147 Setting the Attributes for Pthreads 148 Compiling Multithreaded Code 151 Process Termination 153 Sharing Data Between Threads 154 ' ', Protecting Access Using Mutex Locks 154 Mutex Attributes 156 Using Spin Locks 157 Read-Write Locks 159 Barriers 162 Semaphores 163 Condition Variables 170 Variables and Memory 175 '..".Multiprocess Programming 179 ; Sharing Memory Between Processes 180 Sharing Semaphores Between Processes 183 Message Queues 184 Pipes and Named Pipes 186 Using Signals to Communicate with a Process 188 Sockets 193 Reentrant Code and Compiler Flags 197 Summary 198 6 Windows Threading 199 Creating Native Windows Threads 199 Terminating Threads 204 Creating and Resuming Suspended Threads 207 Using Handles to Kernel Resources 207 Methods of Synchronization and Resource Sharing 208 An Example of Requiring Synchronization Between 209 Protecting Access to Code with Critical Sections 210 Regions of Code with Mutexes 213
xi Slim Reader/Writer Locks 214 Semaphores 216 Condition Variables 218 Signaling Event Completion to Other Threads or Processes 219 Wide String Handling in Windows 221 Creating Processes 222 Sharing Memory Between Processes 225 Inheriting Handles in Child Processes 228 Naming Mutexes and Sharing Them Between Processes 229 Communicating with Pipes 231 Communicating Using Sockets 234 Atomic Updates of Variables 238 Allocating Thread-Local Storage 240 Setting Thread Priority 242 z Summary 244 7 Using Automatic Parallelization and OpenMP 245 Using Automatic Parallelization to Produce a Parallel Application 245 Identifying and Parallelizing Reductions 250 Automatic Parallelization of Codes Containing '}; Calls 251 Assisting Compiler in Automatically Parallelizing Code 254 /;'/;> Using OpenMP to Produce a Parallel Application 256 / Using OpenMP to Parallelize Loops 258 Runtime Behavior of an OpenMP Application 258 Variable Scoping Inside OpenMP Parallel Regions 259 Parallelizing Reductions Using OpenMP 260 / Accessing Private Data Outside the Parallel Region 261 Improving Work Distribution Using Scheduling 263 Using Parallel Sections to Perform Independent ^'^ Nested Parallelism 268
, Producer-Consumer Scaling - Tasks xii Using OpenMP for Dynamically Defined Parallel 269 Keeping Data Private to Threads 274. Controlling the OpenMP Runtime Environment 276 Waiting for Work to Complete 278 Restricting the Threads That Execute a Region of Code 281 Ensuring That Code in a Parallel Region Is Executed in Order 285 Collapsing Loops to Improve Workload Balance 286 Enforcing Memory Consistency 287 ; An Example of Parallelization 288 Summary 293 8 Hand-Coded Synchronization and Sharing 295 Atomic Operations 295 Using Compare and Swap Instructions to Form More Complex Atomic Operations 297 Enforcing Memory Ordering to Operation 301 Ensure Correct Compiler Support of Memory-Ordering Directives 303 Reordering of Operations by the Compiler 304 Volatile Variables 308 Operating System-Provided Atomics 309 Lockless Algorithms 312 Dekker's Algorithm 312 with a Circular Buffer 315 to Multiple Consumers or Producers 318 Scaling the Producer-Consumer to Multiple Threads 319 Modifying the Producer-Consumer Code to Use Atomics 326 >' The ABA Problem 329 Summary 332 9 Scaling with Multicore Processors 333 Constraints to Application Scaling 333- Performance Limited by Serial Code 334
' Hardware Constraints to Scaling 352. Superlinear Scaling 336 Workload Imbalance 338 Hot LOCkS 'M6^'V';^;..;,'''.'.'' Scaling of Library Code 345 Insufficient Work 347 Algorithmic Limit 350 Bandwidth Sharing Between Cores 353 False Sharing 355 Cache Conflict and Capacity 359 Pipeline Resource Starvation 363 Operating System Constraints to Scaling 369 Oversubscription 369 Using Processor Binding to Improve Memory Locality 371 Priority Inversion 379 Multicore Processors and Scaling 380 Summary 381 10 Other Parallelization Technologies 383 GPU-Based Computing 383 Language Extensions 386 Threading Building Blocks 386 Cilk++ 389 V Grand Central Dispatch 392 Features Proposed for the Next C and C++ Standards 394 Microsoft's C++/CLI 397 Alternative Languages 399 Clustering Technologies 402 MPI 402 MapReduce as a Strategy for Scaling 406 Grids 407 Transactional Memory 407 ; Vectorization 408 Summary 409
11 Concluding Remarks 411 Writing Parallel Applications 411 Identifying Tasks 411 Estimating Performance Gains 412 Determining Dependencies 413 Data Races and the Scaling Limitations of Mutex Locks 413; Locking Granularity 413 Parallel Code on Multicore Processors 414 Optimizing Programs for Multicore Processors 415 The Future 416 Bibliography 417 Books 417 POSIX Threads 417 Windows 417 Algorithmic Complexity 417 Computer Architecture 417 Parallel Programming 417 OpenMP 418 Online Resources 418 Hardware 418 Developer Tools 418 Parallelization Approaches 418 Index 419