Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications

Similar documents
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Using ODHeuristics To Solve Hard Mixed Integer Programming Problems. Alkis Vazacopoulos Robert Ashford Optimization Direct Inc.

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Simultaneous Multithreading on Pentium 4

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

AUTOMATIC SMT THREADING

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

HISTORY OF MICROPROCESSORS

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Systems Programming and Computer Architecture ( ) Timothy Roscoe

An Introduction to ODH CPLEX. Alkis Vazacopoulos Robert Ashford Optimization Direct Inc. April 2018

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Multicore Challenge in Vector Pascal. P Cockshott, Y Gdura

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture

CUDA GPGPU Workshop 2012

. Understanding Performance of Shared Memory Programs. Kenjiro Taura. University of Tokyo

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

CS516 Programming Languages and Compilers II

Parallel Computing. Hwansoo Han (SKKU)

High Performance Computing on GPUs using NVIDIA CUDA

Exam Parallel Computer Systems

What is Cache Memory? EE 352 Unit 11. Motivation for Cache Memory. Memory Hierarchy. Cache Definitions Cache Address Mapping Cache Performance

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Ingo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA Copyright 2003, SAS Institute Inc. All rights reserved.

Dynamic Performance Tuning for Speculative Threads

Double-precision General Matrix Multiply (DGEMM)

Software within building physics and ground heat storage. HEAT3 version 7. A PC-program for heat transfer in three dimensions Update manual

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

A Thread is an independent stream of instructions that can be schedule to run as such by the OS. Think of a thread as a procedure that runs

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Demand-Driven Software Race Detection using Hardware

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Learning with Purpose

Multithreaded Processors. Department of Electrical Engineering Stanford University

A parallel patch based algorithm for CT image denoising on the Cell Broadband Engine

Towards Parallel, Scalable VM Services

COL862: Low Power Computing Maximizing Performance Under a Power Cap: A Comparison of Hardware, Software, and Hybrid Techniques

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Multi-Processors and GPU

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Multithreading: Exploiting Thread-Level Parallelism within a Processor

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

SRM-Buffer: An OS Buffer Management Technique to Prevent Last Level Cache from Thrashing in Multicores

Topic 22: Multi-Processor Parallelism

Cache-oblivious Programming

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores

Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance

COTS Multicore Processors in Avionics Systems: Challenges and Solutions

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Topic 22: Multi-Processor Parallelism

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

XPU A Programmable FPGA Accelerator for Diverse Workloads

Two hours. No special instructions. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date. Time

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

QR Decomposition on GPUs

Lecture 3: Intro to parallel machines and models

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Introduction to Multicore Programming

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

Written Exam / Tentamen

ECE232: Hardware Organization and Design

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam

CSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double

Comp. Org II, Spring

Carlo Cavazzoni, HPC department, CINECA

Adaptec MaxIQ SSD Cache Performance Solution for Web Server Environments Analysis

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University

Parallel Processing & Multicore computers

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

Contour Detection on Mobile Platforms

Comp. Org II, Spring

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Abhishek Pandey Aman Chadha Aditya Prakash

Lecture-14 (Memory Hierarchy) CS422-Spring

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Solving Dense Linear Systems on Graphics Processors

A Parallelizing Compiler for Multicore Systems

Large Scale Debugging

Memory hierarchies: caches and their impact on the running time

Introduction to Multicore Programming

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

Lecture 9 Dynamic Compilation

Introducing Multi-core Computing / Hyperthreading

Transcription:

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark

Motivation Hardware Trends Put more cores in a single chip More threads always win? NO! 2009 201X CPU intensive programs Exploits Thread Level Parallelism

Optimal Number of Threads Too many threads More synchronization More contention for system resources Too few threads Resource underutilization Who can decide the number? Not a programmer

Why NOT? Input changes Various working-set size The system changes Decision must be made at runtime Various available resources Hardware changes Various L2/L3 cache structure / size, etc.

Proposal 16 Thr. OK. I will create lots of threads > 128 Thr. Thread Tailor Combine Threads New Binary Binary Compile Distribute Combining Threads Group Several Threads into a Single Thread Threads in the same group are executed in serial Executed on the SAME core

Details Profiler Instrument Profile Info. Graphs > 128 Thr. Instrumented Codes Binary Collect System Info. Run Combine Algorithm Result Code Generator Combined Codes Development Distribution Thread Tailor

Graph Construction Thread 1 Thread 2 Cycles = 10M Working-set = 10K Synchronization Cost (cycles) Communication Cost

Communication Cost Intuition : STORE Instruction causes coherence miss in cache Log Memory Access per Thread Thread 1 Thread 2 Address LD Count ST Count 0x00001234 5 10 0x00001338 4 9 0x00004000 7 7 LD ST Graph 29 1 2 LD ST Address LD Count ST Count 0x00001234 0 7 0x00002000 4 4 0x00004000 3 8 0x00001234: MIN(5, 7) + MIN(10, 0) + MIN(10, 7) = 12 0x00004000: MIN(7, 8) + MIN( 7, 3) + MIN( 7, 8) = 17 Total Communication Cost: 12 + 17 = 29

Combining Algorithm Kernighan-Lin(KL) Graph Partitioning Heuristic Goal : Minimize Execution Cycles Precondition : Combined Threads Cores A 60 B E 60 F = 100 Cycles 60 60 60 C 60 D 60 60 60 60 60 10 G 60 H 2 Cores Partition 1 Partition 2 Partition 1 Cycle Partition2 Cycle Move Move Estimation Estimation From Node B C D G A E F H 210 220 2 A A B C D G E F H 130 120 1 G A B C D E F G H 40 40 1 D

Thread Combining Application Dynamic Compiler No : Create Normal Thread Thread Code Cache Translation Target to combine? vm_thread_create() User Thread Yes Thread : Create User Thread User Thread Replace Thread APIs with Wrapper Functions Wrapper Function for Thread Creation Context Switched by Dynamic Compiler Serially Execute User Threads in Real Thread Thread

Experimental Setup 2 cores Intel Core 2 Duo 6600 (2.4 Ghz) 4 cores Intel Core 2 Quad Q6600 (2.4.Ghz) 8 cores 2 Quad-core CPUs with SMT Intel Xeon E5520 ( 2.26 Ghz ) 16 cores (Logical) 2 Quad-core CPUs with SMT and HyperThreading Intel Xeon E5520 ( 2.26 Ghz )

Speedup Results 1.2 1.31 1.66 2.36 1.83 1.15 1.1 1.05 1 0.95 0.9 2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 fluidanimate transpose blackscholes twister water_n^2 swaptions Core Number

Result Analysis - Transpose Transpose m * n matrix to n * m 1 2 3 4 5 6 Parallel Transpose 1 4 2 5 3 6 Thread 1 128 cols distance Thread 2 Input Matrix 128 rows distance Output Matrix

Result Analysis - Transpose Transpose m * n matrix to n * m 1 2 3 4 5 6 1 4 2 5 3 6 Core 0 Intel Nehalem L1 private (32K) Input Matrix L2 private (256K) Output Matrix L3 Shared (8M)

Result Analysis - Transpose Transpose m * n matrix to n * m 1 2 3 4 5 6 1 4 2 5 3 6 Core 0 Intel Nehalem 64 Byte Block L1 private (32K) Input Matrix L2 private (256K) Output Matrix L3 Shared (8M)

Result Analysis - Transpose Transpose m * n matrix to n * m 1 2 3 4 5 6 1 4 2 5 3 6 Core 0 Intel Nehalem L1 private (32K) Input Matrix L2 private (256K) Output Matrix L3 Shared (8M)

Result Analysis - Transpose Transpose m * n matrix to n * m 1 2 3 4 5 6 1 4 2 5 3 6 Core 0 Intel Nehalem Input Matrix 512 Byte distance L1 private (32K) Output Matrix 128 rows distance L2 private (256K) L3 Shared (8M)

Result Analysis - Transpose Transpose m * n matrix to n * m Input Matrix 1 2 3 4 5 6 1 4 2 5 3 6 iterates 128 times Core 0 8KB (128 * 64byte) 8KB (128 * 64byte) L1 private (32K) Intel Nehalem L2 private (256K) iterates 128 times Output Matrix L3 Shared (8M)

Result Analysis - Transpose Transpose m * n matrix to n * m 1 2 3 4 5 6 1 4 2 5 3 6 Core 0 Intel Nehalem 8KB (128 * 64byte) 8KB (128 * 64byte) L1 private (32K) Input Matrix L2 private (256K) Output Matrix L3 Shared (8M)

Result Analysis - Transpose Transpose m * n matrix to n * m 1 2 3 4 5 6 1 4 2 5 3 6 Core 0 Intel Nehalem 8KB (128 * 64byte) 8KB (128 * 64byte) L1 private (32K) WRITE HIT! Input Matrix L2 private (256K) Output Matrix L3 Shared (8M)

Result Analysis - Transpose Transpose m * n matrix to n * m 1 2 3 4 5 6 1 4 2 5 3 6 Core 0 Intel Nehalem 8KB (128 * 64byte) 8KB (128 * 64byte) Working-set fits into L1 Cache (No Capacity Miss!) L1 private (32K) WRITE HIT! Input Matrix L2 private (256K) Output Matrix L3 Shared (8M)

Summary Choosing Optimal Number of Threads is Hard Thread Tailor Ease the Pain Graph Representation Combine Threads at Runtime

Thank you