Shared Memory Architectures. Programming and Synchronization. Today s Outline. Page 1. Message passing review Cosmic Cube discussion

Similar documents
COSC 6374 Parallel Computation. Non-blocking Collective Operations. Edgar Gabriel Fall Overview

COSC 6374 Parallel Computation. Communication Performance Modeling (II) Edgar Gabriel Fall Overview. Impact of communication costs on Speedup

Compiling a Parallel DSL to GPU

Distributed Systems Principles and Paradigms. Chapter 11: Distributed File Systems

PARALLEL AND DISTRIBUTED COMPUTING

Distributed Systems Principles and Paradigms

COSC 6374 Parallel Computation. Dense Matrix Operations

Distance vector protocol

Error Numbers of the Standard Function Block

A distributed edit-compile workflow

LINX MATRIX SWITCHERS FIRMWARE UPDATE INSTRUCTIONS FIRMWARE VERSION

McAfee Web Gateway

CS553 Lecture Introduction to Data-flow Analysis 1

INTEGRATED WORKFLOW ART DIRECTOR

Data sharing in OpenMP

Paradigm 5. Data Structure. Suffix trees. What is a suffix tree? Suffix tree. Simple applications. Simple applications. Algorithms

UTMC APPLICATION NOTE UT1553B BCRT TO INTERFACE PSEUDO-DUAL-PORT RAM ARCHITECTURE INTRODUCTION ARBITRATION DETAILS DESIGN SELECTIONS

V = set of vertices (vertex / node) E = set of edges (v, w) (v, w in V)

How to Design REST API? Written Date : March 23, 2015

CS453 INTRODUCTION TO DATAFLOW ANALYSIS

To access your mailbox from inside your organization. For assistance, call:

CMPUT101 Introduction to Computing - Summer 2002

Solution of Linear Algebraic Equations using the Gauss-Jordan Method

Midterm Exam CSC October 2001

All in One Kit. Quick Start Guide CONNECTING WITH OTHER DEVICES SDE-4003/ * 27. English-1

Welch Allyn CardioPerfect Workstation Installation Guide

Exploiting Locality to Ameliorate Packet Queue Contention and Serialization

COMPUTER EDUCATION TECHNIQUES, INC. (WEBLOGIC_SVR_ADM ) SA:

Start Here. Remove all tape and lift display. Locate components

CS 241 Week 4 Tutorial Solutions

Inter-domain Routing

1. Be able to do System Level Designs by: 2. Become proficient in a hardware-description language (HDL)

Calculus Differentiation

Engineer-to-Engineer Note

Dynamic Programming. Andreas Klappenecker. [partially based on slides by Prof. Welch] Monday, September 24, 2012

Can Pythagoras Swim?

LING/C SC/PSYC 438/538. Lecture 21 Sandiway Fong

Today. Search Problems. Uninformed Search Methods. Depth-First Search Breadth-First Search Uniform-Cost Search

Lesson 4.4. Euler Circuits and Paths. Explore This

Problem Final Exam Set 2 Solutions

Troubleshooting. Verify the Cisco Prime Collaboration Provisioning Installation (for Advanced or Standard Mode), page

CS201 Discussion 10 DRAWTREE + TRIES

4-1 NAME DATE PERIOD. Study Guide. Parallel Lines and Planes P Q, O Q. Sample answers: A J, A F, and D E

OPERATION MANUAL. DIGIFORCE 9307 PROFINET Integration into TIA Portal

cisc1110 fall 2010 lecture VI.2 call by value function parameters another call by value example:

EECS150 - Digital Design Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

Lecture Overview. Knowledge-based systems in Bioinformatics, 1MB602. Procedural abstraction. The sum procedure. Integration as a procedure

Software Configuration Management

Pattern Matching. Pattern Matching. Pattern Matching. Review of Regular Expressions

16 Bit Software Tools ADDU-21xx-PC-1 Code Generation and Simulation

Agilent Mass Hunter Software

Mid-term exam. Scores. Fall term 2012 KAIST EE209 Programming Structures for EE. Thursday Oct 25, Student's name: Student ID:

box Boxes and Arrows 3 true 7.59 'X' An object is drawn as a box that contains its data members, for example:

CS 340, Fall 2016 Sep 29th Exam 1 Note: in all questions, the special symbol ɛ (epsilon) is used to indicate the empty string.

Today s Lecture. Basics of Logic Design: Boolean Algebra, Logic Gates. Recursive Example. Review: The C / C++ code. Recursive Example (Continued)

the machine and check the components AC Power Cord Carrier Sheet/ Plastic Card Carrier Sheet DVD-ROM

An introduction to model checking

NOTES. Figure 1 illustrates typical hardware component connections required when using the JCM ICB Asset Ticket Generator software application.

Enterprise Digital Signage Create a New Sign

HIGH-LEVEL TRANSFORMATIONS DATA-FLOW MODEL OF COMPUTATION TOKEN FLOW IN A DFG DATA FLOW

Package Contents. Wireless-G USB Network Adapter with SpeedBooster USB Cable Setup CD-ROM with User Guide (English only) Quick Installation

Today. Quiz Introduction to pipelining

ASTs, Regex, Parsing, and Pretty Printing

Engineer To Engineer Note

Network Layer: Routing Classifications; Shortest Path Routing

Architecture and Data Flows Reference Guide

Virtual Machine (Part I)

Compilers Spring 2013 PRACTICE Midterm Exam

Containers: Queue and List

Avocado: A Distributed Virtual Reality Framework

Topic: Software Model Checking via Counter-Example Guided Abstraction Refinement. Having a BLAST with SLAM. Combining Strengths. SLAM Overview SLAM

CSCI 446: Artificial Intelligence

Title. How FIFO is Your Concurrent FIFO Queue? Andreas Haas, Christoph M. Kirsch, Michael Lippautz, Hannes Payer. RACES Workshop, October 2012

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe

Lecture 13: Graphs I: Breadth First Search

Reference types and their characteristics Class Definition Constructors and Object Creation Special objects: Strings and Arrays

COMPUTER EDUCATION TECHNIQUES, INC. (XML ) SA:

Systems I. Logic Design I. Topics Digital logic Logic gates Simple combinational logic circuits

[Prakash* et al., 5(8): August, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

UT1553B BCRT True Dual-port Memory Interface

Engineer To Engineer Note

McAfee Data Loss Prevention Prevent

Registering as an HPE Reseller

Register Transfer Level (RTL) Design

Software Release Note

McAfee Network Security Platform

3.5.1 Single slit diffraction

Representation of Numbers. Number Representation. Representation of Numbers. 32-bit Unsigned Integers 3/24/2014. Fixed point Integer Representation

Assembly & Installation Instructions: 920 CPU Holder, 920-X

ITEC2620 Introduction to Data Structures

10/9/2012. Operator is an operation performed over data at runtime. Arithmetic, Logical, Comparison, Assignment, Etc. Operators have precedence

Registering as a HPE Reseller. Quick Reference Guide for new Partners in Asia Pacific

Don Thomas, 1998, Page 1

Minimal Memory Abstractions

CICS Application Design

Engineer To Engineer Note

Engineer-to-Engineer Note

Greedy Algorithm. Algorithm Fall Semester

Functor (1A) Young Won Lim 8/2/17

Architecture and Data Flows Reference Guide

Transcription:

Tody s Outline Arhitetures Progrmming nd Synhroniztion Disuss pper on Cosmi Cube (messge pssing) Messge pssing review Cosmi Cube disussion > Messge pssing mhine Shred memory model > Communition > Synhroniztion Ultromputer/RP3 disussion > Shred memory mhine Shred memory progrmming Fine grin versus orse grin prllelism How do hes hnge things > Improve nd omplite! > Beehive 6.173 Fll 2010 L07 Agrwl - 1 - - 2 - Pge 1

Review Messge Pssing Prllel Progrmming Model How to Reeive Messge Messge Beehive uses polling Privte Memory Privte Memory Core 0 lol he ld Core1 lol he lol he lol he messge Proess A Proess B E.g., MPI P stio ldio Wit in loop if no msg Communition: vi messges Synhroniztion: vi messges Messge - 3 - - 4 - Pge 2

The Cosmi Cube The Erliest Messge Pssing Mhine The Cosmi Cube - 5 - M Witing on reply A M M M M Messge P P P P P interfe 0 1 2 3 4... 7 Dist=1 B Swith proess to hide lteny Dist=3 64 nodes (remember, multiores on single hip rrived ir 2000) Diret network hyperube (detils lter in ourse) Privte memories Messge sends by lling into OS Routing in softwre Sequentil progrmming on eh proessor & messge send/reeive (muh like Beehive) Hide omm lteny by swithing proesses Simple hrdwre Disuss pper - 6 - Dist=7 Pge 3

Next, Rell, Prllel Progrmming Model Ultromputer Design Blkbord ptures stte Shred memory memory lok M0 M1 This is butterfly network vrint of Omeg network Designers Threds E.g., pthreds Communition: vi shred memory Synhroniztion: shred memory loks Indiret network Omeg network (detils lter in ourse) Shred memory mhine Communition/synhroniztion through shred memory Hrdwre routing of memory requests No lteny hiding wit for memory request Conept built s IBM RP3 mhine (we will see this lter) - 7 - - 8 - Pge 4

Populrized SPMD Progrmming (Single-progrm multiple-dt) Quik Detour Brrier Synhroniztion DO P_A R_A P_B S_A P_C P_D DO Brrier synhroniztion Glob_C=5 DO Glob_Z=Glob_Z+1 DO DO Prllel setion Replite setion Seril setion You will do this in lb 4 Proessors P dd sub Brrier Wit or or Brrier OK to proeed Brrier Wit Time or xor Brrier Wit Brrier synhroniztion pplies to set of proesses Annotte sequentil progrms A proess tht exeutes brrier must wit until ll other proesses hve exeuted their brrier Disuss how to do brrier on Beehive using messge pssing - 9 - - 10 - Pge 5

SPMD Progrmming Approh Single Progrm Multiple Dt Adding Pir of Vetors A Sequentil Progrm # define LENGTH 1000000 You should lern this! int [LENGTH], b[length], [LENGTH]; int i=0; Most prllel progrms written for ommodity multiores use this style (ll ommodity multiores hppen to be shred memory mhines!)* All proessors run opy of the sme progrm (ommonly slightly modified version of the sequentil progrm) Proessor-speifi behvior reted using unique proessor IDs Also need to introdue synhroniztion s neessry Let s do simple exmple to build intuition min() /* Initiliztions */... /* red in the two vetors */... i = 0; while (i < LENGTH) [i] = [i] + b[i]; i = i + 1; /* output the nswer */... *Note tht, in generl, SPMD style of progrmming n be pplied to either shred memory or messge pssing mhines Sequentil ddition of two vetors - 11 - - 12 - Pge 6

Prllel SPMD Version Assume Ultromputer model. Assume no hes, single word memory ess # define LENGTH 1000000 int [LENGTH], b[length], [LENGTH]; int i=0; int L=0; min() /* rete prllel proesses */... /* Initiliztions */ if (mypid == 0)... /* red in the two vetors */ if (mypid == 0)... int myi; myi = getwork(); while (i < LENGTH) /* output the nswer */ if (mypid == 0)... int getwork() getlok(); i = i + 1; /* inrement is tomi */ releselok(); return(i); Assume eh proess runs the rest of the sme progrm Only proess 0 runs this Get n index on whih to work. [myi] = [myi] + b[myi]; Exmple of self myi = getwork(); sheduling Pure No hes, single word reds/writes i=4 Sequentil ddition of two vetors - 13 - - 14 - Pge 7

Pure Lok: Using Test-nd-Set Instrution Exmple of spin lok Pure Test-nd-Set Instrution Implementtion old vl wrt 1 i=4 void getlok() while (T&S(L) == 1) ; /* loop till you get the lok */ void releselok() L = 0; /* relese the lok */ - 15 - T&S(L) tomi red-write [Return old vlue; Write 1] How to implement T&S in HW? In SW? i=4 T&S(L) tomi red-write [Return old vlue; Write 1] void getlok() while (T&S(L) == 1) ; /* loop till you get the lok */ void releselok() How to implement T&S in SW? Dekker s Alg. Problem: Lok is held for lod-store yle! Loks out even the lok releser. L = 0; /* relese the lok */ Cn we do better? Ides? - 16 - Pge 8

Pure Test & Test & Set Pure Bkoff onept void getlok() i=4 while (L == 1) ; while (T&S(L) == 1) ; /* loop till you get the lok */ void releselok() L = 0; /* relese the lok */ - 17 - T&S(L) tomi red-write [Red old vlue; Write 1] Any other problems? void getlok() i=4 while (T&S(L) == 1) ; /* loop till you get the lok */ void releselok() L = 0; /* relese the lok */ - 18 - T&S(L) tomi red-write [Red old vlue; Write 1] while (L == 1) ; /* introdue bkoff here */ Cn do exponentil bkoff Qudrti bkoff Rndom bkoff, et. We engineers love to optimize! Pge 9

Pure So, getting work item is not so hep fter ll, is it? Any ides? Jobi, Sme Bsi Conept i L i=4,5,6,7 Corse grin prllelism (versus fine grin prllelism): Get blok of 4 or 16 or more indies eh time to mortize the overhed of loking - 19 - Getwork() grbs n index to row (e.g.) Synhroniztion s before Lods nd stores to shred rry Finish row. How do I know when to strt next jobi itertion? Use brrier fter you finish your row Lots of ommunition over the network And very energy ineffiient - 20 - Pge 10

Pure 32-bit energy osts in 40nm DRAM red: ~1000pJ Send 1mm distne: ~10pJ Ides? Ches! Add: ~ 1pJ Register red: ~1pJ Che red (smll L1): ~10pJ - 21 - Pge 11