Input and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state

Size: px

Start display at page:

Download "Input and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state"

Merry Penelope Manning
5 years ago
Views:

1 What is computation? Input and Output = Communication Input State Output i s F(s,i) (s,o) o s There are many different types of IO (Input/Output) What constitutes IO is context dependent Obvious forms of external IO User input devices: mouse, keyboard, joystick Sensors: cameras, microphones, Kinect Networks: Ethernet, wifi, GSM Obvious forms of internal IO Read from memory, write to memory Read from disk, write to disk Less obvious forms of IO Interrupts from external devices Transfers from cache to memory Registers to cache HPCE / dt10 / 2013 / 10.1 HPCE / dt10 / 2013 / 10.2 Transforming state Hardware ( core) In what ways does a computer transform state? Data manipulation: add, multiply, compare, xor, and, Control flow: sequencing, loops, conditionals Transfer of control: function calls, OS calls Map naturally to our idea of instructions But what constitutes the state? Contents of the registers (PC, Regs, Flags, StackPtr) Contents of registers + stack Contents of registers + stack + heap above + page-table above + disk contents When does manipulation of state become IO? HPCE / dt10 / 2013 / 10.3 An active thread of computation executing on a core State: PC, stack pointer, flags, registers IO: Read and write to memory, transfer control to supervisor Provides very simple processing of data and control flow Transforming data contained in registers Simple control flow: sequencing, branching Nested control flow: function call/returns Should we treat push/pop as transformation or communication? Does the stack constitute state of the hardware thread? When does a thread of execution begin and end? What do cores do if there are no threads? HPCE / dt10 / 2013 / 10.4

2 OS / Kernel A potentially active thread of computation May be currently assigned to a hardware thread; might be paused State: PC, registers, stack pointer + operating system meta-data IO: Access memory, call OS functions, sync with other threads Operating System threads have a more complicated lifecycle Active : currently assigned to a hardware thread and running Inactive : ready to run, but not assigned to a hardware thread Blocked : unable to run until some condition is met Number of OS threads is limited by storage, not count If there are more OS threads than s, time-slicing will occur HPCE / dt10 / 2013 / 10.5 HPCE / dt10 / 2013 / 10.6 Process Scheduling OS s to Cores A collection of OS threads and allocated OS resources State: state of logical threads, page table, resources IO: Inter-process communication, shared memory, networks, files A process is a unit of isolation s within a process share a single memory space s in different processes cannot communicate via memory Page-table: map process-relative address to physical addresses Create OS threads and processes via OS New OS thread: Create, pthread_create New process: CreateProcess, exec, spawn [1] What causes a HW thread to switch OS threads? Interrupts: an external factor causes an interrupt handler to run Pre-emption: OS decides another thread should have a chance Usually originates in some form of recurring interrupt timer Exceptions: thread tries an operation that can t be handled Benign exception: page fault (need to swap in page from disk) Error: invalid page fault, invalid instruction, divide by zero IO: the thread asks the OS to perform some service e.g. Read/write to a file or socket Synchronisation: the thread waits for a signal from another thread Process of changing threads is a context switch [1] This refers to a posix OS spawn, not a cilk spawn HPCE / dt10 / 2013 / 10.7 HPCE / dt10 / 2013 / 10.8

3 The Context Switch: Explicit Steps The Context Switch: Hidden Costs Must schedule a different OS thread to a hardware thread 1. Suspend execution of the current hardware thread 2. Copy the hardware thread state into the OS thread state 3. Retrieve the next OS thread 4. Copy the new OS thread s state to the hardware thread 5. Resume execution of the hardware thread Different cost if switching between processes context switch: save and restore registers Process context switch: save and restore page-table info too s collect some amount of local data while executing Instruction Data Branch Predictor Branch-Target Buffer Translation Lookaside Buffer Virtual pages mapped to physical memory Data may be shared within processes or specific to OS thread Data : threads have different stacks Instruction : threads often execute the same code Local data needed in cache is the working set of a thread HPCE / dt10 / 2013 / 10.9 HPCE / dt10 / 2013 / The Advance of s Parallel caches ALU Registers L1 L2 Memory is an abstraction layer Read from this address Write to this address Used to be physical reality hierarchy creates zones Inner: low-latency, high bandwidth Outer: high capacity, cheap Actual memory is very far away Very big, but very slow We are hitting the memory wall ALU Registers ALU Registers We now have parallel s Each has local caches Shared underlying memory How does tbb::atomic work? Memory Memory HPCE / dt10 / 2013 / HPCE / dt10 / 2013 / 10.12

4 Parallel caches ALU ALU Registers Registers Memory We now have parallel s Each has local caches Shared underlying memory s connect via caches Memory is too dumb Same data in many caches e.g. shared read-only data What about writes? Lazy consistency Atomics: cache coherence Only one can modify atomic data at a time -0 $-0 $ $-1 $-3-3 HPCE / dt10 / 2013 / HPCE / dt10 / 2013 / Identical data may be found in multiple caches Atomic operations expand into a sequence of smaller operations Atomic operations expand to: lock, update, release Locking ensures data is only present in one cache y = x++; x=3 lock(x) x=3 x=3 x=3 x=3 HPCE / dt10 / 2013 / HPCE / dt10 / 2013 / 10.16

5 Atomic operations expand to: lock, update, release Locking ensures data is only present in one cache While locked, data can be manipulated locally Atomic operations expand to: lock, update, release Locking ensures data is only present in one cache While locked, data can be manipulated locally Other s can read the data once it is released y = x++; x=4 release(x) x=4 HPCE / dt10 / 2013 / HPCE / dt10 / 2013 / Atomic operations expand to: lock, update, release Problem occurs if two s want to modify same memory Only one lock will succeed Atomic operations expand to: lock, update, release Problem occurs if two s want to modify same memory Only one lock will succeed Other will block until it can acquire a lock x=4 x=4 lock(x) x=4 y = x++; lock(x) lock(x) HPCE / dt10 / 2013 / HPCE / dt10 / 2013 / 10.20

6 Potential Problems with Atomic Operations Warmth Lock contention : s fight to lock the same location Assume performs atomic with probability p per cycle Given n processors, probably of conflict per cycle is ~1-(1-p) n But eventually progress will be made thrashing: Locking a variable evicts entire cache line Memory traffic increases even if conflicts don t occur Still need to move data from cache to cache General guidelines for use: atomic ops should be a low percentage of total instructions try to ensure that each atomic only lives in one cache Warm cache: working set for a thread is currently in caches has been running for a while and fetched working set was recently scheduled and data is still in the cache The previous thread shared part of the working set Cold cache: working set is not in caches Process/thread is starting up, so data has not been fetched yet has not been scheduled in a long time Previously scheduled thread has evicted working set A hot cache is possible: all requests serviced from cache Difficult to achieve; thread must have tiny working set Sometimes possible with compute-bound work HPCE / dt10 / 2013 / HPCE / dt10 / 2013 / Under-subscribed Under-subscribed Fewer OS threads than hardware threads ( cores) Fewer OS threads than hardware threads ( cores) OS threads start executing and can warm up cache Good throughput per processor Poor utilisation of available s HPCE / dt10 / 2013 / HPCE / dt10 / 2013 / 10.24

7 Over-subscribed Over-subscribed OS threads are scheduled and start warming up cache Lower throughput while warming up HPCE / dt10 / 2013 / HPCE / dt10 / 2013 / Over-subscribed Over-subscribed OS threads are scheduled and start warming up cache Lower throughput while warming up Good throughput on all s once warmed up OS threads are scheduled and start warming up cache Eventually OS decides to pre-emptively schedule New threads are completely cold Previous threads start to cool down HPCE / dt10 / 2013 / HPCE / dt10 / 2013 / 10.28

8 Over-subscribed Over-subscribed OS threads are scheduled and start warming up cache Eventually OS decides to pre-emptively schedule New threads are completely cold Previous threads start to cool down New threads start to warm cache but have lower throughput OS threads are scheduled and start warming up cache Eventually OS decides to pre-emptively schedule OS decides to reschedule again Will evict the longest running threads which have hottest cache HPCE / dt10 / 2013 / HPCE / dt10 / 2013 / Well-subscribed Well-subscribed Same number of OS threads as hardware threads Same number of OS threads as hardware threads HPCE / dt10 / 2013 / HPCE / dt10 / 2013 / 10.32

9 Well-subscribed affinity Same number of OS threads as hardware threads Good throughput per processor Good utilisation of all processors What happens if the OS decides to shuffle threads around? Could move thread away from warm cache Can give OS threads affinity to a specific hardware thread OS will only schedule OS thread onto given But if specified is not free the OS thread will block HPCE / dt10 / 2013 / HPCE / dt10 / 2013 / Managing threads Data movement and Communication Ideal situation: one OS thread per HW thread, fixed affinity Difficult to manage by directly controlling threads Managing affinity is tricky, easy to deadlock or under-utilise Solution: task-based scheduling interface Create work-load of small tasks which are not bound to a thread Task scheduler gives tasks to OS threads as they become idle Manages affinity so tasks stay on the same thread Why does work-stealing work so well? Moving data between cache levels is implicit communication We will shortly see explicit communication with GPUs Communication within a has significant costs Latency : how long a must wait before getting data Bandwidth : sustained transfer rate between memory and Energy : moving bits around a chip takes lots of energy Organise computation to minimise communication Avoid shared read-write memory regions and atomic operations Work in task-private variables; only merge results at the end HPCE / dt10 / 2013 / HPCE / dt10 / 2013 / 10.36

10 What about actual computation? Sometimes moving data around is the computation Graph analysis algorithms: very little actual calculation More often communication is simply enabling calculation Numeric computation: move data to the ALUs to do maths Compilation/analysis: move data to the ALUs to control branching Assume data is already close to the ALUs What metrics should we be trying to optimise? Often the metrics are instruction throughput and cost of work Throughput: instructions per second per what? Cost of work: cost of what per work? HPCE / dt10 / 2013 / Source: Borkar et. al., HPCE / dt10 / 2013 / Evolution of metrics 60s-70s: very little silicon, so let s just make it fit Metric: operations / sec / area Re-use components for maximum functionality with fixed area 80s-90s: Moore s Law is king; burn area and power for speed Metric: operations / sec / dollar How much does a chip cost vs the throughput it achieves Huge amounts of area spent on instruction scheduling 00s: Moore s Law still working; cooling is an issue Metric: operations / sec / (joule or dollar) Silicon is getting cheap, power is getting expensive 10s+: Moore s law looking dubious; power is a massive issue Metric: operations / sec / joule HPCE / dt10 / 2013 / Optimising for Cost of Work Cost of work applies to both sequential and parallel code Sequential code: reduce overhead per sequential operation Parallel code: eventually parallel code reduces to sequential Super-scalar : optimise for cost of work in cycles Standard with PC: each instruction costs one cycle Super-scalar: use fancy logic to issue many instructions per cycle Requires large amounts of area and power: always active SIMD & VLIW: optimise for cost of work in cycles and power SIMD: Single Instruction Multiple Data VLIW: Very Long Instruction Word Add additional functional units: only cost power if they are used HPCE / dt10 / 2013 / 10.40

CPU Architecture. HPCE / dt10 / 2013 / 10.1

CPU Architecture. HPCE / dt10 / 2013 / 10.1 Architecture HPCE / dt10 / 2013 / 10.1 What is computation? Input i o State s F(s,i) (s,o) s Output HPCE / dt10 / 2013 / 10.2 Input and Output = Communication There are many different types of IO (Input/Output)