The Vector-Thread Architecture Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanovic

Size: px

Start display at page:

Download "The Vector-Thread Architecture Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanovic"

Gerard Sutton
5 years ago
Views:

1 The Vector-Thread Architecture Ronny Krashinsky, Chriopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Kre Asanovic MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA ISCA 2004

2 Goals For Vector-Thread Architecture Primary goal is efficiency High performance with low energy and small area Take advantage of whatever parallelism and locality is available: DLP, TLP, ILP Allow intermixing of multiple levels of parallelism Programming model is key Encode parallelism and locality in a way that enables a complexity-effective implementation Provide clean abractions to simplify coding and compilation

3 Vector and Multithreaded Architectures Control Processor vector control PE0 PE1 PE2 PEN PE0 PE1 PE2 PEN thread control Memory Memory Vector processors provide efficient DLP execution Amortize inruction control Amortize loop bookkeeping overhead Exploit ructured memory accesses Unable to execute loops with loop-carried dependencies or complex internal control flow Multithreaded processors can flexibly exploit TLP Unable to amortize common control overhead across threads Unable to exploit ructured memory accesses across threads Coly memory-based synchronization and communication between threads

4 Vector-Thread Architecture VT unifies the vector and multithreaded compute models A control processor interacts with a vector of virtual processors (VPs) Vector-fetch: control processor fetches inructions for all VPs in parallel Thread-fetch: a VP fetches its own inructions VT allows a seamless intermixing of vector and thread control Control Processor vector-fetch VP0 VP1 VP2 VP3 VPN threadfetch Memory

5 Outline Vector-Thread Architectural Paradigm Abract model Physical Model SCALE VT Processor Evaluation Related Work

6 Virtual Processor Abraction VPs contain a set of regiers VPs execute RISC-like inructions grouped into atomic inruction blocks (AIBs) VPs have no automatic program counter, AIBs mu be explicitly fetched VPs contain pending vector-fetch and thread-fetch addresses A fetch inruction allows a VP to fetch its own AIB May be predicated for conditional branch If an AIB does not execute a fetch, the VP thread ops vector-fetch VP Regiers threadfetch VP thread execution AIB inruction fetch fetch s thread-fetch thread-fetch

7 Virtual Processor Vector A VT architecture includes a control processor and a virtual processor vector Two interacting inruction sets A vector-fetch command allows the control processor to fetch an AIB for all the VPs in parallel Vector-load and vector-ore commands transfer blocks of data between memory and the VP regiers Control vector-fetch Processor VP0 vector -load vector -ore Regiers VP1 Regiers VPN Regiers Vector Memory Unit s s s Memory

8 Cross-VP Data Transfers Cross-VP connections provide fine-grain data operand communication and synchronization VP inructions may target nextvp as deination or use prevvp as a source CrossVP queue hos wrap-around data, control processor can push and pop Rericted ring communication pattern is cheap to implement, scalable, and matches the software usage model for VPs Control vector-fetch Processor crossvppop VP0 crossvppush Regiers VP1 Regiers VPN Regiers crossvp queue s s s

9 Mapping Loops to VT A broad class of loops map naturally to VT Vectorizable loops Loops with loop-carried dependencies Loops with internal control flow Each VP executes one loop iteration Control processor manages the execution Stripmining enables implementation-dependent vector lengths Programmer or compiler only schedules one loop iteration on one VP No cross-iteration scheduling

10 Vectorizable Loops << x + Control Processor vector-load vector-load vector-fetch vector-ore vector-load vector-load vector-fetch Data-parallel loops with no internal control flow mapped using vector commands predication for small conditionals operation loop iteration DAG VP0 VP1 VP2 VP3 VPN << + x << x + << x + << x + << x +

11 Loop-Carried Dependencies << x + Control Processor vector-load vector-load vector-fetch vector-ore Loops with cross-iteration dependencies mapped using vector commands with cross-vp data transfers Vector-fetch introduces chain of prevvp receives and nextvp sends Vector-memory commands with non-vectorizable compute VP0 VP1 VP2 VP3 VPN << x + << x + << x + << x + << x +

12 Loops with Internal Control Flow == br Control Processor vector-load vector-fetch vector-ore Data-parallel loops with large conditionals or inner-loops mapped using thread-fetches Vector-commands and thread-fetches freely intermixed Once launched, the VP threads execute to completion before the next control processor command VP0 VP1 VP2 VP3 VPN == br == br == br == br == br == br == br == br

13 VT Physical Model A Vector-Thread Unit contains an array of lanes with physical regier files and execution units VPs map to lanes and share physical resources, VP execution is time-multiplexed on the lanes Independent parallel lanes exploit parallelism across VPs and data operand locality within VPs Control Processor Vector-Thread Unit Lane 0 Lane 1 Lane 2 Lane 3 VP12 VP13 VP14 VP15 VP8 VP9 VP10 VP11 VP4 VP5 VP6 VP7 VP0 VP1 VP2 VP3 Vector Memory Unit Memory

14 Lane Execution Lanes execute decoupled from each other Command management unit handles vector-fetch and thread-fetch commands Execution cluer executes inructions in-order from small AIB cache (e.g. 32 inructions) AIB caches exploit locality to reduce inruction fetch energy (on par with regier read) Execute directives point to AIBs and indicate which VP(s) the AIB shou be executed for For a thread-fetch command, the lane executes the AIB for the requeing VP For a vector-fetch command, the lane executes the AIB for every VP AIBs and vector-fetch commands reduce control overhead 10s 100s of inructions executed per fetch address tag-check, even for nonvectorizable loops vector-fetch miss addr miss AIB fill vector-fetch AIB tags execute directive AIB cache AIB Lane 0 VP inr. thread-fetch AIB address VP12 VP8 VP4 VP0

15 VP Execution Interleaving Hardware provides the benefits of loop unrolling by interleaving VPs Time-multiplexing can hide thread-fetch, memory, and functional unit latencies vector-fetch AIB time-multiplexing Lane 0 VP0 VP4 VP8 VP12 Lane 1 Lane 2 Lane 3 VP1 VP5 VP9 VP13 VP2 VP6 VP10 VP14 VP3 VP7 VP11 VP15 vector-fetch time threadfetch vector-fetch

16 VP Execution Interleaving Dynamic scheduling of cross-vp data transfers automatically adapts to software critical path (in contra to atic software pipelining) No atic cross-iteration scheduling Tolerant to variable dynamic latencies vector-fetch AIB time-multiplexing Lane 0 VP0 VP4 VP8 VP12 Lane 1 Lane 2 Lane 3 VP1 VP5 VP9 VP13 VP2 VP6 VP10 VP14 VP3 VP7 VP11 VP15 time

17 SCALE Vector-Thread Processor SCALE is designed to be a complexity-effective all-purpose embedded processor Exploit all available forms of parallelism and locality to achieve high performance and low energy Conrained to small area (eimated 10 mm 2 in 0.18 µm) Reduce wire delay and complexity Support tiling of multiple SCALE processors for increased throughput Careful balance between software and hardware for code mapping and scheduling Optimize runtime energy, area efficiency, and performance while maintaining a clean scalable programming model

18 SCALE Cluers VPs partitioned into four cluers to exploit ILP and allow lane implementations to optimize area, energy, and circuit delay Cluers are heterogeneous c0 can execute loads and ores, c1 can execute fetches, c3 has integer mult/div Cluers execute decoupled from each other SCALE VP Control Processor Lane 0 Lane 1 Lane 2 Lane 3 c3 c3 c3 c3 c3 c2 c1 AIB Fill Unit c2 c1 c2 c1 c2 c1 c2 c1 c0 c0 c0 c0 c0 L1 Cache

19 SCALE Regiers and VP Configuration Atomic inruction blocks allow VPs to share temporary ate only valid within the AIB VP general regiers divided into private and shared Chain regiers at inputs avoid reading and writing general regier file to save energy c0 shared cr0 VP8 VP4 VP0 cr1 Number of VP regiers in each cluer is configurable The hardware can support more VPs when they each have fewer private regiers Low overhead: Control processor inruction configures VPs before entering ripmine loop, VP ate undefined across reconfigurations 4 VPs with 0 shared regs 8 private regs VP12 VP8 VP4 VP0 7 VPs with 4 shared regs 4 private regs shared VP24 VP20 VP16 VP12 VP8 VP4 VP0 25 VPs with 7 shared regs 1 private reg shared

20 SCALE Micro-Ops Assembler translates portable software ISA into hardware micro-ops Per-cluer micro-op bundles access local regiers only Inter-cluer data transfers broken into transports and writebacks Software VP code: cluer c0 c1 c2 operation xor pr0, pr1 sll cr0, 2 add cr0, cr1 deinations fi c1/cr0, c2/cr0 fi c2/cr1 fi pr0 Hardware micro-ops: Cluer 0 Cluer 1 Cluer 2 wb compute tp wb compute tp wb compute tp xor pr0,pr1 fic1,c2 c0ficr0 sll cr0,2 fic2 c0ficr0 c1ficr1 add cr0,cr1fipr0 cluer micro-op bundle Cluer 3 not shown

21 SCALE Cluer Decoupling Cluer execution is decoupled Cluer AIB caches ho micro-op bundles Each cluer has its own executedirective queue, and local control Inter-cluer data transfers synchronize with handshake signals Memory Access Decoupling (see paper) Load-data-queue enables continued execution after a cache miss Decoupled-ore-queue enables loads to slip ahead of ores Cluer 3 AIB cache VP Cluer 2 AIBs Cluer 1 AIBs Cluer 0 AIBs compute compute compute compute Regs writeback transport writeback transport writeback transport writeback transport

22 SCALE Prototype and Simulator Prototype SCALE processor in development Control processor: MIPS, 1 inr/cycle VTU: 4 lanes, 4 cluers/lane, 32 regiers/cluer, 128 VPs max Primary I/D cache: 32 KB, 4x128b per cycle, non-blocking DRAM: 64b, 200 MHz DDR2 (64b at 400Mb/s: 3.2GB/s) Eimated 10 mm 2 in 0.18µm, 400 MHz (25 FO4) Cycle-level execution-driven microarchitectural simulator Detailed VTU and memory syem model Cache Bank (8KB) Cache Bank (8KB) Cache Bank (8KB) Crossbar Memory Unit Cache Bank (8KB) L/S byp LDQ LDQ LDQ LDQ MD PC 4 mm CP0 Cache Tags Memory Interface / Cache Control Control Processor Mult Div Cluer Lane 2.5 mm

23 Benchmarks Diverse set of 22 benchmarks chosen to evaluate a range of applications with different types of parallelism 16 from EEMBC, 6 from MiBench, Mediabench, and SpecInt Hand-written VP assembly code linked with C code compiled for control processor using gcc Reflects typical usage model for embedded processors EEMBC enables comparison to other processors running handoptimized code, but it is not an ideal comparison Performance depends greatly on programmer effort, algorithmic changes are allowed for some benchmarks, these are often unpublished Performance depends greatly on special compute inructions or sub-word SIMD operations (for the current evaluation, SCALE does not provide these) Processors use unequal silicon area, process technologies, and circuit yles Overall results: SCALE is competitive with larger and more complex processors on a wide range of codes from different domains See paper for detailed results Results are a snapshot, SCALE microarchitecture and benchmark mappings continue to improve

24 Mapping Parallelism to SCALE Data-parallel loops with no complex control flow Use vector-fetch and vector-memory commands EEMBC rgbcmy, rgbyiq, and hpg execute 6-10 ops/cycle for 12x-32x speedup over control processor, performance scales with number of lanes Loops with loop-carried dependencies Use vector-fetched AIBs with cross-vp data transfers Mediabench adpcm.dec: two loop-carried dependencies propagate in parallel, on average 7 loop iterations execute in parallel, 8x speedup MiBench sha has 5 loop-carried dependencies, exploits ILP Speedup vs. Control Processor rgbcmy rgbyiq hpg adpcm.dec prototype 1 Lane 2 Lanes 4 Lanes 8 Lanes sha

25 Mapping Parallelism to SCALE Data-parallel loops with large conditionals Use vector-fetched AIBs with conditional thread-fetches EEMBC dither: special-case for white pixels (18 ops vs. 49) Data-parallel loops with inner loops Use vector-fetched AIBs with thread-fetches for inner loop EEMBC lookup: VPs execute pointer-chaising IP address lookups in routing table Free-running threads No control processor interaction VP worker threads get tasks from shared queue using atomic memory ops EEMBC pntrch and MiBench qsort achieve significant speedups Speedup vs. Control Processor Lane 2 Lanes 4 Lanes 8 Lanes dither lookup pntrch qsort

26 Comparison to Related Work TRIPS and Smart Memories can also exploit multiple types of parallelism, but mu explicitly switch modes Raw s tiled processors provide much lower compute density than SCALE s cluers which factor out inruction and data overheads and use direct communication inead of programmed switches Multiscalar passes loop-carried regier dependencies around a ring network, but it focuses on speculative execution and memory, whereas VT uses simple logic to support common loop types Imagine organizes computation into kernels to improve regier file locality, but it exposes machine details with a low-level VLIW ISA, in contra to VT s VP abraction and AIBs CODE uses regier-renaming to hide its decoupled cluers from software, whereas SCALE simplifies hardware by exposing cluers and atically partitioning inter-cluer transport and writeback ops Jesshope s micro-threading is similar in spirit to VT, but its threads are dynamically scheduled and cross-iteration synchronization uses full/empty bits on global regiers

27 Summary The vector-thread architectural paradigm unifies the vector and multithreaded compute models VT abraction introduces a small set of primitives to allow software to succinctly encode parallelism and locality and seamlessly intermingle DLP, TLP, and ILP Virtual processors, AIBs, vector-fetch and vector-memory commands, thread-fetches, cross-vp data transfers SCALE VT processor efficiently achieves highperformance on a wide range of embedded applications

LOW POWER AND SMALL AREA.

LOW POWER AND SMALL AREA. THE VECTOR-THREAD ARCHITECTURE THE VECTOR-THREAD (VT) ARCHITECTURE SUPPORTS A SEAMLESS INTERMINGLING OF VECTOR AND MULTITHREADED COMPUTATION TO FLEXIBLY AND COMPACTLY ENCODE APPLICATION PARALLELISM AND