Lars Schor, and Lothar Thiele ETH Zurich, Switzerland

Size: px

Start display at page:

Download "Lars Schor, and Lothar Thiele ETH Zurich, Switzerland"

Luke Carpenter
5 years ago
Views:

1 Iuliana Bacivarov, Wolfgang Haid, Kai Huang, Lars Schor, and Lothar Thiele ETH Zurich, Switzerland

2 Efficient i Execution of KPN on MPSoC Efficiency regarding speed-up small memory footprint portability Distributed Operation Layer (DOL): efficient MPSoC design flow based on KPN Highlight: software synthesis on CELL BE 2

3 Outline Introduction to Distributed Operation Layer Efficient MPSoC design in DOL Runtime environment Software synthesis Some experimental results on CELL BE 3

DOL Vision: Write Once, Run Anywhere task-level

(Heterogeneous) multi-processor systems Mapping:

4 DOL Vision: Write Once, Run Anywhere task-level parallelism Single-processor systems instruction-level ti l parallelism sequential Use C/C++ and a platform-dependent compiler (Heterogeneous) multi-processor systems Mapping: binding and scheduling Execution: distributed computation and communication 4

5 Streaming Applications Application domains Consumer electronics Communication systems Medical systems, etc. Real-time applications (Array) signal processing Audio & video (de)coding di High-performance computing WFS+ WFS+ WFS+ WFS+ 5

6 Applications Specified as KPN Dataflow semantics Matches structure of many streaming applications Separates computation and communication Enables design automation Untimed model of computation Facilitates implementation on MPSoCs Iuliana Bacivarov Efficient Execution of KPNs on CELL 6

7 Distributed ib t Memory Architectures t IBM/Sony/Toshiba CELL BE: PowerPC and 8 SPEs connected via ring bus Problem: efficient programming, no parallelism on SPEs Iuliana Bacivarov Efficient Execution of KPNs on Cell CELL 7

8 Goal: Efficient Execution of KPNs on MPSoCs 8

9 DOL Software Design Flow Goals Efficiency Predictability Portability Challenges Scalable specification Automated synthesis Design space exploration System-level performance analysis Strengths Abstraction Automation 9

10 DOL Vision: i Write Once, Run Anywhere 10

11 DOL Synthesis on Cell Goal Efficient execution on MPSoC Key issues Mapping optimization Efficient runtime environment Automatic ti software synthesis 11

12 Outline Introduction to Distributed Operation Layer Efficient MPSoC design in DOL Runtime environment Software synthesis Some experimental results on Cell BE 12

13 Runtime Environment Requirements Required quasi-parallelism on each processor protothreads intra- and inter-processor comm. windowed FIFOs not needed global scheduler (apply local, data-driven execution) full preemption (apply cooperative scheduling) 13

14 FIFO Communication Standard FIFO x 3 x2 float i; read(port_in, &i, sizeof(float)); i = i * i; write(port_out, OUT &i, sizeof(float)); Windowed FIFO float *i, *j; capture(port_in, &i, sizeof(float)); reserve(port_out, &j, sizeof(float)); *j = *i * *i; consume(port_in); release(port_out); x4 x x x4 x 3 24x x x x Remarks Windowed FIFOs preserve Kahn semantics [Huang, ASAP07] Inter-processor windowed FIFO (using DMA) is the only platform-dependent part of the runtime environment 14

15 Windowed d FIFO Implementation ti intra-processor WFIFO Write window inter-processor WFIFO Read window 15

16 Quasi-Parallelism li Using Threads Components Program code & data Context: registers, PC, SP, etc. Stack Limitations High context switch overhead example: SPE has B registers copy 4kB to switch context Multiple stacks consume memory Implementation using assembler code 16

17 Protothreads th [Adam Dunkels 2005] struct pt{unsigned short lc;}; #define PT_BEGIN(pt) switch(pt->lc){ case 0: #define PT_WAIT_UNTIL(pt, cond) pt->lc= LINE ; case LINE : if(!(cond)) return 0 #define PT_END(pt) } pt->lc=0; return 1 01 int protothread(struct pt *pt) { 02 PT_BEGIN(pt); PT_WAIT_UNTIL(pt, wfifo->capture(...)); PT_ END(pt); C pre- processor 07 } 01 int protothread(struct pt *pt){ 02 switch(pt->lc){ case 0: pt->lc=4; case 4: if(!wfifo->capture(...)) return 0; } pt->lc=0; return 1; 07 } Iuliana Bacivarov Efficient Execution of KPNs on CELL

18 P t th Protothreads d [Adam [Ad D Dunkels k l 2005] struct pt{unsigned short lc;}; #define PT_BEGIN(pt) #define PT_WAIT_UNTIL(pt, cond) #define PT_END(pt) 01 int protothread(struct pt *pt) { 02 PT_BEGIN(pt); PT_WAIT_UNTIL(pt, wfifo->capture(...)); C pre06 PT_END(pt); (p ); processor 07 } Iuliana Bacivarov switch(pt->lc){ case 0: pt->lc= LINE ; case LINE : if(!(cond)) return 0 } pt->lc=0; return 1 01 int protothread(struct pt *pt){ 02 switch(pt->lc){ case 0: pt->lc=4; case 4: if(!wfifo->capture(...)) return 0; } pt->lc=0; p ; return 1; ; 07 } Efficient Execution of KPNs on CELL

$square_fire(localdata p){$ READ(PORT_IN,&(p->i),4,p); p->i = p->i *

$PT_WAIT_UNTIL(p, while(1){$ producer_fire(p_data); data); p->i = p->i *

19 Automated t Software Synthesis square_fire(localdata p){ READ(PORT_IN,&(p->i),4,p); p->i = p->i * p->i; WRITE(PORT_OUT,&(p->i),4,p); } software synthesis int main(){ square_fire(localdata p){ //init process network PT_BEGIN(p); PT_WAIT_UNTIL(p, while(1){ p->fifo_in->read(&(p->i),4)); producer_fire(p_data); data); p->i = p->i * p->i; square_fire(s_data); PT_WAIT_UNTIL(p, consumer_fire(c_data); p->fifo_out->write(&(p->i),4)); } PT_END(p); } } 19

20 Outline Introduction to Distributed Operation Layer Efficient MPSoC design in DOL Runtime environment Software synthesis Some experimental results on Cell BE 20

21 Different Thread/FIFO Implementations ti cles] Exec cution Time [clock cy stack-less threads (protothreads) not applicable user-space threads (YAPI) user-space threads (SystemC) kernel-space threads (pthreads) context switch WFIFO access FIFO access (4 bytes) FIFO access (4096 bytes) Protothreads introduce the smallest context switching overhead 8x - 18x faster w.r.t. user-space threads 200x faster w.r.t. kernel-space threads Windowed FIFO is considerably more efficient for large accesses Protothreads are efficient context switch duration ~ 300 cycles and wfifo access ~ 150 cycles 21

22 Context-Switch/FIFOs t on CELL 22

23 MJPEG Decoder Mapping on Cell 23

24 MJPEG Decoder with Different Granularities Time to Decode 3100 JPEG Frames (320x240 pixels) 25 7 ution Tim me [s] Exec Sp peed-up [1] coarse-grained version fine-grained version speedup (coarsegrained version) 0 PPE only1 SPE 2 SPEs 3 SPEs 4 SPEs 5 SPEs 6 SPEs 24 0

25 Summary Efficient execution of KPN on MPSoC Run-time environment based on protothreads and windowed FIFOs - Low run-time overhead - Small memory footprint - Easily portable Automated software synthesis Implementation transparent to programmer Available online 25

26 26

27 Demo: Finding Nemo on Cell BE PowerPC + 6 SPEs PowerPC only 27

Iuliana Bacivarov, Wolfgang Haid, Kai Huang, Lars Schor, and Lothar Thiele

Iuliana Bacivarov, Wolfgang Haid, Kai Huang, Lars Schor, and Lothar Thiele ETH Zurich, Switzerland Efficient i Execution on MPSoC Efficiency regarding speed-up small memory footprint portability Distributed