Ascenium: A Continuously Reconfigurable Architecture. Robert Mykland Founder/CTO August, 2005

Size: px

Start display at page:

Download "Ascenium: A Continuously Reconfigurable Architecture. Robert Mykland Founder/CTO August, 2005"

Janel Amy Jacobs
5 years ago
Views:

1 Ascenium: A Continuously Reconfigurable Architecture Robert Mykland Founder/CTO robert@ascenium.com August, 2005

2 Ascenium: A Continuously Reconfigurable Processor Continuously reconfigurable approach provides: The computational efficiency of direct logic implementations in ASICs All the flexibility of microprocessors (pure software) Unique architecture enables the array to be effectively targeted by an ANSI Standard C compiler CPU logic is continuously redefined to match the compiler s needs for each instruction ( 20 cycles) Massive parallelism even in programs with little or no inherent instruction-level parallelism Dozens of lines of sequential C code are executed every instruction

3 Ascenium Block Diagram No instruction set RLU is combinatorial logic that settles Memory is accessed in statically scheduled block operations Two instruction streams 1) Memory 2) Configuration Whole loops are often executed as one RLU instruction Reconfigurable Logic Unit (RLU) Config Control In Out Memory Ascenium Memory Control

4 Ascenium Basic Operation 12 RISC Instructions ALU Register File Control Reconfigurable Logic Unit (RLU) Config Control In Out Memory Control Memory Conventional Clock = Memory Ascenium

5 A128 Rough Layout Pad Ring RLU Inst. Pipe Data Pipe Onchip Memory 130 nm process 128 -bit logic elements 64 MACs/clock cycle 200 MHz DDR II = 3.2 GB/s external memory bandwidth 256KB onchip memory 19.2 GB/s peak onchip memory bandwidth MHz 52 mm 2

6 Logic Element Input A ( bits) Input B ( bits) Input C ( bits) ½ -Bit Full Multiplier - Accumulator Fast Carry -Bit Add LUT ( bits) LUT Y ( bits) Wide Logic Iterator ( bits) Iterator Y ( bits) Iterator Z ( bits) Output ( bits) Output Y ( bits)

7 Array Connectivity

8 Compiler Block Diagram Library (open source) Bytecode files Source code C, C, Java GCC Based Compiler (open source) Bytecode Bytecode Linker (open source) Bytecode Proprietary Ascenium Code Generator (develop) Ascenium Executable

9 Code Generator Diagram LLVM PARSER GENERIC INT. REP. (GIR) GLOBAL THREADING GIR OPTS. ASCENIUM INT. REP. (AIR) FUNC. ASSIGNMENT & SCORING ARRAY INST. PACKING & COMP. DATA DIRECTIVES FETCH ORDER ASSY. AIR OPTS. OUTPUT indicates where optimizations occur

10 Generic Intermediate Representation (GIR) PARAM A PARAM B PARAM A LOADI PARAM A CALL RET * LOADI STORE RET RET

11 FFT Inner Loop C Code i1 = i0 n2; i2 = i1 n2; i3 = i2 n2; r1 = x[2 * i0] x[2 * i2]; r2 = x[2 * i0] - x[2 * i2]; t = x[2 * i1] x[2 * i3]; x[2 * i0] = r1 t; r1 = r1 - t; s1 = x[2 * i0 1] x[2 * i2 1]; s2 = x[2 * i0 1] - x[2 * i2 1]; t = x[2 * i1 1] x[2 * i3 1]; x[2 * i0 1] = s1 t; s1 = s1 - t; x[2 * i2] = (r1 * co2 s1 * si2) >>15; x[2 * i2 1] = (s1 * co2-r1*si2)>>15; t = x[2 * i1 1] - x[2 * i3 1]; r1 = r2 t; r2 = r2 - t; t = x[2 * i1] - x[2 * i3]; s1 = s2 - t; s2 = s2 t; x[2 * i1] = (r1 * co1 s1 * si1)>>15; x[2 * i1 1] = (s1 * co1-r1 * si1)>>15; x[2 * i3] = (r2 * co3 s2 * si3)>>15; x[2 * i3 1] = (s2 * co3-r2 *si3)>>15;

12 FFT Inner Loop Virtual Circuit Y0 Y2 Y0 Y2 Y1 Y3 Y1 Y3 1 3 A r1 - B C r2 t1 s1 s2 D - E F - G - H t2 t3 t4 - I - J K C2 S2 C2 S2 C1 S1 C1 S1 C3 S3 C3 r3 s3 r4 r5 s4 s5 - L - M N S3 O P Q R S T U V W Y Z 32 AA 32 BB 32 CC 32 DD 32 EE 32 FF GG HH >>15 II >>15 JJ >>15 KK >>15 LL >>15 MM >>15 NN 0o Y0o 2o Y2o 1o Y1o 3o Y3o

13 FFT Loop Array Instruction AA BB CC DD O P R Q S T V U B A E D 0 Y0 2 Y2 C2 S2 W H C EE G F 1 Y1 3 Y3 C3 S3 NN Z Y K I C1 Y3 o MM KK FF M J S1 1 3 o o LL L N GG HH 0 Y1 o o JJ II Y0 2 o Y2 o o A 0 Y0 2 Y2 C2 S2 D C F I J GG HH B E H G K M L O 1 Y1 3 Y3 C3 S3 C1 S1 Y3 1 o 3 o o 0 Y1 o o N P Q T U AA R BB S CC V DD W Y LL II Z EE FF JJ NN KK MM Y0 o2 oy2 o

14 FFT Loop Data Directives 1. Fetch the array instruction and data directives 2. Load the array instruction 3. Read x[t] in four 64-bit wide banks 32 times 4. Write x[t] in one 256-bit wide bank 32 times, changing banks every 8 operations 5. Repeat steps 3 and 4 three more times 6. At the same time as 3 5, read in w coefficients 7. Pass 1: w every clock; 2: w every 4 clocks, then every, then every Stop

15 DSP Benchmark Results (Simulated, Hand-Coded) Benchmark 256-Point Radix-4 FFT Real Block FIR Filter Complex Block FIR Filter DCT x 24 IIR Filter NR = 2,048 A128 core, 130nm, 250MHz, 250mW 512 ns 512 ns 512 ns 512 ns 292 ns TMS320C64x core, 130nm, 300MHz*, 250mW 8,920 ns 6,827 ns 6,827 ns 6,080 ns 6,827 ns * The 130nm C64x core will operate up to 720 MHz, but burns more power at higher frequencies.

16 Ascenium Status Patents filed Chip architecture complete, polishing ongoing Prototype compiler complete & demo-able Development plan in place Raising seed round

17 Summary Great performance and performance/mw: >10x programmable DSPs >4x performance/mw of FPGAs Great time-to-market: -- Normal GPP software development cycle Great ease-of-use: -- Programmers writing standard, non-optimized code in C/C get all the performance

Digital Signal Processor Core Technology

The World Leader in High Performance Signal Processing Solutions Digital Signal Processor Core Technology Abhijit Giri Satya Simha November 4th 2009 Outline Introduction to SHARC DSP ADSP21469 ADSP2146x