Ascenium: A Continuously Reconfigurable Architecture Robert Mykland Founder/CTO robert@ascenium.com August, 2005
Ascenium: A Continuously Reconfigurable Processor Continuously reconfigurable approach provides: The computational efficiency of direct logic implementations in ASICs All the flexibility of microprocessors (pure software) Unique architecture enables the array to be effectively targeted by an ANSI Standard C compiler CPU logic is continuously redefined to match the compiler s needs for each instruction ( 20 cycles) Massive parallelism even in programs with little or no inherent instruction-level parallelism Dozens of lines of sequential C code are executed every instruction
Ascenium Block Diagram No instruction set RLU is combinatorial logic that settles Memory is accessed in statically scheduled block operations Two instruction streams 1) Memory 2) Configuration Whole loops are often executed as one RLU instruction Reconfigurable Logic Unit (RLU) Config Control In Out Memory Ascenium Memory Control
Ascenium Basic Operation 12 RISC Instructions 5 10 11 12 1 3 4 5 6 7 8 9 265 145 155 5 10 20 25 30 35 40 ALU Register File Control Reconfigurable Logic Unit (RLU) Config Control In Out Memory Control Memory Conventional Clock = 24 25 10 11 12 13 14 15 17 18 19 20 21 22 23 24 Memory Ascenium
A128 Rough Layout Pad Ring RLU Inst. Pipe Data Pipe Onchip Memory 130 nm process 128 -bit logic elements 64 MACs/clock cycle 200 MHz DDR II = 3.2 GB/s external memory bandwidth 256KB onchip memory 19.2 GB/s peak onchip memory bandwidth 500 mw @ 200 MHz 52 mm 2
Logic Element Input A ( bits) Input B ( bits) Input C ( bits) ½ -Bit Full Multiplier - Accumulator Fast Carry -Bit Add LUT ( bits) LUT Y ( bits) Wide Logic Iterator ( bits) Iterator Y ( bits) Iterator Z ( bits) Output ( bits) Output Y ( bits)
Array Connectivity
Compiler Block Diagram Library (open source) Bytecode files Source code C, C, Java GCC Based Compiler (open source) Bytecode Bytecode Linker (open source) Bytecode Proprietary Ascenium Code Generator (develop) Ascenium Executable
Code Generator Diagram LLVM PARSER GENERIC INT. REP. (GIR) GLOBAL THREADING GIR OPTS. ASCENIUM INT. REP. (AIR) FUNC. ASSIGNMENT & SCORING ARRAY INST. PACKING & COMP. DATA DIRECTIVES FETCH ORDER ASSY. AIR OPTS. OUTPUT indicates where optimizations occur
Generic Intermediate Representation (GIR) PARAM A PARAM B PARAM A LOADI PARAM A CALL RET * LOADI STORE RET RET
FFT Inner Loop C Code i1 = i0 n2; i2 = i1 n2; i3 = i2 n2; r1 = x[2 * i0] x[2 * i2]; r2 = x[2 * i0] - x[2 * i2]; t = x[2 * i1] x[2 * i3]; x[2 * i0] = r1 t; r1 = r1 - t; s1 = x[2 * i0 1] x[2 * i2 1]; s2 = x[2 * i0 1] - x[2 * i2 1]; t = x[2 * i1 1] x[2 * i3 1]; x[2 * i0 1] = s1 t; s1 = s1 - t; x[2 * i2] = (r1 * co2 s1 * si2) >>15; x[2 * i2 1] = (s1 * co2-r1*si2)>>15; t = x[2 * i1 1] - x[2 * i3 1]; r1 = r2 t; r2 = r2 - t; t = x[2 * i1] - x[2 * i3]; s1 = s2 - t; s2 = s2 t; x[2 * i1] = (r1 * co1 s1 * si1)>>15; x[2 * i1 1] = (s1 * co1-r1 * si1)>>15; x[2 * i3] = (r2 * co3 s2 * si3)>>15; x[2 * i3 1] = (s2 * co3-r2 *si3)>>15;
FFT Inner Loop Virtual Circuit 0 2 0 2 1 3 Y0 Y2 Y0 Y2 Y1 Y3 Y1 Y3 1 3 A r1 - B C r2 t1 s1 s2 D - E F - G - H t2 t3 t4 - I - J K C2 S2 C2 S2 C1 S1 C1 S1 C3 S3 C3 r3 s3 r4 r5 s4 s5 - L - M N S3 O P Q R S T U V W Y Z 32 AA 32 BB 32 CC 32 DD 32 EE 32 FF GG HH >>15 II >>15 JJ >>15 KK >>15 LL >>15 MM >>15 NN 0o Y0o 2o Y2o 1o Y1o 3o Y3o
FFT Loop Array Instruction AA BB CC DD O P R Q S T V U B A E D 0 Y0 2 Y2 C2 S2 W H C EE G F 1 Y1 3 Y3 C3 S3 NN Z Y K I C1 Y3 o MM KK FF M J S1 1 3 o o LL L N GG HH 0 Y1 o o JJ II Y0 2 o Y2 o o A 0 Y0 2 Y2 C2 S2 D C F I J GG HH B E H G K M L O 1 Y1 3 Y3 C3 S3 C1 S1 Y3 1 o 3 o o 0 Y1 o o N P Q T U AA R BB S CC V DD W Y LL II Z EE FF JJ NN KK MM Y0 o2 oy2 o
FFT Loop Data Directives 1. Fetch the array instruction and data directives 2. Load the array instruction 3. Read x[t] in four 64-bit wide banks 32 times 4. Write x[t] in one 256-bit wide bank 32 times, changing banks every 8 operations 5. Repeat steps 3 and 4 three more times 6. At the same time as 3 5, read in w coefficients 7. Pass 1: w every clock; 2: w every 4 clocks, then every, then every 64 8. Stop
DSP Benchmark Results (Simulated, Hand-Coded) Benchmark 256-Point Radix-4 FFT Real Block FIR Filter Complex Block FIR Filter DCT x 24 IIR Filter NR = 2,048 A128 core, 130nm, 250MHz, 250mW 512 ns 512 ns 512 ns 512 ns 292 ns TMS320C64x core, 130nm, 300MHz*, 250mW 8,920 ns 6,827 ns 6,827 ns 6,080 ns 6,827 ns * The 130nm C64x core will operate up to 720 MHz, but burns more power at higher frequencies.
Ascenium Status Patents filed Chip architecture complete, polishing ongoing Prototype compiler complete & demo-able Development plan in place Raising seed round
Summary Great performance and performance/mw: >10x programmable DSPs >4x performance/mw of FPGAs Great time-to-market: -- Normal GPP software development cycle Great ease-of-use: -- Programmers writing standard, non-optimized code in C/C get all the performance