SDSoC: Session 1

Size: px

Start display at page:

Download "SDSoC: Session 1"

Judith Strickland
5 years ago
Views:

1 SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM

2 What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the creation of a platform, we can develop our application in C, C++ and accelerate functions from executing in the Processor to being implemented in programmable logic.

3 Heterogeneous SoC

4 Softcore Designs

5 Benefits of PL Acceleration

6 Benefits of PL Acceleration Function PS (Clocks) PL (Clocks) Reduction AES Linux % AES Bare Metal % AES FreeRTOS % FIR Filter Bare Metal % Matrix Multiply Bare Metal %

7 Traditional Development Flow Generate High Level System Model Segment System between PS and PL Develop PS solution Develop PL Solution Integrate PS and PL solution

8 SDSoC Development Flow

9 Under the hood Accelerates the Function using Vivado HLS Analyses Communication Establishes AXI Communications Generates Software Stub

10 Who has Used HLS before? HLS came of age over the last 5 years HLS is excellent for data flow acceleration e.g. signal processing, image processing, Artificial Intelligence and Machine Learning Scheduling Binding Control Extraction

11 Example of HLS

12 Looking a little deeper

13 Terminology

14 1-9 C to RTL HLS synthesizes the C code in different ways Top-level function arguments synthesize into RTL I/O ports Loops in the C functions are kept rolled by default C functions synthesizes into blocks in the RTL hierarchy Arrays in the C code synthesize into block RAM in the final design

15 Interfacing

So How does it work Accelerator function is called Configures DMA to move data Data is sourced from DDR, OCM, or L1/L2 to a input buffer Data is moved to appropriate place DMA Runs Accelerator

16 So How does it work Accelerator function is called Configures DMA to move data Data is sourced from DDR, OCM, or L1/L2 to a input buffer Data is moved to appropriate place DMA Runs Accelerator function completes user code continues The accelerator loads the output buffer Once the transfer is complete, the DMA signals the processor Buffer might be local memory, BRAM, or FIFO Accelerator runs

17 Supporting Libraries Math Vivado HLS IP Library Linear Algebra Library Arbitrary Precision Data Types revision Acceleration stack HLS Video Libraries

Deeper Dive on the Zynq L1 cache for each processor L2 cache shared by processors Snoop control unit regulates memory flow, providing cache coherency Cache Coherency Full coherency: Masters

18 Deeper Dive on the Zynq L1 cache for each processor L2 cache shared by processors Snoop control unit regulates memory flow, providing cache coherency Cache Coherency Full coherency: Masters participating in full coherency can access each others caches I/O coherency: Intended for I/O devices that generally have no cache but can access shared memory in the caches of fully coherent masters

19 Interfacing PS to PL Memory interconnect Enables data transfer between PL and PS memory resources PS-PL interface provided by four AXI ports Master interconnect (PS master) Enables data transfer between PS master and PL slave endpoints PS-PL interface provided by two AXI ports Slave interconnect (PS slave) Enables data transfer between PS slave and PL master endpoints PS-PL interface provided by two AXI ports

20 Importance of ACP The accelerator coherency port (ACP) bolts directly into the snoop control unit (SCU) ACP is a PS slave AXI port (that is, accelerator is a master) Enables the accelerator on the ACP to write directly into the L1 and L2 caches And indirectly to the DDR/OCM Data movement is limited by the size of the cache target Typically L2 is the target due to its larger size Cache misses result in moving data to the DDR Frequent misses indicate that the HP port (not the ACP port) should have been used When the ACP is not used, there is no non-standard use of the caches

21 Deeper Dive on the MPSoC Much more complex system SDSOC accelerate APU functions Quad/dual-core ARM Cortex-A53 processor

PS / PL Interfacng MPSOC Accelerated coherency port (ACP) 128/64-bit configurable AXI coherency extension (ACE) Two high-performance coherent interfaces (HPC) Four high-performance slave ports

22 PS / PL Interfacng MPSOC Accelerated coherency port (ACP) 128/64-bit configurable AXI coherency extension (ACE) Two high-performance coherent interfaces (HPC) Four high-performance slave ports Configurable 32/64/128-bit data width Two high-performance master ports Can be accessed from APU or RPU Configurable 32/64/128-bit data width PL to low-power domain (PL_LPD) Configurable 32/64/128-bit data width One RPU low-latency port (LPD_PL) Configurable 32/64-bit data width

23 MPSoC greater Cache complexity System-wide coherency Cache coherent interconnect block (CCI) with coherency management logic System memory management unit (SMMU) supports multiple memory tables and virtualization Next-generation interface coherency support ACE: AXI coherency extensions HPC: High-performance one-way coherency for memories and peripherals Accelerated coherency port (ACP) to the APU 128/64-bit configurable Extends the APU snoop control unit to external processor caches

24 MPSoC Cache ACP Coherent access to external master Snoop Control Unit APU coherency CCI Cache Coherent Interconnect ACE Masters full Coherent ACE Lite IO Coherent

25 SDSoC Tool Introduction Device support: Zynq device, Zynq UltraScale+ MPSoC & MicroBlaze ARM compiler tool chain support: Linaro-based gcc compiler tool chains Target OS support: Linux (kernel 4.x, Xilinx branch), bare metal, and FreeRTOS QEMU and RTL co-simulation: Linux and Windows 64-bit host support OpenCL compilation flow support

26 How to get best results Getting the best performance takes a few iterations Not unusual for your first acceleration of a function to be worse than performance in PS Need to consider the movement of large elements of data, otherwise transfer time dominates and masks the acceleration Vivado HLS skills are very important to achieve optimal performance Understanding the SoC architecture is also necessary

27 What is the best to accelerate Some obvious things that cannot be accelerated Pre compiled Libraries, OS Calls etc Intensive algorithms are a good candidate Need to consider the time it takes for data movement to and from the PL Use Amdahl's law

28 Amdahl's law S: overall performance improvement Alpha: percentage of the algorithm that can be sped up with hardware acceleration 1-alpha: percentage of the algorithm that cannot be improved. p: is the speedup due to acceleration (%). Set Alpha to 0.1 and select speed up - even with large acceleration P defined, speed up is close to 1 Set Alpha to 0.5 and select same speed up close to factor of two improvement.

29 Getting the best from HLS Functions we accelerate into logic often need optimising Loops need unrolling Memory Structures need optimising Resource allocation HLS controlled via #pragma in the accelerated function

1-5 #pragma HLS PIPELINE Improves throughput (=initiation interval) Three clock cycles before operation RD can occur again Throughput = three cycles Three cycles before the first output is written

30 1-5 #pragma HLS PIPELINE Improves throughput (=initiation interval) Three clock cycles before operation RD can occur again Throughput = three cycles Three cycles before the first output is written Latency = three cycles For the loop, six cycles Latency is the same Throughput is better Fewer cycles, higher throughput Loop latency has been improved Can be applied to functions too not just loops

31 Considerations for Pipelining HLS will unroll all loops nested below the pipeline directive Pipelining the inner-most loop will result in best performance for area

32 Coding Style that prevents pipelining To unroll the loop must have the fixed bounds Feedback within the pipeline

33 #pragma HLS DATAFLOW Works like pipelining but at the top level

34 Considerations for dataflow The data must flow through the design from one task to the next We cannot have code structures which Single-producer-consumer violations Bypassing tasks Feedback between tasks Conditional execution of tasks Loops with multiple exit conditions

35 HLS Pipeline VS HLS Dataflow HLS Dataflow is coarse grain pipelining Works on functions and loop level HLS Pipeline is fine grain pipelining Working on operator level

Arrays Arrays are intuitive and useful software constructs When accelerated they synthesise into Block RAM by default Array can be targeted to any memory resource in the

36 Arrays Arrays are intuitive and useful software constructs When accelerated they synthesise into Block RAM by default Array can be targeted to any memory resource in the library This can create bottle necks in the accelerator as memory access are required What we want to be able to do is reshape memory configurations to enable more efficient accesses

37 #pragma HLS ARRAY PARTITION Array partitioning allows higher bandwidths We can restructure block RAM implementation for more optimal results Like always there is a trade off between performance and area We use the pragma HLS ARRAY PARTITION If limited block RAMS are available look at the HLS ARRAY RESHAPE

38 What does partitioning look like

39 Understanding the dimensions In the pragma we define the array dimension to be fractured.

40 Questions?

Optimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015.

Optimizing HW/SW Partition of a Complex Embedded Systems Simon George November 2015 Zynq-7000 All Programmable SoC HP ACP GP Page 2 Zynq UltraScale+ MPSoC Page 3 HW/SW Optimization Challenges application()