Exercise: RISC Programming

Size: px

Start display at page:

Download "Exercise: RISC Programming"

Claire Peters
5 years ago
Views:

1 Exercise: RISC Programming Increasing efficiency of a RISC-core with simple instruction extensions Michael Gautschi

Introduction The exercises in today will be performed on the Pulpino platform Open source platform [www.pulp-platform.

2 Introduction The exercises in today will be performed on the Pulpino platform Open source platform [ OpenRISC / RISC-V core 32kB Instruction memory 32kB Data memory SPI (load/unload data) UART (for printf) Small event unit

3 Exercise Overview 1. Introduction example Compile & execute Helloworld 2. RTL Simulator basics Run motion_detection application [perf counters, traces, read] 3. Benchmarking Analyze performance improvements of the new instructions 4. Efficient matrix multiplications and convolutions with Dot-product Program a convolution and show the benefit of the dot product 5. Motion detection with efficient convolution Plug the optimized convolution into the application and observe the speedup 6. Compressed instructions on RISC-V Coremark analysis

4 Getting Started 1/2 Copy data from master account: $ mkdir 2_OpenRISC $ cp /home/soc_master/2_openrisc/pulpino.tar.gz 2_OpenRISC/. $ tar xzf pulpino.tar.gz 2_OpenRISC directory We will be working in the software (sw) and build directories rtl/ips-dir: Contains HDL source code sw-dir: contains application sourcecode (in apps) build-dir: Contains compiler and simulator outputs RTL-simulations will be run here vsim-dir: Contains all scripts for RTL compilation

5 Getting Started 2/2 We will be working on the scratch because we are going to generate some data 1. Create a build directory and set up the compiler $ mkdir /scratch/soc_xx/build_or10n 2. Configure the build directory $ cd /scratch/soc_xx/build_or10n $ cp ~soc_master/2_openrisc/pulpino/sw/cmake_configure.or1k.gcc.sh. In the configure script: Set the path to your exercise folder: PULP_GIT_DIRECTORY= /home/soc_xx/2_openrisc/pulpino $ or1k -g /cmake_configure.or1k.gcc.sh You have successfully set up the build directory! 3. Compile the RTL code $ make vcompile Lets get started with exercise 1!

6 Exercise 1 Introduction a) The build directory is created, the compiler is configured, the RTL is compiled. We are ready to start with a simple helloworld. b) Compile helloworld helloworld.c is located in sw/apps/helloworld/. To compile the application enter the build folder and run the makefile $ cd /scratch/soc_xx/build_or10n $ make helloworld.read : to generate the assembly $ make helloworld.slm.cmd : to generate input data for RTL simulations c) Compile & Run helloworld The application can be run in modelsim (gui) or in batch mode: $ make helloworld.vsim : to start modelsim (+type run al) $ make helloworld.vsimc : to run in batch mode Console should output helloworld Output is also written to the file: apps/helloworld/stdout/uart

7 Exercise 2 Basic Tests 1/4 We are now looking at a more complicated application: The motion_detection application To compile&run the application: $ make motion_detection.vsimc A timer is tracking how many cycles were required to compute the image The printf-output is sent over UART, and the testbench dumps the received data to the file: build_or10n/apps/sequential_tests/motion_detection/stdout/uart The testbench also outputs a trace file which allows to see in what sequence the instructions have been executed: build_or10n/apps/sequential_tests/motion_detection/trace_core00.log

8 Exercise 2 Basic Tests 2/4 To better understand what the compiler generated you can have a look at the disassembled code: $ make motion_detection.read Disassembled instructions PC Absolute and relative jump/branch targets Trace file: Time Cycle PC Instruction encoding ALU register update; load data to register; write to memory

9 Exercise 2 Basic Tests 3/4 Performance counters: In order to profile an application, the core supports several performance counters. Only one Counter exists in the micro-architecture to keep the area overhead small To count multiple events the program has to be run in sequence with different events configured The following events are of interest: Functions to set up performance counters: Name ID Counts: SPR_PCER_CYCLES 0 # cycles SPR_PCER_INSTR 1 # instructions SPR_PCER_LD_STALL 2 # load hazards SPR_PCER_LD 7 # load insn. SPR_PCER_ST 8 # store insn. SPR_PCER_JUMP 9 # jumps SPR_PCER_BRANCH 10 # branches SPR_PCER_DELAY_NOP 11 # delay nops perf_reset() perf_enable_id(id) perf_stop() cpu_perf_get(id) : to reset counters : start count event ID : stop counting : read counter

10 Exercise 2 Basic Tests 4/4 Tasks: How many kb is the binary? How big is the convolution_rect function? Profile the motion_detection algorithm: How many instructions are executed? How many load/stores were used? How many cycles were counted? What is the IPC (# instructions per cycle)?

11 Exercise 3 Benchmarking 1/6 We will benchmark a simple matrix multiplication: sw/apps/sequential_tests/matrixmul8/matrixmul.c sw/apps/sequential_tests/matrixmul8/matmul_kernels.c To have some quick cycle count feedback the timer is used: Include timer.h and use the functions: reset_timer() start_timer() stop_timer() get_time() Hardware loops Hardware loops are enabled by default To prevent the use of hardware loops in your application a flag has to be set: Open../matrixMul8/CMakeLists.txt and remove the flag: -mnohwloop If you recompile the application, the flag will be used for compilation automatically The compiler will generate the following hwloop instructions to produce efficient loops: - lp.start - lp.end - lp.count - lp.counti - lp.setup - lp.setupi

12 Exercise 3 Benchmarking 2/6 Tasks: Check if hardware loops are generated (in the matrixmul8.read file) What speedup do you expect when enabling hardware loops? How many instructions are actually saved? Compare the matrixmul8.read with and w/o hwloops. Do your measurements match your estimations? How do your results change if you set N, M to a constant? (in matmul8() ) int M = SIZE; int N = SIZE; Execution time: (# cycles/ % improvement) Baseline - Hardware loop (2 register set) Codesize [B]

Exercise 3 Benchmarking 3/6 Post increment immediate:

Deactivate with mnopostmod Old MAC: Post increment register:

Accumulation register stored in a special register

13 Exercise 3 Benchmarking 3/6 Post increment immediate: Activated by default! Deactivate with mnopostmod Old MAC: Post increment register: From a hardware perspective, what is the drawback of this instruction? Multiply-accumulate instruction: Old architecture: Accumulation register stored in a special register Accumulation result can be accessed in two cycles New architecture: Enabled by default! Accumulates directly on the register file Disable with -mnomac New MAC:

14 Exercise 3 Benchmarking 4/6 Vector Instructions: Add, sub, comparisons are all supported in vector mode It is possible to process in parallel: One word Two halfwords, or Four bytes Check in the matrixmul.read if vector code is generated. Vector instructions have the format: lv.{sub,add,dotp, } Tasks: Run the matrixmul application with the different compiler options 1. no extensions: -mnohwloop -mnopostmod -mnomac 2. with hardware loops: -mnopostmod -mnomac 3. with post increment: -mnomac 4. with register mac:. Summarize your results in the first table on the next page

15 Exercise 3 Benchmarking 5/6 Instructions Cycles Codesize Total Reduction [%] Total Baseline - - +Hardware-loop +Post increment +mac +Dot product Speedup [%] [B] Use constant values for N, M to get a fair comparison What can be done better? Try to improve the matrix multiplication by using dot product operations (see next slide)

multiplication) In the second step we can load vectors of 4 chars, and use the Dotproduct and Sum of

16 Exercise 3 Benchmarking 6/6 In order to speed up the multiplication with dot products we are first transposing matrix B (this leads to more efficient access patterns when loading vectors in the multiplication) In the second step we can load vectors of 4 chars, and use the Dotproduct and Sum of Dot-product instruction to compute one output pixel How many cycles are required to compute one output pixel?

17 Exercise 4: Efficient Convolutions (1/4) Convolutions are important kernels in image processing Convolutions are defined as: Let us consider a 5x5 window to compute the convolution For each output pixel we need 25 multiplications, and 24 additions, or 1 multiplication and 24 mac operations The Dot product instruction can do 4 multiplications, and 3 additions in a single cycle Hence, 1 Dot Product, and 6 Sum of Dot Product instructions are sufficient

18 Exercise 4: Efficient Convolutions (2/4) Look at the code given in (appname = convolution) apps/sequential_tests/convolution/conv_kernels.c The 5x5 convolution exists for 2 versions conv5x5_byte() and conv5x5_scalar() Check the difference in execution time In order to keep the complexity under control we will now look at a 3x3 kernel The scalar version conv3x3_scalar() is already functional The vector version conv3x3_byte() needs to be completed Task: Compare the two 5x5 convolution kernels Complete the 3x3 convolution kernel (see also next slide)

Exercise 4: Efficient Convolutions (3/4) The idea of the vector 3x3 convolution is: 1. Load vectors instead of bytes 2. Process one output pixel in each iteration 3.

19 Exercise 4: Efficient Convolutions (3/4) The idea of the vector 3x3 convolution is: 1. Load vectors instead of bytes 2. Process one output pixel in each iteration 3. Use Dotp to maximize the throughput 1 iteration For each vertical column of the image: Initialize the vectors V1,V2 Move V2 -> V1 Move V1 -> V0 Load V2 (fresh data) Compute the convolution with three dot product instructions Move kernel 1 pixel down Switch to next vertical column

20 Exercise 4: Efficient Convolutions (4/4) Tasks: What speedup do you expect? Complete the table using the performance counters How many cycles are required to compute one output pixel? Total instructions Cycles Loads operations Total Reduction[x] Total Speedup [x] Total Reduction [x] 5x5: w/o dot product x5: With dot product 3x3: w/o dot product x3: With dot product Discuss your results with an assistant

21 Exercise 5: Motion detection with fast convolution (1/3) In this exercise we will focus again on the motion detection algorithm. apps/sequential_tests/motion_detection/motion_detection.c The algorithm is doing a bunch of image processing steps: Dilatation Erosion Convolution Etc. The computationally heaviest part is the convolution It is using a 3x3 convolution with a Sobel filter Datatypes are shorts (not bytes!)

c Hints: Define 5 vectors V0-V4 Initialize V1-V4 in the beginning of a

22 Exercise 5: Motion detection with fast convolution (2/3) Tasks: Modify the convolution of exercise 4 in order to work with shorts See conv_fast.c Hints: Define 5 vectors V0-V4 Initialize V1-V4 in the beginning of a new column Use the shuffle instruction to combine V3 and V4 into V

23 Exercise 5: Motion detection with fast convolution (3/3) Tasks: Complete the table below (use performance counters to get the instructions/load operations) How do you expect your performance to change if you increase the image size? you can include the header img_40_40.h to see the difference Runtime will increase! Make sure debug outputs are deactivated! Total instructions Cycles Load operations Total Reduction [%] Total Speedup [%] Total Reduction [%] 10x10: w/o dot product 10x10: With dot product 40x40: w/o dot product 40x40: With dot product

24 Exercise 6 RISC-V compressed Instructions (1/2) In this exercise we are going to use the new RISC-V core Not all instructions have been ported yet The core supports 32 bit and compressed 16bit instructions Create a build folder for RISC-V: $ cd /scratch/soc_xx/build_riscv Configure the build folder: $ cp ~soc_master/2_openrisc/pulpino/sw/cmake_configure.riscv.gcc.sh. In the configure script: Set the path to your exercise folder: PULP_GIT_DIRECTORY= /home/soc_xx/2_openrisc/pulpino $ riscv -g2.2.8./cmake_configure.riscv.gcc.sh To switch between compressed and uncompressed instructions set the RVC flag Set RVC=1 in cmake_configure.riscv.gcc.sh to enable compressed instructions Source the configure script again Compile the RTL: $ make vcompile : compiles Pulpino with the RISC-V core

25 Exercise 6 RISC-V compressed Instructions (2/2) Coremark is a core comparison benchmark Independent of frequency Coremark/MHz score = 10^6 / (#ticks) The higher the better Tasks: Run coremark on RISC-V and compute the score (make coremark.vsimc) Run coremark with compressed instruction Go to ARM homepage and compare it to your results RISC-V RISC-V (Compressed) Cortex M0 Cortex M4 Score Size Score Size Score Score

26 Questions & Answers You have successfully completed the exercise You can find sample solutions under: (after the exercise) ~soc_master/2_openrisc/solutions If you are interested in a mini-project we can offer you: Implement a program on Pulpino (e.g. a game) Use the LCD display of the Zedboard Implementation and optimization of a benchmark using the multicore pulp environment See last exercise about the pulp architecture RISC-V core architecture development. Analysis of: Mini core VLIW architecture We are open to your own ideas!

Exercise: OpenRISC Programming

Exercise: OpenRISC Programming Increasing efficiency of the OpenRISC core with simple instruction extensions 23.03.2015 Michael Gautschi Antonio Pullini Introduction All exercises will be performed on