Hands-On Workshop: Memory Configuration and Throughput

Size: px

Start display at page:

Download "Hands-On Workshop: Memory Configuration and Throughput"

Brittany Oliver
5 years ago
Views:

1 Hands-On Workshop: Memory Configuration and Throughput FTF-AUT-F0343 Ioseph Martinez Senior Applications Engineer A P R TM External Use

2 Session Introduction This session reviews the challenges of working with the latest MCUs for automotive instrument cluster & graphic systems. Interconnect complexity and throughput requirements have incremented for systems that do graphical applications Understanding memory system configuration is important because it helps you select the right part for your project multimedia/graphic projects can be underestimated or overestimated if the memory system of the part is not correctly understood External Use 1

3 Session Objectives After completing this session, you will be able to: Calculate bandwidth requirements for different systems Differentiate between the different type of masters and slaves in a system and how they access or get accessed in the system Perform memory bandwidth stress tests to achieve peak bandwidth in Freescale Vybrid controllers External Use 2

4 Agenda Introduction to Vybrid Controllers and Next-generation Cluster Systems QuadSPI Memory Theory and Practice DDR DRAM Theory and Practice Internal SRAM Theory and Practice Session Closure External Use 3

5 Vybrid R Series System External Use 4

6 Next-generation Cluster System External Use 5

7 Key Differences Next-generation cluster systems have internal flash memory while Vybrid processors don t Next-generation cluster DDR2 can be 32 bits wide, while Vybrid processor is 16 bit Vybrid processor has an L2 cache controller Vybrid processor ports for internal SRAM are all AXI. In nextgeneration clusters, some are AXI and others are AHB Vybrid processor operates the core and some other masters at 400 MHz. Next-generation cluster system operates at 320 MHz (system frequency is 133 MHz (R-Series) and 160 MHz respectively) External Use 6

8 About the Masters Masters initiate and drive access to the slaves A5 and M4 can consume some of the bandwidth, but caches relieve most of the load from the system Most of the opcodes require more than 1 cycle to execute. Load is reduced based on the type of encoding used Masters may operate at different frequencies depending on whether they are clocked at system frequency or a multiple of it Latencies and peak bandwidth on each master also depends on the slave being accessed External Use 7

9 2D-ACE: Display Controller Bandwidth per layer = pixel clock * bytes per pixel Maximum 6 layers blend in a single pixel External Use 8

10 Graphics Processing Unit: OpenVG1.1 Full fixed function hardware vector graphics GPU Hardware tessellation: Minimum CPU involvement 16x FSAA: Photorealistic quality Multiformat rendering High quality vector font rendering Standard API OpenVG1.1 Output bandwidth = sysfreq * pixels 200 Mpixels for Vybrid processor 160 Mpixels for next-generation cluster Input bandwidth = 4 x output bandwidth GC355 GPU Core AHB AXI Host Interface Memory Controller Graphics Pipeline Front End Vector Graphics Engine Imaging Engine VG Pixel Engine External Use 9 r0: 23-Sep-13

11 About the Slaves Slaves are passive elements accessed by the masters in the system. They stand by until a master accesses them Some slaves are read only while others are read/write. Read/write can double the bandwidth Some slaves have higher latency for random accesses than others (ex. external DRAM and QSPI) Some slaves have more than 1 instance of the same module External Use 10

12 Agenda Introduction to Vybrid Controllers and Next-generation Cluster Systems QuadSPI Memory Theory and Practice DDR DRAM Theory and Practice Internal SRAM Theory and Practice Session Closure External Use 11

QSPI Features Dual QuadSPI architecture supports: Two external serial flashes per QuadSPI module Programmable sequence engine compatible with any

data recombination internally in QuadSPI READING ONLY Flexible receive (Rx) buffering scheme: Sub-buffers allocated to specific masters Master

13 QSPI Features Dual QuadSPI architecture supports: Two external serial flashes per QuadSPI module Programmable sequence engine compatible with any serial flash Supports up to 4 chip selects QuadSPI can control 2 x 4-bit serial flashes: Individual flash mode Parallel mode enabling octal flash with data recombination internally in QuadSPI READING ONLY Flexible receive (Rx) buffering scheme: Sub-buffers allocated to specific masters Master prioritisation Pre-fetch capability Suspend & resume for lower priority masters Up to 100 MHz clock (200 MByte/s peak bandwidth) in Next Gen Cluster External Use 12

14 QuadSPI Bandwidth Serial interface bandwidth (b/w): Peak b/w = [66 Mhz(sclk) * 4(quad) * 2(parallel mode) *2(ddr)] / [8bits/byte] = 132 MByte/sec Effective b/w: Less than peak b/w. Overhead due to flash command Impact depends on data size transferred AXI Read Request 1 st Databeat available on AHB Sclk Cycles? 8 4/ ? FIRST ACCESS Pre Command Addr Mode Dummy Data Post SUBSEQUENT ACCESSES 64 bit databeat in 4 cycles Sclk Cycles? 4/ ? Pre Addr Mode Dummy Data Post Effective bandwidth: Access 18-4 (same command in subsequent access) Effective b/w (128 byte access, XIP, 24add) = (32/( )) % = MByte/sec Effective b/w (128 byte access, XIP, 32add) = 98.2 MByte/sec Effective b/w (256 byte access, XIP, 32add) = MByte/sec External Use 13

15 Laboratory 1, Part 1: QSPI: Flashing memory Step 1: Open Lab1.eww by double clicking on it Step 2: Build the project (F7) Step 3: Download the project (Ctrl+D) Step 4: Debug the project run Step 5: Wait for the program to erase and program the memories (this may take more than 30 seconds) Step 6: Some colors will appear on the screen How does it look? External Use 14

16 Laboratory 1, Part 1: QSPI Flashing memory Step 7: Break code to debug: Menu Debug>Break Step 8: Go to menu View>Register and select DCU0 Step 9: Select DCU0_DIV_RATIO:DIV_RATIO Step 10: Change the pixel clock to a lower frequency until the image looks correct on the screen. (Increment the value of the Divider) What is the divider value at which the image looks correct? External Use 15

17 Laboratory 1, Part 2: QSPI: Bits per pixel Step 1: Stop Debugging. Open image.c file. Step 2: Comment the following line: #define PROGRAM_GRAPHICS Step 2: Select a different image with lower resolution by selecting (3) on the following line: #define IMGNUMBER (3) Step 8: Rebuild, debug and run again Step 9: If the image does not looks correct, try to find a DCU0 clock divider on which the image looks correct. What is the value at which the image looks correct? Now try with: #define IMGNUMBER (1) External Use 16

18 Screen Pixel Clock & QSPI Throughput Screen pixel 60 fps: WQVGA (480 x 272): #9 MHz WVGA (800 x 480): #32 MHz QSPI clock max throughput: DDR MHz 200 MB/s max in next gen cluster DDR 8 66 MHz 132 MB/s max in Vybrid processor Per layer 2D-ACE required throughput: 8 9 MHz 9 MB/s max 22 layers can be blended in next-gen cluster, 14 layers in Vybrid processor 16 9 MHz 18 MB/s max 11 layers can be blended in next-gen cluster, 7 layers in Vybrid processor 8 32 MHz 32 MB/s max 6 layers can be blended in next-gen cluster, 4 layers in Vybrid processor MHz 64 MB/s max 3 layers can be blended in next-gen cluster, 2 layers in Vybrid processor (Theoretical/ideal use cases) External Use 17

19 Laboratory 1, Part 3: QSPI 2D-ACE Blending Step 1: Start Over, open image.c file Step 2: Uncomment the following line: #define EXTRALAYER8BPP Step 3: Rebuild, debug and run again Does the image looks correct? Step 4: If the image does not looks correct, try to find a DCU0 clock divider on which the image looks correct. External Use 18

20 Laboratory 1, Part 3: QSPI 2D-ACE Blending Step 5: Stop Debugging, open image.c file Step 6: Uncomment the following line: #define QUADREADS Step 7: Select a different image with higher resolution by selecting (0) on the following line: #define IMGNUMBER (0) Step 8: Rebuild, debug and run again Does the image looks correct? Step 8: Stop Debugging and uncomment the following line: #define EXTRALAYER16BPP Step 9: Rebuild, debug and run again External Use 19

21 Laboratory 1, Part 4: QSPI Parallel Mode Step 1: Start Over, open image.c file Step 2: Uncomment the following line: #define PARALLELREADS Step 3: Rebuild, debug and run again Does the image looks correct? External Use 20

22 QuadSPI Memory Map Serial and Parallel Region Start Address End Address Size (MB) QSPI0 0x2000_0000 0x2FFF_FFFF 256 AMBA_BASE SFA1AD SFA2AD Serial Mode QSPI1 0x5000_0000 0x5FFF_FFFF 256 A1 A2 AMBA_BASE SFA2AD Parallel Mode A1 + B1 SFB1AD SFB2AD B1 B2 SFB2AD A2 + B2 External Use 21

23 Laboratory 1, Part 4: QSPI Parallel Mode Step 4: Start Over, open image.c file Step 5: Uncomment the following line: #define PROGRAM_GRAPHICS Step 6: Rebuild, debug and run again. Step 7: Wait until something shows on the screen, it will take a while since we are re-flashing the memory. Step 8: Close the debug session and comment again: #define PROGRAM_GRAPHICS Step 9: Rebuild, debug and run again. Does the image looks correct? External Use 22

24 Agenda Introduction to Vybrid Controllers and Next-generation Cluster Systems QuadSPI Memory Theory and Practice DRAM Theory and Practice Internal SRAM Theory and Practice Session Closure External Use 23

25 DRAM Controller Next-generation cluster devices and Vybrid processors have different types of DRAM controllers: Next-gen: Supports SDR 16 MHz and DDR 16/ MHz Vybrid: Supports LPDDR2 & DDR MHz In the case of the next-gen cluster devices the A5, GPU and 2D- ACE has direct access to the DRAM for more efficient access There are some penalties for different data access methods. The most efficient way is linear access Peak bandwidth is calculated this way: Peak BW = Freq * BusWidth * mode Mode = 2 if DDR otherwise Mode = 1 Effective BW is a complex thing to calculate, but it is OK to generalize to certain efficiency level External Use 24

26 Screen Pixel Clock & DRAM Throughput Screen pixel 60 fps: WQVGA (480 x 272): #9 MHz WVGA (800 x 480): #32 MHz DRAM clock max throughput: DDR MHz 2560 MB/s max in next-gen cluster DDR MHz 1600 MB/s max in Vybrid processor SDR MHz 320 MB/s max in next-gen cluster Per layer 2D-ACE required throughput: 24 9 MHz 27 MB/s max 94 layers can be blended in next-gen cluster, 59 layers in Vybrid processor, 11 with SDR memory 32 9 MHz 32 MB/s max 80 layers can be blended in next-gen cluster, 50 layers in Vybrid processor, 10 with SDR memory MHz 96 MB/s max 26 layers can be blended in next-gen cluster, 16 layers in Vybrid processor, 3 with SDR memory MHz 128 MB/s max 20 layers can be blended in next-gen cluster, 12 layers in Vybrid processor, 2 with SDR memory (Theoretical/ideal use cases) External Use 25

27 Laboratory 2: DRAM, Overhead Step 1: Open Lab2.eww Step 2: Build the project (F7) Step 3: Download the project (Ctrl+D) Step 4: Debug the project run Step 5: Look at the serial console, what is the time spent on that function? Step 6: Modify the size of the buffer set BUFFERSMALLHEIGHT = 4 Step 7: Rebuild, debug and run again What is the time spent on that function? External Use 26

28 Laboratory 2: DRAM, GPU Write Step 1: Start Over, open image.c file Step 2: Comment #define TESTOVERHEAD Step 3: Uncomment #define TESTCLEAR Step 4: Rebuild, debug and run again What is the time spent on the each of the two operations? Do the numbers make sense? What is the achieved BW? Actual time = measured time - overhead External Use 27

29 Laboratory 2: DRAM, GPU Write Step 5: Start Over, open image.c file Step 6: Uncomment #define TESTCLEAR Step 7: Rebuild, debug and run again What is the achieved BW for the 32bpp operations? How it can be compared to the 16bpp operations? External Use 28

30 Laboratory 2: DRAM, GPU Copy Step 1: Start Over, open image.c file Step 2: Uncomment #define TESTCOPY Step 3: Rebuild, debug and run again What is the achieved BW for the operations? External Use 29

31 Laboratory 2: DRAM, GPU Blend Step 1: Start Over, open image.c file Step 2: Uncomment #define TESTBLEND Step 3: Rebuild, debug and run again What is the achieved BW for the operations? External Use 30

32 Laboratory 2: DRAM, GPU Rotate Step 1: Start Over, open image.c file Step 2: Uncomment #define TESTROTATE Step 3: Rebuild, debug and run again What is the achieved BW for the operations? External Use 31

33 Laboratory 2: DRAM, GPU QSPI Step 1: Start Over, open image.c file Step 2: Uncomment #define TESTQSPI Step 3: Rebuild, debug and run again What is the achieved BW for the operations? External Use 32

34 Agenda Introduction to Vybrid Controllers and Next-generation Cluster Systems QuadSPI Memory Theory and Practice DRAM Theory and Practice Internal SRAM Theory and Practice Session Closure External Use 33

35 RAM Controller On next-generation cluster devices there are two types of internal RAM: System RAM: Uses AHB port Graphics RAM: Uses AXI port Peak bandwidth = Freq * BusWidth Some features of the next-gen internal RAM controller: 1.3 MByte graphics SRAM block does not natively support ECC FlexECC enables conversion of non-ecc SRAM into ECC SRAM 1.3 MBytes non-ecc SRAM converts to 1 MByte ECC SRAM 320 kbytes are sacrificed as a syndrome-array 128 kbyte contains the packed ECC syndromes 192 kbyte becomes inaccessible Separate path from RAM controller to the syndrome-array allows parallel fetch of data and syndrome External Use 34

36 Screen Pixel Clock & SRAM Throughput Screen Pixel 60 fps: WQVGA (480 x 272): #9 MHz WVGA (800 x 480): #32 MHz DRAM clock max throughput: 160 MHz 1280 MB/s max in next-gen cluster 133 MHz 1064 MB/s max in Vybrid Per layer 2D-ACE required throughput: 9 MHz 27 MB/s max 47 layers can be blended in next-gen cluster, 39 layers in Vybrid processor 9 MHz 32 MB/s max 40 layers can be blended in next gen cluster, 33 layers in Vybrid processor 32 MHz 96 MB/s max 13 layers can be blended in next gen cluster, 11 layers in Vybrid processor 32 MHz 128 MB/s max 10 layers can be blended in next gen cluster, 8 layers in Vybrid processor (Theoretical/ideal use cases) External Use 35

37 Laboratory 3: RAM GPU Operations Step 1: Open Lab3.eww Step 2: Rebuild, debug and run again Step 3: Compare the results of DRAM (Lab2 vs. Lab3) in terms of BW Parameters to be tested measured: #define TESTOVERHEAD #define TESTCLEAR #define TEST32BPP #define TESTCOPY #define TESTBLEND #define TESTROTATE #define TESTQSPI External Use 36

38 Agenda Introduction to Vybrid Controllers and Next-generation Cluster Systems QuadSPI Memory Theory and Practice DDR DRAM Theory and Practice Internal SRAM Theory and Practice Session Closure External Use 37

39 Session Summary Graphics systems require full awareness of maximum limits, latencies and effective bandwidth for optimal usage. Each memory will have different limitations or scenarios where a device is most efficient. Application has to be designed considering this. Distributing utilization and bandwidth between the different memories for the different masters is an important requirement for graphics systems, because typically it will offload each slave and allow other masters to perform efficiently External Use 38

40 For Further Information External Use 39

41 Session Closing By now, you should be able to: Effectively describe the general bandwidth requirements of a graphical application based on the system configuration. Use this knowledge to decide what type of platform fits better with your designs Avoid the common problem of running out of bandwidth for a graphic application by using the different memories on a Freescale automotive microcontroller. External Use 40

42 Freescale Semiconductor, Inc. External Use

Understanding Vybrid Architecture

Freescale Semiconductor, Inc. Application Note Document Number: AN4947 Rev. 0, 07/2014 Understanding Vybrid Architecture by Jiri Kotzian and Rastislav Pavlanin Vybrid controller solutions are built on