Engineering Degree Thesis 15 credits. Energy monitoring of the Cortex-M4 core, embedded in the Atmel SAM G55 microcontroller. Zeid Bekli William Ouda

Size: px

Start display at page:

Download "Engineering Degree Thesis 15 credits. Energy monitoring of the Cortex-M4 core, embedded in the Atmel SAM G55 microcontroller. Zeid Bekli William Ouda"

Phyllis Houston
5 years ago
Views:

Bekli William Ouda Exam: Bachelor of Science in Engineering Examiner: Olle Lindeberg

1 Faculty of Technology and Society Computer Engineering Engineering Degree Thesis 15 credits Energy monitoring of the Cortex-M4 core, embedded in the Atmel SAM G55 microcontroller Zeid Bekli William Ouda Exam: Bachelor of Science in Engineering Examiner: Olle Lindeberg Subject Area: Computer Engineering Supervisor: Tommy Andersson Date of final seminar:

2 Abstract The technology in cellular phones, portable computing systems, intelligent- and connected- devices are evolving in a high pace and in many cases these devices are required to operate in a low-power environment. The problem that continues to emerge, is the power consumption in microcontrollers and DSP devices. This issue has over time become important to solve in order to maximize battery life. To ease the choice of power efficient microcontrollers, controlled experiments were therefore performed with the Cortex-M4, this microcontroller was chosen because of the upgraded hardware, which has led to an appreciable change in both power- and speed efficiency compared to its predecessors. The conclusion presents important points, along with advantages and difficulties to consider when implementing a DSP application. By comparing different optimizations with the Floating Point Unit(FPU), Fixed-point and software Floating-point, the results show that there are major differences in power consumption between these three options. Depending on which option and optimization used then the power consumption can exceed over 70% more compared to the other options available. i

3 Acknowledgements We would like to show our appreciation to our supervisor Tommy Andersson for taking the time to guide and support us during this thesis. We would also like to thank Magnus Krampell for all the support and encouragement during our three years of studies in Malmo University. ii

4 Contents 1. Introduction Problem domain Research Questions limitations Theoretical background Digital Signal Processing The FIR filter The IIR filter GNU Compiler Collection (GCC) Fixed-point Floating-point Energy monitoring in MCU DSP device The Cortex-M The SAM G55 DSC Atmel power debugger Interrupt latency Related work Martin Trevor - The designers guide to the cortex-m processor family Li Tan, Jean Jiang - Digital Signal Processing Joseph Yiu - The Definitive Guide to ARM Cortex -M3 and Cortex -M4 Processors Savita Rani - Area and Speed Efficient Floating-Point Unit Alexandre Aminot, et al. - Floating Point Units Efficiency in Multi-Core Processor Method Literature study Research problem Controlled experiments SAMG55 with FIR- and IIR- filter In-system debugging (DWT) Power measuring system Results and Analysis Algorithms and filter design Energy monitoring Enabling the FPU Results of the controlled experiments The power consumption when using optimization -O0 and -O1 with Software Floating-Point, Fixed- Point and FPU The power consumption in FIR filter The power consumption in IIR filter Comprehensive analysis Discussion Method discussion Discussion of the data measurement The power consumption when executing the FIR filter The power consumption when executing the IIR filter iii

5 7. Conclusion Answering the research questions How does power consumption vary when using the same algorithms with and without hardware floating point unit? (Floating-point operation) How does power consumption vary when using an algorithm with the same functionality as RQ1? (Fixed-point operation) How is the dependency between speed and power consumption? Further work Contribution of this thesis iv

6 Acronyms ADC ADP CMSIS-DAP CMSIS_DSP DMIPS DP DSC DSP DSP device DWT EDBG FFT FIR FPO FPU HP GCC IIR IoT JTAG MAC MCU MSB Opt SIMD SP SWD Analog to Digital Converter Atmel Data Protocol Cortex Microcontroller Software Interface Standard-Debug Access Port Cortex Microcontroller Software Interface Standard-Digital Signal Processing Dhrystone Million Instructions Per Second Double Precision Digital Signal Controllers Digital Signal Processing Digital Signal Processor Data Watchpoint and trace Atmel Embedded Debugger Fast Fourier Transform Finite Impulse Response Floating Point Operations Floating Point Unit Half Precision GNU Compiler Collection Infinite Impulse Response Internet of Things Joint Test Action Group Multiplier Accumulator unit Microcontroller Unit Most Significant Bit Optimization Single Instruction Multiple Data Single Precision Serial Wire Debugger v

7 1 Introduction Signal processing is used in almost every technology that we rely on today such as cellphones, computers, smart watches, automotive control systems, just to name a few[1]. Signal processing is at the heart of our modern world, powering today s entertainment and tomorrow s technology [1]. One part of signal processing is Digital Signal Processing (DSP) that refers to a set of algorithms that are used to process digital signals. The usage for some of these algorithms is to improve the signal by using techniques such as Finite Impulse Response (FIR), Infinite Impulse Response (IIR). Other algorithms that are widely used is the Fast Fourier Transform (FFT) [2]. A DSP device 1 is a microprocessor specialized in processing of real time signals and algorithms for DSP. The advantage of a DSP device compared to a general-purpose processor is its ability to be more efficient when processing the same algorithms. This is because the main task of a DSP device is signal processing [3][4]. This results in one of the most common issue, that is the increased complexity of DSP blocks that have led to an ever-increasing power consumption challenge. The DSP blocks consist of a network of adders and multipliers, typically it is these networks that have a major influence on the power consumption for a DSP device [5][6]. There are processors that are manufactured with DSP architecture layers that support the general-purpose processor to perform the DSP algorithms more efficient and these can be characterized as Digital Signal Controllers (DSC). This means that, there might not be any reason to buy a separate DSP device and a general-purpose processor to get the desired results. One could be able to save significantly on the cost of the products that are being constructed, by replacing two processors with one high performance processor with DSP extension [7]. One of these processors is the ARM Cortex-M family processor, Cortex-M4. The Cortex-M4 is a powerful member in the Cortex-M family and is used worldwide in a range of digital signal control embedded market segments. Robotics is one of these segments and has a critical role in healthcare such as precision surgery and assisted-living. Other important segments are automotive control systems, smartwatches and medical instruments [8]. 1 Both Digital signal processing and Digital signal processor use the acronym DSP, therefore from onwards we will be referencing to Digital signal processor as DSP device. 1

8 1.1 Problem domain The technology in cellular phones, portable computing systems, intelligent- and connected- devices are evolving in a high pace and in many cases these devices are required to operate in a low-power environment [7]. The problem that continues to emerge, is the power consumption in microprocessors and DSP devices. This issue has in time become important to solve in order to maximize battery life [7][9]. For mobile devices where the requirement to achieve higher performance and better power efficiency is crucial to the success of the mobile devices, one option to consider is using the Cortex-A72 processor. It uses the poweroptimization ARM big.little TM processing technology which combines highperformance ARM CPU cores with the more power efficient ARM CPU cores, to give a good performance at a significantly lower average power [10]. The Cortex-M4 processor is different from the Cortex-A72 processor in the way that it is built on one high-performance core [11], this means that it cannot use the big.little processing technology. Yet the Cortex-M4 processor is used in areas such as robotics and healthcare. This is because of the upgraded hardware, which has led to an appreciable change in both power- and speed efficiency compared to its predecessors. New technologies and hardware accelerators were made and then implemented in the Cortex-M4 CPU such as single cycle multiply, hardware division, bit field instruction and of course the added DSP functions, this has been an important factor that has led to making the Cortex-M4 into a high-performance processor [12]. How does power consumption relate to these new hardware technologies and are there any significant changes in power consumption when using DSP algorithms? 1.2 Research Questions The aim of this research is to investigate the power consumption in a Cortex-M4 DSC, and to review the DSP algorithms when enabling the hardware Floating Point Unit(FPU) that is featured in the Cortex-M4 DSC compared to when having it disabled. By implementing DSP algorithms such as FIR- and IIR- filters, will there be any noticeable tradeoff between speed and power consumption? This will give a better insight of the advantages and disadvantages of the Cortex-M4 DSC. 2

9 Main question How does power efficiency vary in the Cortex-M4 DSC when enabling the FPU compared to when disabled, and is speed related in anyway? Sub questions RQ1: How does power consumption vary when using the same algorithms with and without hardware floating point unit? (Floating-point operation) RQ2: How does power consumption vary when using an algorithm with the same functionality as RQ1? (Fixed-point operation) RQ3: How is the dependency between speed and power consumption? 1.3 Limitations The aim of this thesis is to measure the power consumption on the Cortex-M4 DSC along with the embedded FPU which can be enabled optionally. The research done on the sub questions that are in section 1.2 will be based on the points below. DSP execution computed with FPU, and GNU compiler optimization -O0 and -O1 DSP execution computed with software Floating-point, and GNU compiler optimization -O0 and -O1 DSP execution computed with Fixed-point, and GNU compiler optimization -O0 and -O1 The device used in this thesis is the SAM G55 with the Cortex-M4 core. The wake-up time will not be included in the measurements, this is because of the differences that can be found between the MCUs that use the Cortex M4 core. The power consumption of the ADC and DAC will also not be included in the measurements, because the main focus is on the power consumption in the Cortex-M4 DSC when executing the DSP algorithms. Optimization -O2 and -O3 makes debugging harder and gives incorrect DWT cycles values and therefore will not be used. 3

10 2. Theoretical background The aim of this chapter is to review the areas that are important to understand in order to follow the chapters ahead in this thesis. Each subsection below should give a sufficient understanding to each term. 2.1 Digital Signal Processing Signals are patterns of variations that represent information. There are all kinds of signals such as speech signals, audio signals, video or image signals, radar signals, just to name a few [4][13]. Many signals originate as continuous-time signals, and speech signals are one of these. It can sometimes be desirable to obtain the discrete-time representation of the signal, and one way to do this is through sampling equally spaced points in time. The result will be a discrete time representation of the signal that can be processed digitally [16] The FIR filter In FIR filters each value in the output sequence is a weighted sum of a finite number of samples of the input sequence, which is basically a feed-forward difference equation. The relationship of a general FIR filter is specified by the following equation [4][13]. M y[n] = b k x[n k] k=0 (Eq. 1) The output signal (y[n]) is dependent on the input signal (x[n]), the filter order (M) and the value of the impulse response (bk). It can be illustrated by doing a block-diagram. A third-order FIR filter can be seen in figure 1 below and its equation is. y[n] = b 0 x[n] + b 1 x[n 1] + b 2 x[n 2] + b 3 x[n 3] (Eq. 2) 4

11 Figure 1. Third-order FIR filter The input signal in an third-order FIR filter (Figure 1) has three signal delays (unit delays) that are then multiplied with filter coefficients (b0, b1, b2, b3), and the results of the product are then added to generate the output(y[n]) [13] The IIR filter IIR filters are feedback systems in such way that the output value of the system is reused. The difference between a FIR filter and a IIR filter, is its intelligibility to combine an output value with an input signal to compute an output. The general IIR difference equation is [4][13]. n y[n] = a i y[n i] i=1 M + b k x[n k] k=0 The equation coefficients are feedback (ai), feedback filter order (N), the feedforward (bk), the feedforward filter order (M) and of course the input signal (x[n]) and output signal (y[n]). By taking a closer look at the equation, it is obvious that if the coefficient (ai) were to be zero then we would have acquired the equation of a FIR filter [4][13]. (Eq. 3) A block-diagram of a first-order IIR filter with its corresponding equation can be seen in figure 2 and equation 4. y[n] = a 1 y[n 1] + b 0 x[n] + b 1 x[n 1] 5 (Eq. 4)

Direct Form II which can be seen in figure 2B.

12 Figure 2A. First-order IIR filter in Direct Form I Figure 2B. First-order IIR filter in Direct Form II The IIR filter in figure 2A is of Direct Form I, however there is also Direct Form II which can be seen in figure 2B. The difference in these two forms, is that the unit delay in Direct Form II can be combined, this is because the signal to the unit delays in figure 2B is the same [4][13]. IIR filters gets unstable if the pole(s) are outside the unit circle[4]. 6

13 2.2 GNU Compiler Collection (GCC) The GCC is one of the most used compilers today, it is a free software and volunteers can contribute to improving the functionality of the GCC. Basically the GCC is an optimizing compiler from the GNU project. The GNU project have object file tools such as the assembler and linker [14]. There are five options for code optimization with GCC [15][16]. -Os - This optimization is for space usage(size) rather than speed. -O0 - Optimization is disabled and will make it more easy to debug, and will be slower than the three options below. -O1 - This optimization is also suitable for debugging, this option will enhance both speed performance and space usage(size). -O2 - Full optimization and skips any optimization that can lead to increase in space usage(size). -O3 - Does the same as -O2, the difference is in how the optimization is used to increases the space usage(size) for speed performance. Optimization -O2 and -O3 are fast performing options, but makes debugging harder [16]. 2.3 Fixed-point Fixed-point arithmetic means fixed number of digits before and after the decimal point. This implies that the resolution is depending on the amount of bits. For example, if 8-bits are used for fraction, then the resolution will be 2^-8 [4][17][18]. The maximum number in Fixed-point arithmetic is limited to the number of bits available, e.g. with 32-bits it is possible to use from 0- up to 32- bits as fraction numbers and the scaling is predetermined by the user. This makes it more commonly used when FPU is not available in the hardware [2][17]. There are four ways to store and represent integer(converting decimal numbers into binary), Unsigned integer, Offset binary, Sign and Magnitude, Two s complement [2][17]. Unsigned integer is quite straightforward when compared to the other three formats, it can go from the number zero up to the maximum positive number depending on the amount of bits. One noticeable disadvantage with unsigned integers is that there are no negative representations. 7

14 Offset binary format works in a similar way to the unsigned integer format, the difference lies in the shifted offset that allows either a positive number or a negative number to be represented. Sign and magnitude format is another way to represent negative and positive integers. Where the most significant bit (MSB) is zero for positive numbers, and one for negative numbers, this is called the Sign bit. The following bits function as a standard binary format. This in term will mean that there are two ways to represent zero, and that is a waste of bit pattern [2][17]. Two s complement format is more used by engineers, because it is less complex to implement in the hardware compared to the other three formats [2][17]. This is illustrated in the table below. Bits Decimal Table 1. Illustration of Two s complement format When wanting to represent fraction numbers with sign and magnitude format, it is possible to trick the CPU into thinking that it is dealing with integers. As seen in figure 3 that the Sign(S) bit is in the 16th position while the decimal point is put between the 5th and 6th bit [17][18]. Figure 3. Fraction representation with sign and magnitude format The equation below is to calculate the intended value for figure 3 [17]. +1 S = 1 Sum = S (integer(decimal) + fraction(decimal)) 2 5 (Eq.5) 8

15 2.4 Floating-point Floating-points indicate that the decimal point is floating around based on the given value, unlike the Fixed-point representation where the decimal point is set on the same place, this makes the Floating-point representations more dynamic and efficient. There are three precisions used with Floating-point, Half Precision(HP) uses 16-bit, Single Precision(SP) is the more common one and uses 32-bit, and Double Precision(DP) is used with 64-bits [4][19]. The IEEE standard representation for Floating-point is divided in three parts [4][19]. The sign bit The exponent The mantissa Precision Sign bit Exponent Mantissa HP floating 1 bit 5 bits 10 bits SP floating 1 bit 8 bits 23 bits DP floating 1 bit 11 bits 52 bits Table 2 - Basic floating formats The sign bit decides the polarity of the value, where setting the bit to 0 represents positive numbers and 1 represent negative numbers [4][19]. The mantissa is the fraction part (the part after the separator), for example the following Floating-point number can be considered, the number 7 can be represented as 1.75*4=( 1+½+ 1/4)*2 2, the mantissa is the fraction part, (0.75) and the exponent 2+bias [4][19]. For SP floating point it can be normal to have an exponent in the range of 1 up to 254, and this is best explained when studying the mathematical equation of the SP value seen in eq.6. value = ( 1)( 1) sign 2 (exponent 127) (1 + ( 1 2 Fraction[22]) + (1 4 Fraction[21]) ( 1 Fraction[0])) 223 (Eq. 6) This is the Offset binary format (explained above in section 2.3), where the exponent is shifted by a bias. The bias shown in the equation above is 127 [4][19]. 9

16 FPU is a hardware unit that can be added to processors to perform Floating-point arithmetic operations in less cycles than the software Floating-point. Most FPU s support the IEEE standard[4][19]. 2.5 Energy monitoring in MCU CMOS technology is used in MCUs and there are two power dissipations, static and dynamic. Static dissipation is the power leakage that occurs during steady state, while dynamic power occurs when switching states [20]. There are different forms in how to monitor energy efficiency and power characteristics. One way to measure energy efficiency, is to look at the work that is done with a limited energy. By doing so the measurement unit can be in the form of Dhrystone Million Instructions Per Second (DMIPS)/μW or CoreMark(benchmark score)/μw. These two forms are a set of benchmark for the embedded system. Power measurement is on the other hand based on three factors, the active current which is measured in μa/mhz, the sleep mode current is measured in μa since the clock is should be stopped and the third factor to consider is energy efficiency. By energy efficiency it is the execution time that is taken into consideration, if the MCU has a long execution time then the overall power consumption will suffer [19]. The Cortex-M3 and Cortex-M4 has a number of power features such as sleep mode, wait mode and backup mode. The Cortex-M4 should be able to run at under 200μA/MHz, while some other Cortex-M processors are able to run at under 100 μa/mhz [19]. 2.6 DSP device A DSP device is a processor that is specialized in DSP algorithms, this leads to fast arithmetical calculations [3][21]. Harvard structure or the improved Harvard structure is generally used in a DSP device, that means data and instructions/program are in a separate memory. There are at least 4 buses in the DSP device: bus of program data, bus of program address, bus of data, and bus of data address. This separation means faster- and independent access during a cycle [3]. DSP device usually possess several processing units, these units main purpose are to enhance the speed of the device [3]. One of the units is the FPU, approximately one third of the DSP devices out in the market have a FPU unit, and over one half of the FPU non-users are planning to change. This is due to high cost of the hardware [2][21]. The pipelines are structured in a different way than the general-purpose processors, this allows the DSP device to execute multiple instructions simultaneously [3]. 10

2.7 The Cortex-M4 In the year 2004 the Cortex-M microcontrollers were presented. The Cortex-M4 and Cortex-M7 processors support DSP instructions.

The key features in the Cortex-M4 are DSP, SIMD(Single Instruction Multiple Data), MAC(Multiply-Accumulate) unit, debug, Harvard architecture, 32-bit performance, and optional FPU [22]. 2.7.

17 2.7 The Cortex-M4 In the year 2004 the Cortex-M microcontrollers were presented. The Cortex-M4 and Cortex-M7 processors support DSP instructions. The Cortex-M4 can be used in demanding areas where memory protection and Floating-point for SP and HP calculations are mandatory. The key features in the Cortex-M4 are DSP, SIMD(Single Instruction Multiple Data), MAC(Multiply-Accumulate) unit, debug, Harvard architecture, 32-bit performance, and optional FPU [22] The SAM G55 DSC The SAM G55 is a microcontroller based on the Cortex-M4 core and is intended for low power applications [8]. The SAM G55 DSC is an development board with this controller. Figure 4A. SAM G55 features Figure 4B. The SAM G55 DSC The key features in the SAM G55 DSC are Atmel Embedded Debugger(EDBG), Atmel Data Protocol(ADP), current measurement header, 120 MHz, Analog to Digital Converter(ADC) module, Serial Wire Debugger(SWD)- and Joint Test Action Group(JTAG) interfaces, Data Watchpoint and trace(dwt) [23][24]. The EDBG is intended for onboard debugging, and one of its functions is to stream data from the MCU to the host PC. The EDBG makes use of the ADP when streaming the data. The DWT is a debugging unit that enables data tracing and counters for the processor. SWD is an alternative to the JTAG interface for debugging [23][24]. 11

2.8 Atmel power debugger The Atmel power debugger(figure 6) is a development tool which is intended for debugging and programming the ARM Cortex-M based Atmel SAM and Atmel AVR microcontrollers.

18 2.8 Atmel power debugger The Atmel power debugger(figure 6) is a development tool which is intended for debugging and programming the ARM Cortex-M based Atmel SAM and Atmel AVR microcontrollers. The controllers need to have an interface of JTAG or SWD [25]. The JTAG also referred as boundary-scan is defined by IEEE as a method for testing functionality on circuit boards [26], while the SWD interface is a subset of the JTAG interface. The SWD interface takes use of TCK- and TMS- pin for connection, and these two pins can also be found on the JTAG 10-pin connector [27]. The power debugger has two separate means for measuring current and is ARM CMSIS-DAP(Cortex Microcontroller Software Interface Standard-Debug Access Port) compatible which means it will work with Atmel Studio 7.0 or later[25]. CMSIS-DAP is a interface that provides access for debugging [28]. A key benefit of the debugger is that it streams measurements and data to the Atmel Data Visualizer for real-time analysis [25]. Figure 5. The Power Debugger Channel A in the power debugger provides high accuracy measurements when measuring a low current in the range of 100mA - 500μA, the resolution is around 3μA and the accuracy is no worse than 3% [25]. 2.9 Interrupt latency Interrupt latency is the number of clock cycles required from a processor to react to an interrupt signal on entry and on exit. The interrupt latency is around twelve cycles on entry and ten cycles on exit. If the FPU is enabled then an increase of seventeen cycles is possible on entry and on exit [29]. 12

19 3. Related work In this section, you will find relevant information contributed from previous work that is closely related to this thesis. The guidelines in the subsections below points out important features in the Cortex-M4 and mathematics in the subject of DSP but also about the speed efficient Floating-point unit. 3.1 Martin Trevor -The designers guide to the cortex-m processor family - Chapter 8 The main focus in Trevor s [12] book is to understand the DSP functions that are embedded within the Cortex M4 and the Cortex M7. The combination with a traditional MCU can be referred to as a DSC. Martin Trevor explains the key features that are added to the M4 and M7 to support DSP usage. The enhancements are SIMD instructions, FPU and a more improved MAC unit compared to the M3. Trevor then uses the ARM CMSIS-DSP(Cortex Microcontroller Software Interface Standard - Digital Signal processing) software library to show how to access these functions that are added in M4 and M7. By doing experiments he explains the difference between FPU and the software Floating-point, he also explains how to enable and disable the FPU. The SIMD instructions are also explained. This is done by giving some code examples that shows how efficient SIMD is with DSP algorithms and he even shows some exercises on how to optimize DSP algorithms and these are explained in a chronological order. Further the CMSIS DSP Software library is explained in more detail, a part of this is about the conversion functions and their ability to convert between Floating-point and Fixed-point. The most relevant points that are brought up is about SIMD, FPU and MAC that are embedded in Cortex M4. All these functions are relevant to this thesis since they will be encountered when solving the research question in section 1.3. By studying this book, it has given a better understanding of what a DSC represents but also how speed and power efficiency play a significant role in modern processors such as the M4. 13

20 3.2 Li Tan, Jean Jiang - Digital Signal Processing - chapter 7, chapter 8, and chapter 9 Digital Signal Processing offers electrical engineers and computer engineers an introduction to the use of mathematics in the subject of DSP. Tan, et al. [4] takes advantage of the availability of powerful computers, and software environments such as MATLAB to perform extensive computation and create laboratories, this in return will give engineering students a bigger perspective about the effects that can be gained from filtering signals. In chapter 7 Tan, et al. illustrates with figures, and mathematical equations about the concept of FIR filters. This is also explained by creating block-diagrams. The intention of these basic illustrations is to give engineers an understanding of how FIR filters can be implemented in projects or laboratories. Chapter 8 in Digital Signal Processing is much like chapter 7, it is about IIR filters and how they can be implemented in projects and laboratories. It is explained through block-diagrams, mathematical representations, and figures. To keep it simple, Tan, et al. explaining a simple first-order IIR filter and a second-order filter, and to sum up all the points presented in the subsections, they present a few examples of IIR filters. In chapter 9 Tan, et al. brings up hardware and software for DSP devices. They explain the architecture differences that exist between a DSP device and a traditional MCU, such as the Harvard- and Von Neumann-architecture. Followed by the hardware units that exist in most common DSP devices such as the MAC unit. They bring up how important it is with a MAC unit by showing a visual representation of how the execution of the MAC function works. Fixed- and Floating- point are both brought up in much detail in this book. Li, et al. brings up the differences of these two and how they are implemented in DSP devices. The FPU and MATLAB are both essential to this thesis. To solve RQ1 and RQ2 the use of MATLAB is required for generating filter coefficients and a signal with noise, and by following this guide has made it less complex to understand the workflow of MATLAB. The examples that are given by Li, et al. on FIR- and IIR-filters are implemented with both Fixed- and Floating- point, which is an important part to understand for this thesis. 14

21 3.3 Joseph Yiu - The Definitive Guide to ARM Cortex -M3 and Cortex - M4 Processors - chapter 9, 13, 21 and 22 Joseph Yiu [19] sheds light on the Cortex-M3 and the Cortex-M4 having examples of guidelines. The chapters of focus will be 9, 13, 21 and 22 because they are closely related to this thesis. Chapter 9 is divided into two major sections. The first section is about low power systems, and low power features in the Cortex-M family. The focus will be on this section. Joseph Yiu brings up important questions like what does low power mean in microcontrollers? and then later explains that one typical way to measure energy efficiency is in the form of DMIPS/uW or CoreMark/uW which is basically how much processing is done with limited energy. Yiu later states that the measurement of power is done in ua/mhz since it traditionally is based on active current and sleep mode current, however this is now inadequate because energy efficiency is equally important. The end of the section is about how to utilize the low power feature in application software, this is illustrated through charts and tables. The second important chapter that needs to be reviewed is chapter 13. It is based on Floating Point Operations (FPO). Yiu introduces software Floating-point, FPU and their usage in Cortex-M4. By showing examples such as how to convert a value to SP in IEEE-754 standard, along with HP and DP. He later points out that for MCUs without FPU, the arithmetic calculations are carried out by run-time library functions. This brings us to chapter 21 (ARM Cortex-M4 and DSP Applications) and chapter 22 (Using the ARM CMSIS-DSP Library) which is about the DSP functions in the Cortex-M4 processor and how it compares to DSP devices. Yiu starts by explaining the term DSP, and its use on a MCU which is the key feature in the Cortex-M4 which makes it into a DSC. This is illustrated by showing the architecture layers added to the Cortex-M4. Yiu even states that by using the Cortex-M4 which is a DSC will solve the limitation of having an MCU and an DSP device separately, this will lead to lower power consumption and lower overall system cost. The signal processing algorithms in the CMSIS-DSP library are optimized for Cortex-M4, Yiu brings up some examples and guides through common algorithms from the CMSIS-DSP library such as FIR-filter, IIR-filter, and FFT. Yiu guidelines that are introduced in the book such as FPO, CMSIS-DSP library, SIMD, Cortex-M4 as a DSC and these guidelines are important to this thesis. By understanding in which form energy- and power- efficiency is measured, this in turn sets the foundation for the benchmark that will be applied to solve the research questions in section

22 3.4 Savita Rani - Area and Speed Efficient Floating Point Unit Savita Rani [30] explains what the FPU is and the advantages of using FPU compared to the use of Fixed-point arithmetics. Rani states that the FPU is a key element in the area where real time computations are required such as with signal processing, and then mentions that with numbers that are very large or very small the use of FPU is required even if using Fixed-point arithmetics can be faster. One point that Rani brings up is that multiplication is not as common as the use of addition, but is very important even essential for MCUs and DSP devices where DSP applications are involved. Rani talks in more detail about multiplication techniques and methods such as, Integer Multiplication Methods, Truncated Multipliers and Logarithmic Multipliers. Then investigates the performance of these three multiplication methods mentioned above, by using simulations to analyze the output of the multiplication techniques with full FPU, it is then discussed which multiplication technique that provides better results. In this thesis both the use of FPU and Fixed-point arithmetics are performed, therefore it is important to think about what Rani says about how very large and very small numbers can be an issue when using Fixed-point arithmetics. Especially since the use of very small numbers are used in both the FIR- and IIR- filter. For the IIR-filter small changes in the coefficients can make it very unstable and this is probably one of the reasons that Rani recommends the use of FPU over Fixed-Point arithmetics when dealing with very small numbers. 16

23 3.5 Alexandre Aminot, et al. - Floating Point Units Efficiency in Multi- Core Processor Alexandre Aminot, et al. [31] explains the speed-up extensions in multi-core processors such as having a multi-core processor with FPU in every core, Aminot, et al. call them SMP. There are processors that only have FPU in some of the cores, Aminot, et al. call them for FAMP. The paper's research question is how to efficiently exploit floating point units in multi-core processors? and if there is any advantages of having FPU in all the cores. The method used in Aminot, et al. research is based on controlled experiments, three energy management systems are compared when using FAMP processors, the results are later compared with the SMP processors that only takes use one energy management system. The three energy management systems are, application level, scheduler event level, and the hardware level which SMP use. It is stated that no modification of the code will be made for the experiments and that they use different benchmarks to estimate the power consumption, and the performance. The results that Aminot, et al. achieved from experimentation is from the first energy management system which is about when using applications that take advantage of integers or minimal use of floating point then it is better with the FAMP processor, because the speed-up does not balance the power cost. Aminot, et al. mentions an application that is mostly used for Floating-point and with this application the SMP consumes less energy and has a higher speed than with FAMP. The second energy management system is the scheduler level where the system switches cores depending on the event that is in the application. The results lowered the energy consumption but instead increased the execution time, this is because of the time spent in the core without FPU is longer. The energy management system did not decrease the energy consumption for applications that depend more on the FPU, this is because more time is spent on the speed for switching cores and less time is spent for the cores without FPU. Aminot, et al. recommends that for applications that need to use floating point should be completely executed on the core that have FPU. The third energy management system used is experimented with both an SMP- and FAMP- processor, is the hardware(instruction) level. The hardware level is an aggressive technique that quickly powers up the FPU in the core(s). This technique is application dependent, and the power up time is 1000 cycles. The energy consumption in the hardware level is reduced when using longer applications compared to having the same applications in the scheduler level. Aminot, et al's conclusion is that the FPU is not necessary for each core, this is because of the power leakage that occurs in the FAMP processors because of the FPU. 17

24 4. Method The workflow that is used in this thesis is presented in this section. It can be seen as a top-down framework that consist of two main categories; literature study and controlled experiments. Controlled experiments are then divided into three subcategories; SAMG55 with FIR- and IIR- filter, Power measuring system and Insystem debugging. This structure is presented in figure 6 [32]. Figure 6. Research workflow 4.1 Literature study It should be noted that several studies were reviewed during this thesis, and the most relevant reviews can be found in section 3. This has resulted in giving an overview of the domain problem stated in section 1.1 and the techniques (observe, formulate and evaluate) to identify the sub questions. 18

25 4.2 Research problem The questions in section 1.2 are acquired through studies. To reach a conclusion on the research problem three steps have to be followed in a chronological order to ease the workflow. Evaluating the background Deciding the problem domain Setting the limitations Evaluating the background is the first step done in this thesis, the advantage of this step is to get an overview of the area. The following step is to decide the problem domain found in the area that was evaluated in the first step. The last step is to set the limitations in order to focus on the specific problem at hand. These steps above are done through iteration of literature study which can be seen in figure 6. The most related studies in this thesis can be seen section Controlled experiments Science classifies knowledge. Experimental science classifies knowledge derived from observation Denning P.J [33]. To get a basis in an area it is important to acquire an understanding of the fundamental components and relationships in that area. By doing experiments one will be provided with the necessary data to better evaluate, predict, understand, control and improve a development process and product [32]. This is a well-known concept, where basically everything is held constant except for one variable [32][33]. The DSP functionality can be seen as the variable in the Cortex-M4 DSC. In this thesis experiments will be performed with an apparatus. The data( Section 5.4) given from the apparatus is then analyzed and used to answer the problems in question. Apparatus can be divided into two categories, system apparatus and simulator apparatus [33]. The apparatus in this case will be the system apparatus, SAMG55, In-system debugging, and Power measuring system SAMG55 with FIR- and IIR- filter The system apparatus from Atmel is the SAMG55 development board. The SAMG55 is used to execute DSP algorithms such as FIR- and IIR-filter. During the execution of the algorithms the current measurement header and the Cortex debugger header are connected to a Power measuring system(section 4.3.3), the schematics for this setup can be seen in figure

26 4.3.2 In-system debugging (DWT) Measuring speed with in-system debugging is efficient. This is done by marking a set of code with a start- and stop- counter. This allows the user to see the amount of cycles that will be performed to execute the marked code. In this thesis in-system debugging will be performed with Atmel Studio to achieve a result to sub question three in section 1.3 [34] Power measuring system The apparatus used in this thesis is the Atmel power debugger (section 2.8) which is a device used to measure power consumption. The power debugger allows the user to follow the power consumption of the FIR- and IIR- filter in a real-time application and analyze the efficiency of the Cortex-M4 device in its present state [25]. Atmel data visualizer is a program that is compatible with the Atmel power debugger which offers a graph plotter, oscilloscope and other indicators that will help in interpreting the data [25]. The power debugger and the data visualizer will be used to achieve a result to sub question one, two and three in section 1.3. The measurements will be done using common FIR and IIR algorithms. 20

27 5. Results and Analysis The data and results presented in this section are based on Atmel SAM G55 DSC with the Cortex-M4 Core and FPU. 5.1 Algorithms and filter design The algorithms that are used to accomplish RQ1, RQ2, and RQ3 are FIR-filter and IIRfilter. With such filters, there are some key parameters that need to be considered. The sampling rate The number of taps The pass-band The stop-band In this thesis, the tool that is used to calculate FIR- and IIR- filter coefficients is called the Filter Designer tool and is a graphical GUI from MATLAB to design and analyze filters. The benefit of using this tool is its easy-to-use GUI that enables the user to design digital FIR- and IIR- filters by setting the specific parameters(sampling, pass-band and stop-band) listed above. The two filters that are mentioned above has been designed as low pass filters with the following parameters. Filter Apass Astop Fpass Fstop Sampling rate Number of taps received FIR 1 db 40 db 1100 Hz 2000 Hz Hz 23 IIR 1 db 40 db 1100 Hz 2000 Hz Hz 5 Table 3 -The parameters used in FIR- and IIR filter. The Filter Designer tool gives a magnitude response overview of the design that was created, by doing so one can evaluate if the design meets the specifications that are sought, and in this case the requirements were met. 21

28 Figure 7A. FIR filter magnitude response Figure 7B. IIR filter magnitude response At this point, the FIR- and IIR- filter coefficients are created which is then implemented in Atmel Studio. The last step is to generate a sinusoidal signal with some interference. This signal is also created in MATLAB and is going to be the basis for the IIR- and FIRfilters to filter out the interference. The interference signal are sinusoids with frequencies and Hz and can be seen in the FFT spectrum in figure 8. Figure 8. FFT generated by MATLAB with Frequency 800, 2500, 4500 Hz The implementation of the IIR-filter with Fixed-point was a special case. It was created as an double section filter, which in turn means that the number of taps are 3. Double section is intended to function in the same manner as an single section filter, the difference lies in the functionality where the output of the first section becomes the input for the second section this is illustrated in the figure below. 22

Savita Rani[32]. The filter coefficients are test on another development board that are based on the Cortex M4 core, this is done because the SAMG55 lacks Digital-Analog Converter (DAC). 5.

29 figure 9. Double section IIR filter The reason for using this implementation is because of the low accuracy with the Fixed-point coefficients and this makes the IIR filter unstable, this was mentioned by Savita Rani[32]. The filter coefficients are test on another development board that are based on the Cortex M4 core, this is done because the SAMG55 lacks Digital-Analog Converter (DAC). 5.2 Energy monitoring To monitor the energy in the DSC SAM G55 MCU, two tools were used, the data visualizer and the power debugger. The first tool is the Power debugger which can use both the JTAG- and SWD- interface to target the SAM G55 DSC. The main focus is on the SWD (programming and debugging) interface together with the two current sensing channels (power measurement) that are on the Atmel debugger (Figure 10). Figure 10. Logical Construction of the Power Debugger [25] The benefit of the Cortex-M4 is its capability to collect data in a cycle-by-cycle resolution with the data watchpoint and trace unit (DWT), which is then shown on Atmel Studio. By doing so have led to identifying some particular energy consuming spots in the embedded system. The second tool is the Data visualizer that is based on the ADP. The intent of this ADP protocol is to transfer data from a target MCU to the user s PC. This is done through the Cortex debug header that can be found on the SAM G55. In this project, the method used to transfer the data from the DSC to the PC was through the Power debugger. Figure 11 shows the paring for the the SAMG55 MCU. 23

Figure 11. The wiring diagram[25] A great benefit is to integrate the Data visualizer to work with the GNU C/C++ compiler and debugger, making it easier to monitor the embedded system.

30 Figure 11. The wiring diagram[25] A great benefit is to integrate the Data visualizer to work with the GNU C/C++ compiler and debugger, making it easier to monitor the embedded system. It is also important for the monitoring tools to have the same interface as the embedded system, or else they will not be compatible to each other, unless implementing a new interface. This was not an issue in this project since the SAM G55 has the same interface as the Atmel Studio Data visualizer, which is the ADP mentioned above. In short, the ADP protocol is very important in this project because a large set of data will be transferred from the DSC to the host PC. To measure the power consumption accurately with the data visualizer, the following three areas are important to monitor. The Active Mode The Standby Mode The Sample Area (Active Mode + Standby Mode) The Active mode is the part of the Sample area where the interrupt code is executed. While the Standby mode is the time where no code is executed. The Sample area is important in such way that it makes it possible to monitor the overall power consumption in a sample. These three areas are illustrated in the figure 12 below. 24

31 Figure 12. The power measuring areas This approach below for measuring the three areas(active mode, Standby mode, and Sample area) is chosen in order to disregard the capacitors that can be found in between the MCU Voltage supply headers and the current measurement headers. The monitoring of the Active mode is done in five steps, the first step is by increasing the sample frequency, so that almost no time is spent on Standby mode(the interrupt latency time, section5.4). The second step is to record the average current and the average power from the data visualizer. The third step is the in-system debugging, this is done to record the amount of cycles it takes to execute the Active mode. Step four is to convert the amount of cycles into time(eq. 7). Step five is to multiply the time in Active mode with the average power to get the energy spent in Active mode(eq. 8). 1 DWT cycles = Time in Active mode MCU Clock frequency (Eq. 7) Time in Active mode Average power in Active mode = Total energy in Active mode (Eq. 8) While for monitoring the power consumption in the Standby mode area it is done in four steps. First is set the device in sleep mode. The second step is to record the average current and the average power from the data visualizer. The third is to get the time spent in Standby mode (Eq. 9). The last step is to multiply the time in Standby mode with the average power to get the energy spent in Standby mode(eq.10). 25

32 1 Sample frequency Time in Active mode = Time in Standby mode (Eq. 9) Time in Standby mode Average power in Standby mode = Total energy in Standby mode (Eq. 10) The total energy in the Sample area is computed by adding the energy from the Standby mode and the Active mode(eq. 11). The calculation of the average power in the Sample area is shown in equation 12. Total energy in Active mode + Total energy in Standby mode = Total energy in Sample area (Eq. 11) Total energy in Sample area Sample time = Average power (Eq. 12) 5.3 Enabling the FPU There are different ways to enable the FPU depending on the MCU. The SAM G55 uses a processor made by Atmel, and these steps where necessary: 1. Make sure that the following symbol ARM_MATH_CM4 = true can be found in the compiler. 2. Adding two flags to both the compiler and the linker. -mfloat-abi=hard -mfpu=fpv4-sp-d16 3. Include arm_math header in main.c. 4. Call the fpu_enable() function in main. 26

33 5.4 Results of the controlled experiments The conclusive results obtained with experimentation in a controlled environment can be seen below. The Active current, the Sleep mode, and the energy efficiency that are mentioned in section 2.5 can also be seen in the subsections below. The Standby mode average current: 11.18mA FPU disabled 11.46mA FPU enabled For the results in tables 4A-4C, it is important to consider the accuracy of the power debugger that is explained in chapter 2.8 and the interrupt latency in chapter 2.9. The latency time(entry plus exit) was measured, and is around: optimization -O0 FPU disabled 407 ns optimization -O0 FPU enabled 407 ns optimization -O1 FPU disabled 266 ns optimization -O1 FPU enabled 340 ns During the latency time the MCU is not put into sleep mode, and by doing so the measurements of the Active mode will be more accurate. The current measured during latency time is between (depending on the optimization and if the FPU is enabled or disabled) 24.48mA mA. The effect of the latency is at worst ~0.88%, this can be calculated with the equations below. ((active mode time + latency time) average current) latency current latency time) active mode time = X current (1 average current ) 100 = error in % X current (Eq.13) (Eq.14) 27

34 Software Floating-point Filter Opt. Area DWT cycles Avg. Current (ma) Avg. Power (mw) Time (μs) Energy (μj) Active O0 Sample N/A N/A FIR Active O1 Sample N/A N/A Active IIR -O0 -O1 Sample N/A N/A Active Sample N/A N/A Table 4A. Complex composition of the conclusive results for software Floating-point Filter Opt. Area DWT cycles FIR -O0 -O1 Active Sample Active Sample FPU Avg. Current (ma) Avg. Power (mw) Time (μs) N/A N/A N/A N/A Energy (μj) IIR -O0 -O1 Active Sample Active Sample N/A N/A N/A N/A Table 4B. Complex composition of the conclusive results for FPU 2 This value is received from the debugger 28

35 Filter Opt. Area DWT cycles FIR -O0 -O1 Active Sample Active Sample Fixed-point Avg. Current (ma) Avg. Power (mw) Time (μs) N/A N/A N/A N/A Energy (μj) IIR -O0 -O1 Active Sample Active sample N/A N/A N/A N/A Table 4C. Complex composition of the conclusive results for Fixed-point 5.5 The power consumption when using optimization -O0 and -O1 with Software Floating-Point, Fixed-Point and FPU The measuring units that are presented in this project are based on the energy consumption and the number of cycles executed. These measurements are for well documented algorithms such as the FIR filter and the IIR filter. The results shown in tables 4A-C are achieved with the Atmel data visualizer, Atmel power debugger and the DWT unit. A benefit in the SAMG 55 DSC is that it has three low power modes which are backup, wait and sleep. When using sleep mode, the core clock should be stopped if used correctly and all the other functions should be able to keep on running [35]. In this experiment, the sleep mode is implemented and used to reduce the power consumption. In the subsections below the main focus will be on the average power in the Sample area, and the execution time in the Active mode with optimization -O0 and -O1. 2 This value is received from the debugger 29

36 5.5.1 The power consumption in FIR filter The power consumption varies between the two optimizations -O0 and -O1. The values are based on the FIR filter with Software Floating-Point, FPU and Fixed-Point. Software Floating-Point: Active mode time difference: ~32.6% Sample area average power difference: ~26.7% FPU enabled: Active mode time difference: ~17.6% Sample area average power difference: ~11.1% Fixed-Point: Active mode time difference: ~20.4% Sample area average power difference: ~11.9% For the three options(fixed-point, FPU and Software Floating-Point) mentioned above it is clear that with optimization -O1 the power consumption is reduced by 10% to 25% compared to -O0 and the execution time is reduced by 16% to 29% The power consumption in IIR filter The power consumption also varies for the IIR filter depending on the optimization(-o0 and -O1) chosen. Software Floating-Point: Active mode time difference: ~46% Sample area average power difference: ~ 17.5% FPU enabled: Active mode time difference: ~38% Sample area average power difference: ~6.9% Fixed-Point: Active mode time difference: ~41.4% Sample area average power difference: ~ 11.1% Much like section it was more beneficial to use optimization -O1 where the power consumption is reduced by 6% to 17% compared to -O0 and the execution time is reduced by 30% to 38%. 30

37 Energy(nJ) in Sample area 5.6 Comprehensive analysis When following the workflow of chapter 5, a pattern has been noticed in the FIR- and IIR-filter values when performing -O0 and -O1 this can be traced back to subsection and This pattern can be seen in the Active mode time and the Sample area power consumption where the measured values are lower with -O1 compared to -O0. Based on the charts below one can see that the execution time and the power consumption are tightly related to each other. FPU ensures faster Floating-point calculations which in turn will lead to that the time of the active mode is decreased and the time of the Standby mode is increased. Since in the Standby mode the DSC is not executing any code, this results in an overall reduced power consumption FIR time(μs) in Active mode Chart 1. FIR Sample area power consumption(y-axis), execution time in Active mode(x-axis) Chart 1, shows the total energy for the FIR filter. An interesting point is that the execution of the FIR filter with optimization -O1 is done with less time and less total energy consumption in the Sample area. When comparing the three options (FPU, software Floating-point and Fixed-point) in chart 1, it gets clear that when using Floating-point its less energy expensive to enable the FPU. While when performing Fixed-point compared to FPU, there are no major differences for the total accumulated energy consumption in the Sample area, when taking the accuracy of the power debugger and the interrupt latency in to consideration. When relating the energy consumption to the execution time in the chart above, longer execution time will use more energy, however this does not apply when comparing the FPU with Fixed-point when using the same optimization. 31

38 Energy(nJ) in Sample area Fixed-point takes more time to execute the code than FPU, yet the energy consumption is approximately the same IIR Time(μs) in Active mode Chart 2. IIR Sample area power consumption(y-axis), execution time in Active mode(x-axis) In the results for the IIR filter measurements seen in chart 2, it is obvious that the FPU, the software Floating-point and the Fixed-point are executed faster and consumes less energy with -O1. When comparing these three options(fpu, software Floating-point and Fixed-point) with each other in -O0 then it is clear that the FPU consumes less energy than the other two options and the execution time is also faster. In -O1 the FPU is executed faster than the other two options, however when looking at the total energy consumption then the difference between Fixed-Point and FPU is indistinguishable when taking the measurement error into account. 32

mw mw FIR 100 90 80 70 60 50 92,6 54,2 54,2 70,8 48,5 48,1 Software Floating-point FPU 40 Fixed-point 30 20 10 0 Opt -O0 Opt -O1 Chart 3.

39 mw mw FIR ,6 54,2 54,2 70,8 48,5 48,1 Software Floating-point FPU 40 Fixed-point Opt -O0 Opt -O1 Chart 3. FIR Sample area, average power consumption Chart 3 shows that the power consumption in -O0 with software Floating-point is 71% more than the power consumption used by the FPU, while in -O1 the power consumption is 46% more. Optimization -O1 has been more beneficial for the software Floating-point compared to the FPU and Fixed-point by around 15mW more but is still inferior to the FPU and Fixed-Point. IIR ,7 46,2 49,3 50,1 43,1 44,1 Software Floating-point FPU Fixed-point 10 0 Opt -O0 Opt -O1 Chart 4. IIR Sample area, average power consumption For the IIR filter it has been a challenge to distinguish which of the two options(fpu and Fixed- Point) with -O1 that has the most power consumption because of the small measurement error that exist. However even in this case, the Software Floating-point has been proven to be inferior to the other two options. 33

ELC4438: Embedded System Design ARM Embedded Processor

ELC4438: Embedded System Design ARM Embedded Processor Liang Dong Electrical and Computer Engineering Baylor University Intro to ARM Embedded Processor (UK 1990) Advanced RISC Machines (ARM) Holding Produce