Surveillance Camera. MPEG-2 Implementation on an FPGA. 8th semester project, AAU, Applied Signal Processing and Implementation Spring 2008.

Size: px

Start display at page:

Download "Surveillance Camera. MPEG-2 Implementation on an FPGA. 8th semester project, AAU, Applied Signal Processing and Implementation Spring 2008."

Patience Malone
5 years ago
Views:

1 Surveillance Camera MPEG-2 Implementation on an FPGA 8th semester project, AAU, Applied Signal Processing and Implementation Spring 2008 Group 842 Clément Borlot Martin Daniel Jesper Kjeldsen Søren Reinholt Søndergaard Martin Brinch Sørensen

3 Title: Surveillance Camera MPEG-2 Implementation on an FPGA Theme: Applied Signal Processing and Implementation Project period: ASPI2, Spring term 2008 Project group: 842 Participants: Clément Borlot Martin Daniel Jesper Kjeldsen Søren Reinholt Søndergaard Martin Brinch Sørensen Supervisor: Yannick Le Moullec Copies: 7 Number of pages: 129 Appendices hereof: 29 Attachment: 1 CD-ROM Completed Aalborg University Institute for Electronic Systems Fredrik Bajers Vej 7 DK-9100 Aalborg Phone: (+45) Abstract: This project deals with the design challenges of implementing MPEG-2 compression on an FPGA with an IP camera for a surveillance system. The MPEG-2 compression profile was investigated to establish how this is performed. Through profiling it was found that the function nothxnothy was the most time consuming. Metric calculations were used to show that the function is potentially parallelizable and therefore suitable for hardware implementation. Hardware/software partitioning was done to split the main modules of the system - camera, MPEG-2 and Ethernet - into submodules for either hardware or software implementation and to establish hardware/software interfaces. It was decided to implement RGB to Y C B C R conversion and chroma subsampling as well as nothxnothy in hardware. The Terasic DE2 development board was selected as platform and hardware implementation was done in Verilog code. uclinux was installed on the Nios II softcore processor to implement a TCP/IP protocol and the software modules were executed on the softcore processor. Tests of the submodules showed that these work as expected and that implementation of the MPEG-2 compression on an FPGA is possible. A useful frame rate could, however, not be obtained without improvements to execution time. The content of this report is freely accessible, though publication (with reference) may only occur after permission from the authors.

4 Group 842 iv

5 Preface This report is the documentation for the theoretical and practical work for Surveillance Camera - MPEG-2 Implementation on an FPGA. It represents the 2 nd semester of the Applied Signal Processing and Implementation master specialization at Department of Electronic Systems at Aalborg University, Denmark. The report is composed of two parts: the main report and an enclosed CD-rom. The CD-rom contains the implemented code, material used for testing and a digital copy of this report. The report consists of three layers: Chapters, sections and subsections. The chapters and sections are stated in the main table of contents. Figures, tables and listings are numbered by two numbers separated by a dot. The first number indicates the chapter number and the second number denotes the number of figure, table or listing in the chapter. Equations are numbered in brackets and with the same convention as the figures and tables, e.g. (2.1). Listing enviroment contains code examples or terminal commands with each line numbered for easier referencing. The following notation is used throughout the report: Functions and variables are written in italic, lower case bold letters are vectors and upper case bold letters are matrices. The words frame and image are used interchangeably. Citations are written in square brackets with a number, e.g. [3]. The citations are listed in the bibliography on page 100. Aalborg University, June 3 rd 2008 v

6 Group 842 Clément Borlot Martin Daniel Jesper Kjeldsen Søren Reinholt Søndergaard Martin Brinch Sørensen vi

7 Contents 1 Introduction Scope of project Delimitation Development Model MPEG-2 Compression Profile Requirement Specification Analysis of the MPEG-2 Algorithm Choice of MPEG-2 Algorithm Profiling of the Algorithm The dist1() Function MPEG-2 Algorithm Blocks Applying metrics to code pieces Conclusion Hardware/Software Partitioning 23 4 Hardware Design Selection of Development Board Design Tools vii

8 Group 842 CONTENTS 4.3 Hardware Partitioning The Avalon Bus General Purpose Processor Random Access Memory Camera Hardware Module MPEG-2 Hardware Accelerator Ethernet Hardware Module Software Design Program Structure Operating System Camera Module Ethernet Module MPEG-2 Module Conclusion and Future Work Conclusion Future Work Bibliography 99 A Appendix for uclinux 101 A.1 Installation Procedure for uclinux onto PC A.2 Configuration Procedure for uclinux A.3 Installation Procedure for uclinux onto Development Board A.4 Customizing the Kernel A.5 How to Compile C-code for the uclinux viii

9 CONTENTS Group 842 B MPEG-2 Overview 109 B.1 I-, P- and B-images B.2 Slices B.3 Subsampling of Chromacity levels B.4 Motion Prediction and Transform Domain Coding B.5 Macroblocks B.6 Discrete Cosine Transform (DCT) C ModelSim Results 121 C.1 RGB to Y C B C R and Chroma Subsampling C.2 Results for Test of Hardware Accelerated MPEG-2 Algorithm ix

10 Group 842 CONTENTS 0

11 Chapter 1 Introduction In today s world video cameras are becoming more and more common. Video cameras come in many sizes and shapes varying from professional film equipment to web cameras and cell phone cameras. The availability of the latter and online services such as YouTube have made video recordings a common way of capturing memorable occasions, but video cameras also have uses in more serious applications such as surveillance. With the transition to digital electronics, surveillance applications are often based on web cameras connected to a recording device or Internet Protocol (IP) cameras connected to a LAN device and through this to a recording device. A typical problem using a digital camera is the limitation of bandwidth from the camera to the recording device. A video sequence consists of a very large amount of data and to transfer this data in real time a correspondingly large bandwidth is needed. In most cases such bandwidths are not available and the most common solution to this problem is to compress the video sequence before transfer. A variety of compression algorithms exist such as DivX, MPEG, H.264 etc.. These all offer different ranges of compression and loss of quality. At the time of writing one of the most popular compression algorithms is the MPEG-2 algorithm as this is used for DVD videos and digital video broadcasts. It is investigated if it is possible to implement a system for encoding and transferring video data from a camera on an FPGA. The flexibility of FPGAs makes them interesting in surveillance equipment. Since 9-11 and the British terror acts, there have been huge development and research in surveillance, resulting in a vast array of algorithms for different kinds of surveillance features. The FPGA provides a platform where it is possible to implement new features as the research and development continue. Another way to make use of the flexibility of an FPGA is to develop user defined systems that fits the users requirements. Here the user will be able to choose algorithms such as face recognition, motion detection etc. and upgrade the system when new needs are discovered without changing the entire system. 1

12 Group 842 CHAPTER 1. INTRODUCTION FPGA PC/Server Camera MPEG-2 Encoder TCP/IP Protocol LAN/WAN MPEG-2 Decoder Storage and Viewing Figure 1.1: Overview of the entire system and dataflow. 1.1 Scope of project The scope of this project is to see if a basic surveillance system, doing compression of the camera data and transmitting them to a server, can be implemented on an FPGA with the use of an IP camera and the MPEG-2 compression profile. The MPEG-2 algorithm is chosen because of its availability, wide field of use and relative simplicity compared to H.264. The processing part of the system is based on an FPGA platform and the project is mainly concerned with the mapping of the MPEG-2 algorithm onto hardware. Other tasks in the project involve finding a suitable MPEG-2 algorithm for implementation, adapting Intellectual Property (IP) cores for communicating with a camera device and transferring the MPEG-2 output to a recording device via the TCP/IP protocol or writing this software from scratch. The tasks of the project have been listed below: MPEG-2 algorithm (overview of the algorithm can be found in appendix B): Determine MPEG-2 algorithm to use. Analyse the algorithm with respect to execution time and implementation on an FPGA platform. Accelerate time critical parts of the algorithm by implementing these in hardware, specifically execution time. Adapt interfaces of the algorithm to work with the FPGA platform. FPGA platform: Determine which FPGA platform to use. Determine IP cores to use and adapt these or write software from scratch. 1.2 Delimitation Many features such as motion detection and face recognition can be implemented in the project. The main focus is, however, compressing a video sequence using an MPEG-2 encoder algorithm. 2

13 CHAPTER 1. INTRODUCTION Group 842 For the FPGA platform there are also several things to focus on, such as area and power usage, but the main focus is execution time. Therefore the things below will not be examined in this project: Optional camera functionality, e.g. zoom control, change of resolution on the fly etc. Optional algorithm functionality, e.g. motion detection, face recognition etc. Optimizations on the FPGA platform with regard to power and area (unless area needed exceeds area available). Optimization of the MPEG-2 algorithm in its original programming language, only hardware optimizations. 1.3 Development Model To outline the progress needed in this project, to establish the goals stated above, a model based on the V-model is used in the system development, [20, p. 37]. By using the V-model it is easy to divide the project into a number of steps needed through a project life cycle and it should insure an outcome (documentation and product) from each step. The V-model consists of two branches, where the left branch concerns different kinds of design and analysis based on the requirements set for the product. The right branch concerns test and implementation at different levels of the designed product. In figure 1.2 the V-model is sketched. As seen in this figure the reason for the V is to illustrate the interaction between the different levels in the project development, as for example the coherence between module design and module testing, where the module testing of course needs to match the specific module being designed. The right branch of the V-model ensures that tests are undertaken during the development of the product and thereby makes it easier to discover where problems occur. 3

14 Group 842 CHAPTER 1. INTRODUCTION Requirement Specification Acceptance Testing Program Design Proces Integration Process Design Module Integration Module Design Module Testing Module Coding Figure 1.2: The V-model establishes a work flow through the project. Through the left branch the overall objective, goals and tasks are determined. In the right branch it is investigated if these things are obtained. [20, p. 37] The reason why the V-model is chosen for this project is because of the vast array of algorithms and functions featured in the MPEG-2 compression profile and the many ways of implementing the algorithms contained in this profile in an FPGA. An existing MPEG-2 algorithm is used meaning the source code is already at our disposal, the design phase of this project will deal with different ways of optimizing the algorithm to utilize the features of the FPGA. 1.4 MPEG-2 Compression Profile This chapter provides a small overview of the compression profile analyzed and synthesized in this project for the sake of understanding the concepts investigated in the following chapters. The parts described in this section are more thoroughly described in appendix B and the reader is encouraged to read this appendix for further knowledge about the subject. This section and appendix are mainly based on [22] Choice of Compression Profile MPEG-2 is a compression profile that standardizes the way of doing lossy compression on a video and audio stream. As it is only the video stream that is of interest in this project, the way of doing audio compression is neglected. By using a combination of compression methods, 4

15 CHAPTER 1. INTRODUCTION Group 842 MPEG-2 tries to reduce the quantity of video data so it is possible to abide a target bitrate for storage and transmission. With MPEG-2 it is possible to choose parameters and profile. The parameters set upper bounds for resolution and frame rate at a given target bitrate all the way up to a resolution of 1920x1152@60fps. This means that MPEG-2 is able to support compression of high definition TV. The profiles set the functionalities of the MPEG-2, meaning that a change in profile will change some of the functions used to compress the video stream. For instance whether or not chroma subsampling and/or B-images should be used - functions that is described in the following. This flexibility of MPEG-2 is the main reason why it was chosen as the compression method in this project, as it fits the flexibility of a FPGA MPEG-2 Functions Lossy compression, as used in MPEG-2, means that some information are lost and never restored again. This will in some way imply a loss of quality, but the smart thing about MPEG-2 is that it removes spatial and temporal redundancies that - to some degree - do not affect the Human Visual System (HVS). One way of removing spatial redundancy is by chroma subsampling, where color information is removed. The color resolution (chroma) for the HVS is lower than the brightness resolution (luma), meaning that the human eye easier detects changes in luma than in chroma. MPEG-2 benefits from this by mapping RGB values to Y C B C R - which represent luma by Y and chroma by C B C R - and then subsample the chroma values. This preserves the brightness but removes the redundancies of the color. The converted and subsampled pels are grouped together in luma and chroma macroblocks which is used through the encoder and decoder illustrated in figure 1.3. The usage of this scheme changes depending on the kind of image being encoded. MPEG-2 provides three different kinds of images I-, P- and B-images. I-images provide random access to the video stream. P-images are forward predicted from I- or other P-images and act as reference image as well as provide higher compression than I-images. B-images use both forward and backward prediction and provide the highest compression rate, but they can not be used as reference for further prediction. This means that the Motion Estimator which provide the temporal prediction is skipped for I-images. Instead a Discrete Cosine Transform (DCT) and quantization is performed and a reference image resembling a decoded version of the image is stored in the encoder. The Mo- 5

16 Group 842 CHAPTER 1. INTRODUCTION tion Estimator is used for P- and B-images where matching of macroblocks between the current image and a reference image leads to motion vectors, m, and a predicted macroblock, p. The color difference between pels in the input macroblock and the predicted macroblock is regarded as the prediction error e, which is DCT transformed and quantized. Zig zag scan and variable length coding is used to introduce some final lossless compression of the transformed data, before it is sent to the decoder through a given channel. Zig zag scan transforms the 2-dimensional macroblock of quantized DCT coefficients into a 1-dimensional bitstream and Huffman coding is used in the variable length encoder. Reference Picture + + e q IDCT Inverse Quantization Motion Estimator m m p Form Prediction Image Blocks input + - e DCT Quantization Variable Length Coding step size Control Video Buffer Encoder Channel (Broadcast/ DVD/etc.) Figure 1.3: MPEG-2 coding scheme of the encoder. Coding scheme including decoder can be found in appendix B. With inspiration from [15, p. 9] and [22, p. 12] Through these processes some spatial redundancies (DCT) and temporal redundancies (motion estimation) are removed and a compression ratio of approximately [22, p. 41] is obtainable. MPEG-2 is also one of the most widespread digital video formats as it is used in television broadcasting, making the MPEG-2 standard well defined and its encoder and decoder easy to access. 6

17 CHAPTER 1. INTRODUCTION Group Requirement Specification As this project concerns research on implementing a MPEG-2 compression profile on an FPGA, the requirement specification is more a statement of goals set for this project, than a specification for the product that is being developed. This means that the major concern of the project is to evaluate the challenges when implementing MPEG-2 on an FPGA regarding mapping software to hardware and execution time, but there is still some requirements that the end product should maintain for it to be applicable. General Requirements A working MPEG-2 encoder on an FPGA platform. Record video stream from a camera on the FPGA platform. Ability to establish an Ethernet connection between the FPGA platform and a PC for streaming of output video. Output video stream quality high enough to identify people. MPEG-2 Requirements Compression rate of around times [22, p. 41]. MPEG-2 functions to be implemented (based on the info from appendix B: RGB to Y C B C R conversion. Chroma subsampling. Discrete Cosine Transform (DCT). Quantization of luma and chroma DCT values. I-, P- and B-pictures. Motion prediction. Variable length codeword (Huffman coding). FPGA Platform Ethernet interface. 7

18 Group 842 CHAPTER 1. INTRODUCTION The ethernet interface should be able to support a stream of 4 Mbit/s due to the minimum requirements for the MPEG-2 standard [11, p. 41] Camera interface. Frame rate: 25 frames per seconds. Resolution: The initial starting point is 160x128 and the target resolution is 640x480 which is above normal TV resolution. Color camera. In the next chapter the MPEG-2 compression profile is analyzed to establish which parts should be hardware implemented and which should be kept in software. 8

19 Chapter 2 Analysis of the MPEG-2 Algorithm In this chapter an MPEG-2 encoder algorithm is selected and analyzed with regard to later hardware/software partitioning. The algorithm is profiled to find out which parts and functions of the algorithm take up the most of the execution time. After profiling the rest of the code is examined to create block diagrams describing the overall structure and to determine how the time consuming functions work together with the rest of the algorithm. The results from the profiling and block diagrams are used to choose one or more code pieces to analyze further. A set of metrics is then applied to these code pieces to determine whether or not they are suitable for a hardware implementation. 2.1 Choice of MPEG-2 Algorithm The MPEG-2 encoder algorithm chosen for the project is developed by the MPEG-2 and is available free of charge from [23]. The last release of the encoder was July 19th 1996 meaning not all new features are included in this encoder. It is, however, able to encode using Simple and Main profiles and all levels (Low to High) as well as chroma subsampling 4:2:0 and 4:2:2. The profile specifies the quality of the compressed sequence while the level specifies the parameters for resolution and frame rate. The source code is chosen because of its simplicity and the fact it is developed by the MPEG-2 committee rather than a 3 rd party software developer. Because of its simplicity, i.e. it is a console based encoder, it is easier to analyze the code, as there will be less excess code and modules to identify. The source code for the encoder algorithm can be found on the enclosed CD in the source code/unmodified code folder. 9

20 Group 842 CHAPTER 2. ANALYSIS OF THE MPEG-2 ALGORITHM 2.2 Profiling of the Algorithm The profiling is conducted in Bloodshed Dev-C++ while running compressions of three different video sequences: Uncompressed pictures generated in MATLAB, large movement in one direction (MAT- LAB sliding). Uncompressed pictures generated in MATLAB, random information, i.e. no movement directions (MATLAB random). Decompressed pictures from a Wallace & Gromit clay movie clip (W & G). Three different video sequences have been chosen, because different video input might require different encoding and thus different workloads. The video sequences can be found on the enclosed CD in the MATLAB sequences folder. The profiling is conducted on a PC running Windows XP SP2 using an AMD Athlon 64 X , an nforce 590 SLI based motherboard and 2x512 MB DDR2 PC-8500 memory modules. Other hardware setups may produce different results and the number of background processes running may also affect the results Changes to the Algorithm To profile the algorithm it is necessary to run the compiled algorithm from within the compiler. This means a few changes to the initialization of the algorithm have to be made. Also part of the function calls are changed to allow for a more detailed profile of the algorithm. Only the overall changes will be documented in the report, while comments for specific changes will be added in the source files. The modified and original algorithm can be viewed on the enclosed CD in the source code folder. Initialization The unmodified algorithm is made in such a way that it can be executed from the command line with a set of arguments specifying the path of a parameter file and destination of the MPEG-2 output file. Instead of reading the arguments from the command line the initialization is changed such that these are specified at compile time. The parameter file is still located and loaded from a separate file on the hard drive. 10

21 CHAPTER 2. ANALYSIS OF THE MPEG-2 ALGORITHM Group 842 Function name\video sequence MLAB sliding MLAB random W & G dist1() % % % fdct() 7.71 % 3.87 % 6.31 % fullsearch() 3.41 % 2.18 % 3.03 % quant non intra() 3.41 % < 2 % 2.31 % Functions with less than 2 % total time % 7.34 % 9.14 % (a) Function name\video sequence MLAB sliding MLAB random W & G nothxnothy() % % % fdct() 7.74 % 4.49 % 6.55 % hxhy() 7.12 % 7.49 % 8.54 % nothxhy() 4.24 % 3.84 % 3.46 % hxnothy() 3.37 % 3.65 % 3.43 % fullsearch() 2.50 % 2.06 % 3.51 % Functions with less than 2 % total time % 9.93 % % (b) Table 2.1: Profiling results of the MPEG-2 algorithm using three different video sequences. (a) shows the results from the algorithm with changes to the initialization only and (b) shows results from the algorithm where the dist1() function has been divided into four subfunctions: hxhy(), nothxhy(), hxnothy(), nothxnothy(). Function calls After the changes to initialization had been made a preliminary profiling was conducted. The results from the profiling showed that the function called dist1() takes up approximately 79 % of execution time, see table 2.1(a). The dist1() function consists of four main for loops, but only one of the loops is executed in each call to dist1() depending on a set of if statements. To determine the distribution of execution time in each of the four loops, these have been converted to separate function calls (named hxhy(), nothxhy(), hxnothy(), nothxnothy() and called depending on the same if statements. A detailed description of the dist1() function is presented in section Results The results from the profiling can be seen in tables 2.1 (a) and (b). From the tables it appears that the dist1() function is the most time consuming part of the compression. When it is split into the four subfunctions hxhy(), nothxhy(), hxnothy() and nothxnothy() it is seen that almost all the time spent in dist1() is used in the subfunction nothxnothy(). Any other function in the algorithm uses less than 10 % of the total execution time. Summing the execution time of the subfunctions does not yield exactly the execution time of dist1(). This is the case because the results will vary even with the same input pictures and settings, due to processes running in the background while profiling. 11

22 Group 842 CHAPTER 2. ANALYSIS OF THE MPEG-2 ALGORITHM Profiling with Visual Studio A profiling of the code has also been conducted in Microsoft Visual Studio 2008 on another computer to see if the results produced are different. The MATLAB sequences, a still picture sequence and an already compressed sequence have been encoded. The numbers vary little from those obtained in the Bloodshed C++ and the dist1() function still takes up an average of 80 % of the execution time. For the nothxnohy() part the execution time is on average 70 %. 2.3 The dist1() Function dist1() is a function that calculates the absolute distance as a numerical value, between two macroblocks in different images. This value corresponds to the p value in the MPEG-2 encoder scheme. It calculates this by comparing corresponding pels in the macroblocks and summing the absolute value of their differences. The distance calculated is used to determine if the two macroblocks contain similar image content and thus if they can be compressed removing temporal redundancies. The dist1() function consists of four for loops, but only one of the loops is executed in each call to dist1() depending on two input parameters, hx and hy. These specify whether or not interpolation is to be used in the vertical and/or horizontal direction. For profiling purposes each of the four sections were rewritten to subfunctions that are called depending on the same set of if statements that are used in the dist1() function. The if statements are based on whether hx and hy are different from zero or not and thus the subfunctions have been named to reflect this, i.e. the for loop where hx 0 and hy=0 have been named hxnothy() and similarly. As the nothxnothy() part of the dist1() function takes up most of the execution time, this is analyzed further. The code piece and flow diagram of the code can be seen in listing 2.1 and figure 2.1: 12

23 CHAPTER 2. ANALYSIS OF THE MPEG-2 ALGORITHM Group 842 Repeated for p1[0;15] and p2[0;15] j = 0 [ j >= h ] [ v >= 0 ] [ v < 0 ] v = -v [ j < h ] j = j + 1 v = p1[0] p2[0] s = s + v p1 = p1 + lx p2 = p2 + lx [ s < distlim ] [ s >= distlim ] Figure 2.1: Flow diagram of the nothxnothy() part of the dist1() function. 1 if (!hx &&!hy) 2 for (j=0; j<h; j++) 3 { 4 if ((v = p1[0] p2[0])<0) v = v; s+= v; 5 if ((v = p1[1] p2[1])<0) v = v; s+= v; 6 if ((v = p1[2] p2[2])<0) v = v; s+= v; 7 if ((v = p1[3] p2[3])<0) v = v; s+= v; 8 if ((v = p1[4] p2[4])<0) v = v; s+= v; 9 if ((v = p1[5] p2[5])<0) v = v; s+= v; 10 if ((v = p1[6] p2[6])<0) v = v; s+= v; 11 if ((v = p1[7] p2[7])<0) v = v; s+= v; 12 if ((v = p1[8] p2[8])<0) v = v; s+= v; 13 if ((v = p1[9] p2[9])<0) v = v; s+= v; 14 if ((v = p1[10] p2[10])<0) v = v; s+= v; 15 if ((v = p1[11] p2[11])<0) v = v; s+= v; 16 if ((v = p1[12] p2[12])<0) v = v; s+= v; 17 if ((v = p1[13] p2[13])<0) v = v; s+= v; 18 if ((v = p1[14] p2[14])<0) v = v; s+= v; 19 if ((v = p1[15] p2[15])<0) v = v; s+= v; if (s >= distlim) 22 break; p1+= lx; 25 p2+= lx; 26 } Listing 2.1: The code used for the nothxnothy() part of the dist1() function. Because of the high use of if statements the for loop may seem to be very control dependent, but further inspection reveals this is not necessarily the case. 13

24 Group 842 CHAPTER 2. ANALYSIS OF THE MPEG-2 ALGORITHM First of all the loop is repeated h times, where h (integer) is the height of the macroblocks (in pels), p1 and p2 (unsigned chars) are the addresses of the top left pels in each macroblock and lx (integer) is the distance (bytes in memory) between vertically adjacent pels in the macroblock. v (integer) is a temporary variable, s (integer) is the sum of differences (reset to 0 every time dist1() is called) and distlim (integer) is a predefined limit for s. Inspecting one of the if statements it is seen that the difference (v) between two corresponding pels in each macroblock is calculated. The if statement then checks if v is less than zero and if true the sign is changed (otherwise not). Regardless of the outcome of the if statement the calculated value of v is added to s. Because the if statement only acts upon negative values of v and the action is a sign change this is equal to taking the absolute value of v, e.g. the if statements can be viewed as absolute value operators rather than control elements. Each macroblock is 16 pels wide, this means that the 16 if statements will calculate the absolute differences between every two corresponding pels in a line and sum these in s. For each loop the values of p1 and p2 will be incremented by lx, i.e. the byte distance to next line and the loop starts over. The addresses of p1 and p2 will then be equal to the pels in the following line in the macroblock. This will be repeated h times, i.e. till the difference of all pels in the macroblock has been calculated and summed in s or s exceeds the limit defined by distlim. By further inspecting the code it is seen that in one loop there are only four lines of code containing dependencies; s must be calculated before the if statement/break in lines and all differences must be calculated before adding lx to the addresses p1 and p2 in lines This means that the absolute differences calculated in lines 4-19 can be executed in parallel and then summed to find s. Since p1 and p2 are used for addressing in C, it is not necessarily needed to convert line in a hardware implementation as addressing may be done in a smarter way. A flow diagram of a parallelized solution can be seen in figure 2.2. Lines are not included and the adders used to calculate s have been arranged for the shortest critical path. 2.4 MPEG-2 Algorithm Blocks In this section the basic blocks of the algorithm are described. Only the most important functions are described in separate subsections and only the major methods and function calls are shown in the block diagrams. The algorithm will be explored to no further extend than to identify the functions corresponding to the encoder blocks described in appendix B and the functions making calls to dist1(). 14

25 CHAPTER 2. ANALYSIS OF THE MPEG-2 ALGORITHM Group 842 s p1[0] p2[0] p1[1] p2[1] p1[15] p2[15] p1 lx p2 abs abs abs s if (s >= distlim) break; p1 p2 Figure 2.2: Flow diagram of a possible hardware implementation of dist1(). The if statements have been parallelized into 16 separate paths. The diagram is repeat h times for a complete macroblock unless broken by the s >= distlim statement. Code used for addressing in C has not been included as this may differ in a hardware implementation and the adders used to calculate s have been arranged for the shortest critical path main() Main function of the algorithm. Most functionality is placed in subfunctions meaning only the function calls and a few initializing and closing methods are placed in main(). main() reads in arguments from the command line which specifies the path to the parameter file and the output file. A file stream to the output file is created, the parameters and quantization matrices are read in and the compression is initialized and started with init() and putseq(), respectively. When the compression is finished the file streams to the output and statistics files are closed. A block diagram of main() can be seen in figure 2.3. readparmfile() readquantmat() Open file streams init() putseq() Close file streams Figure 2.3: Block diagram of the main() function. Only the most important methods and functions are shown in the diagram. main() calls the initializing functions, opens and closes file streams and starts the compression. puthdr() putseqext() putseqdispext() Determine frame number and type Set up parameters to frame type Write frame header to statistics file readframe() 15 motion_ estimation() predict() dct_type_ estimation() transform() putpict()

26 Group 842 CHAPTER 2. ANALYSIS OF THE MPEG-2 ALGORITHM readparmfile() readparmfile reads in the parameters from the file specified in the command line. The parameters are then checked to prevent errors in the compression caused by illegal parameter settings putseq() putseq() is the top function of the actual compression. It sets up the file header and rate control, determines which image number and image type is to be encoded (I/P/B) and calls the subfunctions for the specific compression tasks. The functions called do not change depending on image type, only the parameters which these are called with. putseq() also calculates and writes statistics to the statistics file as well as the encoded images to the output file and a file tail before returning to main(). All the blocks in between the first and last are repeated for each image until all images have been encoded. A block diagram of putseq() can be seen in figure 2.3. Comparing the block diagram to the MPEG-2 encoder coding scheme shown in figure B.1 page 110 one may notice there are no arrows showing a backward flow of data. This is intentional from the MPEG committee to make it easier to replace blocks. Data passed between the encoding of different images is handled by the use of buffers and data structures containing the needed information. readparmfile() readquantmat() Open file streams init() putseq() Close file streams puthdr() putseqext() putseqdispext() Determine frame number and type Set up parameters to frame type Write frame header to statistics file readframe() motion_ estimation() predict() dct_type_ estimation() transform() putpict() iquant_intra/ non_intra() itransform() calcsnr() stats() putseqend() Figure 2.4: Block diagram of the putseq function. Only the most important methods and functions are shown in the diagram. putseq calls the subfunctions which perform the compression image by image. All the blocks in between the first and last are repeated for each image. 16

27 CHAPTER 2. ANALYSIS OF THE MPEG-2 ALGORITHM Group 842 Image number and type Because images are not encoded sequentially it is necessary for the algorithm to determine which image is to be encoded. This is determined by the number of images in a group of pictures (GOP) and the number of images in between either an I-image or a P-image. If for instance there are 12 images in a GOP and 2 images in between either an I- or P-image, the sequence of images would look like IBBPBBPBBPBB. The encoding order would then be image 1 (I), 4 (P), 2 (B), 3 (B), 7 (P), 5 (B), 6 (B), etc. This means the type of image to encode is also determined without considering the content of the image. More advanced algorithms may estimate which type of image is best to encode depending on the content of the image. Set up parameters according to image type The encoding functions of the algorithm are made in such a way that only the input parameters change when the type of image change. This means it is necessary to set up these parameters before calling the encoding functions. readframe() readframe is the function that reads the images from the hard drive. Depending on image format (i.e. Y C B C R, PPM) different read methods are used. For Y C B C R format a file stream to each file is opened and the pel data is copied from hard drive to a destination in memory. Access to the images read is done by the use of pointers. motion estimation(), predict(), transform(), iquant intra/non intra(), itransform() Corresponds to the Motion Estimator, Form Prediction, DCT, Inverse Quantization and IDCT blocks shown in figure B.1 page 110, respectively. The only call to dist1() is located in the function fullsearch() which is part of motion estimation(). putpict() Corresponds to Quantization, Variable Length Coding and Video Buffer blocks shown in figure B.1 page 110. Furthermore putpict() writes the encoded image to the output file on a hard drive. 17

28 Group 842 CHAPTER 2. ANALYSIS OF THE MPEG-2 ALGORITHM calcsnr() Calculates the Signal-Noise ratio between original input image and reconstructed image. This is only used for statistics. stats() Calculates remaining statistics and writes all of these to the statistics file on hard drive. putseqend() Writes sequence and file tail to the output file on hard drive fullsearch() fullsearch() is the function that performs the search for matching blocks as described in section B.4 page 117. All macroblocks in current image will, one at a time, be passed to the fullsearch() function along with a reference image. The reference image can be either an I-image or a P- image. motion estimation() will determine if what type of prediction is needed and pass the correct reference image (preceding or subsequent) to fullsearch(). fullsearch() will then determine which macroblocks in the reference image to compare the macroblock from the current image with by the means of a search window. The size of the search window is defined as a number of pels in the parameter file and may differ for P-images and B-images. The search will be initiated at the corresponding macroblock in the reference image and then spiral outwards till all macroblocks within the search window have been compared with the macroblock from the current image. The macroblock in the reference image will be moved only one pel at a time rather than the size of a macroblock. The macroblock in the current image will always be centered within the search window and any macroblocks in the reference image not within the search window will be skipped. The first time the dist1() function is called within fullsearch() the value of distlim will be 65,536. The use of distlim is to tell the dist1() function what the smallest preliminary value of the distance between two macroblocks was and break out of the function if the sum of differences, s, exceeds distlim, e.g. stop calculating the remaining differences as there is allready a macroblock with a better fit. Since there are 256 pels in a macroblock with min/max values of 0/255 (unsigned char) s will never exceed (255 0) 256 pels = 65,280. The actual value of s will be used as distlim for the subsequent call to dist1(). For any other subsequent calls to dist1() the value of distlim will be set equal s if s is less than the previous value of distlim. If 18

29 CHAPTER 2. ANALYSIS OF THE MPEG-2 ALGORITHM Group 842 it is greater than the previous value of distlim, s will simply be discarded. This means that, from an algorithmic point of view, there is no difference between including or not including lines in listing 2.1. Taking this into account and the fact that it may be possible to remove lines as these are used for addressing in C, all dependencies in nothxnothy() can be removed. This makes it possible to calculate the differences of all pels in the two macroblocks in parallel as opposed to 16 pels for each loop. 2.5 Applying metrics to code pieces Metrics are design tools used in the process of hardware/software co-design. These will aid the designer in matching the algorithm and architecture to optimize different parameters such as silicon area, speed and energy consumption. Metrics are applied to algorithms to determine different aspects such as control/processing/memory elements and potential parallelism. The algorithm or architecture can then be modified according to the results of the metrics. The metrics applied in this project are described in [25] and are a part of the Design Trotter Framework. Three metrics have been applied: The criticity (γ) metric, which expresses how potentially parallelizable the code is. This metric has been chosen because parallel execution is a strength of FPGAs and therefore it is logical to examine if the code is parallelizable. The memory orientation metric (MOM), which expresses to how large a degree of the code is dominated by memory accesses. This metric is chosen because it will show if special attention should be put into optimizing memory throughput. The control orientation metric (COM), which expresses to how large a degree of the code is dominated by control elements. This metric is chosen because codes with many control elements are usually better suited for a general purpose processor (GPP) and as such to show if special attention should be put into optimizing control elements in the FPGA solution. Several other metrics exist, e.g. the affinity metric [19] which describes what type of hardware platform the code is best suited for (GPP, digital signal processor, FPGA). Because the results of metrics may vary dependent on the way an algorithm is implemented in a high level language, the results should be used as indicators only. Common sense still applies. Only the nothxnothy() loop have been chosen to apply the metrics to, because it accounts for an average of 65 % of the total execution time and any other function less than 10 %. 19

30 Group 842 CHAPTER 2. ANALYSIS OF THE MPEG-2 ALGORITHM General notes Because the code is written in a high level language, some of the functionality may be considered redundant or not easily defined as a particular operation for a hardware implementation. There are two cases in the nothxnothy() code piece: 1. Calculation of absolute value: Can be considered as an operation which makes the if statement redundant. Can be considered as it is written in C, e.g. a control element with an outcome that may or may not require another operation. 2. Addressing pels using array indexes: The address values p1 and p2 can be calculated as in C and thus requires operations. Can be considered as a part of the addressing system and is as such not counted as separate operations. It is chosen by the project group that calculating the absolute value is considered as an operation and the addressing of pels are part of the addressing system. Counters used in the for loop are not considered as separate operations Criticity (γ) Metric The criticity metric expresses the average parallelism of the code it is applied to. It does this by calculating the ratio between the total number of operations and the number of operations in the critical path: γ = # of operations Critical path The closer γ is to the total number of operations the code is potentially more parallelizable. To calculate a value that is more intuitive the normalized γ metric is used [19]: (2.1) γ n = 1 Critical path # of operations (2.2) Using figure 2.2 the critical path contains 9 operations: Memory access, subtraction, absolute value, 5x additions and an if statement using the s value. The total number of operations is 81. The normalized γ value is then: γ n = 1 9 = 0.89 (2.3) 81 The calculated γ is for one loop only, but will remain the same for h loops. As expected the γ is close to 1 indicating that the code piece is potentially parallelizable. 20

31 CHAPTER 2. ANALYSIS OF THE MPEG-2 ALGORITHM Group Memory Orientation Metric (MOM) The MOM expresses to how large a degree a function is dominated by memory accesses. It does this by calculating the ratio between the number of memory accesses and total number of memory accesses, processing elements and control elements: MOM = # memory accesses # processing + # memory accesses + # controls (2.4) For each loop there are a total of 32 memory accesses to the pels in two different macroblocks. The total number of elements used in the γ metric was found to be 81. Of these 48 are processing, 1 is control and the remaining 32 elements are the memory accesses. The MOM is then: MOM = 32 = 0.40 (2.5) The calculated MOM is for one loop only, but will remain the same for h loops. With a MOM of 0.4 the nothxnothy() loop contains a considerable amount of memory accesses, but is not dominated by these. However, remembering the results from the profiling the loop takes up more than 60 % of the total execution time. Because of this the loop will, in total, still have a very large number of memory accesses. execution if the memory interface is optimized. Thus it may be possible to gain speedups in the Control Orientation Metric (COM) The COM expresses to how large a degree a function is dominated by control elements. It does this by calculating the ratio between the number of control elements and total number of memory accesses, processing elements and control elements: COM = # controls # processing + # memory accesses + # controls (2.6) As stated in the MOM there is only 1 control element in the loop, 48 processing and 32 memory accesses. The COM is then: COM = 1 = (2.7) The calculated COM is for one loop only, but will remain the same for h loops. With the COM value being close to 0 there is very little control elements in the function and there will not be any significant speedups to gain from optimizing this part of the code. 21

32 Group 842 CHAPTER 2. ANALYSIS OF THE MPEG-2 ALGORITHM 2.6 Conclusion In this chapter an MPEG-2 encoder algorithm have been selected and analyzed with regard to later hardware/software partitioning. The algorithm has been profiled and the results showed that on average 80 % of the total execution time was spent in a function called dist1(). This function consists of four loops of which only one is executed in each call to dist1(). The loop which takes the most of the execution time, nothxnothy(), uses on average 65 % of the total execution time while the remaining loops and any for function uses less than 10 %. Following the profiling the dist1() function, specifically the nothxnothy() part, was examined in depth and it was found that part of it could be parallelized. Block diagrams showing the most important functions and methods were then made and the use of the dist1() function was identified. This showed that dependencies inside the nothxnothy() loop may be possible to remove and thus make it possible to parallelize the nothxnothy() loop to further extent. The chapter is concluded with the calculation of the three metrics, γ, MOM and COM, applied to the nothxnohy part of dist1(). These confirmed the observations made in the previous sections such as a high potential to parallelize the code, very little control used and a moderate amount of memory accesses. These observation are used in the hardware/software partitioning as basis for deciding which parts should be hardware implemented and which should be software implemented. 22

33 Chapter 3 Hardware/Software Partitioning In this chapter the system is divided into modules to decide which parts of the system should be implemented in hardware or software. The original system contains three main modules; Camera, MPEG encoding and Ethernet (Handling TCP/IP) as illustrated in figure 3.1. To establish which parts of the MPEG-2 encoding should be implemented in hardware, profiling and metric calculations are used. Profiling on the MPEG-2 algorithm was done to the parts of the code that had the longest execution time. Here it is important to note that the longest execution time is not necessarily where there are most operations, but also depends on the time these operations take, i.e. a multiplication often takes longer on a GPP than an addition. The parts of the code that take up most of the execution time are of interest, because a hardware implementation of these may reduce the total execution time, especially if parallelization of the code is possible. To establish whether or not the code could be parallelized metric calculations were done. These also gave knowledge of how control and memory oriented the code was to help clarify if the module should be implemented in hardware on an FPGA or executed as software by a GPP. As explained in the beginning of the chapter, the system is divided into three basic modules based on functionality used to create the desired system. The modules are Camera, MPEG-2 and Ethernet as illustrated in figure 3.1 Camera MPEG-2 Ethernet Figure 3.1: The three basic modules of the original system. The next step is to explode these modules into submodules for hardware and software to define the partitioning. New submodules called interfaces are added between the hardware submod- 23

34 Group 842 CHAPTER 3. HARDWARE/SOFTWARE PARTITIONING ules and software submodules. This is mainly done as these are rather complicated interfaces compared with an interface between two software submodules. The Ethernet module is divided into three submodules the Ethernet hardware part, the interface to the software and the TCP/IP submodule. The TCP/IP submodule is based on an existing software implementation running on a GPP processor, this is therefore added as a software submodule. The Camera module can be divided into two submodules, a hardware part and an interface to the software. This is illustrated in figure 3.2. Physical Camera (Hw) Camera interface (Hw/Sw) MPEG-2 (Sw) TCP/IP Protocol (Sw) Ethernet interface (Sw/Hw) Physical Ethernet (Hw) Figure 3.2: Dividing the basic modules into submodules and Hardware/Software interfaces. Attention must be payed to minimize communication over these hardware/software interfaces. Decreasing the amount of data moved into software modules would increase the amount of time available to process the data, as all the modules would most likely share the same GPP. To investigate if this is possible the MPEG-2 algorithm is exploded in figure 3.3. The explosion is based on Appendix B where the MPEG-2 algorithm is explored. Physical Camera Camera interface MPEG-2 TCP/IP Protocol Ethernet interface Physical Ethernet RGB2 Y CbCr Motion estimation Prediction DCT Chroma Subsampling Quantization Variable length coding Video Buffer Inverse DCT Inverse Quantization Figure 3.3: Explosion of the MPEG-2 algorithm. The chroma subsampling described in appendix B does a compression of the picture data. Therefore it is moved to hardware and thereby decrease the data amount moved over the hardware/software interface for the camera. This means that the conversion to Y C B C R must be moved into hardware as well. In chapter 2 it was found that the subfunction nothxnothy() of dist1() used in the Motion Estimation in the MPEG-2 algorithm was suitable for hardware acceleration. Moving this to hardware creates another hardware/software interface in the MPEG-2 module. Figure 3.4 illustrates the final hardware/software partitioning. From this diagram the three original modules still exist, but some of their submodules have been moved around to make 24

35 CHAPTER 3. HARDWARE/SOFTWARE PARTITIONING Group 842 a more efficient system. Also it is now defined where each submodules belong, whether it be software or hardware. Hardware Software Physical Camera Camera Hardware Interface RGB2 Y CbCr Chroma Subsampling Camera Hw/Sw Interface Camera Module nothxnothy MPEG-2 Hw/Sw Interface Motion estimation Prediction Inverse DCT DCT Inverse Quantization Quantization MPEG-2 Module Video buffer Variable length coding Physical Ethernet Ethernet Hardware Interface Ethernet Hw/Sw Interface TCP/IP protocol Ethernet Module Figure 3.4: Figure of the final hardware software partitioning of the system. Each of these three modules are to be implemented and tested as described by the V-model. This is also true for their submodules, though the testing of each submodule is only performed when it is deemed necessary. After the hardware/software partitioning the modules are designed, implemented and tested. 25

36 Group 842 CHAPTER 3. HARDWARE/SOFTWARE PARTITIONING 26

37 Chapter 4 Hardware Design After identifying the modules and their submodules in the hardware/software partitioning different tools as Quartus and Verilog Code are used to design and implement the hardware modules and submodules. As this is a development project where the focus of the project should be on the analysis and implementation of the MPEG-2 encoder onto some type of platform, rather than the selection of different hardware components and making them interact, it is decided to use an existing platform for developing and testing the implementation of the specific algorithm. The first step in the hardware design is therefore to select which development board to use for this project. 4.1 Selection of Development Board In this section the development board for the project is selected. There are currently two different boards with a camera available at the university. One system is called RC203 from Celoxica and it utilizes a Xilinx Spartan FPGA. The other board, the DE2, is from Terasic and contains an Altera Cyclon II FPGA. Both kits contain a camera and some example programs showing how to utilize it Descriptions of the Available Boards To make a selection of which board to use the two boards are compared to each other in table

38 Group 842 CHAPTER 4. HARDWARE DESIGN Terasic DE2 Celoxica RC203 FPGA Type Altera Cyclone II EP2C35 Virtex 2 V Logic Elements 33,216 - Slices - 14,336 Speed 100 Mhz 160 Mhz SoftCore Processor Nios II MicroBlaze Internal RAM Kibit 448 Kibit I/O Embedded Multipliers SRAM Type - - Amount 512 Kibyte - Speed - - Buswidth 16 bit - SDRAM Amount 8 Mibyte 4 Mibyte Speed 50 MHz - Buswidth 16 bit - Camera Type CMOS CCD Resolution 1.3 MPixels (1280x1024) 200 lines CCD (320x200) Frame max res 15 fps 25 fps Ethernet Speed 10/100 Mbit/s 10/100 Mbit/s Flash RAM Amount 1 Mibyte - External Storage Type SD Card SD Card Size - - Table 4.1: Comparison table between the development boards that are available to this project. The information is provided by [7] and [21]. 28

39 CHAPTER 4. HARDWARE DESIGN Group Conclusion There is no noticeable differences between the FPGAs presented in table 4.1, though it is not possible to directly compare the number of slices with the number of logic elements. Therefore with these boards the only differences are in the camera and the amount of external RAM available. Here the priority of having a high resolution camera goes in favor of the Terasic board. Therefore the choice of which development board to use falls on the DE2-board from Terasic. 4.2 Design Tools The main design tools used for the project are listed below; these are all part of the Altera Quartus II Design Suite. [1] Quartus II is the main software tool for developing applications for FPGAs. It is produced by Altera, and allows developers to go through all the steps of an FPGA design: analysis and synthesis of HDL designs (written either in VHDL or VerilogHDL), compiling, simulation, and finally programmation of the device. Quartus II is used to transfer the compiled code onto the FPGA, via a USB connection. SOPC-builder is a system generation tool included in Quartus II which allows to integrate automatically the components required by a system. It has an intuitive graphical user interface for easily building and modifying a system: it makes it possible to easily choose and customize components, then select connections between these and the Nios II softcore processor, generate a system including interconnects and automatically generate memory mapped header files. It also automatically integrates SOPC-Builder components, like Altera Intellectual Property (IP) or Altera Megacore functions. The Nios II Integrated Development Environment (IDE) is used for developing sequential C-code for the Nios II family of embedded processors. It includes a project manager, a source editor with a C/C++ compiler, a software debugger (which can connect to the FPGA hardware via a JTAG cable) and a flash programmer. ModelSim is used for testing the different Verilog modules of the project. It provides a simulation and debug environment for FPGA designs, and supports several languages such as Verilog, SystemVerilog, VHDL and SystemC. It allows performing integration to emulation, Hw/Sw co-verification, mixed-signal simulation, and also provides an integrated debug enviroment. 29

40 Group 842 CHAPTER 4. HARDWARE DESIGN 4.3 Hardware Partitioning In addition to the hardware/software partitioning the physical system needs to contain more hardware modules than the ones listed. This is the implementation of a general purpose processor (GPP), RAM and a bus to connect these hardware modules. To make development easier the hardware is divided into modules and processed individually in the following section. As a GPP Altera provides a softcore processor called Nios II which can be implemented and run on Altera FPGAs. This removes the need for an external GPP to execute the software submodules. From the vendor, Terasic, of the development board there are several demonstration projects that include the VerilogHDL code for all the hardware components that are available on the development board, like VGA controller, audio and other functions. The existing Verilog code from these demonstration projects are to be modified to fit this project and new hardware submodules for camera and the hardware accelerated part of the MPEG-2 algorithm is implemented. The hardware system looks as illustrated in figure 4.1 on module level. Camera MPEG Ethernet Bus GPP RAM Figure 4.1: The hardware modules in the system. In addition to the hardware components shown on figure 4.1 the chosen demonstration project also includes a VGA controller, audio chip, UART, SD card reader, 8x 7 segment displays, LCD display, USB controller, switches and buttons. These are removed from the demonstration project to decrease the resource usage on the FPGA. 4.4 The Avalon Bus The Avalon bus is the standard bus architecture developed by Altera to make the interconnection between all the hardware components connected to the FPGA as well as the hardware components inside the FPGA. The bus is created by the SOPC-builder. This section is based on information from [4]. 30

41 CHAPTER 4. HARDWARE DESIGN Group 842 The Avalon bus contains three types of signals; address, data and control signals. The hardware components work with the concept of a master and slave. The master controls the Avalon bus, it communicates with the slave by writing/reading data to/from a specific address space which is dedicated to the particular slave. In order to check if the slave has data available to the master, the master can either continuously poll the slave or the slave can send an interrupt to master to indicate that data is available. This is a very common design for communicating between hardware components. The advantage is that the Avalon bus handles differences in bus widths, so connecting a 16 bit component to a 32 bit bus does not present a problem since the Avalon bus handles this internally. When there are differences in the bus widths, the Avalon bus can handle this by two different methods. Native Bus addressing where one 32 bit master address is assigned for every 8 or 16 bit slave address or Dynamic Bus sizing where one 32 bit master address is translated into two 16 bit slave addresses etc. A master can only initiate transfer and a slave can only receive a transfer. The Avalon bus works as the real master as it decides who gets the bus. This means that there can be more than one master peripheral connected to the Avalon bus, which enables the possibility of using more then one softcore processor in the system, but as a processor is a rather large system, it consumes a lot of the resources, so the amount of processors should be kept at a minimum. Without any wait states the bus runs at full speed and the transfer rate for the system is therefore only dependent on the bus width and the system clock frequency. As the system clock runs at a frequency of 100 MHz the transfer rate can be calculated for 16 bit bus as: TR avalon = = 1.6 Gbit s (4.1) For a 32 bit 100 MHz the transfer rate would be [4]: TR avalon = 3.2 Gbit s (4.2) The Avalon bus also handles interrupts from external devices like Ethernet and UART. These are given individual interrupt numbers inside the SOPC-builder and the interrupt lines are then passed on to any softcore processor (or other devices) to handle the interrupt in software. When assigning interrupts to external devices, it is important to note that interrupt 0 is not available and two devices can not share the same interrupt line. To make connections to internal and external devices that need to run at a different clock speed or to reduce the number of pins out, the Avalon bus has three different types of bridges to 31

42 Group 842 CHAPTER 4. HARDWARE DESIGN choose from: Pipeline, Clock and Tristate. It is possible to use more than one type of bridge in a system. Each type of bridge is explained in the following sections The Clock Bridge This is normally used when components with different clock frequencies need to communicate. It is basically two FIFO buffers, one for each direction of the bridge. Data is moved to and from the FIFOs at the clock cycles for the individual buses, thereby eliminating the need for the clock to be synchronized. The speed of this bridge is dependent on the component with the slowest clock The Pipeline Bridge The pipeline bridge increases the transfer rate when a large amount of data needs to be read from or written to an external device that creates a delay from when the system asserts the address to the data is ready. The pipeline works by changing the address to write or read from each cycles. In order for this to work the external device must support pipelining. After the number of delay cycles the external device needs that the data would be written or read each cycles with no delay. A read or a write operation takes one clock cycle, therefore the transfer rates of the pipeline bridge for MHz and 32 bit 100 MHz are the same as the Avalon bus. The pipeline bus is only possible to use if the peripheral or the internal component supports it. And it is only useful if the components have some sort of wait state between the addresses are asserted and the data is ready to read or write. [4] The Tristate Bridge The tristate bridge is normally used for off chip interconnection like an external bus where more than one peripheral is attached. It is also used to interfaces asynchronous peripherals. The maximum transfer speed is half the speed of the pipeline bus, as address is asserted on the rising edge of the clock and data is read on the next rising edge, meaning it takes two clock cycles to complete a transfer. If the component connected to the Avalon bus requires wait states the bandwidth is futher lowered by the number of wait states inserted. The maximum transfer rate that the tristate bridge can achieve at normal operations with no wait states would amount to half the transfer rate of the Avalon bus. [4]. 32

43 CHAPTER 4. HARDWARE DESIGN Group Burst Transfer All the types of bridges and the Avalon bus support burst transfer. Burst transfer makes it possible to transfer data each cycle to and from the Avalon bus. The master indicates how many consecutive data read/writes are to be performed and indicates the address for the first package. The Avalon bus then increases the address based on whether it uses the Dynamic or the Native addressing scheme. For Native addressing the address value from the master remains constant. If the Dynamic scheme is used the address is incremented by one after each clock cycle Conclusion The demonstration project that is used as a basis for the development and implementation only use one tristate bridge to connect to the flash memory. There is no immediate reason to make any change to the implemented Avalon bus as it gives the functionality needed for this project. 4.5 General Purpose Processor As mentioned in section 4.3 a Nios II softcore processor is added to hardware to function as a GPP. Different options for the GPP can be configured through the SOPC builder, removing actual code development for implementing a GPP on the FPGA. The following specifications are given for the Nios II processor: Full 32-bit instructions set, data path, and address space. 32 general-purpose registers. 32 external interrupt sources. Single-instruction 32 x 32 multiplies and divides producing a 32-bit result. Dedicated instructions for computing 64-bit and 128-bit products of multiplication. Floating-point instructions for single-precision floating-point operations. Single-instruction barrel shifter. Altera provides 3 different versions of their Nios II processor. The economy version has the smallest usage of FPGA but also the slowest performance. The standard has a compromise of 33

44 Group 842 CHAPTER 4. HARDWARE DESIGN size vs. performance, giving the smallest size possible while maintaining performance. The last version is the fastest version, this version has the highest performance as well as most features to increase performance, but it also consumes the largest amount of resources on the FPGA. It is a possible to implement some of the instructions of the processor in hardware. The multiply instruction can be implemented in hardware where there are different possibilities, the best solution is to use a DSP block, but it can also be implemented in the Logic Element blocks (slower). It is also possible to make a hardware implementation of the divide instruction that gives a performance improvement. As standard there is also a floating point custom instruction available in the Nios II processor. It supports single-precision addition, subtraction and multiplication. Division can be added as a custom block. Custom instruction decreases execution time, for a given instruction. The cost of the improved performance provided by the custom instructions is the area consumed by the Nios II core on the FPGA. The Nios II processor supports an instruction cache from 512 bytes up to 64 kilobytes. To further improve performance the instruction cache supports burst transfer to increase transfer rate to the instruction cache. The Nios II also has a data cache that support the same sizes and the burst transfer as the instruction cache. The burst transfer is described in chapter Implementation The GPP is already implemented in the demonstration project and the processor is the fast version, so there is no performance increase computational-wise by changing it to another version. The only change made to the processor is an increase of the data and instruction cache. This would give an increase in performance, but it is not something that can be measured, as it depends on the code being executed on the processor. Increased cache should lower the amount of RAM read and writes as more data is available in cache, thereby lowering the time it takes to access data. The total amount of internal RAM on the FPGA amounts to around 60 Kbytes. As other hardware modules also use this memory the softcore should not consume too much of this value. In the demonstration project the Data cache and Instruction cache consume 4 Kbytes each. This value is increased to 8 Kbytes as a tradeoff between memory consumption and performance. No other changes were made to the existing project Conclusion The softcore GPP processor is already implemented in the demonstration project and there is no reason to make a new one. The only change made to the existing system is an increase in 34

45 CHAPTER 4. HARDWARE DESIGN Group 842 instruction and data cache. 4.6 Random Access Memory The DE2 board has 8 Mibytes of SDRAM available running at 50 MHz. As the uclinux operating system (see 5.2) requires a minimum of 8 Mibytes to function the existing RAM block on the DE2 board is replaced by a 16 Mibyte block of SDRAM. The two modules are pin compatible, so the only necessary change to the existing Verilog code is an increase in the address space for the RAM. This is done rather easily in the SOPC builder. The new SDRAM module used has better timing specifications than the 8 MiBytes module. To investigate whether it is possible to change the default settings the following performance calculations is done. The transfer rate for the system is for 50 MHz operations, Column Address Strobe (CAS) latency - the delay in clock cycles from when the address is set until the data is moved to the output - of 3 or 2 and pipelining be specified as: TR = bus width operating frequency CAS latency + address ready delay TR CAS 3 = = 200 Mbit s TR CAS 2 = = 266 Mbit s TR Pipeline = = 800 Mbit s (4.3) (4.4) (4.5) (4.6) As the Avalon bus operates at a frequency of 100 MHz, the transfer rate for the RAM at this frequency is also calculated: TR CAS 3 = = 400 Mbit s TR CAS 2 = = 533 Mbit s TR Pipeline = = 1.6 Gbit s (4.7) (4.8) (4.9) The maximum speed of a 16 bit Avalon bus is 1.6 Gbit/s, however, the maxium speed of the SDRAM obtained by setting its frequency to 100 MHz and its CAS latency to 2 would only provide a bandwidth of 533 Mbit/s. Therefore the Avalon bus is not going present itself as a bottleneck for the SDRAM. 35

46 Group 842 CHAPTER 4. HARDWARE DESIGN The default setting of the old RAM is a CAS latency of 3 cycles and a operating frequency of 50 MHz. Decreasing the CAS latency to 2 cycles results is an increase in bandwidth of 33 %. Doubling the operating frequency of the RAM would double the bandwidth to the RAM. The best solution would therefore be to make the RAM-block run at the maximum speed of the Avalon bus at 100 MHz. There are, however, other concerns; the manufacturer of the DE2-board may not have designed the print board to operate at this frequency and the electric signals would degrade to a point where it might no longer be possible to get the RAM to function properly Implementation of the RAM module Changing the frequency for the board involves changing the phase locked loop (PLL) running the clock frequency for the SDRAM to 100 MHz instead of 50 MHz and changing the SDRAM configuration in the SOPC-builder to also support 100 MHz. Changing the CAS latency and the address space for the RAM module are also done in the SOPC-Builder, by changing the configuration of the RAM module. As there are only four possible configurations, the changes can be tested one by one to find the fastest configuration Testing of the RAM module The changes were added to the system and the uclinux was started as described in appendix A.3 to verify that it could boot with the new configuration. The result was that the RAM could run at 100 MHz and a CAS latency of 3, which was also confirmed to be possible by different uclinux news groups. The increase of the physical size was tested by starting the uclinux and checking if the amount of RAM available was indeed increased, note that the uclinux would crash if the RAM was configured wrong Conclusion The standard configuration was changed for the RAM modules to increase the throughput of the system, an increase of the clock frequency from 50 MHz to 100 MHz was found possible at a CAS latency of 3. Another change made to the RAM was increasing the address space to accommodate the larger RAM component that was installed on the development board. 36

47 CHAPTER 4. HARDWARE DESIGN Group Camera Hardware Module Technical specifications The camera used for the project is based on the CMOS digital image sensor MT9M011, manufactured by Micron R Imaging. It is a small digital image sensor with an resolution of 1,280 x 1,024 (WxH) (SXGA resolution) and a frame rate of up to 150 fps (@352x288). Preliminary Some signal control and processing blocks are embedded with the CCD sensor itself such as a 10-bit ADC, timing and control registers or a simplemt9m011 two-wire serial - 1/3-Inch programming Megapixel interface. Image The Sensor sensor can be programmed by the user to change parameters such as resolution, frame rate and exposure time. [16] Figure 1: Block Diagram Control Register Active Pixel Sensor (APS) Array SXGA 1,316H x 1,048V Timing and Control Serial I/O Sync Signals Analog Processing ADC Data Out Figure 4.2: Block Diagram of the MT9M011 sensor. [16] Figure 2: Typical Configuration (Connection) The MT9M011 sensor has been integrated by Terasic into a small board 2.8V ANALOG called TRDB-DC2, which is connected to the DE2-board with a simple IDE cable. Thus there is no need to spend 2.8V I/O 2.8V DIGITAL time on the hardware interface between the camera and the DE2-board. Terasic also provides some example design files written in VerilogHDL, 0.1µF which 1µF can0.1µf be1µf used to get 1µFstarted0.1µF with the camera and the DE2-board. Table 4.2 summarizes the technical specifications of the TRDB-DC2 module: DGNDQ DGND AGND Verilog HDL Code for Camera module 1KΩ 1.5KΩ 1.5KΩ V DDQ V DD VAAPIX V AA DOUT9 DOUT8 Two-Wire SDATA DOUT7 To improve the execution Serial time Bus of the MPEG-2 compression SCLK algorithm on the DOUT6 FPGA, modules DOUT5 DOUT4 of the algorithm are implemented as integrated circuits using the hardware description language DOUT3 RESET# DOUT2 Verilog HDL. Verilog is chosen because of its similarity with the well known DOUT1C-programming + DOUT0 10μF language and its hierarchy of modules, making it easy to develop on multiple parts of the system PIXCLK DGNDQ LINE_VALID at the same time. Verilog also qualifies as a dataflow language meaning that FRAME_VALID it models a program MASTER CLOCK (25 MHz) CLKIN FLASH as a directed graph of data between operations, which provides a clear overview of how data is STA NDBY OE# DGNDQ DGND AGND 37

48 Group 842 CHAPTER 4. HARDWARE DESIGN Features of the CMOS Sensor Core Optical format 1/3-inch (5:4) Focal Length 4.8mm Optical aperture F/2.8 Focusing range 50cm to infinity Sensor array format 1280(H) x 1024(V) Pixel Size 3.6m x 3.6m Active Imager Size 4.6mm(H) x 3.7mm(V), 5.9mm(diagonal) Color Filter Array RGB Bayer Pattern Shutter Type Electronic Rolling Shutter (ERS) Maximum Data Rate / Master Clock 25 MPS / 25 MHz SXGA (1280 x 1024) / up to 15 FPS Resolution / Programmable frame rate VGA (640 x 480) / up to 60 FPS CIF (352 x 288) / up to 150 FPS Sensor ADC Resolution 10-bit, on-chip Responsivity 1.0V/lux sec (550nm) Dynamic Range >71dB SNRMAX 44dB Features of the Image Signal Processor Programmable Control I/F Two-wire serial interface Output Format Raw Data Output Major Electric Characteristics Power consumption 129mW (full resolution mode) 70mW (preview mode) Digital: 2.5V - 3.1V (2.8V normal) Operating voltage Analog: 2.5V - 3.1V (2.8V normal) I/O: 1.7V - 3.6V Operating Temperature -30C to +70C Table 4.2: Parameters and specifications of the camera module. [16] moved and manipulated in the program. This chapter deals with parts of the system that the project group has decided to implement in Verilog HDL code. The goal by implementing modules in Verilog code is removing workload from the soft-core processor implemented on the FPGA and make use of the parallel execution of modules available on an FPGA. The code implemented is provided by Terasic (manufacturer of the Altera DE2 board) and Xilinx (manufacturer of Spartan and Virtex FPGA s) and modified in order to suit the assignments needed in the system. As the interface between the camera and the rest of the system is given as Verilog code from Terasic it is obvious that this part should be kept in Verilog. The output from the camera is the most data intensive part of the whole system as no compression is utilized at this point. It is desirable to remove some of this data before interfacing with the Nios II softcore processor, as 38

49 CHAPTER 4. HARDWARE DESIGN Group 842 data exchange between hardware and the Nios II core takes up clock cycles. Possible ways of data reduction is reducing resolution and frame rate, and after looking through the modules of MPEG- 2 it becomes clear that one of the first things that needs to be done is chroma subsampling. This is a relative simple task and it reduces the amount of data by half if 4:2:0 subsampling is used. To do this in hardware implies that the first couple of steps of the system also need to be hardware implemented. This includes raw camera data to RGB conversion, RGB to Y C B C R and finally the chroma subsampling before moving data to the Nios II core for further compression Code for Physical Camera Submodule The module that handles capturing data from the camera is called CCD Capture.v. This module takes data from the camera and puts it into registers where it is provided for the next module. Besides data capturing, it also creates 11 bit X- and Y-coordinates for each camera pixel covering all 1,280 x 1,024 pixel positions and lastly it counts the number of frames. The camera settings are handle by the module I2C CCD Config.v which configures the camera through an I 2 C bus. Here it is possible to reduce the resolution (by decreasing the number of rows and columns in the image to be read out and thus increasing the blanking area), change the gain of red, green and blue and set the exposure time. Figure 4.3a shows the pixel array of the camera with a black frame around the edges and a boundary frame around the image itself. In I2C CCD Config.v it is possible to set the horizontal and vertical boundaries, thereby changing the number of rows and columns that should be black and in this way change the resolution. Exposure can be changed by 16 of the switches on the FPGA board where a higher value gives longer exposure time and thereby a brighter image Raw to RGB Submodule The raw data from the camera is divided into three registers red, green and blue by the module RAW2RGB. Here the X- and Y-coordinates are used to extract the right RGB values, which are located in the camera as showed in figure 4.3b. The camera uses a Bayer color pattern [16] where the even numbered rows and odd numbered columns contain red and green color pixels and odd numbered rows and even numbered columns contain blue and green color pixels. Knowing this structure applies throughout the picture it is possible to extract the right colors by looking at the LSB of the X- and Y- coordinates. 39

Pixel Data Format MT9M011-1/3-Inch Megapixel Image Sensor Pixel Data Format Pixel Array Structure The MT9M011 pixel array is configured as 1,316 columns by 1,048 rows (shown in Figure 3).

50 Pixel Data Format MT9M011-1/3-Inch Megapixel Image Sensor Pixel Data Format Pixel Array Structure The MT9M011 pixel array is configured as 1,316 columns by 1,048 rows (shown in Figure 3). The first 26 columns and the first eight rows of pixels are optically black, and Group 842 CHAPTER 4. HARDWARE DESIGN Figure 3: Pixel Array Description can be used to monitor the black level. The last column and the last seven rows of pixels are also optically black. The black row data is used internally for the automatic black level adjustment. However, the first eight black rows can also be read out by setting the sensor to raw data output mode (Reg0x22). There are 1,289 columns by 1,033 rows of optically active pixels, which provides a four-pixel boundary around the SXGA (1,280 x 1,024) image to avoid boundary effects during color interpolation and correction. The additional active column and additional active row are used to allow horizontally and vertically mirrored readout to also start on the same color pixel. 8 black rows (0, 0) 1 black column SXGA (1,280 x 1,024) + 4 pixel boundary for color correction + additional active column + additional active row = 1,289 x 1,033 active pixels 26 black columns (1315, 1047) 7 black rows (a) (b) Figure The 4.3: MT9M011 (a) The uses resolution a Bayer color is pattern, changed shown byin blanking Figure 4. The a even-numbered large area. The rows highest resolution also has a blanking area used contain for green internal and red black color pixels, level and adjustment. odd-numbered (b) rows Bayer contain color blue pattern and green used in camera.[16] color pixels. Even-numbered columns contain green and blue color pixels; odd-numbered columns contain red and green color pixels. Figure 4: Pixel Color Pattern Detail (Top Right Corner) RGB to column Y C readout B Cdirection R and Chroma Subsampling. black pixels Pixel (26, 8) A color space conversion from RGB to Y C B C R is needed as Y C B C R is the color space for G R G R G R G digital component video used in television. row readout direction... B G G R B G G R B G G R B G B G B G B G B G R G R G R G than color it isbvalid G B tog reduce B G Bcolor information but keeping luma information by use of chroma 09005aef8102abe8 PDF: 09005aef81051c04/Source: MT9M011_2.fm - Rev. D 1/05 EN 7 Micron Technology, Inc., reserves the right to change products or specifications without notice Micron Technology, Inc. All rights reserved. RGB or its gamma corrected R G B consists of three primary colors red, green and blue which which displayed together gives a specific color. Figure 4.4 illustrates the RGB and Y C B C R color spaces. Conversion of R G B to Y C B C R is done due to bandwidth and storage issues. As the eye is more sensitive to change in brightness subsampling. 60% to 70% of brightness is in the green color, therefore color is represented by a blue chroma level (C B ) and a red chroma level (C R ) leaving out the green color and thereby the brightness/luma from C B C R. If RGB was used to represent a single color, equal bandwidth of all three colors should be used. By using Y C B C R, chroma subsampling can be used, maintaining the bandwidth of brightness but reducing the bandwidth of C B C R with minor perceptual degradation. Figure 4.4: RGB color space and Y C B C R color space with RGB color space mapped onto it.[5] In the modules csc top.v and csc.v, RGB is converted to Y C B C R. Due to later storage of the 40

51 CHAPTER 4. HARDWARE DESIGN Group 842 Y C B C R values the word length of the output is set to 8 bits. Thus four luma and/or chroma values can be saved in 32 bits. The reference design on which the RGB to Y C B C R conversion is based is modified to do a 4:2:0 subsampling, as explained in appendix B.3. This is done by use of the X- and Y-coordinates from the CCD Capture.v module. Figure 4.5 shows how four luma are concatenated with one chroma with respect to the X- and Y-coordinates for a picture with resolution [NxM]. Csc top.v takes care of all the data repositioning, so the data is stored as four luma values to one blue chroma and one red chroma. Y 0,0 Y 0,1 Y 0,2 Y 0,3 Y 0,N-2 Y 0,N-1 C B0,0 C R0,0 C B0,2 C R0,2 C B0,N-2 C R0,N-2 Y 1,0 Y 1,1 Y 1,2 Y 1,3 Y 1,N-2 Y 1,N-1 Y 2,0 Y 2,1 Y 2,2 Y 2,3 C B2,0 C R2,0 C B2,2 C R2,2 Y 2,N-2 Y 2,N-1 C B2,N-2 C R2,N-2 Y 3,0 Y 3,1 Y 3,2 Y 3,3 Y 3,N-2 Y 3,N-1 Y M-2,0 Y M-2,1 C BM-2,0 C RM-2,0 Y M-1,0 Y M-1,1 Y M-2,2 Y M-2,3 C BM-2,2 C RM-2,2 Y M-1,2 Y M-1,3 Y M-2,N-2 Y M-2,N-1 C BM-2,N-2 C RM-2,N-2 Y M-1,N-2 Y M-1,N-1 Figure 4.5: The chroma values C B and C R are calculated together with the first luma value in each square. The four luma values in each square share these two chroma values to provide a 4:2:0 subsampling. In memory the luma and chroma values are stored as in figure 4.6. The pattern shown is repeated for each two lines of a frame. Y 0,0 C B0,0 C R0,0 Y 0,1 Y 0,2 C B0,2 C R0,2 Y 0,3 Y 0,N-2 C B0,N-2 C R0,N-2 Y 0,N N-4 2N-3 2N-2 2N-1 Y 1,0 Y 1,1 Y 1,2 Y 1,3 Y 1,N-2 Y 1,N-1 2N 2N+1 2N+2 2N+3 3N-2 3N-1 Figure 4.6: In memory the luma values are stored with the top left pel first and bottom right pel last. The luma values are placed after every second chroma value in odd lines. The pattern shown in the figure is repeated for every two lines of a frame. The csc.v module does the conversion from RGB to Y C B C R as described in section B.3 where Y C B C R were calculated by equation B.2. These equations however are based on the RGB values ranging from 0 to 1. In the Verilog code the RGB values are represented by 8 bits giving them 41

52 Group 842 CHAPTER 4. HARDWARE DESIGN a range from 0-255, which leads to three new equations [5]: Y = 16 + (0.257 R G B ) [ ] (4.10) C B = ( R G B ) [ ] (4.11) C R = (0.439 R G B ) [ ] (4.12) where: Y is the luma level [-] C B is the blue chroma level [-] C R is the red chroma level [-] R, G and B is the gamma corrected red, blue and green color levels [-] Figure 4.7 shows the direct mapping of these three equations as they would be implemented: R' G' B' 16 X X X + Rounding and Limit Y' Rounding X X X + Cb and Limit X X Rounding X + and Limit Cr x637_01_ Figure 4.7: Mapping of the equations 4.10 to As rounding implies an operation for comparing decimal values to be higher or lower than 0.5 a more efficient way to round a number is used. By adding 0.5 to the constants (16, 128 and 128) and truncating the decimal value a just as applicable way of rounding is implemented. To avoid a large tree of adders, internal negative numbers, and to ease future pipelining, equations 4.10 to 4.12 are rearranged: Y = truncate { [( G ) + (0.257R B )] } [ ] (4.13) C B = truncate { [( B ) (0.148R G )] } [ ] (4.14) C R = truncate { [( R ) (0.368G B )] } [ ] (4.15) With this implementation the conversion can be done in only three clock cycles and by avoiding internal negative numbers, special considerations are not needed, as would it be the case with negative numbers since Verilog has no native understanding of signed numbers. [5] 42

53 CHAPTER 4. HARDWARE DESIGN Group G R B X X X 1st Clock Cycle + + 2nd Clock Cycle + 3rd Clock Cycle [truncate] Y Figure 4.8: Mapping of the equations 4.13 to 4.15, with truncation. The values for R G B range between 0 and 255. In csc.v it is possible to configure the internal precision of the Y C B C R values between 8 and 13 bits. By setting these precisions individually, it is possible for the designer to choose between lower execution time (higher frequency of execution) and smaller area used of device (number of slices), but lesser precision for low internal precision (8 bits); or the designer can choose better precision, but longer execution time and more area consumption for high internal precision (13 bits). To satisfy the RGB color bars provided with the RGB to Y C B C R reference design from Xilinx [5] the internal precision should be set to 13-bit for Y, 11-bit for C B and 10-bit for C R. Test of Raw Data to RGB As no modification of this module were undertaken there have not been conducted any tests for the module. Test of RGB to Y C B C R and Chroma Subsampling This paragraph contains the submodule tests for the RGB to Y C B C R and Chroma Subsampling submodules. All test of the implemented Verilog code have been performed in ModelSim which is a program for Windows implemented in the Altera program package also including Quartus. Two types of test have been performed: the first is an execution time test where the purpose of the test is to verify what the delay from the input to the output are, and whether or not the delay is the same for all output lines. The next test is a functional test to verify that the Y C B C R is done correctly in the Verilog code. 43

54 Group 842 CHAPTER 4. HARDWARE DESIGN Test of Execution Time of RGB to Y C B C R and Chroma Subsampling This test illustrates the delay from when data is put into the module to when its output is ready. The test is done by inputting a RGB test vector and then see when the data reaches the output and if it reaches all the output lines at the same time. The input vectors are as follows: The modules clock is assigned a clock cycle of 40 ns resembling the clock from the camera of 25 MHz. ClockEnable and Reset is set to always high and always low respectively so the system is active and will not reset during simulation. Red, Green and Blue input are set to match the 75 % simple color vector as showed in tabel 4.4, this input is not that important for the sake of execution time as long as the frequency of this input does not exceed the frequency of the clock. ix Cont which indicates the location of RGB-values on a given line is set to count from 0-7 and the repeat itself. iy Cont which indicate on which line given RGB-values are located is set to count from 0-3 and then repeat itself. The setting of ix Cont and iy Cont resembles a frame of resolution [NxM]=[8x4]. Outputs to be verified are: Luma (Y ): check delay in relations to input and duration of the pulse. Blue chroma (C B ): check delay in relations to input and duration of the pulse. Red chroma (C R ): check delay in relations to input and duration of the pulse. oy Cont and ox Cont is used by the system to tell the following module that converted data are ready and what kind of data it is (subsampling). It is checked if these are concurrent with the output data. odataout and odataready is used to check if oy Count and ox Cont are syncronized with the time output is ready. Figure 4.9 illustrates a waveform produced in ModelSim. This specific figure shows the input output relations for testing execution time of the conversion and subsampling module. Several of these waveforms have been produced with ModelSim to clarify that the Verilog modules tested works as desired. The other waveforms is found under ModelSim Results in the appendix C. 44

55 CHAPTER 4. HARDWARE DESIGN Group 842 Edit:/csc_top/Clock Edit:/csc_top/ClockEnable Edit:/csc_top/Reset Edit:/csc_top/iY_Cont Edit:/csc_top/iX_Cont Edit:/csc_top/Red Edit:/csc_top/Green Edit:/csc_top/Blue sim:/csc_top/y b4 a b4 a sim:/csc_top/cb 80 2c 9c 48 b8 64 d4 80 2c 9c 48 b8 64 sim:/csc_top/cr 80 8e 2c 3a c6 d e 2c 3a c6 d4 sim:/csc_top/odataout b48080xx b48080a1 839c2ca1 839c2c70 54b8c670 54b8c641 23d d47210 b4d47210 b4a17210 b4a18310 b4a a18370 sim:/csc_top/odataready sim:/csc_top/oy_cont 0 1 Output Ready sim:/csc_top/ox_cont ns 100 ns 200 ns 300 ns 400 ns 500 ns 600 ns 700 ns Figure 4.9: Simulation performed in ModelSim outputs wave forms as the one showed here. This simulation illustrates the input output relations for the test of execution time discussed above. A discussion of the results shown above is provided in appendix C. Functional test of RGB to Y C B C R conversion This test is performed to verify that the RGB data sent to the submodule is processed into Y C B C R correctly. In the Verilog file csc top.v, some simple color test vectors are given for 100 %, 75 %, 50 % and 25% RGB color values. Where 100 % color values mean that the highest possible value for red, green and blue is used. For the 8 bit input word length of this system a 100 % color value would be 255, 75 % would be 191, etc.. The color test vectors applied to ModelSim are illustrated in table 4.3 through 4.6. These vectors are input to Red, Green and Blue and the outputs Y, C B and C R are checked if they are equivalent to the corresponding values in the color test vectors. For this functional test, the input settings are the same as for Execution time of RGB to Y C B C R and Chroma Subsampling, but instead of only testing for 75 % color values the input for RGB is interchanged between 100 %, 75 %, 50 % and 25% test vectors. This is to see if the conversion is done correctly for a vast array of inputs, giving a good indication that the code converts as supposed without checking all possible inputs. For output there is only a need for investigating Y, C B and C R. Chroma subsampling The chroma subsampling is verified by checking if the odataout register contains the right values from the Y, C B and C R registers. For even frame lines the odataout register should contain two luma values and two chroma value at a time, leaving every second pair of chroma values 45

56 Group 842 CHAPTER 4. HARDWARE DESIGN out. For odd lines every chroma value are discarded and the odataout register should contain the last four calculated luma values when odataready goes high. The settings for this test is the same as for Execution time of RGB to Y C B C R and Chroma Subsampling. Test of Internal precision Because area usage could become a factor of concern when implementing multiple algorithms for hardware acceleration on a FPGA, it is investigated what effect a reduction of internal precision in the RGB to Y C B C R conversion would have on the precision of the output. As suggested in [5] the internal precision should be set to 13-bit for Y, 11-bit for C B and 10-bit for C R. These settings should uphold the values of the color test vectors below. In this test the csc.v module is rewritten to calculate the Y C B C R values with an internal precision of 8-bit. The following tables contain the color test vectors that were used in the tests described above. The output of Y, C B and C R should be equal to the values also contained in these tables. Color R G B Y Cb Cr white yellow cyan green magenta red blue black Table 4.3: Test vectors (100% RGB Color Bars). Color R G B Y Cb Cr white yellow cyan green magenta red blue black Table 4.4: Test vectors (75% RGB Color Bars). 46

57 CHAPTER 4. HARDWARE DESIGN Group 842 Color R G B Y Cb Cr white yellow cyan green magenta red blue black Table 4.5: Test vectors (50% RGB Color Bars). Color R G B Y Cb Cr white yellow cyan green magenta red blue black Table 4.6: Test vectors (25% RGB Color Bars). Conclusion In this subsection the modules for doing the RGB to Y C B C R conversion and chroma subsampling were developed and tested. The result from the tests indicated that the functionality of the system worked as expected and a delay of six clock cycles from input set to output ready. The functional test of conversion showed that there actually only went two clock cycles from RGB input to Y C B C R output. This is one clock cycle faster than expected and illustrated in figure 4.8. This reduction of clock cycles is considered as an effect of the optimization done by ModelSim itself before simulation. Due to time limitations this result is not investigated any further, but the optimization could be the cause of further pipelining or conversion of registers into signals, etc.. The system managed to uphold the Y C B C R values for an internal precision of 13-bit for Y, 11-bit for C B and 10-bit for C R. When the internal precision was set to 8-bit, deviations of ±1 was seen in numerous locations. This is a variation of 0.4 % in color value for the gain lesser area usage, as the 8-bit version only use 4 DSP elements compared to the 14 DSP elements of the version with higher internal precision. Although the 8-bit version uses more logic cells (179 compared to 167) and more dedicated logic cells (230 compared to 163), it is intuitively assumed that the area usage of the DSP elements from the high precision version exceeds the area usage of 47

58 Group 842 CHAPTER 4. HARDWARE DESIGN the 8-bit version. The synthesis tools used in this project did not have an AND-gate equivalent to make more specific comparison. In this project the internal precision is kept at 13-, 11- and 10-bit, as there is not any problem of area usage and because data received at a speed of 25 MHz from the camera does not prompt any problem. Should area usage at some point become a problem then lower internal precision could be implemented with low effect on quality Camera Data Buffer The camera data buffer (CCDbuffer) contains two lines of the frames, because this takes up the least amount of memory while still being able to synchronize frames. Two lines are needed for synchronization because the pattern of how Y, C B and C R values are stored is repeated every second line. These two lines are transferred to the hardware/software interface, to make sure that the data is not overwritten by new incoming data a ping-pong structure is made. This means that 2 times two lines are stored in two data buffers. The process should run as described in figure Write two lines of data to Current Buffer Change Buffer Wait for CS from Hardware/ Software Interface Set interrupt Remove interrupt Figure 4.10: Flowchart for the CCDbuffer. In order to access the buffers where the data from the RGB to Y C B C R and Chroma Subsampling submodule should be saved, a counter is used. This counter should be set to zero each time an even line occurs and the X-coordinate is zero, and increment by one each dataready received 48

59 CHAPTER 4. HARDWARE DESIGN Group 842 from the RGB to Y C B C R and Chroma Subsampling submodule. In the software part of the hardware/software interface the current frame number needs to be known in order to synchronize the frames, therefore this functionality also has to be implemented in the system. Implementation The system contains three parts; a buffer, a frame counter and an interrupt handler. implementations are handled individually in the following: Their Buffer In order to create a buffer a MegaFunction from Quartus is used that creates a RAM module. A dual port RAM function is used, making the implementation simpler as the data lines do not have to be shared by the input and the output from the module. To optimize further, only one RAM module is used to implement both buffers. The distinction is done with the use of the second LSB of the Y-coordinates. This changes for every second line and can thereby distinguish which buffer is being worked on. This is used for the LSB in the address input for the RAM module for the write and the inverted value is used for the output. This makes it impossible to read the same buffer that is being written to. As the data being written into the buffer needs to be placed at the correct address, it is necessary to create an address generator. The address generator is implemented as a simple counter that is reset each time the X-coordinate is zero and the Y-coordinate is an even number (the LSB is equal to zero) and advances by one each time a dataready signal is set. The address for the buffer is 9 bits, meaning the counter needs to be 8 bits long. The second LSB of the Y-coordinate is used as the LSB of the address and the remaining 8 bits are copied from the counter. The interface to the hardware/software interface consists of a chip select (CS) and data lines. These are connected to the readenable and the readdata respectively on the Megafunction RAM module. The address to read from the RAM module is constructed from the 8 LSBs of the address from the hardware/software interface and the inverted bit from the Y-coordinates. The inverted bit is used as the LSB and the address from the hardware/software interface as the MSBs. Figure 4.11 illustrates the connections for the buffer. 49

60 Group 842 CHAPTER 4. HARDWARE DESIGN Chroma Subsampling Data [31:0] Write Data [31:0] RAM module Data Read Data [31:0] Hardware/software Interface Data [31:0] Clk Write Clk Clk Read Clk Clk Counter Write Enable CS Read Enable CS Counter output Write address [8:1] Address Read address [8:1] Address [7:0] X_cord[10:0] Y_cord[10:0] Write address [0] Read address [0] Y_cord[0]!Y_cord[0] Figure 4.11: Connections for the Megafunction buffer. Frame counter Each time the X- and Y-coordinates are zero a new frame is started and the frame counter should advance one count. A 32 bit value is used for this and when overflow occurs it reset to zero. Because there is a delay of two image lines created by the ping-pong buffer structure the frame should first advance on the Y-coordinates equal to 2. In order to avoid using a shared data line for transfering the image data and the frame counter which would create additional delay in data transfer as well as make a complicated address decoder, an additional hardware/software interface has to be created that handles this communication only. Implementation of this interface can be found in section The Interrupt Handler Interrupts should be set each time two lines have been moved into the buffer. The simplest way to implement this is to look for a transition on the second LSB of the Y-coordinates. This should be a positive as well as a negative transition. The interrupt is removed again when a Chip Select (CS) is made from the hardware/software interface. Test of Camera Data Buffer The testing of the data buffer is done by a delay test and a functional test. The test is performed in ModelSim. It should be noted that this is not a test performed on the board, but a simulation performed by sofware that verifies the Verilog code performs as intended. 50

61 CHAPTER 4. HARDWARE DESIGN Group 842 Execution Time Test The delay test verifies how much delay there is between the input and the output. The test is run over six lines (six changes of Y-coordinates) and only five changes of X-coordinates cheating the system and minimizing the amount of redundant test results. The input values are as follows: Inputs: Data input from the chroma subsampler: 0 the first 3 clock cycles, 1 for the 3 following clock cycles and zero for the rest. Clk input from the chroma subsampler: 10 Mhz clock. X cord from the chroma subsampler: Counter 10 Mhz. Y cord from the chroma subsampler: Counter 2 Mhz. Reset: Set low (active high). Address from the Hardware/Software interface: Counter 10Mhz (two lines). Clk input from the Hardware/Software interface: 10 Mhz clock. CS input from the Hardware/Software interface: High. The expected output from the test should be as described: The data output to the hardware/software interface should have an delay of 2 lines from the Datainput to the output equal to 10 clock cycles. Irq output to the hardware/software interface should go high after 2 lines and stay high as the CS input from the hardware/software interface does not change. The FrameCounter should advance each time the Y cord = 2 and the X cord = 0. It was not possible to perform this test because the Megafunction used for the RAM-block would not function within ModelSim. Functional Test This test is basically the same test as the delay test with the following differences. The input data from the chroma subsampler is changed to give an changing input, by using a counter. The connection to the hardware/software interface is simulated to resemble the way the data exchange works. The inputs are as follows: Data input from the chroma subsampler: Counter that runs from 0 to

62 Group 842 CHAPTER 4. HARDWARE DESIGN Clk input from the chroma subsampler: 10 Mhz clock. X cord from the chroma subsampler: Counter 10 Mhz. Y cord from the chroma subsampler: Counter 2 Mhz. Reset: Set low (active high). Address from the hardware/software interface: Counter 20Mhz (two lines) start with the value 8. Clk input from the hardware/software interface: 20 Mhz clock. CS input from the hardware/software interface: 2 Mhz clock starts with 0 for two cycles and then the clock goes high and runs continously from this point. The output expected from this test should be as described: The data input from the chroma subsampler should appear with the delay, specified in the previous test, on the data output to the hardware/software interface. The interrupt should go high when the buffers change and go low again when the CS goes high. The FrameCounter should advance each time the Y cord = 2 and the X cord = 0. For the same reasons as the exection time test it was not possible to perform the functional test. The functionality of the CCDbuffer was therefore not verified by this test. Conclusion It was not possible to verify that the submodule functions as intended. The reason was that the Megafunction used for the RAM block could not be simulated in ModelSim. Because of time limitations, it was not possible to find a solution for the problem, therefore this module can not be verified as functioning Hardware/Software Interface To transfer picture data into software the camera must be interfaced with the Avalon bus to enable communication with Nios II processor. This interface must handle interrupts when data is ready as well as read/writes to specific addresses. The module is going to work as a wrapper to the already written submodule, the CCDbuffer. This means that it is just going to forward the control line, address lines and the data lines from the CCDbuffer to the Avalon bus. The module is a dummy module to make debugging easier. 52

63 CHAPTER 4. HARDWARE DESIGN Group 842 The connections to the CCDBuffer is CS and address to the frame data as well as a 32 bit data line. Additionally an extra interface is made to connect to the frame count which is also a 32 bit data line. The hardware/software interface connected to the Avalon must contain standardized control and data lines so that the Avalon bus can communicate with it. These lines are CS, read/write enable, read/write address, data lines, interrupt and reset which are the most important. Identifying which data lines in the interface that the Avalon bus should interact with can either be done by using a standardized naming system for the lines or by setting them manually when creating the module in the SOPC-builder. The procedure in creating an Avalon interface in the SOPC-builder is as follows: first the Verilog file that needs the interface is specified, then lines going out of the module must be specified if they belong to the Avalon bus e.g. data lines, address lines or if they are externally connected to another hardware module. Then the SOPC-builder creates a wrapper around this module. In order to make the Avalon interface communicate with other hardware modules e.g. the CCDBuffer, the connection must be made to the wrapper instead of the actual interface. The wrapper is located inside the system 0 module that contains all the elements created in the SOPC-builder. Implementation of Camera Hardware/Software Interface The Avalon interface could be made directly to the CCDBuffer, but to make error detection and testing easier a dummy module is written in Verilog that forwards the data and controllines from the Avalon bus to the CCDBuffer, where the address decoding and data handling is processed. This is, however, not the case with the interrupt line from the CCDBuffer. Instead there is made a small function inside the dummy modules that enables the possibility of masking any interrupt coming from the CCDBuffer. When a positive edge is detected on this it sets an internal register (IRQreg) in the dummy register to on. IRQreg is logically ANDed with the register IRQmask which masks the interrupt generating the interrupt control wire that is feed to the Avalon bus. The IRQreg register is set low again when a read to the interface from the Avalon bus is performed, which would mean that the interrupt is being handled. The IRQmask used by the software must be set to 0 when reset occurs meaning that interrupts only get noticed by the system when the IRQmask is set by the device driver. If no software is present to handle the interrupts no interrupts would be sent to the system. In order to make this work a write function is also implemented in the dummy module, however, this is not forwarded and is only used to write data to IRQmask register. The source code can be found on the enclosed CD in 53

64 Group 842 CHAPTER 4. HARDWARE DESIGN the DE2 Net folder. Test of Camera Hardware/Software Interface The Avalon interface is tested with the device driver in section There have been made some small changes to the module to do this. Interrupts are generated by adding a counter module from the Megafunction block in the Quartus development tool. This counter is used as a frequency divider, dividing the external clock of 50 MHz in the DE2 board down to 10 Hz. Also the data read by the device driver is just the address it is reading from which have been fed back to the data lines. Therefore the data being read should be the same as the address being read from. Conclusion In this section the hardware/software interface for the camera was developed. It was chosen to make a thin wrapper module that contain nothing more than some interrupt handling. All the control lines and data lines were forwarded to make this modules as simple as possible. A test was performed using the device driver created in chapter 5.3 and the module was found to function as intended. 4.8 MPEG-2 Hardware Accelerator Hardware acceleration can be used to accelerate sequential code, especially if the code can be parallelized. For example accelerating a matrix rotation can be up to 73x times faster and a Fast Fourier Transform (FFT) can be around 15x times faster than on the Nios II processor [2]. This increase of the speed of the accelerated part can be about 10x to 100x or even more, but it depends on the specific situation and the specific code. The procedure for making the hardware acceleration is as follows. The algorithm is first analyzed to identify code pieces that could benefit from acceleration. This is done for the MPEG-2 algorithm in chapter 2. The next step is then to consider how the code can be accelerated with regard to parameters such as size, power and speed. For this project only the speed parameter is considered. The last step is to make the implementation in a HDL and test it. For the development board running the Nios II softcore processor there exists a tool called C2H [2], which can transform parts of C code directly into VHDL and make all the necessary changes to the system. Unfortunately this tool is only for the Nios IDE development tool which is not compatible with the uclinux operating system being used. This means that the hardware 54

65 CHAPTER 4. HARDWARE DESIGN Group 842 accelerator and the interface must be written manually in the implementation phase Design From the profilling and metric calculations the C code in listing 4.1 was found to be suitable for hardware acceleration. The function calculates the absolute difference between two macroblocks where the macroblocks can be passed either as whole macroblocks (16x16 pels) or two half macroblocks (16x8 pels). 1 if (!hx &&!hy) 2 for (j=0; j<h; j++) 3 { 4 if ((v = p1[0] p2[0])<0) v = v; s+= v; 5 if ((v = p1[1] p2[1])<0) v = v; s+= v; 6 if ((v = p1[2] p2[2])<0) v = v; s+= v; 7 if ((v = p1[3] p2[3])<0) v = v; s+= v; 8 if ((v = p1[4] p2[4])<0) v = v; s+= v; 9 if ((v = p1[5] p2[5])<0) v = v; s+= v; 10 if ((v = p1[6] p2[6])<0) v = v; s+= v; 11 if ((v = p1[7] p2[7])<0) v = v; s+= v; 12 if ((v = p1[8] p2[8])<0) v = v; s+= v; 13 if ((v = p1[9] p2[9])<0) v = v; s+= v; 14 if ((v = p1[10] p2[10])<0) v = v; s+= v; 15 if ((v = p1[11] p2[11])<0) v = v; s+= v; 16 if ((v = p1[12] p2[12])<0) v = v; s+= v; 17 if ((v = p1[13] p2[13])<0) v = v; s+= v; 18 if ((v = p1[14] p2[14])<0) v = v; s+= v; 19 if ((v = p1[15] p2[15])<0) v = v; s+= v; if (s >= distlim) 22 break; p1+= lx; 25 p2+= lx; 26 } 27 return s; Listing 4.1: C code found suitable for hardware acceleration. The code contains a number of if statements, which could suggest a control structure, but closer inspection in section 2.3 revealed that these are used to calculate the absolute value of the subtractions. It was also found that the dependency on s and distlim in lines could be removed without any algorithmic changes. Because lines is used for addressing in C these will not be needed for a hardware implementation either. 55

66 Group 842 CHAPTER 4. HARDWARE DESIGN The 16 subtractions and absolute values can be done in parallel and the sum of differences, s, can also be parallelized to some degree. The loop is repeated h times as stated by the for loop where the variable h can be either 8 or 16 depending on if half macroblocks or whole macroblocks are used. The distance is then calculated by taking the absolute difference between the 256 (16x16) or 128 (16x8) pels and summing these in s. When paralellizing the sum value s it is possible to do this in several ways as illustrated in figure (a) (b) Output of parallel part Output of parallel part Output of parallel part + Output of parallel part + + (c) Figure 4.12: Data flow graphs for calculation of a sum in parallel (a), sequential (b) and partial parallel (c). The choice of how to calculate the sum depends on how many logic elements is desired to use and the speed. If size is the most important parameter a sequential approach should be used as the same adder and register could be used for all the calculations, but if one addition takes one clock cycle it 56

67 CHAPTER 4. HARDWARE DESIGN Group 842 would take 255 clock cycles to calculate the sum. If the speed is the most important parameter a parallel approach should be used, here 128 addition are made in parallel then 64 additions, 32 additions etc. The result could then be achieved with 8 parallel additions using several adders instead of 255 sequential additions using one adder. The parallelized solution would be almost a factor 32 faster. It is also possible to make a combination of the sequential and the parallel methods to achieve a compromise between speed and size. Assuming that each addition takes one time unit (tu) figure 4.13 illustrates the execution time vs. the number of adders used. 70 number of adders-time dependence execution time [tu] number of adders[-] Figure 4.13: Dependency between number of adders and the time used for the calculations. Because the speed parameter is considered as the most important parameter the parallel solution is used. It is important to notice that if an addition is performed the output register word length must be increased by one compared to the input registers. This is because the sum of differences can double in each addition and overflow is prevented by increasing the word length. Figure 4.12 also holds true for designing how to calculate the absulte value of the difference, i.e. either one absolute value element can be used to calculate all absolute values sequentially or 256 absolute value elements can be used to calculate all of these in parallel. The latter is chosen because of the focus on the speed parameter. Because the macroblocks can be handed as whole 16x16 macroblocks or two 16x8 macroblocks it is neccesary to take this into consideration for the hardware implementation. This is done 57

68 Group 842 CHAPTER 4. HARDWARE DESIGN by dividing the hardware accelerator into two identical blocks each capable of calculating the distance for a 16x8 macroblock. In each block there are one subtraction step, one absolute value step and seven addition steps. If the distance for a 16x8 macroblock needs to be calculated data will only be transferred to one of the blocks and the result calculated. For 16x16 macroblocks data is transferred to both blocks and the results from these are added together to give the final sum. The selection of whether a 16x8 or a 16x16 macroblock is used is set from the hardware/software interface. The steps that the hardware accelerator goes through is as follows: Receive macroblock data from the hardware/software interface - only the amount of data is specified at this point. Receive the Datatype (16x8 or 16x16 macroblocks) from the hardware/software interface and route the macroblock data to the correct accelerator block(s). Receive the start processing command from the hardware/software interface, i.e. accelerator does not start by itself. the Perform the subtractions and absolute values in parallel. Calculate the sum of absolute values in parallel. Return the sum to the hardware/software interface Implementation The implementation is made in Verilog and can be found on the CD in the DE2 Net/Hwa.v folder. The hardware/software interface consist of the avalon bus and a device driver that is described in section The return value can not be done by pushing data to the hardware/software interface. It must either be done by interrupt or by polling. Because the return value is ready after only a few clock cycles it is not neccesary to use interrupts, because there will not be time for the softcore processor to do other tasks meanwhile, instead polling is used. The polling function consists of a register (Dataready) that the hardware/software interface reads until the value changes from a zero to a one, indicating that the result is done and can now be read by the hardware/software interface. To the Avalon bus there are three data interfaces. Two interfaces for data used by each accelerator block and one for control, e.g. setting the dataformat, starting the calculations, polling for dataready and the result of the calculation. The addresses for the data interfaces is 8 bits for 58

69 CHAPTER 4. HARDWARE DESIGN Group values. The data bus width is 8 bit because the maximum value is 255. For the control bus there are 2 addresses which are shared for both reading and writing. In the read process they are used for Dataready and the return value and in the write process they are used for Dataformat and start processing. The databus must be 16 bits wide as the return value is between 0 and 65,280 ( ). The hardware accelerator is started by setting the Start process register to 1. The accelerator then checks the value stored in the Data Format register. If it is one the distance for 16x8 macroblock is calculated and if it is two the distance for 16x16 macroblock is calculated. The first thing to do in the calculation is subtracting the corresponding pel values and calculating the absolute value of this. There are several ways of implementing the absolute value function, two solutions are considered for the implementation. The first is a if else statement where the two input values are compared and the smaller value is subtracted from the larger value. Tests in Quartus showed that this solution consumes 32 logic elements. The second solution is to use signed variables and subtract the two inputs and check the sign bit afterwards. If the result is negative, e.g. the sign bit is one, a 2s complement conversion is made to convert the output into a positive value. The sign bit is then discarded and the sum of differences is calculated. This implementation consumes 18 logic elements. It is worth noting that no attention needs to be paid to negative input values from the macroblock data, as these are unsigned chars meaning their range of values are The latter implementation is chosen because it uses less logic elements. The outputs from the absolute value elements are put in two arrays, d1 and d2. d1 contains the outputs from the first accelerator block and d2 contains the outputs from the second accelerator block, meaning each array contains 128 values. The next step is to sum the contents of these registers. This is performed by adding dx[0] and dx[1] together and putting the result in dx[0]. In parallel with this the addition of dx[2] and dx[3] is performed and the result is placed in dx[1] etc. The results from the first step of additions are then contained in dx[0] to dx[63]. This procedure is repeated untill only one result is contained in dx[0]. Following this the Dataformat is checked, if it is 1, d1[0] is the final result, if it is 2 the final result is the sum of d1[0] and d2[0]. The last step is to put the DataReady High so that the hardware/software interface can read the results. A DFG of the hardware accelerator with final results being the sum of d1[0] and d2[0] is illustrated in figure

70 Group 842 CHAPTER 4. HARDWARE DESIGN Acc. block 1 Acc. block 2 P10 P20 P11 P21 P12 P22 P13 P23 P14 P24 P15 P25 P1 254 P2 254 P1 255 P abs abs abs abs abs abs abs abs d1[0] d2[63] d1[0] d2[31] d1[0] d2[15] d1[0] d2[0] sum Figure 4.14: Data Flow graph of hardware accelerated part of the MPEG-2 algorithm Test of the MPEG-2 Hardware Acceleration The hardware accelerator code is simulated and tested in ModelSim to check the functionality and execution time of the developed Verilog code. The only difference between the tests is the test vector. It should be noted that this test only verifies the code, not the hardware implementation on the FPGA. A test verifying the hardware implementation is performed when the hardware/software interface has been implemented in section The procedure for the two tests are as follows: 1. Set up test vectors in the macroblock buffers. 2. Set the DataFormat to two 16x8 macroblocks. 3. Start processing the testvectors. 4. Wait for DataReady signal. 5. Read the result. 60

71 CHAPTER 4. HARDWARE DESIGN Group Set the DataFormat to two 16x16 macroblocks. 7. Start processing the testvectors. 8. Wait for DataReady signal. 9. Read the result. The testvectors used for the two test are stated in table 4.7. Each buffer for the macroblocks contain 256 values. Macroblock 1 Macroblock 2 Test to 255 Test 2 0 to Table 4.7: Test vectors for the MPEG-2 hardware accelerator. The verification is done by monitoring the ReadData output, the internal register d1[0] and d2[0] and the data buffers for the macroblocks 1 and 2. The ReadData output should provide the value for DataReady as well as the result of the distance calculation between the macroblocks. The result should be 8,128 for the 16x8 calculation and 32,640 for the 16x16 calculation. The data buffer provides an indication that the test vectors are copied into the data buffers. This is only for debugging purposes as the result of the calculations also verifies this. The internal registers d1[0] and d2[0] contain the intermediate results of the additions. These should contain the results in table 4.8 based on the describtion of the calculations in the previous section. For the 16x8 macroblock calculation only the intermediate result in d1 matters as the output from the d2 array is not used. Add. step d1[0] d2[0] Table 4.8: Intermediate result in the internal registers of the distance calculations. The result of the first test can be found in figure C.10 page 127. The result from the second test is in figure C.11 page 128. Test results with the hardware/software interface can be found 61

72 Group 842 CHAPTER 4. HARDWARE DESIGN in section Conclusion In this section the MPEG-2 hardware accelerator was designed, implemented and tested. The results from the tests performed here as well as the test performed for the hardware/software interface in section verifies the functionality. It has also been shown that the execution time of the hardware accelerator is 14 clock cycles. 4.9 Ethernet Hardware Module This section contains the implementation and testing of the Ethernet hardware module. As this is not the main focus of the project as described in the initial problem in chapter 1, the implementation is based on existing implementations with minor changes. The Verilog code for connecting with the network chip on the DE2 development board is already written in the demonstration project and these files are used more or less unchanged in this project Implementation of the Communication on the FPGA Initial trials with the board indicated a problem with the existing Verilog code. Investigations into the problem revealed that it was the read wait state of the Avalon bus in the SOPC-builder that was set wrong. In the demonstration project this has to be increased from 40 ns to 80 ns to comply with the datasheet for the network chip Test There are no test performed for the hardware part of the Ethernet module, its functionality is implicitly verified by the test of the software module for the Ethernet in section Conclusion In this chapter the hardware module was implemented and the test performed for the software submodule 5.4.1, verified that this module functions as intended. The implementation was based on an existing implementation with some small changes to make it function properly. 62

73 CHAPTER 4. HARDWARE DESIGN Group 842 This concludes the design, implementation and testing of the hardware modules and their submodules. The next step is to move on to the design, implementation and testing of the software modules and their submodules. 63

74 Group 842 CHAPTER 4. HARDWARE DESIGN 64

75 Chapter 5 Software Design This chapter contains the implementation and test of the software modules and submodules described in the hardware/software partitioning in chapter 3. An additional software module is added to the system, namely an operating system, this is mainly there as existing implementations should be used where ever possible. Using an operating system makes it easier to find and utilize existing implementations like an TCP/IP stack. All communication would have to take place through the operating system, therefore the conceptual drawing of the system is going to look like figure 5.1. Camera MPEG Ethernet Operating System Figure 5.1: The modules in the software part of the system. 5.1 Program Structure A server-client connection over Ethernet is established to transfer the encoded MPEG-2 output from the development board. For the MPEG-2 output it does not matter whether the development board is the server or the client, but from a practical standpoint it makes sense to make the development board the server. This means that the MPEG-2 stream of data is sent after the PC makes a connection to the development board, e.g. the PC asks for data instead of 65

76 Group 842 CHAPTER 5. SOFTWARE DESIGN getting it sent to it. For the development board to act as a server it is necessary to modify the overall program structure of the MPEG-2 algorithm. Furthermore it is chosen to allow a user to start and stop the MPEG-2 encoding from the PC. The program for the server side works as illustrated in figure 5.2. Create socket and bind to a port Listen for incoming connections Accept incomming connections Close connection Read incoming data from client Read and encode frame Write data to client Figure 5.2: Flowchart for the socket server. Details on creating and binding sockets, establishing connections, reading data and writing data are described in section 5.4. The server sets up a socket and then listens for a client wishing to get an MPEG-2 data stream. If a client connects, the server accepts the connection and reads any data transferred from the client. The data transferred from the client can either indicate an MPEG-2 stream should be initiated or stopped. If the client indicates a MPEG-2 stream should be initiated, the algorithm will proceed and transfer file headers, read a frame from the camera, encode this and transfer this to the PC. After this it will read from the network again and if the client did not send new data indicating the MPEG-2 data stream should be stopped, the algorithm will keep reading new frames, encoding these and transferring them to the PC. When the client indicates the MPEG-2 stream should be stopped the server will transfer the file tail and close the connection. To transfer the MPEG data to a PC, the PC must be running a socket client implementation. A flowchart for how this should function is shown in figure

77 CHAPTER 5. SOFTWARE DESIGN Group 842 Create socket Connect to server Write data to the server Read incoming data from the server Close connection Figure 5.3: Flowchart for the client side. Details on creating sockets, establishing connections, reading data and writing data are described in section 5.4. The client sets up the socket for communicating with the server and then connects to the server. The client then sends a data packet indicating a MPEG-2 stream should initiate, reads the frames sent back from the server and stores these on the hard disk. This continues until the client program sends a packet to the server indicating the MPEG-2 stream should be stopped. The client will then receive the file tail from the server, write this to the hard disk and close the connection. The received MPEG-2 frames should then be ready for playback. Details on creating sockets, establishing connections, reading data and writing data are described in section Operating System Altera suggest the uclinux distribution as an operating system. It is an open source Linux core for their Nios II processors. Behind the uclinux there is a community that develops and maintains this distribution and it has been ported to many other processors e.g. an ARM processor. There are multiple tutorials available and an online forum to get support from if needed. This makes it a suitable solution as an operating system for the Nios II softcore processor. The uclinux already supports all the features except the camera on the DE2-board. It implements a TCP/IP stack for communication and drivers for the Ethernet module. In appendix A.1 and A.3, there is a small guide to how the uclinux distribution is compiled on a PC and 67

78 Group 842 CHAPTER 5. SOFTWARE DESIGN how it is installed on the DE2-board. Because the uclinux is an operating system, command line interaction with the system is also possible. The uclinux comes as standard with a very basic command line interface called Sash. Sash allows for telnet interaction. Running programs and doing settings would therefore be possible over a network connection, rather than using the USB-Blaster connection which is default. Sash is though a rather basic command line tool and if a more powerful command line tool is needed other options are available to the uclinux. The procedure for compiling the kernel with the right drivers and programs that are used by the DE2-board is described in the appendix A Camera Module The software needed for the hardware/software interface for the camera consists mainly of a device driver. A device driver is used because it enables the usage of hardware interrupts, which is preferred over polling for data ready. The construction of the device driver is based on the Avalon interface module (see section 4.7.7). The interface module has a mask interrupt register, which is used mainly to ensure that no interrupts occur while the device driver is not able to process. A device driver runs in kernel space under the uclinux, whereas other programs are run in user space. Programs running in user space and programs running in kernel space can not share memory. Therefore the data must be transferred from the kernel space program to the user space program. This process resembles the process of reading and writing to a hard drive, this is instead of a file done to a device located in /dev. The device driver contains 3 different elements: 1. Initialize and close the device driver. 2. Communication with the user space programs. 3. Interrupt handling. The image data from the camera is read out two lines at a time from the Avalon hardware module. This poses a problem as it is not possible to know where in a frame these two lines are located. Instead the device driver collects enough lines for a whole image and keeps count of the number of images processed. This value is then compared with the record of how many images the camera has made. If the image count of the device driver and camera does no match, then 68

79 CHAPTER 5. SOFTWARE DESIGN Group 842 the camera has started a new image and the current lines from the camera are the first in a new image. All previous data is therefore discarded and the image of the device driver is started from new. This way of handling the synchronization of lines means that an entire image is collected by the device driver before it is transferred to user space. If an interrupt should occur during the transfer of this image to a user space, the interrupt handler would just move these new lines into the image, that is being transferred to user space, and thereby corrupting it. To avoid this situation two image buffers are used which the interrupt handler alternates between. The interrupt handler must first remove the interrupt mask so that the interrupt goes low, after this, it must check if the current frame is the same for the camera and the device driver. Then it moves the lines into the image buffer. If a whole image has been transferred to the device driver it should wake up the read handler and change image buffer. The last thing the interrupt handler does is to re-enable the interrupts. The function for the interrupt handler is also illustrated in a flowchart in figure 5.4. Wait for interrupt Remove Interrupt mask Switch buffer Read Framecount Wake up read handler Yes Reset buffer No Framecount = Local count? No Buffer full? Yes Read 2 lines of data Copy it to the current buffer Figure 5.4: Flowchart for the interrupt handler. The device driver can not push data to a user space program, the user space program must read from the device driver to obtain the newest image, to ensure that it does not get the same image twice, the read command from the user space program is set to sleep until the next whole image is ready. The flow chart for the read handler is illustrated in figure

80 Group 842 CHAPTER 5. SOFTWARE DESIGN Read from user space Wait for wake up from Interrupt handler Copy buffer to user space program Close Figure 5.5: Flowchart for the read handler. The initialize and close handler registers the device driver in the kernel and the kind of operations the device driver supports. In this implementation the read function is used, as well as the open and release function, these are all needed to enable communication with a user space program. When initializing the device driver the major and minor number of the device driver needs to be specified. These numbers are used to identify the driver and are needed when registering the driver with the kernel. These numbers are unique so two drivers are not allowed to share the same numbers. It is possible to make dynamic allocation of the numbers, but a fixed value is used for this project. The initialize and close handler also registers the interrupt handler with the kernel in order to specify which code to execute when a specific interrupt occurs. It also enables interrupts from the Avalon interface module, by enabling the interrupt mask, which is disabled by default. When the device driver is closed, the handler unregisters all the things that where registered in the kernel as well as disables interrupts from the Avalon module by unmasking them in the Avalon interface module. The user space handler enables the usage of the device driver by a user space program. contains the functions to handle user space programs to open and release the program. It The implementation and development for this submodule is done with inspiration from [13] and [12]. 70

81 CHAPTER 5. SOFTWARE DESIGN Group Implementation of the Hardware/Software Interface The program is written in C and the source code can be found on the enclosed CD in /S ourcecode/devicedriver/ccddriver.c. When determining the physical addresses to read from and write to when communicating with the Avalon interface, it should be noted that there is a difference between the address space used in the SOPC builder and the one used by the uclinux. The BASE address for the Avalon module can be found in /include/nios2 system.h and it is better to include this header file and link directly to address generated by the system. The procedure for compiling the device driver into the kernel as well as the user space test programs can be found in appendix A.5.2. The major number for the driver is 240, the minor number is 0 and the interrupt number is Testing of the Hardware/Software interface for the Camera Data Integrity The initial test is a user space program that prints the output from the device driver to the console for one image. A clock generator is attached to the Avalon module to generate interrupts at a rate of 12 interrupts per second. Other rates would not have any effect on the result of the test. The hardware and software image to upload on the DE2-board for this test is available on the enclosed CD, as well as the compiled version of the user space test program and the source code for the device driver, the user space test program and Verilog code for the Avalon interface. Test Procedure 1. Copy the hardware and the software image onto the FPGA as described in appendix A Copy the compiled version of the user space test program onto the board via a FTP transfer from a PC. 3. Start the nios2-terminal and type the commands listed in listing />msh 2 #modprobe ccddriver 3 #mknod m 666 /dev/camdev c #cd /home/ftp/ 5 #chmod 777 testccd 6 #./testccd 71

82 Group 842 CHAPTER 5. SOFTWARE DESIGN Listing 5.1: Commands used to run the Data Integrety test. The test file has read one image and displayed the content from the first two lines, two lines from the middle of the frame and the last two lines. In the screen dump created from the test program, I specifies the index in the frame and d is the data located at this index. As the Avalon module just reads the address back to the data lines, the first data value for every odd line should be 0 and the last value should be 120. The result from this test was that the data contained the correct address and therefore the functionality of the module must be correct. Transfer Speed This test is performed to estimate the transfer speed from the Verilog code to a user space program. This is interesting as it determine whether or not the hardware/software interface presents a bottleneck for the rest of the system. The original program would have 25 frames per second and each frame would generate 64 interrupts. This requires a clock generator with a frequency of 1600 Hz to emulate. The frequency divider used in the previous test can generate a frequency of 1525 Hz when using bit number 14, this is not high enough. Increasing the frequency by a factor of 2 using bit number 13 would generated interrupts at a rate of 3051 interrupts per second ensuring that it is tested at a higher frequency than it actually runs at. As before the source and compiled version of the test programs are included on the CD as well as the software and hardware images of the system. The user space test program just reads 1500 frames to a buffer and does nothing else. The device driver dumps the current image count to the screen when a user space program mounts or demounts the device driver. The procedure is the same as for the previous test only the last part differs instead of line 6 in listing 5.1 do the following. 1. Make a telnet connection to the board to read out interrupt counts. 2. In the telnet client type listing 5.2 to get the interrupt count. 3. Note the interrupt count for the camera. 4. In the nios2-teminal type line 6 in listing 5.1 to start the test program. 5. Note the time it takes for the program to finish and note the two CURRENTPIC (frame count) values. 6. In the telnet client retype line 3 in listing 5.2 and note the interrupt count for the camera. 72

83 CHAPTER 5. SOFTWARE DESIGN Group 842 Interrupts count Table 5.1: Test results for interrupt count in the transfer speed test. 7. Wait 60 seconds and in the telnet client retype line 3 in listing 5.2. Note the interrupt count for the camera. 8. Wait 60 seconds and in the telnet client retype line 3 in listing 5.2. Note the interrupt count for the camera. 9. Wait 60 seconds and in the telnet client retype line 3 in listing 5.2. Note the interrupt count for the camera. 1 />msh 2 #cd /proc 3 #cat interrupts Listing 5.2: Commands used to run the speed test. As this test also depends on the speed of the person performing it, it would be wise to wait with the notations until the test is done as they are saved in the terminal. The results for the interrupt counts are listed in table 5.1, the frame count was 7857 and 9356 and the time it took to transfer 1500 frames to user space was 31 seconds. To verify that 1500 frames are transferred to the user space program, the two frame counts are subtracted. This gives a value of 1499 and add 1 because the first frame should also count. This means that no frames were lost during the data transfer to the user space program. The program took 31 seconds to finish. During this time interrupts should have occurred (1500 frames times 64 lines). Dividing this by the frequency of the interrupt generator gives an execution time of 31.5 seconds. It is assumed that no interrupts were lost and when running at 1600 Hz which is lower it would be safe to assume that no interrupts are going to be lost. The last thing to verify is that all interrupts generated in hardware are actually handled by the device driver. In order to verify this, the amount of interrupts per second is calculated. This is done by the interrupt count notated 60 seconds apart. In table 5.2 the results for calculating the amount of interrupts generated each 60, 120 and 180 seconds is listed. The values in table 5.2 should be the same as the frequency generator running at 3051 interrupts per second. The calculated frequencies are close to this value with the largest deviation around 73

84 Group 842 CHAPTER 5. SOFTWARE 60 seconds Table 5.2: The calculated frequency of the external interrupt generator. 75 interrupts which gives an error around 3%. This is close considering human interaction in the test. Because the test is almost running at twice the transfer rate that the program is going to run at, it is assumed that the interface is not going to present a bottleneck for the system. 5.4 Ethernet Module The uclinux contains the drivers and the TCP/IP stack necessary to communicate with a PC over an Ethernet connection and is added as default to the core when compiling the kernel. To transfer the MPEG-2 data stream from the board to a PC, a very simple communication procedure is developed using a socket transfer. When using a socket it is not necessary to know anything about the layers below the socket interface, only some basic knowledge about how to configure the socket interface is needed, like what kind of protocol to use, as well as the IP address and the port the socket is running on at the server. Socket works for both TCP packet and UDP packet. For handling a data stream, TCP is normally used. UDP is used more for packet transfer, where it is not important that they arrive in a correct order. Communication over a socket uses the server client paradigm, here it is important to notice that a server can not contact a client, but a client can contact a server. The board is the server and the PC is the client as described in section 5.1. This section is made with insperation from [6]. Implementation of the Server Side To implement the flowchart for the server side shown in figure 5.2 each of the function calls to the socket is described. The implementation of the control structure is not described here, but can be seen in the source code included on the CD. Create Socket and Bind it The first functions used creates the socket and bind it to a port. To open a socket the following function is called: 1 sd = socket(pf_inet, SOCK_STREAM,6); 74

85 CHAPTER 5. SOFTWARE DESIGN Group 842 PF INET tells the socket that a Protocol Family communication is used e.g. TCP/IP is a member of the this Family. SOCK STREAM makes the socket stream that the data is being put into. There is for the last parameter a portability issue as the uclinux does not support the function getprotobyname(). This function returns a numerical value for a given protocol like UDP or TCP. Reading this value out for TCP with a printf() command on a Linux PC gives the value 6 for TCP. It is assumed that the same values are used on the uclinux core as it is also based on the Linux kernel, also it is hinted by some authors that it can be set to 0 without any problems. Opening a socket is a generic function so it is used for both the server as well as the client. To bind the socket to a port the following command is used: 1 bind(sd, (struct sockaddr )&sad, sizeof(sad)); sad or the socket address is a structure that contains the following: 1 struct sad{ 2 short sin_family; 3 u_short sin_port; 4 struct in_addr sin_addr;}; sin family is the value of the PF INET used when creating a socket. sin port is the port being listen on and in addr sin addr is IP address of the server. Listen for incoming connections To listen for incoming connections the following command is used: 1 listen(sd, QLEN) sd is a pointer to the socket and QLEN is a request queue, incoming connection request are going to be stored in this queue until the connection is accepted. Accept incoming connection In order to accept an incoming connection the following command is used: 1 alen = sizeof(cad); 2 sd2=accept(sd, (struct sockaddr )&cad, &alen); accept takes the existing information about the socket (sd) and creates sd2 which is a new socket that is exclusively used for communication with the client. This is a standard way of handling the communication over a socket. 75

86 Group 842 CHAPTER 5. SOFTWARE DESIGN Read from client The function to receive data from a client is described below: 1 n = recv(sd2, resbuf, sizeof(resbuf), 0); The first argument is the socket used for communication. The second and third argument is the buffer where the received data should be copied to and the size of this buffer. The last argument is not used but here arguments can be passed how the data should be copied to the buffer, like wait for the buffer to be full etc. The function returns a zero if no data has been received, else it returns the size of data copied to the buffer. Write to client The function to send data to a client is as follows: 1 send(sd2,sendbuf,n,0); The input parameters are the same as for the recv function where n is the amount of data in the send buffer (sendbuf ) that should be sent to the client. Closing the Connection When the server is done transferring data to the client the socket controlling the communication should be closed with the following command: 1 close(sd2); The server then returns to listening for incoming connections. The commands and the structure is made with insperation from [6]. Implementation for the Client Side To transfer the MPEG-2 data to a client, a program is implemented that sends a request to the server and the server sends a MPEG-2 stream to the client which the client stores as a file. This file can then be played back to view the MPEG-2 stream. The procedure for the communicating with the server is described in figure 5.3. For the implementation, only the function calls to the socket are described. Only one function call has not already been described in the server implementation, this is therefore the only function that is going to be described here. Connect to Server The socket call needed to make a connection to a server is as follows: 1 int connect(int sockfd, struct sockaddr serv_addr, int addrlen); 2 connect(sd, (struct sockaddr )&sad, sizeof(sad)); 76

87 CHAPTER 5. SOFTWARE DESIGN Group 842 In order to make this call the IP and port address of the server needs to be known. sd is the file descriptor of the socket opened on the client side by the open socket command. The second argument (struct sockaddr *)&sad contains the information about the server side i.e. the IP address and the port contained in a struct, constructed the same way as in the server implementation. The third argument sizeof(sad) is the size of the second argument Ethernet Test Transfer Rate Test This is a small test to give an indication on what kind of transfer speeds are obtainable on the development board. The test is basically a FTP transfer test. 1. A FTP server is started on the development board. 2. A FTP client is started on a PC, which can display transfer rate. 3. A file is copied to the FTP directory on the server. 4. The transfer rate is noted down. To get the FTP server up and running the Ethernet settings described in appendix A.2 are made. The file that is copied to the FTP must not be larger then the memory available in uclinux. If the file is larger, it is going to overwrite the operating system and the system is going to crash. For our test a random file of 4 Mbyte is used for the transfer test and we got a transfer rate of around 1.7 Mbyte/s. This result is only to get an idea of what kind of transfer speeds are obtainable on the board. It must be noted that only the FTP server is running on the development board. Therefore the actual transfer rate for the whole system is going to be lower then the results obtained as the MPEG encoder is going to take up allot of processing time. Communication Test This Ethernet test consists of a text-file sent from a client (PC) to a server (Development board). The server then returns the same text-file to the client and the client outputs it in a new file. The code is based on a client server example written by [24]. Alterations made to make this test run in uclinux are mentioned in the code. Please note that the code used for the implementation is a striped version of this test code as it contains additional functionality that is not needed. 77

88 Group 842 CHAPTER 5. SOFTWARE DESIGN The code running on the client and the server side can be found on the CD in the /testcode/ethernet/ folder. The code was compiled as described in appendix A.5 and transferred into the board via the FTP transfer protocol. The program is started on the board using the nios2-terminal with the following commands. 1 />msh 2 #cd /home/ftp 3 #chmod 777 server 4 #./server Start the client program on a PC with the following command: 1 #./client [ host [port]] [input] [output] The file named input is now sent to the board and returned in output. The result was that the sent text files and the received text file where identical and the feedback from the terminal running on the development board indicated that the server received and returned the text file Conclusion In this section the software for the Ethernet module was developed using sockets and tested. The communication test indicated that the module functions correctly. The transfer rate test indicated that a transfer rate of up to 13.6 Mbit/s is obtainable, which satisfies the required speed of 4 Mbit/s, 1.5. It should how ever be noted that this value could go down if the Nios II core is doing other operations e.g. the MPEG-2 encoding. 5.5 MPEG-2 Module MPEG-2 Hardware/Software Interface This interface is made the same way as for the hardware/software interface for the camera. By making a device driver for the hardware accelerated part of the MPEG-2 algorithm (MPEG- HWA). There is no need for interrupts to be generated in this device driver because of the relative fast computational time. When the data has been copied to the MPEG-HWA it takes around 14 clockcycles for it to finish the calculations, therefore the device driver only polls the MPEG-HWA until data is ready and then reads the result of the computation. 78

89 CHAPTER 5. SOFTWARE DESIGN Group 842 The way the user space kernel space works, data must be copied to the device driver in one buffer. This means the two macroblock must be collected into one buffer and then moved to the device driver. The MPEG-HWA allows for 16x8 as well as 16x16 macroblocks to be processed. When doing the fullsearch() one of the macroblock does not change, it would be prudent to have the ability to only update one of the macroblocks and then do the calculations. This allows for four different scenarios. Two 16x16 macroblock. Two 16x8 macroblock. Update one 16x16 macroblock. Update one 16x8 macroblock. The device could determine what to do with the incoming data by the amount given by the userspace program but scenario 2 and 3 gives the same size buffer to transfer to the device driver, therefore it has been decided to add a control byte as the first entry in the buffer transferred to the device driver, the controls byte is as follows for the scenarios. Two 16x16 macroblock, control byte 0. Two 16x8 macroblock, control byte 1. Update one 16x16 macroblock, control byte 2. Update one 16x8 macroblock, control byte 3. The data copied to the device driver has to be organised in one single buffer, this buffer should be organized as illustrated in figure 5.6. When doing the update macroblock the data stored in macroblock 2 is not changed only the one stored in macroblock 1 is changed. 79

90 Group 842 CHAPTER 5. SOFTWARE DESIGN Macroblock 2 Macroblock 1 Control Case 1 Case 2 Case 3 Case 4 1 byte 128 byte 128 byte 128 byte 128 byte Figure 5.6: Illustration of how the data should be organized when transferring data with the four different control bytes. The shaded area of the buffers are the data that is transferred in sequence. Implementation The implementation is written in C, and the device driver can be found in /S ourcecode/devicedriver/mpeghwa.c. The device driver have to be compiled into the kernel from the procedure described in appendix A.5.2. The major number for the implementation is chosen to be 241 and the minor number is 0. Test Two tests are performed. The first test is a functional test where different test vectors are sent to the devicedriver and it is verified that the correct data is returned. The second test is a execution time test where the execution time for the original distance measurement is compared to the accelerated version. Functional Test 80

91 CHAPTER 5. SOFTWARE DESIGN Group 842 Here six different test vectors are sent to the hardware algorithm to verify the functionality of the interface. The return values are checked to see if they are correct. This test does not verify the calculations of the MPEG-HWA as this was done in section 4.8.3, but it uses the results to verify that the interface is sending and receiving data as intended. As the number of test vectors that were used in the test of the MPEG-HWA were rather limited, this test only further verifies that the MPEG-HWA is also functioning correctly. The test program, the kernel software image and the FPGA hardware image for this test is located on the enclosed CD in the folder /testcode/m peg/, there is also the source code for the device driver and the test program. Please note that this version of the device driver includes additional debug feedback that has been removed in the final version. The test vectors are as follows the organisation of the macroblocks and control byte is illustrated in figure 5.6. Test 1: The first macroblock contains 256 zeroes, the second macroblock contains the values from 0 to 255, the device driver is told that it should process two 16x16 macroblocks by setting the control byte to 0. The return value should be the sum of the values from 0 to 225 or 32,640. Test 2: The first macroblock contains the values from 0 to 255, the second macroblock contains 256 zeroes, the device driver is told that it should process two 16x16 macroblocks by setting the control byte to 0. The return value should be Test 3: The first macroblock contains the values from 0 to 255, the second macroblock contains 256 zeroes from the previous test, the device driver is told that it should process one 16x16 macroblocks by setting the control byte to 2. The return value should be 32,640. Test 4: The first macroblock contains the values from 0 to 127, the second macroblock contains 128 zeroes, the device driver is told that it should process one 16x8 macroblocks by setting the control byte to 3. The return value should be 8,128. Test 5: The first macroblock contains 128 zeroes, the second macroblock contains the values from 0 to 128 from the previous test, the device driver is told that it should process two 16x8 macroblocks by setting the control byte to 1. The return value should be 8,128. Test 6: The first macroblock contains 256 zeroes, the second macroblock contains 256 ones, the device driver is told that it should process two 16x16 macroblocks by setting the control byte to 0. The return value should be 256. Test 7: The first macroblock contains 256 zeroes, the second macroblock contains 256 time the value 255, the device driver is told that it should process two 16x16 macroblocks by setting the control byte to 0. The return value should be 65,

92 Group 842 CHAPTER 5. SOFTWARE DESIGN Test 8: The first macroblock contains 256 ones, the second macroblock contains 256 zeros, the device driver is told that it should process two 16x16 macroblocks by setting the control byte to 0. The return value should be 256. The test procedure is as follows: 1. Copy the hardware and the software image onto the FPGA as described in appendix A Copy the compiled version of the user space test program onto the board via a FTP transfer from a PC. 3. Start the nios2-terminal and type as illustrated in listing />msh 2 #modprobe mpeghwa 3 #mknod m 666 /dev/mpegdev c #cd /home/ftp/ 5 #chmod 777 testmpeg 6 #./testmpeg Listing 5.3: Commands used to run the functional test. The test file now returns the results from the test. The return value from each test is displayed as well as what this value should have been. The result from the test was that there were no deviations from the expected results from the tests vectors. Execution Time This test is performed by running the hardware accelerated distance mesurement and calculating the CPU time spent by the function by using the internal function clock(). This is then repeated for the original algorithm and the two time measurements are compared. For this test two frames and the distance between one macroblock from the privious frame is compared to 257 macroblocks in the currentframe for this test 16x16 macroblocks are used. This process is then repeated 100 times to get a higher execution time. The amount of comparisons between macroblocks can change, and this have an impact on the amount of the data being transferred when the MPEG-HWA is used. Therefore the test is also conducted for 65 and 17 macroblock comparisons, these are reapeated 400 and 1,600 times to get higher executions times as well. In the first calculations two macroblocks are transferred to the MPEG-HWA and the distance is calculated. The following transfers only contain one macroblock from the previous frame. 82

93 CHAPTER 5. SOFTWARE DESIGN Group 842 The same hardware and software images are used as in the previous test and the source code and the compiled version can be found in the same directory as the previous test. Continue from where the last test stopped and copy the testmpegexe program to the board via FTP. Then in the nios2-terminal window type the commands in listing 5.4: 1 #chmod 777 testmpegexe 2 #./testmpegexe Listing 5.4: Commands used to run the execution speed test. The results from the test are listed in table 5.3. # Comparisons # times Time used Perfomance gain new ,30 old new ,23 old new ,08 old Table 5.3: Test of the difference in execution time between the hardware accelerated version and the old version MPEG-2 Algorithm Source Code Before reading this section it is advised to briefly review chapter 2 because function and variable names are reused. Most of the MPEG-2 algorithm can be recompiled to run under uclinux without changes, but some modifications have been made to make it possible to run the algorithm on the development board, such as camera and network interface. These changes are described in this section. Input parameters The input parameters are specified on a file located on the hard drive. The first thing the algorithm does is read in these parameters from the file. To avoid using a parameter file these are specified at compile time instead. The disadvantage is that the algorithm will be less flexible as it is required to recompile the code rather than just use another parameter file. All changes with regard to the parameter file have been made in mpeg2enc.c. Parameters that affect other changes made to the algorithm have been listed below: Framerate: 25 FPS. 83

94 Group 842 CHAPTER 5. SOFTWARE DESIGN Resolution: 160x128. Chroma format: 4:2:0. # of frames in GOP: 12. # of B frames between reference frames: 2 - Resulting GOP: IBBPBBPBBPBB. Statistics The unmodified algorithm is capable of outputting a statistics file if this is specified in the parameter file. No statistics will be used in the FPGA implementation and any code associated with statistics is then redundant. To avoid unnecessary lines of codes all of these have been removed and the stats.c source file is no longer included in the algorithm as this is used for statistics only. Reading from camera The unmodified algorithm uses a file stream to read frame data from hard drive. This is replaced by a file descriptor, which is used to communicate with the camera device driver under uclinux. In readpic.c all code lines used to open and close the file stream are removed and instead the following two lines of code are added in mpeg2enc.c to open and close the file descriptor: 1 fdcam = open("/dev/camdev", O_RDWR); 2 close(fdcam); fdcam is the camera file descriptor, /dev/camdev is the camera device driver and O RDRW defines the rights to use with the file descriptor. To read from the camera the read function is used: 1 read(fdcam, buffer, n) n is the number of bytes read from the camera and buffer is where the bytes are placed after reading. To be able to use these functions under uclinux it is necessary to include the following libraries in mpeg2enc.c: 1 #include <sys/types.h> 2 #include <fcntl.h> 3 #include <sys/ioctl.h> 4 #include <sys/stat.h> When reading the frames from the camera the luma and chroma values are ordered as shown in figure 4.6 page 41. Before the the frames are encoded the luma and chroma values need to be 84

95 CHAPTER 5. SOFTWARE DESIGN Group 842 rearranged such that all luma values are placed first followed by the chroma values, C B before C R. The pels in each block of luma or chroma values are ordered with top left pel first then going through each line of the frame from left to right meaning bottom right pel will be the last in each block. The rearranging of the luma and chroma values is performed in readpic.c. Reading from and writing to MPEG-HWA The MPEG hardware accelerator replaces the nothxnothy part of the dist1 function described in section 2.3. It is possible to only replace the code inside the nothxnothy loop with one function writing the two macroblocks to the hardware accelerator and one function reading the result, but better performance can be obtained if these functions are moved into the fullsearch function. Each time fullsearch is executed it will make a number of calls to dist1 depending on the size of the search window. Two pointers to the macroblocks will be passed, but in hardware the macroblocks need to be written to the accelerator. One of the macroblocks is, however, static for that one execution of fullsearch, meaning it will not be necessary to update this data every time dist1 is called. By doing so almost 50 % less data needs to be moved, as it is only necessary to write both macroblocks to the accelerator the first time dist1 is called. To transfer data to and from the hardware accelerator a file descriptor is used in conjunction with a device driver in the same manner as for the camera, except the file descriptor is called fdmpeg and the device driver is /dev/mpegdev. Writing data to the device driver is done using the write function with the same parameters as the read function: 1 write(fdcam, buffer, n) Because macroblocks span over several lines in vertical direction, but not whole lines in horizontal direction the macroblock data will be scattered in memory. Before a macroblock can be written to the accelerator it is necessary to rearrange the data such that it is one continuous block. Each time a write to the accelerator is performed one byte is added in front of the data block to tell the device driver how much data should be transferred. Opening and closing file descriptors is handled in mpeg2enc.c while rearranging, writing and reading data is placed in the fullsearch function in motion.c. The nothxnothy loop is removed from dist1. Writing to network The unmodified code stores the encoded frames in a file on the hard drive. Data is written to the hard drive using a file stream and done on a byte level in the putbits.c file. Instead of using a file stream a file descriptor is used to make a connection to the network layer. The bytes are then transferred over the network through the file descriptor in putbits.c. The Ethernet connection is established and data is sent to it/read from using the commands described in section

96 Group 842 CHAPTER 5. SOFTWARE DESIGN Creating and binding sockets as well as setting up the listening server is performed as the first thing in mpeg2enc.c and when a client wishes to connect the connection is established. When a packet indicating a MPEG-2 stream should be initiated the encoding is started with the blocks described in section 2.4, opening and closing file streams excluded. These are replaced by opening and closing Ethernet connections. Instead of terminating the program when the MPEG-2 encoding is stopped mpeg2enc.c will return to listening for new client connections. In putseq.c a change is made such that this is looped until a packet indicating the MPEG-2 should be stopped, is received from the client rather than looping it for a number of frames. After each frame has been encoded and transmitted, putseq.c will read data from the network to see if it should continue or stop the MPEG-2 stream. If a stop packet has been received the file tail is transmitted to the client and the connection is closed otherwise another loop will be performed. In putbits.c the writing of bytes to the file stream is replaced by the function writing bytes to the file descriptor used for communicating with the network layer. Frame encoding sequence When encoding an MPEG-2 sequence frames will not be encoded in a sequential order if the sequence contains B frames. If the GOP defined in section is used the encoding order would be frame 1 (I), 4 (P), 2 (B), 3 (B), 7 (P), 5 (B), 6 (B), etc. Because the algorithm encodes files from hard drive this is not a problem, but when encoding on the FPGA frames will be read from the camera in sequential order. This means it will be necessary to store some of the frames in buffers till they are needed for encoding. The unmodified algorithm uses three pairs of buffers, a pair for old frames, a pair for new frames and a pair for auxiliary frames. In each pair there is one original frame which is read from hard drive and one reference frame which is reconstructed from a compressed frame. All buffers are accessed through pointers. When the first I-frame is read in, this will be stored in the buffer pointed to by the new frames pointer. This is done indirectly by setting the input pointer equal to the new frames pointer and then using the input pointer in the frame read function. The old frame pointer and auxiliary frame pointer will be pointing to the empty buffers, respectively. When a new I or P frame is read, the new frames and old frames pointers will be swapped and the input pointer will again be set equal to the new frames pointer. The previously read I frame will then be pointed to by the old frame pointer and the new frames pointer will still be pointing to the current frame. This swapping of pointers is repeated each time an I- or P-frame is read. When a B frame is read the input pointer will be set equal to the auxiliary frame pointer and the new frames and old frames pointers will be unchanged. By doing so the old frames and new frames pointers can be used in the forward and backward prediction of the 86

97 CHAPTER 5. SOFTWARE DESIGN Group 842 encoding of the B frame, respectively. If more B frames are read the pointers will not be changed, as the same frames are used for forward and backward prediction. This is illustrated in figure 5.7. To make it possible to encode the sequentially incoming frames two temporary frame buffers New frames Old frames Auxiliary frames New frames Old frames Auxiliary frames New frames Old frames Auxiliary frames New frames Old frames Auxiliary frames ref org ref org ref org ref org ref org ref org ref org ref org ref org ref org ref org ref org Input frames (a) Input frames Input frames (b) Input frames New frames Old frames Auxiliary frames New frames Old frames Auxiliary frames ref org ref org ref org ref org ref org ref org Input frames Input frames (c) Figure 5.7: Pointers and buffers used for frames in the unmodified algorithm. When the first I frames is read (a) the old frames and auxiliary buffers will be empty. The input pointer will be pointing to the new frames buffer. When a subsequent I or P frame is read (b) the new frames and old frames pointers will be swapped and the input pointer will be moved to the new frames buffer. When a B frame is read (c) the new frames and old frames pointers will remain unchanged while the input pointer is set to point to the auxiliary buffer. If more B frames are read the pointers will not be modified. are used. With the used GOP there are two B frames in between each I or P frame and these are stored in the buffers until needed. When this happens the needed B frame is copied from its temporary buffer into the buffer pointed to by input pointer and the buffer can be used for a new B frame. To fill up the buffers three consecutive frames are read in to begin with, one I frame to the input buffer and two B frames to the temporary buffers. After this a pattern for reading and copying frames will be repeated for each three frame as illustrated in figure 5.8. The buffer handling is implemented in motion.c while declaration of pointers and memory allocation for the buffers is handled in global.h and mpeg2enc.c, respectively. 87

98 ref org ref org ref org Input frames Group 842 CHAPTER 5. SOFTWARE DESIGN 1 (I) 2 (B) 3 (B) 4 (P) 5 (B) 6 (B) Input buffer Input buffer Input buffer Input buffer (1) (2) (3) (4) 7 (P) 8 (B) 9 (B) Input buffer (5) Input buffer (6) Input buffer (7) Figure 5.8: Use of two temporary frame buffers to obtain the right encoding order with frames being input sequentially to the FPGA. Step (1) three frames are read in, the I frame directly to the input buffer and a B frame in each temporary buffer. Step (2) the fourth frame is passed directly to the input buffer as this is a P frame. The temporary buffers remain unchanged. Step (3) the temporary buffer containing the first B frame is copied to the input buffer and then overwritten with the first new B frame from the camera. Step (4) The second B frame is copied to the input buffer and overwritten with the second new B frame from the camera. After step (4) the steps (2-4) is repeated as shown in steps (5-7) untill the encoding is finished Test of the MPEG-2 Algorithm Source Code Each of the different changes made to the algorithm have been tested. Code that has been tested previously will be reused, but not tested for errors. Only new lines of code will be tested. Most of the tests will be verified by running the code on the development board and check if the encoded output file size matches that of an output file encoded on a PC. The output files will also be inspected for visual changes. A total of 10 frames are encoded. Five different versions of the program with increased number of changes can be found on the CD in the testcode/mpeg algorithm folder. Each version folder contains a compiled MPEG2ENC.exe file for execution on a PC with Windows, a compiled mpeg2encode file for execution on the development board and the *.c source files. Below is a brief description of each version: V1: Unmodified version. output file. Uses command line arguments to specify parameter file and V2: Parameters have been made static and included at compile time. Output file have been made static. No need for arguments in the command line. V3: Buffers to allow for sequential encoding have been added to the code. V4: Code to rearrange macroblocks to be located in one block in memory have been added. The frame distance is still calculated using dist1. 88

99 CHAPTER 5. SOFTWARE DESIGN Group 842 V5: Code to use the MPEG-HWA to calculate the macroblock distances has been added. Can not be compiled for or run on other platforms than the development board. Any further changes have not been implemented because of time limitations. Test procedure for Windows 1. V1: Run MPEG2ENC.exe from the command line with the arguments pars.pc out.mpeg. This will read the parameters from pars.pc and save the encoded output in a file called out.mpeg. V2-V4: Run MPEG2ENC.exe from the command line, no arguments needed. The encoded output will be saved in a file called out.mpeg. V5: Can not be run on other platforms than the development board. 2. Determine the file size of the out.mpeg. This should be exactly the same for all versions. 3. Visually inspect out.mpeg. This should look exactly the same for all version. Test procedure for development board 1. Copy the hardware and the software image onto the development board as described in appendix A Copy the compiled mpeg2encode to the development board using FTP transfer. 3. Copy the frame files from video/matlabslide to the development board using FTP transfer. Place the frame files in the root, not in a subfolder. 4. V1 only: Copy the pars.linux to the development board using FTP transfer. 5. V1: In the command line type in lines 1-3 and 7 from listing 5.5. V2-V4: In the command line type in lines 1-3 and 6 from listing 5.5. V5: In the command line type in lines 1-6 from listing If execution time is printed in the command line note this in results. 7. Transfer the encoded out.mpeg to a PC using FTP transfer as shown in listing Determine the file size of the out.mpeg. This should be exactly the same for all versions. 9. Visually inspect out.mpeg. This should look exactly the same for all version. 89

100 Group 842 CHAPTER 5. SOFTWARE DESIGN 1 />msh 2 #cd /home/ftp/ 3 #chmod #modprobe mpeghwa 5 #mknod m 666 /dev/mpegdev c #./mpeg2encode 7 #./mpeg2encode pars.linux out.mpeg Listing 5.5: Commands used to run the functional tests for the MPEG algorithm changes. 1 #ftp [ip adress] 2 ftp>[username] 3 ftp>[password] 4 ftp>binary 5 ftp>put out.mpeg 6 ftp>exit Listing 5.6: Commands used to transfer a file from the development board to a PC using FTP transfer. Values in square brackets should be replaced with the correct info. Test results The test results have been listed in table 5.4. Execution time File size Visual difference? V1 PC B No Dev. board B No V2 PC B No Dev. board B No V3 PC B No Dev. board s B No V4 PC B No Dev. board s B No V5 Dev. board s B No Table 5.4: Test results for the MPEG-2 algorithm test. The lack of visual differences and equal file sizes show that there is no difference between any of the versions from an algorithmic point of view. From a processing point of view there are noticeable changes in version 4 and version 5. There is a 66 % increase in execution time from version 3 to version 4 where the rearranging of the macroblock data is introduced, but no hardware acceleration used. When the hardware acceleration is introduced in version 5 there is a 12 % decrease in execution time, despite the test results in section showing performance gains of at least a factor of 2 in execution time. The number of frames encoded per second can 90

101 CHAPTER 5. SOFTWARE DESIGN Group 842 be calculated as: V2 : V3 : V4 : V5 : 10 = FPS (5.1) 10 = FPS (5.2) 10 = 0.08 FPS (5.3) 10 = 0.09 FPS (5.4) Conclusion In this section the hardware/software interface for the MPEG hardware accelerator and changes to the MPEG algorithm source code were implemented. The interface was implemented as a device driver for uclinux and tested. The test verified that the device driver work as intended and the performance gain in terms of execution time was at least a factor of 2. Because of time limitations the communication with the camera and network was not implemented in the algorithm source code, but encoding tests could still be performed on static data uploaded to the development board. These tests showed that the implemented changes worked as intended and also the execution time for these. The MPEG-HWA yielded only a 12 % decrease in execution time despite the test results for the interface showing performance gains of up to a factor of 2. The tests also showed an increase in execution time of 50 % when rearranging of macroblock data, but no MPEG-HWA, was introduced to the code. Because of this it is possible to conclude that even though the MPEG-HWA works as intended, only minor performance gain can be obtained because it is not the calculations that take up most execution time, but memory moves. The MPEG-HWA is indeed faster than the Nios II core on its own, but because of the associated memory moves the overall performance is better using the Nios II core without the MPEG-HWA. This concludes the software design chapter and thereby the work of this project. The modules and their submodules from hardware/software partitioning have been designed, implemented and tested and their individual results where extracted in the many subconclusions. To compare the work achieved through this project with the goals set in the introduction, a final conclusion is written in the following chapter. A discussion of future work is also located in this chapter to sum up what could have been achieved if given another time frame. 91

102 Group 842 CHAPTER 5. SOFTWARE DESIGN 92

103 Chapter 6 Conclusion and Future Work Throughout this report numerous subconclusions have been drawn. In this chapter the essential parts of these conclusions are summarized. Furthermore, a general conclusion is established upon the goals reached in the subconclusions and the goals stated in the problem statement. For the prototype of this project to be completed, some short term future work is considered. Besides these necessary implementations, future improvements are also investigated, with regard to implementation of new algorithms and thereby adding more features to the system. 6.1 Conclusion The problem in this project was to design a surveillance system for a TCP/IP-network environment on an FPGA. As video surveillance introduces a huge amount of data-transfer from the camera to the operator, it was needed to implement a compression of this data in the given environment. With the flexibility of an FPGA - providing the opportunity for implementing modules in hardware without fabricating new ICs - it was interesting to investigate algorithms used in the surveillance system for potential hardware implementation to increase the performance. The problem stated above made it possible to divide the project into three main modules. A camera to provide the video stream, MPEG-2 encoding as a method of data reduction and Ethernet as the interface between the system and the operator. Profiling was used to investigate which submodules of the MPEG-2 encoding algorithm takes up the most execution time. The profiling showed that the function nothxnothy for calculating the distance between macroblocks used on average 65 % of the total execution time for the MPEG-2 module. Calculating the characterization metrics for this function indicated that it was potentially parallelizable meaning it might be suited for an FPGA. Therefore a hardware 93

104 Group 842 CHAPTER 6. CONCLUSION AND FUTURE WORK implementation of this function was carried out. After the modules had been identified, these were divided into submodules in order to perform the hardware/software partitioning of the system, i.e. to specify which parts of the modules should be implemented in hardware and which should be implemented in software. This resulted in moving the chroma subsampling and RGB to Y C B C R submodules from the MPEG-2 software module into the camera hardware module. By doing this, the data send through the hardware/software interface between the camera module and the MPEG-2 module is reduced by 50 %. Also the nothxnothy function of the MPEG-2 algorithm was hardware accelerated by writing its functions in Verilog code. This hardware accelerated part is capable of executing the nothxnothy two times faster than the existing sequential software implementation. Due to memory moves associated with the hardware accelerated part, the obtained performance boost was a 1.5x faster execution of the new hardware accelerated implementation compared to the old software implementation. Throughout the project, design reference code provided from manufacturers and standards organizations have been modified to fit the requirements for the designed system. This includes camera interface, configuration and establishment of an Ethernet connection and the MPEG-2 encoding algorithm. The DE2 board from Terasic was selected for implementing the hardware and software partitioned modules as this board provided the best environment among the boards available. The Ethernet connection was implemented by installing a uclinux operating system on the Nios II softcore processor and modifying the Ethernet implementation for the DE2 board to fit this operating system. The camera module was implemented by modifying an existing Verilog code from Terasic, so it fits the registers of the conversion and subsampling modules. Further modifications to the camera code were done to extract control signals used throughout the camera module. To implement the two hardware submodules, RGB to Y C B C R conversion and chroma subsampling, a reference design in Verilog code from Xilinx needed to be modified to fit Altera s FPGA and the DE2 board. Input registers and control signals were written to fit those from the camera. Subsampling was written from scratch and implemented as a part of the Verilog conversion code. To investigate area usage of the conversion code, a modification was done on the internal precision of its calculations. The hardware accelerated part was also written from scratch. The remaining parts of the MPEG-2 encoding algorithm were implemented on the Nios II softcore processor, on the FPGA. Through several tests the implemented code was investigated for proper functionality. As an entire system was not produced and not all module tests could be conducted in ModelSim, some functionality of the camera module could not be verified. Module tests did, however, show that some submodules that could not be tested individually functioned as expected. The following contains the most significant tests for the success of the overall project: 94

105 CHAPTER 6. CONCLUSION AND FUTURE WORK Group 842 Camera 1. RGB to Y C B C R conversion: Using the RGB color bars as test vectors the test gave the correct conversion values and showed that the delay introduced from input to output was six clock cycles. Configuring the internal precision of this module from 13 bit (Y ), 11 bit (C B ) and 10 bit (C R ) to 8 bit for all of them, showed reduction in area usage, but lacking precision, as it introduced a variation of 0.4% from the correct values. This is stated in Chroma subsampling: The test showed that the data was organized correctly in the assigned registers. 3. CCDbuffer: There were problems testing this submodule, as it uses a mega function from Altera that could not be tested in ModelSim; therefore the functionality of the module could not be verified. 4. Interface with software: The device driver for the uclinux and the Avalon interface to the softcore processor were tested and performed as expected. MPEG: 1. Software Implementation: The needed changes for the MPEG algorithm to function on the FPGA where partly implemented. Changes to avoid a parameter file, buffers to allow encoding of the sequentially incoming images, code to rearrange macroblock data before being hardware accelerated and an interface to the hardware accelerator have been implemented. The changes that were not implemented includes rearranging of data read from the camera module and the server side of the Ethernet connection. Tests showed that the implemented changes worked as expected and did not do any changes to the output from an algorithmic point of view. 2. Hardware Acceleration: a functional test was performed in ModelSim with different testvectors and the result showed that the calculations as well as the function were executed correctly. 3. Hardware Software Interface: a functional test was performed in uclinux, where different test vectors were sent to the hardware accelerated part of the MPEG-2 and the results returned were verified. This implied that the interface relayed data to and from the hardware accelerated part of the MPEG-2 as expected. Ethernet: 1. A data transfer test was performed as well as a performance test. The results from these test showed that data transfer was possible between the Ethernet and the operator and that the transfer speed is high enough so that it is not a bottleneck, e.g. higher than 4 Mbit/s. 95

106 Group 842 CHAPTER 6. CONCLUSION AND FUTURE WORK The objectives set in Scope of Project 1.1 were all achieved. The MPEG-2 algorithm was analyzed with respect to execution time. The most time consuming part of the algorithm was hardware accelerated and hardware/software interfaces were constructed to link the hardware implemented modules together with the software executed on the Nios II softcore processor. The Intellectual Property (IP) cores of the chosen FPGA were used through the Altera design tools and through the developed Verilog code, but presented a problem during functionality testing. The goals set in Requirement Specification 1.5 concerning General Requirements were partially met. A working MPEG-2 encoder was implemented on the FPGA and an Ethernet connection was established from this FPGA to a PC, but the surveillance system was not able to output a video stream. The reason for this should be found in the test of CCDbuffer that was not conducted and no final compilation of the whole system was finished due to time limitations. Though the final system could not be tested, it is most likely possible to draw the following conclusions: generally speaking it is possible to implement a working MPEG-2 compression profile on an Cyclone II FPGA. Tests of the MPEG-2 algorithm showed that the implemented algorithm was capable of encoding around 0.09 frames per second (fps) at a resolution of 160x128 pels, equation (5.4). In order to reach the wanted frame rate of 25 fps the system needs to sped up by a minimum of 280 times. Note that this speed-up does not take into account that a higher resolution than 160x128 would further increase the execution time. In the future work section possible solutions are investigated for reaching this goal, if at all possible. One of the solutions discussed is the effect of uclinux. This operating system was implemented on the Nios II softcore processor, reducing the performance and thereby increasing execution time by introducing an overhead on the Nios II softcore processor. 6.2 Future Work This section is divided into short term and long term improvements as well as new features that could be added to the surveillance camera Short Term Improvements Due to time limitations a working prototype was not finished during this project. Therefore some attention to the missing links in the project is needed. This applies to testing the CCDbuffer and compilation of the systems modules as explained in section 6.1. As discussed in the previous section, there were problems regarding the speed of the Nios II softcore, therefore a short term improvement would be an investigation of speed performance when the uclinux operating system is discarded from the system. Other improvements of exe- 96

107 CHAPTER 6. CONCLUSION AND FUTURE WORK Group 842 cution speed could be investigated by using Direct Memory Access (DMA) for all the modules interacting with the software. Hardware acceleration of more blocks of the MPEG-2 algorithm could also reduce execution time, as would also a faster FPGA. After these speed improvements, it would be of interest to see how high a video quality could be obtained with the DE2 board. This is mainly regarding the resolution of the camera which could be changed to 1280x1024, but other things like subsampling and frame rate would also be of interest Long Term Improvements When talking about long term improvements, it is with a perspective of 6 months. This is to give an idea of what could be of interest in a follow up project. Development on an ASIC: One of the possibilities to improve the performance would be to design the system for an Application Specific Integrated Circuit (ASIC) rather than a FPGA. This will, however, reduce the flexibility of the system as an ASIC is an integrated circuit customized for a particular use and can not be reprogrammed. The advantages of the ASIC is a 1.67 to 3.33 times higher speed and a power consumption 0.20 to 0.50 times lower than an FPGA. [3] However, an ASIC design is an expensive implementation and would, because of this, not be realistic for a University project. Different strategies of implementation: Other implementation strategies could also be investigated. This includes implementing the whole MPEG-2 encoder in hardware and not only the three submodules as it has been done in this project. The use of a General Purpose Processor (GPP) besides the FPGA instead of the Nios II softcore processor would give a significant performance boost as a GPP can run at higher clock frequencies and enables the use of faster memory. Network protocols: As a future task it could also be interesting to investigate a more suitable communication when collecting data. The protocol used could change depending on the type of surveillance, active surveillance, i.e. somebody is looking continuously at the stream, or passive surveillance i.e. the stream is stored on a server for future processing New Features Face recognition: For surveillance reasons, face recognition would be a very powerful tool to incorporate in a 97

108 Group 842 CHAPTER 6. CONCLUSION AND FUTURE WORK system like the one developed in this project. Further improvement for face recognition would be to establish a database of registered faces to make surveillance of individual persons possible. Motion detection: Motion detection could be used as a power saving feature, as when no movement is detected, no data is processed. Ethernet over power: By using the power line supplying the system to also work as an Ethernet connection to the operator a lot of wiring could be saved and thereby ease the installation. Improve security: Giving the surveillance camera the ability to be used on an unsecure network like wireless LAN, would result in some considerations about which kind of cryptography to use as well as how to implement it effectively into the system. The four topics discussed above are all related to surveillance, but if considerations about other applications that uses cameras in an intelligent way are made, the list of optional implementations becomes even longer. The most important feature of the system developed in this project, is the flexibility of the FPGA that makes it possible to tailor the system so it fits the application. Furthermore it is possible to keep developing the application as new algorithms become available. 98

109 Bibliography [1] Altera [ ], Quartus II Design Suite, Internet. URL: [2] Altera [ ], Nios II C-to-Hardware Acceleration Compiler, Internet. URL: [3] Altera [ ], FPGAs and Structured ASICs, Overview & Research Challenges, Internet. URL: PHPSESSID=1b34dbb389a16a17339c6dd60acde5c4 [4] Altera [May 2007], Avalon Memory-Mapped, Interface Specification, Internet. [5] Benoit Payette [2002], Color Space Converter: R G B to Y CbCr, Internet. [6] Brian Beej Hall [ october 8, 2001], Beej s Guide to Network Programming Using Internet Sockets, Internet book. URL: [7] Celoxica [2005], Platform Developeros Kit RC200/203 Manual, Internet. URL: [8] Charles A. Poynton [1996], A Technical Introduction to Digital Video, Internet. URL: principles.pdf [9] Charles A. Poynton [ a], Frequently Asked Questions about Color, Internet. URL: [10] Chroma subsampling [ ], Internet. URL: subsampling [11] Joan L. Mitchell, William B. Pennebacker, Chad E. Fogg and Didier J. LeGall [2002], MPEG Video Compression Standard, ebook. 99

110 [12] Jonathan Corbet, Alessandro Rubini, Greg Kroah-Hartman [February 2005], Linux Device Drivers, O Reily. [13] Lutz, Philipp [Febrary 9th 2008], Devicedrivers and Testapplications for a SOPC solution with Nios II softcore processor and uclinux, Universty of Applied Sciences Augsburg. [14] Members [n.d.], Nios Community Wiki, Internet. URL: [15] Michael Smed Kristensen and Søren Birk Sørensen [2006], Suppression of MPEG Artifacts, Master Thesis Work in Applied Signal Processing and Implementation. [16] Micron Technology, Inc [2004], 1/3-Inch Megapixel CMOS Active-Pixel Digital Image Sensor, Internet. [17] Mohammed Ghanbari [1999], Video coding an introduction to standard codecs, Book. [18] Philip Lutz [ ], Devicedrivers and Testapplications for a SOPC solution with Nios II softcore processor and uclinux, Internet. URL: carcass/nios2-uclinux/ [19] Rasmus Abildgren, Aleksandras Saramentovas, Paulius Ruzgys, Peter Koch, Yannick Le Moullec [2007], Algorithm-Architecture Affinity - Parallelism Changes the Picture. [20] S. Biering-Sørensen, F. O. Hansen, S. Klim, P. T. Madsen [2002], Håndbog i struktureret programudvikling, 1.10 edn, Ingeniøren bøger. [21] Terasic [2008], Manuals for DE-2 development board and TRDB-DC2 Camera, Internet. URL: [22] Thomas Sikora [n.d.], MPEG-1 and MPEG-2 Digital Video Coding Standards, Internet. URL: [23] Tristan Savatier [March 2008], Moving Pictures Experts Group, Internet. URL: [24] Vincent Chu [Winter 2004], Client Server example using sockets, Internet. URL: vwchu [25] Yannick Le Moullec, Jean-Philippe Diguet, Nader Ben Amor, Thierry Gourdeaux Jean- Luc Philippe [2005], Algorithmic-level Specification and Characterization of Embedded Multimedia Applications with Design Trotter.

111 Appendix A Appendix for uclinux This appendix is made with inspiration from [18] and the Nios Community Wiki [14]. A.1 Installation Procedure for uclinux onto PC To install uclinux, the first step is to download a distribution of uclinux, which can be done from For this project, version was used. All work with uclinux must be done on some kind of Linux platform. If a windows system is preferred it is possible to use an emulator program like Cygwin, to do the compilation and the configuration of the uclinux. The following programs need to be available in Linux: gcc make ncurses-dev bison flex gawk lynx bash git-core 101

112 Group 842 APPENDIX A. APPENDIX FOR UCLINUX Most of these can be installed on a Ubuntu system by: 1 sudo apt get install The default shell must be set to bash. Then make sure that cc is symbol linked to gcc by typing: 1 cd /usr/bin 2 ln s gcc cc The next step is to set-up the Git server that is used for revision control of the uclinux kernel, and selects which kernel version to use: 1 tar xf uclinux dist tar 2 cd uclinux dist 3 git branch a Now select uclinux version by typing: 1 git checkout v uc The uclinux should now be ready to be configured, but first the Nios II embedded design suite cross compiler must be installed. It contains the compiler used when compiling the uclinux kernel as well as programs for compiling programs for the uclinux. The cross compiler can be downloaded from the Nios Community Wiki [14] or from Altera s webpage. It is unpacked and installed onto the system by line 1 in listing A.1 in the /opt/nios2 directory. Line 2 must be added to bash profile in Linux. Do an echo to $PATH in the terminal and it should return the directory where the cross compiler was installed. 1 tar jxf nios2gcc.tar.bz2 C / 2 PATH="${PATH}":/opt/nios2/bin Listing A.1: Installing Nios II croscompiler. Now the Quartus suite should be installed, this is necessary because the uclinux needs to know the hardware configuration of the FPGA e.g. addresses, the amount of RAM, ROM. Quartus running on Windows can be used to create the files needed to make the configuration of uclinux, they just needs to be copied to the Linux system. A.2 Configuration Procedure for uclinux In order to do the configuration procedure the *.pft file from SOPC builder is needed, which contains the physical addresses of all the components connected to the Nios Core. The *.ptf file is generated when building the hardware configuration for the FPGA in SOPC-builder. 102

113 APPENDIX A. APPENDIX FOR UCLINUX Group 842 To set up the uclinux for the configuration of the system it is supposed to run on, type the following command in the folder where the uclinux distribution is installed: 1 $ make menuconfig First select vendor/product. Select the vendor to be Altera and the product to be nios2nommu. Exit this menu and go into the Kernel/Library/De f ault selection. Here select only the Default all settings. Exit two times and when ask if the changes should be saved select yes. The next thing is to set up the memory map from the ptf file. Type the following command where <> is the path to the PTF file including the name of the ptf file like /etc/mypro ject/proc.pt f : 1 $ make vendor_hwselect SYSPTF= <absolute path to PTF file> The system now asks which CPU to build the kernel against, normally there is only one selection here so press the number for the Nios processor (most likely 1). The next selection is where to upload the kernel to. Here select the cfi flash. The last selection is where to execute the kernel select SDRAM. To prepare the configurations file for potential user applications, run the following command: 1 $ make romfs The last step is to compile the kernel and create the image to upload to the board: 1 $ make 2 $ make linux image. A.3 Installation Procedure for uclinux onto Development Board After the creation of the kernel image, the next step is to move it onto the board. The first thing to do is to configure the FPGA this can either be done from Quartus by programming the board, or by using the following command in the terminal, using the sof image that contains the hardware configurations of the FPGA: 1 $ nios2 configure sof <absolute path to the sof file for the project> The next step is to upload the kernel image to the board, the image is located in the /image folder inside the folder for uclinux distribution: 103

114 Group 842 APPENDIX A. APPENDIX FOR UCLINUX 1 $ nios2 download g <absolute path to the uclinux image file> Now start the Nios 2 terminal: 1 $ nios2 terminal A.4 Customizing the Kernel The standard kernel installed in the previous section does not contain the network drivers for the ethernet or tools for developing device drivers. Therefore the following changes are made when compiling the kernel. Note that some of the these changes are specific to the DE2 board like network drivers. Run the following command again: 1 $ make menuconfig Select Kernel/Library/De f ault. Here mark the Customize Kernel Settings and Customize Vendor/User settings. Press exit two times and yes when asked to save. A Menu for Linux Kernel Configuration should appear. Do the following: 1 Select Loadable module support 2 Select Enable loadable module support and module unloading 3 4 Select Processor Type and features 5 Select Platform 6 Set this to the DE2 development board. 7 Return to main menu 8 9 Select Networking 10 Mark Networking support 11 Select Networking options 12 Mark Packet socket, Unix domain sockets, TCP/IP networking and IP multicasting. 13 Return to main menu Select Device Drivers 16 Select Network device support 17 Select Ethernet 10 or 100 Mbit 18 Mark Ethernet 10 or 100 Mbit and DM9000 with checksum offloading. 19 Return to main menu. Press exit and yes when asked to save the configuration. 104

115 APPENDIX A. APPENDIX FOR UCLINUX Group 842 A menu for selecting user application should now appear: 1 Select Busybox 2 The following application must be marked. 3 cat, chmod, dmsg, echo, ftpget, ftpput, hostname, ifconfig, 4 ifconfig: status reporting, ifconfig: extra options, insmod, 5 insmod: lsmod, insmod: modprobe, insmod: rmmod, kill, killall, 6 ls, mkdir, mknod, mv, ping, ps, pwd, rm, route, shell, 7 msh: Minix shell, MSH is /bin/sh, sh: command editing, 8 sh: tabcompletion, sh: stand alone, sh: applets firs, telnet. Exit twice and select yes when asked to save configuration. Now run the following command 1 $ make romfs Go into the /uclinux dist/vendor/altera/nios2nommu and open the file rc and append the following to the end of the file the second ifconfig command is to set up the MAC-address of the ethernet, this value could be changed if necessary: 1 hostname uclinux 2 mount t proc proc /proc 3 mount t sysfs sysfs /sys 4 mount t usbfs none /proc/bus/usb 5 mkdir /var/tmp 6 mkdir /var/log 7 mkdir /var/run 8 mkdir /var/lock 9 ifconfig lo route add net netmask lo 11 ifconfig eth0 hw ether 00:07:ed:0a:03:30 12 ifconfig eth0 up 13 dhcpcd & 14 inetd & The last thing to do is to implement the changes into the core run, the following commands to recompile the kernel and make a new image for the DE2-board: 1 $ make 2 $ make romfs 3 $ make 4 $ make linux image.. 105

116 Group 842 APPENDIX A. APPENDIX FOR UCLINUX A.5 How to Compile C-code for the uclinux This section is a short guide to compile code for the uclinux and how to execute it on the board. In this section two different kinds of program are explored: user space programs and device drivers. The user space programs are c programs like the MPEG-2 encoder or a socket server, and device drivers is code that makes it possible to communicate with hardware from user space. A.5.1 User Space Programs Compile the user space program with the following lines on a Linux PC with the crosscompiler installed: 1 nios2 linux uclibc gcc <Name of c file> o <Name of output file> elf2flt s 1600 Wall Copy the output file via ftp to the development board, change the permissions for the file and then run it: 1 # cd /home/ftp 2 # chmod 777 <name of file> 3 #./<name of file> A.5.2 Device Drivers Compiling Device Drivers is rather complicated, the procedure used here was recommended by the Nios Wiki forum, but is not the fastest. Other options are available but have not been explored. The device driver is compiled with the kernel the procedure is as follows, first copy the source file of the device driver into the /linux 2.6.x/drivers/misc/ directory. Append the following to the file Kcon f ig. Text in <> depends on the device driver: 1 config <NAMEFORDEVICEDRIVER> 2 tristate <This module is... > 3 depends on DE_2_BOARD 4 help 5 <data about the device driver i.e. 6 MAJOR number and other tings that should 7 be in the help function> Append the following to the file Make f ile the name for the device driver must be the same as in the Kconfig file: 106

117 APPENDIX A. APPENDIX FOR UCLINUX Group obj $(CONFIG_<NAMEFORDEVICEDRIVER>) += <NameOfSouceFile (dont type.c)>.o Now the device driver must be selected to be included into the kernel by doing the following steps: 1 $ make menuconfig Select Kernel/Library/De f ault. Here mark the Customize Kernel Settings. Press exit two times and yes when asked to save. A Menu for Linux Kernel Configuration should appear. Do the following: 1 Select Device Drivers 2 Select Misc devices 3 Mark the device driver with an <M>. 4 Return to main menu and press exit and aswer yes to save Now the kernel must be recompiled by typing the following in the main folder of the uclinux distribution: 1 $ make 2 $ make romfs 3 $ make 4 $ make linux image. The linux software image can now be loaded onto the development board by the procedure described earlier. The device driver is now available inside the kernel, were it can be activated by the following commands typed in the nios2-terminal: 1 modprobe <name of the source file (do not type.c)> 2 mknod m 666 /dev/<name of the device driver> c <Major Number> <Minor Number> Major and minor number is selected inside the source file for the device driver, it is important that no two device driver have the same numbers. <name of the device driver> can differ from the name of the actual device driver this name is only important for user space program that wants to communicate with the device driver. 107

118 Group 842 APPENDIX A. APPENDIX FOR UCLINUX 108

119 Appendix B MPEG-2 Overview Since 1988 the Moving Picture Experts Group (MPEG), consisting of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), have set standards for video and audio encoding where MPEG-2 is a standard for Digital Video Broadcasting Terrestrial (DVB-T), Super Video Compact Disc (SVCD) and DVD. More specifically MPEG-2 is a compression framework which compresses video data or images by use of lossy encoding. This means that image quality is lost, but the algorithms implemented in the MPEG-2 framework use spatial and temporal redundancies as well as perceptual coding to hide this loss in quality. The compression is necessary to meet the requirements set for storage and transmission rate, a so called target bit-rate. This target bit-rate is decided by the capacity of the system storing the compressed data and/or the bandwidth of the transmitting channel. This coding technique induces a loss in quality when the compressed data is decompressed (or decoded), but hopefully this will only be an objective loss in quality (e.g. a mean-squared-error is obtained between the original data and the decoded data). MPEG-2 tries to remove data from an image or video so the loss in quality is not perceptible for the human eye, but if the target bit-rate is too low then the compression becomes high and some coding artifacts may become visible. Simple compression techniques can be sufficient for simple textures in images and low movement in video, but as images and video get more sophisticated the compression algorithm gets more complex to get a sufficient compression. MPEG-2 has a lot of different parameters that are interchangeable for different kinds of compression. These parameters are explained later on in this section. 109

120 Group 842 APPENDIX B. MPEG-2 OVERVIEW Reference Picture + + e q IDCT Inverse Quantization Motion Estimator m m p Form Prediction Image Blocks input + - e DCT Quantization Variable Length Coding step size Control Video Buffer Encoder Channel (Broadcast/ DVD/etc.) Video Buffer Image Blocks output + + p e q IDCT Inverse Quantization Variable Length Decoding Reference Picture Decoder Frame Prediction m Figure B.1: Coding scheme of how encoding and decoding is performed. With inspiration from [15, p. 9] and [22, p. 12]. 110

121 APPENDIX B. MPEG-2 OVERVIEW Group 842 In figure B.1 a coding scheme for MPEG compression is illustrated. A short step by step exposition of this scheme is presented in this paragraph and a more thorough description of selected blocks is described through sections B.1 to B.6 in this appendix. MPEG-2 codes three different kinds of images, I-, P- and B-images, where I is intra coded and P and B are coded based on motion prediction. The scheme above is able to encode/decode these three kinds of images. The input to the encoder is image blocks of 8x8 pels (pixels). As will be described later on, MPEG works with macroblocks consisting of four 8x8 pels luma blocks and two 8x8 pels chroma blocks (for 4:2:0 subsampling) B.7. The input is fed into a Motion Estimation block, which for P- and B-images calculates motion vectors by macroblock wise comparing the input image with a reference I- or P-image (for P comparing image N with N-1 and for B both N-1 and N+1). The reference image is reconstructed as it be would in the decoder, such that the reference image is the same for both encoder and decoder. The motion vectors m are variable length coded and sent to the decoder. A motion vector describes the spatial location of the macroblock in the reference image. Based on the motion vectors a predicted macroblock, p, is formed, the pels in this predicted macroblock is subtracted from the pels in the input macroblock giving the residual, e, which is DCT transformed, quantized, variable length coded and sent to the decoder. To maintain a reference image equal to the one in the decoder the predicted macroblock is added to the quantized residual e q in the local decoder. A Discrete Cosine Transform (DCT) is used to uncorrelate the pels in the residual, outputting a block of 8x8 DCT coefficients where low frequency components are located in the upper left corner and high frequency components in the lower right corner. These coefficients are quantized, which removes high frequency components of the DCT and thereby the amount of data for transmission. As the low frequency components of the DCT are the key elements a zig zag scan is done through the array of DCT coefficients fig. B.2. This puts the non-zero DCT coefficients into a 1-dimensional bit stream which is then entropy encoded (Variable Length Coding (VLC)), where Huffman coding gives the optimal integer code assignment [11, p. 84]). The VLC encoded bit stream is finally sent to a Video Buffer (VB), which prevents underflow or overflow of the buffer by changing the step size of the quantizer. By changing the step size it can remove DCT coefficients and by this maintains the target bit rate. The channel which the data is send through is actually the one thing that sets the requirement 111

122 Group 842 APPENDIX B. MPEG-2 OVERVIEW for compression rate, as the channel is limited to a specific bit rate, the target bit rate for Digital Consumer Electronics Handbook - McGRAW-HILL BOOK COMPANY T.Sikora the whole system. This bit rate differs as the channel could be DVD, an Internet stream or analog/digital TV broadcasting. detected along the scan line as well as the distance (run) between two consecutive non-zero In the decoder the reverse process of the encoder is performed. The stored bit stream in the coefficients. Each consecutive (run, length) pair is encoded by transmitting only one VLC decoder video buffer is variable length decoded, then an inverse quantization restores the DCT codeword. The purpose of "zig-zag" scanning is to trace the low-frequency DCT coefficients and by inverse DCT the quantized residual is restored. The motion vectors are coefficients (containing most energy) before tracing the high-frequency coefficients also variable length decoded and from these vectors and the reference image 3. each predicted macroblock is decoded and added to the quantized residual to establish the output image. Figure B.2: Zigzag scan of DCT coefficients. Only the non-zero DCT coefficients are put into a 1-dimensional bit stream after entropy coding (VLC). The zigzag scan is used to locate the low frequency DCT coefficients before high frequency Figure DCT 6: coefficients.[22, "Zig-zag" scanning p. 13] of the quantized DCT coefficients in an 8x8 block. Only the B.1 I-, P- and B-images non-zero quantized DCT-coefficients are encoded. The possible locations of non-zero DCT-coefficients are indicated in the figure. The zig-zag scan attempts to trace the DCT-coefficients according to their significance. With reference to Figure 3, the lowest DCT-coefficient (0,0) contains most of the The MPEG coding techniques energy withinfor thevideo blocksis and based the energy on correlation is concentrated between around the pels lower in the DCTcoefficients. same image (i.e. it is possible to predict the magnitude of a pel from nearby pels), this is called Intra-frame coding, or redundancies between consecutive images (i.e. it is possible to predict the magnitude of a pel from 3 The location a nearby of each image), non-zero coefficient this is called along the Inter-frame zig-zag scan is encoded coding. relative to the location of the previous coded coefficient. The zig-zag scan philosophy attempts to trace the non-zero coefficients according their likelihood of appearance to achieve an efficient entropy coding. With reference to Figure 5 the DCT coefficients most likely to appear are concentrated around the DC coefficient with decreasing importance. For Inter-frame many coding images the (P- coefficients and B-images) are traced efficiently uses using temporal the zig-zag prediction scan. to compress the data of consecutive images. This means that the Inter-frame coding technique can be used when consecutive images have similar or identical content. Intra-frame coding (I-images) on the other 13hand uses spatial prediction to compress the data of a single image. This technique should be used when 112

123 APPENDIX B. MPEG-2 OVERVIEW Group 842 the similarity between consecutive images is low or nonexistent, so instead of predicting the magnitude of a pel based on previous images, it predicts the magnitude of a pel by looking at the magnitude of its surrounding pels. To efficiently explore if there is spatial correlation (correlation between nearby pels in the same image) the MPEG framework uses discrete cosine transformation (DCT) on blocks of 8x8 pels. In the case of temporal correlation (correlation between pels in nearby images) MPEG uses differential pulse code modulation (DPCM) instead, see section B.4. MPEG video coding uses a hybrid of DPCM and DCT coding, where it uses temporal prediction first and subsequently spatial prediction on the remaining data to achieve highest compression. An illustration of the correlation of nearby pels in an image could look like what is shown in figure B.3. The spatial correlation illustrated is typical for pels in an image, where the correlation decreases as the distances from one pel to another increases. Figure B.3: Spatial correlation between pels from a typical image, where the x and y axis describe the horizontal and vertical distance between pels. It shows high correlation between pels with a small distance and a decrease in correlation as the distance increases.[22, p. 4] Besides I- and P-images, the MPEG framework also uses bi-directional prediction to generate B-images. These images use both past and future images to predict the content of the current image. B-images use I- and P-images as reference, but not themselves. B-images provide the best compression of the three. The need for reference images used by B-images makes it impossible to only use B-images. Furthermore only I-images gives access points for functionalities as random access, fast forward and fast reverse. These functionalities are of course of great importance when accessing video from a storage media. The hurdle is to find the sequence of images that gives good compression, but still gives a reasonable amount of access points. E.g. a video sequence with only I-images provides the highest degree of accessibility, but the lowest degree 113

124 Group 842 APPENDIX B. MPEG-2 OVERVIEW of compression. As for a sequence of I- and P-images (IPPPPPIPPPPPI...) both compression and accessibility are moderate, but to achieve high compression and moderate accessibility all three image types have to be incorporated (IBBPBBPBBIBBP...). This composition however also increases the coding delay and may turn out to be useless in for instance video telephony applications. A sequence of images such as IBBPBBPBBPBBPBBP is called a group of images (GOP). A GOP always starts with an I-image due to no reference image and is followed by either P- or B-images which are encoded with reference to I- and/or P-images. As for B-images that use future images as reference, these reference images have to be encoded before the B-image using the reference image. This gives the following encoding sequence IPBBPBBPBBPBBPBB. In figure B.4 the difference between a display sequence of a default GOP and its encoding sequence is illustrated. Picture: I B B P B B P B B P B B P B B P Picture: Order: I P B B P B B P B B P B B P B B Order: (a) (b) Figure B.4: (a) The display sequence of a default GOP where the arrows indicate how data is interchanged between images. (b) Encoding sequence where P-images is moved in front of the B-images that they provide data for, it is seen that for the encoding sequence all arrows point forward. B.2 Picture: Slices I P B B P B B P B B P B B P B B Order: Each image in a GOP consists of a number of slices varying from one to the number of macroblocks in an image. These slices provide an overhead that tells the decoder where the incoming macroblocks should be placed in the image. This is done to prevent a noisy channel corrupting a whole image providing the ability to start reconstruction of a image at the start of each slice. The slice overhead consists of minimum 32 bit (start-code), which adds a large amount of data that has to be sent. An image could for instance consist of 10 slices and with a frame rate of 25 Hz the overhead for second would be: = 8000bit/s. That means a balance between error robustness and bandwidth usage should be considered. 114

125 APPENDIX B. MPEG-2 OVERVIEW Group 842 B.3 Subsampling of Chromacity levels The MPEG framework employs extensive use of subsampling and interpolation to make further compression of image data. A simple use of subsampling is to reduce the horizontal and/or vertical dimensions of the input video prior to encoding and by that reducing the number of pels for transmission. Subsampling is also used in the temporal direction where intermediate images are discarded at the encoder and then reconstructed at the decoder by use of interpolation. Redundancy in the aspect of the human eye is also removed by use of the subsampling technique. The human eye is more sensitive to brightness (luminance level) than to chromaticity (chrominance level or color level) and in the MPEG coding schemes these levels are divided into a color space described by three components Y C B C R. Here Y is luma (a nonlinear transform of luminance), C B and C r is the color difference or chroma components, where C B is blue minus luma (B Y ) and C r is red minus luma (R Y ). The Y C B C R components are deduced from the RGB components as shown in B.2, where RGB is the Red, Green, Blue color space. This transformation of the RGB color space is done because Y C B C R is less correlated than RGB and therefore Y C B C R can be coded more efficiently [9]. These chroma components can be subsampled (chroma subsampling) relative to the luma component without any subjective loss in quality from the perception of the human eye. Y = 16 + ( R G B ) [ ] C B = ( R G B ) [ ] (B.1) C R = (112.0 R G B ) [ ] where: Y is the luma level [-] C B is the blue chroma level [-] C r is the red chroma level [-] R, G and B is the gamma corrected red, blue and green color levels [-] From this equation it can be derived that an all black image (meaning that R G B is all zero) would result in Y =16 and an all white image (meaning that R G B is all one) would result in Y =235. This equation is specified for eight bit coding, meaning that the extremes 1-15 and (0 and 255 is reserved for synchronization in Rec. 601) are regarded as foot room and head room for signal processing. As for the chroma levels it is derived that it has at foot room from 1-15, a head room of 240 and an offset value of 128, meaning that if the R G B values is zero or one (black or white) C B and C r equals 128. In figure B.5 it is possible to see an image separated into Y, C B and C r components. It is easy to see how the detail level is much higher for the Y component compared to the chroma components, which is the reason why chroma levels are subsampled with regard to the luma 115

4:2:2 indicates that the chroma horizontal levels run at half the sample rate of the luma level.

126 Group 842 APPENDIX B. MPEG-2 OVERVIEW level. (a) (b) Figure B.5: (a) Original image and Y component below. (b) C B and C r components. Chroma subsampling is done with a Y :C B :C r ratio, examples of subsampling ratios for MPEG-2 could be 4:2:2 or 4:2:0. 4:2:2 indicates that the chroma horizontal levels run at half the sample rate of the luma level. 4:2:0 runs at the same horizontal sampling rates as for 4:2:2, but in addition the vertical sampling rate also runs at half the sampling rate of the luma level. Figure B.6 illustrates the effect of chroma subsampling, where the top shows luma levels which is the same for all four examples as they have the same sampling rate, however a distinct change in chroma levels is seen for each of the four different sampling ratios. The lowest square is the decoded image where luma and chroma are put together. 1/2 1/2 Reduction in bandwidth 1/3 0 Luma Levels Chroma Levels Image Y :C B :C R Ratio Subsampling Figure B.6: (4:4:4).[10] The effects of three different chroma subsamplings compared to an image with no subsampling 116

127 APPENDIX B. MPEG-2 OVERVIEW Group 842 B.4 Motion Prediction and Transform Domain Coding To remove temporal redundancies in consecutive images, motion compensation is introduced where estimated motion vectors are used to describe the movement of pels in the current image. As mentioned under I-images, the correlation between nearby pels is high and this property is used to divide one image into disjoint blocks consisting of 16x16 pels. Now instead of calculating one motion vector for each pel a motion vector is calculated for each block. This will lead to some motion vectors being zero, for the blocks of the image that does not move, and some being offset by a number of pels. The motion vectors close to zero are neglected and only the offset motion vectors are used to further prediction. The motion vector for each block is calculated by dividing the current image into so called macroblocks and then searching a reference image for matches to the macroblocks in the current image. When matches are found the motion compensated prediction error is calculated by subtracting each pel in a block in the current image by its match in the previous image. This is done because matching of macroblocks between current and reference image and thereby the pels is not perfect at all times. The motion compensated prediction error is then DCT transformed and send to the decoder together with its motion vector. B.5 Macroblocks As explained for motion prediction the MPEG framework divides the image into non overlapping blocks of 16x16 pels. These blocks are further divided into four luma blocks each of size 8x8, which is co-sided with the chroma blocks. The number of chroma blocks is decided by which chroma subsampling is run. For instance if the 4:4:4 (no subsampling) is chosen then both of the chroma levels will consist of four blocks, as was the case for luma. If instead the subsampling ratio is 4:1:1, then only one block will represent each chroma levels as illustrated in figure B.7. Figure B.7: From R G B values different macroblocks can be extracted supporting different formats. Subsampling is done by keeping the resolution of luma (Y ) and removing chroma C B and C r information.[8, p. 25] 117

Group 842 APPENDIX B. MPEG-2 OVERVIEW B.6 Discrete Cosine Transform (DCT) DCT is transform domain coding, where blocks of pels are transformed into decorrelated coefficients.

128 Group 842 APPENDIX B. MPEG-2 OVERVIEW B.6 Discrete Cosine Transform (DCT) DCT is transform domain coding, where blocks of pels are transformed into decorrelated coefficients. These coefficients are then transmitted instead of the values of the pels. DCT on small blocks of 8x8 pels is used as DCT has a high decorrelation performance and fast algorithms is available for real time implementations. The use of DCT coefficients instead of pels makes it possible to use perceptual quantization to remove subjective redundancies. This is due to the statistical properties of DCT that illustrated in B.8(a), which shows that the variance - which is what needs to be send - is located around the low DCT coefficients. As DCT coefficients is closely related to Discrete Fourier transform (DFT) these low DCT coefficients can be looked upon as low frequencies and since the human eye is more sensitive to low frequency, it is important that the low coefficients are sent while the high coefficients can be neglected. By neglecting the high coefficients it is possible to improve the quantization of the low coefficients and thereby reduce the quantization errors that may have been caused by poor quantization v u (a) Variance of the DCT coefficients calculated from a image with the spatial correlation of figure B.3. It is only the quantized variance between image blocks that is sent through the channel. The variance is quantized with one of the matrices of figure B.9 removing high frequency DCT coefficients and greatly reducing the amount of data.[22, p. 8] (b) The basis images of the DCT transform. The DCT coefficients tell the decoder how many of each basis image that is needed to reconstruct one 8x8 image block. By looking at the image to the left, it is possible to derive that for this particular example only a few images around DC (upper left corner) are needed to reconstruct the image block.[11, p. 57] Figure B.8 The DCT transform is done in MPEG-2 on chroma and luma blocks of 8x8 pels. The following shows the NxN DCT giving a matrix of DCT coefficients B on the matrix A which consists of 118

TKT-2431 SoC design. Introduction to exercises. SoC design / September 10

TKT-2431 SoC design. Introduction to exercises. SoC design / September 10 TKT-2431 SoC design Introduction to exercises Assistants: Exercises and the project work Juha Arvio juha.arvio@tut.fi, Otto Esko otto.esko@tut.fi In the project work, a simplified H.263 video encoder is