Design and Efficient FPGA Implementation of an RGB to YCrCb Color Space Converter Using Distributed Arithmetic

Size: px

Start display at page:

Download "Design and Efficient FPGA Implementation of an RGB to YCrCb Color Space Converter Using Distributed Arithmetic"

Donald Shaw
5 years ago
Views:

1 Design and Efficient FPGA Implementation of an RGB to YCrC Color Space Converter Using Distriuted Arithmetic Faycal Bensaali and Aes Amira School of Computer Science, Queen s University of Belfast, Belfast BT7 1NN {fensaali, aamira}@quacuk Astract Processing an image in the RGB color space, with a set of RGB values for each pixel is not the most efficient method To speed up some processing steps many video compression and communication techniques use luminance/chrominance color spaces, such as YCrC, making a mechanism for converting etween formats necessary Therefore, techniques which efficiently implement this conversion are desired This paper presents two novel architectures for efficient implementation of a Color Space Converter (CSC) suitale for Field Programmale Gate Array (FPGAs) and VLSI The proposed architectures are ased on Distriuted Arithmetic (DA) ROM accumulator principles The architectures have een implemented and verified using the Celoxica RC1-PP FPGA development oard In addition, they are platform independent and have a low latency (8 cycles) The first architecture has a throughput of height, while the second one is fully pipelined and has a throughput of one and capale of sustained data rate of over 234 mega-conversions/seconds 1 Introduction Color is a visual sensation produced y the light in the visile region of the spectrum incident on the retina Since the human visual system has three types of color photoreceptor cone cells, three components are necessary and sufficient to descrie a color [1] Color spaces (also called color models or color systems) provide a standard method of defining and representing colors There are many existing color spaces and most of them represent each color as a point in a 3D coordinate system Each color space is optimized for a well-defined application area [2] The three most popular color models are RGB (used in computer graphics); YIQ, YUV and YCrC (used in video systems); and CMYK (used in color printing) All of the color spaces can e derived from the RGB information supplied y devices such as cameras and scanners Processing an image in the RGB color space, with a set of RGB values for each pixel is not the most efficient method To speed up some processing steps many roadcast, video and imaging standards use luminance and color difference video signals, such as YCrC, making a mechanism for converting etween formats

2 necessary Several cores for RGB to YCrC conversion can e found in the market, which have een designed for FPGA implementation, such as the cores proposed y Amphion Ltd [3], CASTInc [4] and ALMA Tech [5] As part of an ongoing research project to develop a hardware accelerator for image and signal processing algorithms ased on matrix computations at Queen s University of Belfast [6, 7] This paper proposes the use of FPGA as a low cost accelerator for two RGB to YCrC Color Space Convertion ased architectures using DA ROM accumulator principles The two proposed architectures are ased on serial and parallel manipulation of pixels The target hardware for the implementation and verification of the proposed architectures is Celoxica RC1-PP PCI ased FPGA development oard equipped with a Xilinx XCV2E Virtex FPGA [8, 9] The composition of the rest of the paper is as follows A review for the conversion from R G B to Y CrC is given in section 2 Sections 3 and 4 are concerned with the mathematical ackgrounds and the descriptions of the two proposed architectures Then the results and analysis for the hardware implementations are presented in Section 5 Finally concluding remarks are given in section 6 In the rest of this paper, the gamma-corrected RGB values are noted R G B 2 Converting From R G B to Y CrC Decomposing an R G B color image into one luminance image and two chrominance images is the method that has een used in most commercial applications such as face detection, as well as the JPEG and MPEG imaging standards Input Image RGB to YCCr DCT Quantisation Entropy Coder Compressed Data Fig 1 Baseline JPEG encoder The calculation of R G B color components from Y CrC components consumes up to 4% of the processing power in a highly optimised decoder [1] Accelerating this operation would e useful for the acceleration of the whole process A color in the R G B color space is converted to the Y CrC color space using the following equation: R' Y ' G' Cr = (1) B' C While the inverse conversion can e carried out using the following equation:

3 Y ' R' Cr G' = C B' Proposed Architecture Based Serial Manipulation Approach (2) 31 Mathematical Background Since color space conversion can e expressed as a Matrix-Vector (MV) multiplication, a novel algorithm ased DA is presented in this section Consider the matrix-vector product given y the following equation: Where { }s representation as shown in equation 4: N = 1 i Aik k= A ik ' are L -its constants and { B k }' s C B (3) W = 1 m k k, m 2 m= k are written in the unsigned inary B (4) Where k, m is m th it of B k, (which are zero or one) W is the word-length Sustituting (4) in (3), N 1 W 1 W 1 N 1 m m C i = Aik k, m 2 = Aik, 2 (5) k m k= m= m= k= Define: Therefore, C i can e computed as: N = 1 m Aik k= Z (6) W = 1 i Z m m= k, m m C 2 (7) The idea is that since the term Z m depends on the k, m values has only 2 N possile values, it is possile to precompute and store them in ROMs An input set of N its (, m, 1, m, ( N 1), m ) is used as an address to retrieve the corresponding Z m values The ROM's content is different and depends on the matrix A coefficients These intermediate results are accumulated in W clock cycles to produce C i coefficients 32 Case Study: Converting From R G B Y CrC The CSC core implements the following mathematical formula to convert from one space to another:

4 Where C i ( 2) B C A A1 A2 A3 = B1 C1 A1 A11 A12 A13 (8) B 2 C 2 A2 A21 A22 A23 1 i 3 represent the input and output color i and B i ( ) components respectively Since all the components are in the range of to 255, 8 its are enough to represent them In our application (N=4 and W=8), C i can e computed as: Where: = 7 m C i Z m 2 (9) m = Z = 3 m Aik (1) k, m k = (one for each matrix A row) with the size of 2 N =2 4 =16 are needed in order to store the precompute 2 4 possile partial products values Since the last element of the vector B is equal to 1: 1 for m = 3, m = (11) for m Equation (9) can e rewritten as: Where: 7 * 2 m C i = Z m Ai 3 m= (12) * Z = 2 m Aik (13) k, m k = It is worth mentioning that the size of the ROMs has een reduced to 2 3 Tale 1 gives the content of each ROM Tale 1 Content of the ROM i,m 1,m 2,m The content of the ROM i 1 A i2 1 A i1 1 1 A i1 A i2 1 A i 1 1 A i A i2 1 1 A i A i A i A i1 A i2

5 33 Proposed Architecture Since our ojective is to implement a core which performs two different color conversions (R G B Y CrC), 6 ROMS are needed (3 for each conversion) Figures 2 and 3 show the proposed core pins and its internal architecture respectively The pins description is given in tale 2,m 1,m 2,m (RGB to YCrC) << m PE C B B 1 B 2 S CSC C [:7] C 1 [:7] C 2 [:7] S CE CE (YCrC to RGB) << m << m C 1 C 2 Fig 2 Symol of the CSC core Fig 3 Serial CSC ased DA Architecture The proposed architecture consists of three identical Processing Elements (PEs) and two ROMs locks Each PE comprises a parallel ACCumulator (ACC) and a right shifter and each ROMs lock consists of three ROMs with the size of 2 3 each (Figure 4) The ROM s content is different and depends on the matrix A coefficients, which depend on the conversion type R O M 3 R O M 2 R O M 1 Fig 4 ROMs lock structure It is worth mentioning that our architecture is scalale, however it can e used to perform n conversions y adding every time 3 n ROMs in order to store the matrix conversion coefficients and keeping always the same PEs An N M image can e converted using the proposed architecture y setting the inputs every 8 clock cycles using the R G B components of a new pixel (Y CrC for the inverse conversion)

6 4 Proposed Architecture Based Parallel Manipulation Approach 41 Mathematical Background Consider an N M image (N: image height, M: image width) Let represent each image pixel y ijk, ( i N 1, j M 1, k 2 ) where: ij = R' ij the red component of the pixel in row i and column j ij1 = G' ij the green component of the pixel in row i and column j ij2 = B' ij the lue component of the pixel in row i and column j The image can e converted using the following mathematical formula: (14) c c1 c2 c1 c11 c12 c( N 1) c( N 1)1 c( N 1)2 c ( M 1) c( M 1)1 c( M 1)2 c1( M 1) c a 1( M 1)1 = c a 1( M 1)2 1 a2 c( M 1) c( M 1)1 c( M 1)2 a1 a11 a21 a2 a12 a a 11 3 a a 1 23 ( N 1) ( N 1)1 ( N 1)2 1 ( M 1) ( M 1)1 ( M 1)2 1 1( M 1) 1( M 1)1 1( M 1)2 1 ( M 1) ( M 1)1 ( M 1)2 1 (15) Where the operation can e defined as follows: ij c ij a a1 a2 a3 Each vector cij1 is the result of the product ij1 a1 a11 a12 a13, where ij2 cij2 a 2 a21 a22 a23 1 cijk represents the output image color space components and a a1 a2 a3 A = a1 a11 a12 a13 represents one of the constant matrices in equations 1 and a 2 a21 a22 a23 2 The c ijk elements can e computed using the following equation: 3 cijk = akmijm ( i N - 1, j M - 1, k 2) (16) m=

7 Where { a km }s ' are l -its constants and { ijm }s ' are written in the unsigned inary representation as shown in equation 17: ijm W = 1 l= ( i N - 1, j M - 1, m 2) l ijm, l 2 (17) Using the same development in the previous section, equation (18) can e rewritten as: Where: 7 c ijk = Z a l= * l l 2 k3 * Z = 2 l akmijm, l m= Likewise the first proposed architecture, The ROM s content is different and depends on the matrix A coefficients, which depend on the conversion type 42 Proposed Architecture Eequation 18 can e mapped into the proposed architecture as shown in Figure 5 The architecture consists of 8 identical PE n s ( n 7 ) Each PE comprises three parallel signed integer adders, three n right shifters and one ROMs lock, which have the structure as shown in figure 4 It is worth noting that the architecture has a Latency of W and a Throughput rate equal to 1 The entire image conversion can e carried out in ( Latency ( N M ) Throughput) = 8 ( N M ) clock cycles, while using the standard 3 4 N M = 12 N M clock algorithm, the conversion can e carried out in ( ) cycles, where ( 3 4) is the constant matrix A size ij,7 ij1,7 i i j2,7 i (18) (19) ij,5 ij1,5 i ij2,5 i ij,6 ij1,6 i ij2,6 i ij,3 ij1,3 i ij2,3 i ij,4 ij1,4 i ij2,4 i ij,1 ij1,1 i i j2,1 i ij,2 ij1,2 i i j2,2 i ij, ij1, i i j2, i a 3 5 a a <<1 <<2 <<3 <<4 <<5 <<6 <<1 <<2 <<3 <<4 <<5 <<6 <<1 <<2 <<3 <<4 <<5 <<6 C ij <<7 C ij1 <<7 C ij2 <<7 Delay PE PE: Processor Element Fig5 Proposed parallel architecture ased on DA principles 5 Hardware Implementation The two proposed architectures ased on DA technique have een implemented and verified using the Celoxica RC1-PP FPGA development oard The RC1-PP

8 oard used is a standard PCI us card equipped with the Virtex-E2 FPGA chip (package: g56, speed grade 6) Tale 2 and 3 give the content of the ROMs used for R G B to Y CrC and Y CrC to R G B conversions for oth architectures, respectively Tale 2 Content of the ROMs (R G B to Y CrC) R m / R ij,l G m /G ij1,l B m / B ij2,l ROM1 ROM2 ROM Tale 3 Content of the ROMs (Y CrC to R G B ) Y m /Y ij,l Cr m /Cr ij1,l C m /C ij2,l ROM1 ROM2 ROM The second proposed architecture can e used for the inverse conversion (Y CrC to R G B ) y: Duplicating the ROMS using the same implementation approach used for the first architecture(with a selector signal which allows the user to choose the appropriate converter); or Setting the contents of the ROMs in advance, depending on the desired conversion The precomputed partial products are stored in the ROMs using 13 its fixed point representation (8 its for integer part and 5 its for fractional part) 13-it arithmetic is used inside the architecture The inputs and outputs of the two architectures are presented using 8 its and the outputs are rounded Rounding usually looks at the decimal value and if it is greater than or equal to 5, then the result is increased y one This implies a condition of verifying followed y another arithmetic operation A more efficient way to round a numer is to add 5 to the result and truncate the decimal value This technique has

een applied in our proposed architecture The initial value for each PE s ACC (for the serial architecture) and for the first PE s adder (for the parallel architecture) is set in advance to ( a 5 i3

and ROMs initialisation have een implemented using VHDL 6 Results and Analysis In order to make a fair and consistent comparison with the existing FPGA ased color space converters, the XCV5E-8 FPGA

CSC cores Design Parameters Slices Speed (MHz) Throughput (vector/ clock cycle) Proposed architecture (1) 7 128 8 Proposed architecture (2) 193 234 1 CAST Inc [4] 222 112 1 ALMA Tech [5] 222 15 1

Software Hardware Cr 63 126 12 C 461 Y 684 Cr 83 43 28 C 396 The proposed architecture ased serial manipulation approach shows significant improvements in comparison with the existing implementations

9 een applied in our proposed architecture The initial value for each PE s ACC (for the serial architecture) and for the first PE s adder (for the parallel architecture) is set in advance to ( a 5 i3 ), where ( i 2) The MACs and parallel signed adders have een implemented using Xilinx s CoreGen utility, which contains many efficient designs that can often save time for a programmer The shifters and ROMs initialisation have een implemented using VHDL 6 Results and Analysis In order to make a fair and consistent comparison with the existing FPGA ased color space converters, the XCV5E-8 FPGA device has een targeted Tale 4 illustrates the performances otained for the proposed architecture in terms of area consumed and speed which can e achieved Tale 4 Performance comparison with existing CSC cores Design Parameters Slices Speed (MHz) Throughput (vector/ clock cycle) Proposed architecture (1) Proposed architecture (2) CAST Inc [4] ALMA Tech [5] Amphion Ltd [3] Tale 5 Software/ hardware implementations for RGB to YCrC CSC comparisons Original Image Software implementation Hardware implementation RMS error Y 487 Computation time (ms) Software Hardware Cr C 461 Y 684 Cr C 396 The proposed architecture ased serial manipulation approach shows significant improvements in comparison with the existing implementations [3, 4, 5], which perform the R G B to Y CrC conversion, in terms of the area consumed and the maximum running clock frequency, while the second architecture outperforms the existing ones in term of the numer of conversions per second

10 1 soft hard ) and the computation N M i= j= time, when using the second proposed DA architecture Tale 5 shows the test results ) Tale 5 illustrates the software/hardware implementations comparison in terms of the RMS error -due to the use of difference data representation in the two implemen- N tations- ( 1M 1 ( ) 2 RMS Error = I ( i, j) I ( i, j) for two different images (Baoon image ( 512 ) and Pepper image ( ) It can e seen that the same converted image can e otained fastly when using the FPGA implementation, with a minimum error 7 Conclusion Processing an image in the RGB color space, with a set of RGB values for each pixel is not the most efficient method To speed up some processing steps many roadcast, video and imaging standards use luminance and color difference video signals, such as YCrC, making a mechanism for converting etween formats necessary R G B Y CrC conversions require enormous computing power However, novel, scalale and efficient architectures ased on DA principles have een reported in this paperthe implementation result shows the effectiveness of the DA approach The performance in terms of the area used and the maximum running frequency of the proposed architectures has een assessed and has shown that the proposed systems requires less area and can e run with a higher frequency when compared with existing systems References 1 B Payette, Color Space Converter: R G B to Y CrC, Xilinx Aplication Note, XAPP637 (V1), Septemer (22) 2 R C Gonzalez and R E Woods, Digital Image Processing, Second Edition, Printice Hall Inc, (22) 3 Datasheet (wwwamphioncom): Color Space Converters, Amphion semiconductor Ltd, DS64 V11, (22) 4 Application Note (wwwcast-inccom), CSC Color Space Converter, CAST, Inc, April 15, (22) 5 Datasheet (wwwalma-techcom): High Performance Color Space Converter, ALMA Technologies, (22) 6 FBensaali, AAmira, ISUzun, and AAhmedSaid, An FPGA Implementation of 3D Affine Transformations, The 1th IEEE International Conference on Electronics, Circuits and Systems (ICECS 23) Sharjah, United Ara Emirates, Decemer 14-17, (23) 7 FBensaali, AAmira, AAhmedSaid and ISUzun, Efficient Implementation of Large Parallel Matrix Product for DOTs, The International Conference on Computer, Communication and Control Technologies (CCCT 23), Orlando, Florida, USA, July 31 August 2, (23) 8 Datasheet, RC1 Reconfigurale hardware development platform, Celocixa Ltd, (21) 9 URL: wwwxilinxcom 1 M Bartkowiak, Optimisations of Color Transformation for Real Time Video Decoding, Digital Signal Processing for Multimedia Communications and Services, EURASIP ECMCS 21, Budapest, Septemer (21)

Floating-Point Matrix Product on FPGA

Floating-Point Matrix Product on FPGA Faycal Bensaali University of Hertfordshire f.bensaali@herts.ac.uk Abbes Amira Brunel University abbes.amira@brunel.ac.uk Reza Sotudeh University of Hertfordshire