Optimisation of Statistical Feature Extraction Algorithms

Size: px

Start display at page:

Download "Optimisation of Statistical Feature Extraction Algorithms"

Melanie Casey
5 years ago
Views:

1 Optimisation of Statistical Feature Extraction Algorithms C. Lombard*, W.A. Smit and J.P. Maré Kentron Dynamics Mache Vision Group Keywords: statistical feature extraction, real-time implementation, optimisation. Abstract * This paper describes the process of optimisg statistical feature extraction algorithms for use object recognition. The focus is on real time implementation of these algorithms on applicable processors. Different processors were evaluated, of which the TigerSharc was chosen to be discussed this paper. One sgle and two double wdow features are discussed for object of terest recognition. It is demonstrated here that a large improvement the execution time can be obtaed by implementg several optimisation techniques C, some seemgly consequential. Also demonstrated, is the improvement the use of assembly language can make.. Introduction Before object recognition on an image can be implemented a system, the algorithm must be real-time implementable. In [] and [2] possible features are discussed for detectg pot objects simulated images, and example IR (frared) images are shown Figure. From the origal features tested, only three will be taken as examples for the purpose of this paper. Ways of optimisg the code used for feature extraction, and benchmarkg of the old and new code, are also discussed. The three example features selected are: () Maximum Grey Level [2] (a sgle wdow feature), (2) Average Gradient Strength [2] (a double wdow feature) and (3) Variance Ratio (a double wdow feature). They are reviewed section 2 to provide a basis for the optimisation discussion that follows section Features Two classes of features were used, namely sgle-wdow and double wdow features. Double-wdow features are calculated usg parameters derived from both an ner (target) and an outer (local background) wdow, while * P.O. Box 742, Centurion, 0046, South-Africa. cecilia.lombard@kentron.co.za All images courtesy of Kentron s SIMIS environment. sgle-wdow features are calculated by only operatg on the target wdow. Please note that the outer wdow is "donut"-shaped, i.e. it excludes the region of the ner wdow. Figure : Simulated IR images illustratg the objects with low and high cluttered backgrounds. Outer Wdow Inner Wdow Image Figure 2: The feature extraction procedure, showg the direction of movement of the slidg wdow(s) across the image. The feature extraction procedure is shown Figure 2. The slidg wdow(s) moves across a grey scale image from pixel to pixel, from left to right and from top to bottom. At each new pixel position the three features are calculated over the wdow(s). 2.. Maximum Grey Level This feature searches through the ner wdow for the highest grey level value. Thus, the IR example, if there is a part of the object that is significantly warmer than the rest of the object and the background, the value at that pot will be the value assigned to this feature. An example image, and the features obtaed from that image, is shown Figure 3.

2 3. Optimisation a. IR image b. Maximum Grey Level c. Average Gradient Strength d. Varace Ratio Figure 3: An image and the three features obtaed from it Average Gradient Strength This feature described by [2] relies on the occurrence of sharper ternal detail man-made objects when compared to natural objects, even if the average tensity of the man-made and natural objects is similar. The average gradient strength of the local background is subtracted from the average gradient strength of the object region to prevent large regions of background that exhibits a larger than normal variation, from yieldg a high value for this feature. In [2] the feature is calculated as Fij = ( k, N ( i, j) G( k, n ( k, N ( i, j) Gout ( k, out nout where h v G ( k, = G ( k, G ( k, l ), + G h ( k, = f ( k, f ( k, l + ), G v ( k, = f ( k, f ( k +, l ), and G out ( k, l ) is defed similarly. Here n out is the number of pixels N out ( i, j ) and n is the number of pixels, N ( i, j ) where N out and N respectively denotes the target and local background wdows. () 2.3. Variance Ratio This simple feature is given by: F ij = (2) out where out and respectively denotes the standard deviation values calculated for the local background and target wdows. 3.. Feature extraction In the direct implementation for generatg the features every feature value calculated uses every pixel the slidg wdow for the calculation; for double wdow features every pixel both the wdows are used. Sce adjog wdows overlap completely except for one column or row, this means that many of the calculations are repeated. If formation from the calculation of the previous (adjog) value of a feature was saved and transferred, it could be used for the new calculation, thus savg a large amount of processg time. It was decided to implement this by, for each row, dog the complete calculation for the first wdow and then to calculate the next value the row from that value and the third value from the second value and so forth. This means that when the wdow is shifted to the next pixel a row, the only change to the wdow is that a new column of pixels, on the right of the wdow, needs to be taken to account, and that an old column of pixels on the left of the wdow needs to be removed. This overlap between adjog wdows is shown Figure 4. Old Column Slidg Inner Wdow New Column Image Figure 4: The new wdow that cludes the new column and excludes the old column Maximum Gray Level In the direct method, each time that this feature is calculated, every pixel side the wdow is searched to check if it is higher than the runng maximum. In the less processor tensive implementation the previous maximum and the position of that maximum is passed to the new calculation. If the old maximum lies the overlap region then its value is compared to the values of the new column and then the new maximum is found. If the old maximum lies the discarded column of the previous wdow, then the whole of the new wdow is searched Average Gradient Strength This feature calculates the sums of the variations between consecutive pixels over both the ner and outer wdows,

3 both the vertical as well as the horizontal directions for each. These sums are then used to calculate the feature value. Because the four sums are lear combations of the values the slidg wdows, the ones obtaed for the previous pixel can be used as a basis for calculatg the value for the new pixel. By the same reasong as section 3..., the gradients associated with the new column/s need to be added to the previous gradient total and the old column/s needs to be subtracted. The outer wdow must have two columns added and two columns subtracted because of the "donut" shape of the wdow Variance ratio The formula for calculatg the standard deviation: = n n i= ( x i x) 2 - with n the number of values the wdow, x i the gray level pixel values the wdow and x the average of the values the wdow - represents a problem. The non-learness of the square the formula coupled with the fact that the average changes from wdow to wdow, makes the implementation of an optimisation method similar to the ones used for the other features impossible without an approximation. The approximations implemented for the total calculation were found to be too accurate (they also became more and more accurate the farther from the start pot a row). The only optimisation that could be used was the calculation of the average value of the new wdow usg the previous average. Another optimisation technique that was evaluated was to use the ratio of the variances, and not the ratio of the standard deviations of the two wdows. In other words this would entail removg the calculation of two square roots for every feature value calculated Processor-specific optimisation Number format The TigerSharc processor is a native floatg-pot processor; other words non-floatg-pot numbers are simulated with floatg-pot numbers. This means that extra processg power is required to handle these numbers. However, if assembly language optimisation is used, four 8-bit tegers could be processed parallel stead of one 32-bit floatg-pot number Indexg When usg numerous for-loops with memory dexg side the loops it makes sense to mimise any calculations needed to address a specific memory space. For example, usg two dexes to address a value a two-dimensional matrix - for example the image - seems natural, but the processor uses only one dex, hence every time a double dex is used it has to be converted to a sgle dex, which uses unnecessary processg power. Another place for-loops ( C) where processg power could be saved is at the test for endg a for-loop. The syntax for a for-loop C is as follows: for (x = a; x <last;x++) where 'a' is the start value of the dex, 'x'; the test is 'x<last' and each time the loop executes 'x' is cremented by one ('x++'). If 'last' was a calculation, for example '5*a-3', that calculation would be executed once for every time the loop executes, but if 'last' was a pre-calculated variable the calculation itself would only be executed once General functions There are several math functions C that were written for the general case. When calculatg the variance ratio, for example, a square needs to be calculated. This was origally done with the power function C's math library. The power function is a general function the sense that it is able to handle any power function, not just to the power of two. Hence it needs added logic to handle that, creasg the processg power overheads enormously Assembly language optimisation From the benchmarks it was determed that the most processor-tensive feature to calculate is the variance ratio. For this reason it was decided to focus on the variance ratio when implementg the assembly language optimisation. The formula for the standard deviation, on which the variance ratio is based, is discussed In terms of code the math then looks somethg like this: a. For an area calculate the average: Avg = (sum of pixels / number of pixels) b. Calculate the standard deviation of the wdow as follows: Pixel_std_dev = (pixel_value - Avg)^2 Std_Dev = Sqrt( (sum of Pixel_std_dev's) / (number of pixels-)) The development of a decent assembly implementation of the variance ratio subroute relies on the followg steps: Fd the assembly structions required to implement the function Optimize for multi-function structions (i.e., a CPU (Central Processg Unit) core optimize for multiple arithmetic units, and for the use of SIMD (Sgle Instruction/Multiple Data) where possible) Add software pipeles where applicable. Exploit the CPU architecture to account for multiple cores, and optimize the use of memory and the I/O (Input/Output) subsystem.

4 3.3.. Assembly Instructions Required For the purpose of this exercise the processg is divided to two subroutes, i.e. the average calculation and the standard deviation calculation. Both will code efficiently assembler, although it will be required to pass over the wdow twice. Please note that the code assumes that the data is available ternal memory. It is not concerned with the availability of I/O resources to move that data - the CPU sequencer will sort that out a. Average Average Subroute Author WA Smit Date : 5 Sept 2003 Syntax : Avg(Poter to offset image,num_rows, Width) Returns sum of the rows, C has to divide by number of pixels. Description : this route sums the number of rows as assigned, and across the width as assigned. It returns the sum of a number of pixels equal to (Num_Rows x Width) pixels. Save regs [J6+J]=XR0;; [J6+J]=XR;; [J6+J]=J2;; [J6+J]=J4;; Calculate number of pixels Setup loop XR0=XR8*XR2;; J4=XR4;; Setup DAG J2=;; XR4=0;; Zero sum reg XR8=0;; Zero data reg LC0=XR0;; AVG_LOOP: XR=[J4,+J2];XR4=XR4+XR;; IF NLCOE JUMP AVG_LOOP;; Value is returned XR4 J2=[J6-J];; J4=[J6-J];; XR=[J6-J;; XR=[J6-J];; Return;; The cycle budget is then as follows: Save registers - 4 cycles Set up DAG's (Data Address Generator) 2-3 cycles Zero assembly variables - 2 cycles Set up loop - cycles 2 Note: In [3] the DAG is called the IALU (Integer Arithmetic Logic Unit) *** Inner loop start Fetch data word and add to wdow total - cycle * number of pixels *** Inner loop end Restore registers - 4 cycles The total number of cycles required is then: Cycles overhead: 4 cycles (overhead) Inner loop: Cycles required = *number_pixels (Please note that the ner loop by ference uses a software pipele. Please refer to the std_dev description below for a description of a software pipele.) The above route does not explicitly accommodate the optimization for calculatg the average value of and Figure 4 above. The route does lend itself to be used that way however, if the callg parameters are changed slightly. When the total number of cycles needed to complete the variance ratio subroute was calculated for the results (section 5.), the above optimisation was cluded b. Standard Deviation This route is essence the same as the Average route, with the difference that the calculation per pixel is more complex. StdDev Subroute Author WA Smit Date : 5 Sept 2003 Syntax : StdDev(Poter to offset image,number of pixels, Average) Returns sum of standard of the rows, C has to divide by number of pixels and get the square root Description : this route calculates the standard deviation of the number of rows as assigned, and across the width as assigned. It returns the standard deviation of a number of pixels equal to (Num_Rows x Width) pixels. Save regs [J6+J]=XR0;; [J6+J]=XR;; [J6+J]=XR2;; [J6+J]=J2;; [J6+J]=J4;; Setup loop J4=XR;; Setup DAG J2=;; XR4=0;; Zero sum reg XR0=[J4+J2];; Start the pipele LC0=XR8;; STD_LOOP: XR0=[J4+J2];XR=XR0-XR2;;

5 XR2=XR*XR; XR4=XR4+XR2;; IF NLCOE JUMP STD_LOOP;; XR4=XR4+XR2; End the pipelle Value is returned XR4 J2=[J6-J];; J4=[J6-J];; XR2=[J6-J];; XR=[J6-J];; XR0=[J6-J]; Return;; The cycle budget is then as follows: Save registers - 5 cycles Set up DAG's - 2 cycles Set up assembly variables - 2 cycles Set up loop - cycle *** Inner loop start Fetch data word Subtract wdow_average Multiply result with self and add to wdow total *** Inner loop end Return std_dev - cycle Restore regs - 5 cycles Number of cycles overhead: 6 cycles Inner loop: It is clear that the ner loop requires some optimization. It is proposed that the ner loop use two steps. In the first step the data is fetched and the subtraction is done. The second step is then a multiply-add to complete the processg. It is further proposed that a software pipele be used, thereby ensurg a average throughput of 2 cycles per pixel for the ner loop. A software pipele is needed as the data that is fetched from memory only becomes available for processg the next cycle. The pipele then looks somethg like this: Fetch_n;Sub_empty Fetch_n+;Sub_n Mult_n;Add_n Fetch_n+2;Sub_n+ Mult_n+;... Etc. The ner loop cycles then become: 2 * number pixels Multi-function Instructions Already done Software Pipeles Already done CPU Optimizations The selected processor is a super scalar CPU with two dependent cores. In theory the number of cycles required should be half of what is required for a sgle core. In practice there is doubts on the ability of the CPU's I/O subsystem to support all the data transfers required. When the issue is pursued, the followg is found: Required per cycle: Average Data words: 2 * 32 bit (one each core - the accumulated total is stored a register each core) Instructions 2 * 28 bit. This however for the first iteration of each loop only, as the data thereafter resides the struction cache of each core. I.e. ignore. Standard deviation Much the same situation as Average. It appears then that the CPU efficiency depends only on the ability of the programmer to arrange the data memory such a way that each core has free access to its data. It is therefore advised that the ternal memory blocks be arranged as follows: Block 0 : Program Block : Image Block 2 : Image It is further proposed that the image size be restricted to a size that can fit to a sgle memory block and that these two blocks be swapped between the DMA (Direct Memory Access) subsystem and the cores. The cores can share a sgle 28-bit bus to move their data. Under these conditions full efficiency can be achieved Cycle Calculations When all the cycles that are required to execute the assembly portions of the variance ratio function are added, and the optimization for the average route is cluded, it is found that the assembly version executes four times faster than the C version. If the data variables are reduced to 8 bit variables, and SIMD the selected processor's CPU cores are exploited it should be possible to achieve a speedup of 0 to 6 times, dependg on the availability of data on the CPU ternal busses. 4. Results Several optimisation techniques were tested with the different functions and is dicated by number the results table (Table ):. Calculatg feature values from previous values with a row. 2. Removg the square roots when calculatg the variance ratio feature. 3. Usg floatg-pot numbers for all important variables. 4. Mimizg dexg calculations. 5. Replacg general functions with simple, direct implementations.

6 6. Assembly language optimisation. The ma function calls three functions, of which each calculates one feature. A 250 MHz clock was assumed. It was found that when the square roots are removed from the variance ratio function the execution time decreases, but the differentiation between object and non-object pots also decreases. The square roots were reimplemented because of this, but with one square root stead of two and the times were found to crease very little. 5. Summary A large improvement was obtaed the execution time of the feature extraction algorithm after implementg several optimisation techniques. The fal execution time obtaed for a 300 x 300 image is still fairly long, but a large improvement is expected if the assembly language optimisation was applied to the whole algorithm. Techniques to designate areas with high probabilities of contag objects, before calculatg the features those areas, could also be implemented and tested. 6. References [] Lombard C., van Wyk B.J. and Maré J.P., 2002, Detection of Infrared, Ground-Based Pot Objects: A Case Study, Proceedgs of the Thirteenth Annual Symposium of the Pattern Recognition Association of South Africa, Nov [2] Kwon H., Der S.Z. and Nasrabadi N.M., 2002, Adaptive multisensor target detection usg featurebased fusion, Society of Photo-Optical Instrumentation Engeers, Vol. 4, No., pp [3] Analog Devices, ADSP-TS0 Tigersharc Processor Programmg Reference, Revision.0, Jan Nr. Optimisation technique used: Time (with Image size Ma Maximum Grey Average Gradient Variance Ratio Clock cycles 250MHz (pixels) Function Level Function Strength Function Function clock) none none none none 75 x s x s 3 3,3, x s 4 3,4,3,4,3 3,4 75 x s 5 3,4,3,4,3 3,4,5 75 x ms 6 3,4,3,4,3,4 3,4,5 75 x ms 7 3,4,3,4,3,4 2,3,4,5 75 x ms 8 3,4,3,4,3,4 2,3,4,5 76 x ms 9 3,4,3,4,3,4,2,3,4,5 75 x ms 0 3,4,3,4,3,4,2,3,4,5 00 x ms 3,4,3,4,3,4,2,3,4,5 300 x s 2 3,4,3,4,3,4,3,4,5 75 x ms 3 3,4,3,4,3,4,3,4,5 00 x ms 4 3,4,3,4,3,4,3,4,5 300 x s 5 3,4,3,4,3,4,2,3,6 300 x s 6 n/a n/a n/a,2,3,4,5 300 x s 7 n/a n/a n/a,2,3,6 300 x s Table : Benchmarkg results obtaed with different optimisation techniques.

7.3.3 A Language With Nested Procedure Declarations

7.3. ACCESS TO NONLOCAL DATA ON THE STACK 443 7.3.3 A Language With Nested Procedure Declarations The C family of languages, and many other familiar languages do not support nested procedures, so we troduce