A Speed-Area Optimization of Full Search Block Matching Hardware with Applications in High-Definition TVs (HDTV)

Size: px

Start display at page:

Download "A Speed-Area Optimization of Full Search Block Matching Hardware with Applications in High-Definition TVs (HDTV)"

Eustace Moore
5 years ago
Views:

1 A Speed-Area Optimization of Full Search Block Matching Hardware with Applications in High-Definition TVs (HDTV) Avishek Saha and Santosh Ghosh Department of Computer Science and Engineering, IIT Kharagpur, WB, India, Abstract. HDTV based applications require FSBM to maintain its significantly higher resolution than traditional broadcasting formats (NTSC, SECAM, PAL). This paper proposes some techniques to increase the speed and reduce the area requirements of an FSBM hardware. These techniques are based on modifications of the Sum-of-Absolute-Differences (SAD) computation and the MacroBlock (MB) searching strategy. The design of an FSBM architecture based on the proposed approaches has also been outlined. The highlight of the proposed architecture is its split pipelined design to facilitate parallel processing of macroblocks (MBs) in the initial stages. The proposed hardware has high throughput,low silicon area and compares favorably with other existing FPGA architectures. 1 Introuction Rapid growth in digital video applications accompanied with the demand for better video quality has resulted in increasing popularity of high-definition TVs (HDTV) in the consumer market. An aspect of this trend is the increased interest in designing portable devices capable of encoding HD quality video data. However, the typical HD-compatible video encoders are based on MPEG2 MP@HL. MPEG2 MP@HL encoder uses the exhaustive Full Search Block Matching Algorithm (FSBMA) based motion estimation. In this case, the power consumption of the encoder is prohibitively high, particularly for portable implementations. Again, in a typical video encoder, the ME module occupies more than 80% of its computational complexity. Software based methods are unable to meet the realtime constraints of FSBM-ME implementations [1]. Hence, a highly efficient ME processor core is required to realize portable HD video encoding applications. FSBM architectures can be broadly classified into FPGA [2,3,4,5,6,7] and ASIC [8,9,10,11,12,13,14,15,16,17,18] implementations. This work focusses on using FPGA technology to implement a high-performance ME hardware. A systolic array architecture for FSBM implementing realtime video encoding on a single FPGA chip was proposed in [3]. A novel OnLine Arithmetic (OLA) based design, where each bit is processed in successive clock cycles operating with the most significant bit (MSB) at first, was proposed in [4]. [5] proposed lowpower core-based architectures for real-time motion estimation on FPGAs, that S. Aluru et al. (Eds.): HiPC 2007, LNCS 4873, pp , c Springer-Verlag Berlin Heidelberg 2007

2 84 A. Saha and S. Ghosh are customizable for different coding parameters and hardware resources. Some FSBM hardware architectures proposed in [8] were implemented and their performance evaluated in [6]. The results show that, real-time motion estimation for CIF ( ) sequences can be achieved with 2-D systolic arrays and moderate capacity (250 k gates) FPGA chip. Finally, [7] implements an adder-tree model based 16 1 SAD operation in FPGAs and also extends the 16 1SAD implementation to perform the SAD operations. This paper proposes two approaches for speed-area optimization of the Full Search Block Matching Algorithm(FSBMA) hardware. The novelty of this work lies in the combined optimization of the mutually conflicting design parameters of high throughput and low silicon area. The first approach uses a modification of the SAD operation so as to reduce the overall computational complexity of the ME module. This modification reduces the number of operations that need to be performed for each SAD based block matching within a pre-defined search window. Subsequently, an MB scan technique has been proposed which takes advantage of the SAD modification in a manner so as to further enhance the performance of the hardware implementation. The proposed hardware design uses a pipelined architecture which reduces the processing cycle count for each MB and thus increases the overall throughput. The initial stages of the pipeline have been split to facilitate parallel processing of MBs. The paper is organized is as follows. The next section provides a background on FSBM-based motion estimation. Section 3 describes in detail the SAD modifications and the MB search strategy. Based on the approaches proposed in Section 3, the design outline of an FSBM hardware has been sketched in Section 4. The hardware implementation results and it s comparison with existing FPGAs are presented in Section 5. Finally, Section 6 concludes this paper. 2 Full Search Block Matching In video compression, motion-compensated prediction assumes that the pixels within the current picture can be modeled as a translation of those within a previous picture. This motion information is represented by two dimensional displacement vectors or motion vectors. Due to the block-based picture representation, many ME algorithms employ block-matching techniques. In such techniques, the motion vector is obtained by minimizing a cost function measuring the mismatch between a current MB and the reference MB. Several cost measures are available to measure the amount of distortion between the block in the current frame and candidate block in the reference frame, such as, mean-of-absolute-differences (MAD), sum-of-absolute-differences (SAD), mean-square-error (MSE) etc. SAD, the most commonly used matching criterion, between the pixels of the current MB x(i, j) and the search region y(i, j) can be expressed as, SAD(u, v) = x(i, j) y(i + u, j + v) (1)

3 A Speed-Area Optimization of Full Search Block Matching Hardware 85 where, (u, v) is the displacement between these two blocks. Thus, each search requires N 2 absolute differences and (N 2 1) additions. To find the MB producing the minimum mismatch error, we need to calculate SAD at several locations within a search window. The simplest but the most computationally intensive search method, known as the FSBM, evaluates SAD at every possible pixel location in the search area. In FSBM-based motion estimation, each N N macroblock of the current frame is compared with all candidate MBs in the (N+2p) (N+2p) search window defined within the previously processed frame, where p is the maximum displacement of the N N MBinallfour directions around its boundary. The motion vector is determined by identifying a best matching MB. The FSBMA exhaustively evaluates all possible search locations and hence is optimal [19] in terms of reconstructed video quality and compression ratio. High computational requirements, regular processing scheme and simple control structures make the hardware implementation of FSBM a preferred choice. Fig. 1. Execution profile of a typical video encoder Fig. 1 shows the execution profile of a standard video encoder, obtained using the GNU gprof tool. As can be seen, among the various afore-mentioned modules of a typical video encoder, the motion estimation is the most computationally expensive. Furthermore, it is to be noted that, the SAD computations are the most time consuming due to the complex nature of the absolute operation and the subsequent multitude of additions.

4 86 A. Saha and S. Ghosh 3 Proposed Approaches This section gives a detailed description of the speed-optimized architecture. The first subsection explains the modification of the SAD equation. The MB searching technique adopted to facilitate the SAD sum derivation in Subsection 3.1 has been presented in Subsection Modified SAD Based Fast Block Matching In this section, we try to modify the SAD computation so as to constrain the computational complexity of the FSBM search process, while preserving the optimal solution for the motion vector. Let us again consider the SAD Eq. 1, SAD(u, v) = The above equation can be re-written as, SAD(u, v) because it can be shown that, x(i, j) y(i+u, j+v) x(i, j) y(i + u, j + v) (2) x(i, j) x(i, j) y(i + u, j + v) (3) y(i+u, j+v) (4) The proof of Eq. 4 is presented in Appendix A. Let SAD min denote the current minimum SAD value. Now we posit that, if, x(i, j) y(i + u, j + v) SAD min (5) then, SAD(u, v) SAD min (by inequality 3) (6) So, if Eq. 6 is satisfied, we may skip computing the SAD at the (u, v) th location. Otherwise, we need to compute the OriginalSAD (ref. Eq. 1) at the (u, v) th location and compare it with SAD min. The initial SAD min can be obtained by calculating the OriginalSAD for the first search location only. Thereafter, Eq. 6 can be used to decide on whether or not to peform the OriginalSAD on a particular search location. If the OriginalSAD needs to be calculated for some particular search location and the newly obtained OriginalSAD is less than the exisiting SAD min, then the OriginalSAD is set as the new SAD min.atthis point, it is to be noted that, this approach is not an approximation and always finds the minimum SAD without making any compromise on compression ratio and/or video quality. This is because the algorithm tries to take an initial decision of whether to compute the OriginalSAD. The decision is based on comparison

5 A Speed-Area Optimization of Full Search Block Matching Hardware 87 with SAD min. Again, all SAD min values are obtained after OriginalSAD calculations only. Thus, no decisions are made based on approximate computations and the video quality with this SAD modification is same as that of FSBM with OriginalSAD for all search locations. Again, the right hand term of Eq. 4 can be expressed as, x(i, j) y(i + u, j + v) = X Y (u, v) (7) where, X = x(i, j) andy (u, v) = y(i + u, j + v), i.e., X is the sum of the intensity values of the pixels in the current MB of the current frame and Y (u, v) is the sum of the pixel intensities in the (u, v) th MB location in the search region of the previous frame. It is to be noted that, for an entire search region the sum X for the current MB has to be calculated only once. For each search location, the sum Y (u, v) needs to be calculated. Moreover, the sum Y (u, v) atthe(u, v) th location can be derived from its immediately previous value Y (u 1,v)at(u 1,v) th by subtracting from Y (u 1,v)thesumofpixel intensities of the first column at the (u 1,v) th MB location and adding to the result, the summation of the pixel values at the last column of the (u, v) th MB location. 3.2 Macro Block Searching Strategy The FSBM algorithm primarily searches an N N macroblock within the corresponding (2p +1) (2p + 1) search locations, where p is the search range. The traditional FSBMA requires N 2 absolute differences and (N 2 1) additions to compute every SAD value. Hence, the total operations required to find the best matchofanmbwithinasearchrangeis(2p+1) 2 (2N 2 1). However, our modified SAD equation requires only (N 2 1) additions for the current MB + (N 2 1) additions and 1 absolute difference for each of the (2p +1) 2 search locations = (N 2 1) + (N )(2p+1) 2 = a total of (N 2 1) + N 2 (2p+1) 2 operations. Let, the search locations in the search region be scanned in a manner shown in Fig. 2. As mentioned in subsection 3.1, the sum of the pixel intensities at each search location can be derived from the pixel intensity sum at the previous search location. Compared to the traditional raster scan, the proposed scan technique facilitates this derivation of the SAD sums, particularly in situations where the search locations moves to a row below the current row position. As shown in Fig. 2, the sum at search location (2, 2p + 1) can be easily derived from the sum at search location (1, 2p + 1). However, this derivation is not possible if we compute the sum at location (2, 1) immediately after computing the sum at location (1, 2p +1). Let, the k th search location is represented by SR k and it s right and bottom adjacent search locations are represented by SR k+r and SR k+b then the SAD k+r and SAD k+b can be calculated by following equations. ( ) SAD k+r = {SRk + SR i,j+n SR i,j MB c (8)

6 88 A. Saha and S. Ghosh p p Fig. 2. Movement of search locations in the search region SAD k+b = {SRk + SR i+n,j SR i,j MB c (9) where SR k and MB c represents the sum of the pixel values of k th search location and current (c th ) MB respectively. Eq. 8 has been used to derive the SAD sums when the scan control moves towards right or left in a column-wise manner and Eq. 9 has been used to derive the SAD sums when the scan control moves downward in a row-wise manner. Example: The sum of the second search location (SL 2 ) can be derived from the sum of the first search location (SL 1 ) by subtracting from SL 2,thesumof the pixel values of the 1 st column and then adding to it, the sum of the pixel values of the 17 th column.again,toderivesl 34 (assume p =16)fromit s previous sum at the 33 rd search location (SL 33 ), we need to subtract the sum of the pixel values of the 1 st row from SL 33 and then add to it the sum of the pixel values of the 17 th row. Each derivation of the SAD sum requires 2(N 1) additions [to find the sum of one old and one new column] + 2 additions/subtractions and 1 AD operation [Eq. 8 and Eq. 9] = a total of (2(N 1) + 2) adds/subs and 1 AD = 2N operations and 1 AD. Thus, an entire search region of size (2p +1) (2p +1) requires (N 2 1) operations and 1 AD for the first search location + (2N +1) operations and 1 AD for the remaining (2p +1) 2 1 search locations each = [(N 2 1) + [(2p+1) 2 1]2N] operations and (2p+1) 2 ADs. For N = 16, p = 16, the proposed technique requires only addition/subtraction operations and 1089 ADs, as against the traditional raster search scan, which requires addition/subtraction operations and AD operations.

7 A Speed-Area Optimization of Full Search Block Matching Hardware 89 4 Hardware Design for FSBM Fig. 3 shows the hardware architecture of the SAD calculation unit. The hardware unit consist of one pipelined datapath, two memory banks, datapath and memory controller and some registers. The modified SAD calculation for FSBM algorithm is performed by the datapath in eight independent sequential steps. REG MEM Memory Controller Row Memory bank Column Memory bank Input Output Interface Datapath Controller REG SR REG MB 2:1 2:1 stage 1 stage 2 SAD stage 8 Datapath Fig. 3. Architecture of the proposed SAD unit The proposed hardware adopts the scanning technique shown in Fig. 2. A p = 16 search region consist of pixels (P i,j,where0 i, j 48), which are formed (2p +1) 2 =33 2 = 1089 different search locations. The SAD unit first loads one macro block and the respective search region into the on-chip memory. The memory controller is responsible to store the pixels into the right place by following some special organization procedure. The pixels are organized into the memory banks in such a way that the consecutive SAD calculation could be performed by only one memory access. The pixels are stored into the Column Memory Bank (Fig. 3) in column-major format so that one column of a search location (16 pixels) can be accessed in a single clock. The SAD unit first computes the sum of the macro-block and the sum of the first search location and stores the resultant values into the respective REG MB and REG SR registers (Fig. 3). It computes the first SAD value by performing an absolute difference operation between REG SR and REG MB and store it into

8 90 A. Saha and S. Ghosh the SAD-register. It is to be explained in the previous section that the next right-adjacent search location has only one column difference from the previous location. Hence, to compute the sum of the new search location from the previous REG SR value we need to access all the pixels of 1 st and 17 th columns (P i,1 and P i,17,where0 i 48). Thus our column memory bank have two 16 8 = 128- bit ports. The second SAD value is computed by the absolute difference operation between new REG SR and the respective REG MB values. Then the least SAD value between the previous and the latest one is restored into the SAD-register. This procedure is performed iteratively for every new right as well as left adjacent search locations within the respective search region. The difficulty will be arises when we need to move down from the previous search location. In these cases we need to access two set of row pixels (P i1,j and P i2,j,where0 j 48). The previously organized column memory bank does not support to access those pixels in a single clock. Hence, we have stored the required row values in another Row Memory Bank (Fig. 3). This bank also have two 128-bit data access ports. The size of the row memory bank is only (32 128) + (16 128) bits, which is equal to 768 bytes. Different level of data reuse are discussed in [20], which are primarily reduce the memory accesses in FSBM-based architectures. The current SAD unit adopts the data reuse defined as Level A and Level B in [20]. The locality of data within the candidate block strip where the search locations are moving within the block strip are defined as the data reuse in Level A. Level B describes the locality among the candidate block strips which are overlapped vertically. The present design easily adopts these two levels of data reuse schemes. 5 Results The Verilog RTL of the proposed design has been synthesized on a Xilinx Virtex IV 4vlx100ff1513 FPGA and verified with RTL simulations using Mentor Graphics ModelSim SE. The synthesis results for a macroblock (MB) of size and a search range of p = 16 show that the design can achieve a highest frequency of MHz. In addition, the design requires 333 CLB Slices, 416 DFFs/Latches and a total of 278 input/output pins. The area required by the implementation is 380 look-up tables (LUTs). It is to be noted that, given the high operating frequency of our architecture, the area required by this design is substantially low. The modification of the SAD operation contributes to this high speed and small area and low hardware complexity. The use of memory banks has led to higher on-chip bandwidth. However, this has also led to the only drawback of our design, which is the high number of input/output pins. The first SAD result is generated by the SAD unit after 23 clock cycles. Thereafter, every successive clock cycle generates one SAD value. For a search range of p = 16, which has (2p +1) 2 = 1089 search locations, the number of cycles required by the proposed hardware to find the best matching block is, 23 (for the first search location) + (1089-1) (for the remaining search locations) = 1111 cycles. Thus, our proposed FPGA implementation processes a MB

9 A Speed-Area Optimization of Full Search Block Matching Hardware 91 in, 1111 clock cycles per MB * 4.52 ns per clock cycle = usec. Similarly, a 720p HDTV frame of dimension can be processed in, 3600 MBs per frame * usecs per MB = msec. At this speed, number of 720p HDTV frames can be processed by the proposed hardware every second. Thus, the number of frames processed per second by our design is much higher than other existing architectures, which is evident from Table 1. Modification of the SAD computation, the proposed MB search strategy and the split-pipeline design contributes to this high speed and throughput of our proposed hardware design. Table 1. Comparison of hardware performance with N=16 and p=16 Design Frequency CLBs HDTV 720p Throughput Throughput/Area (in MHz) (in slices) (fps) (MBs/sec) Loukil et al. [2] (Altera Stratix) Mohammad et al. [3] (Xilinx Virtex II) Olivares et al. [4] (Xilinx Spartan3) Roma et al. [5] (Xilinx XCV3200E) Ryszko (AB2) et al. [6] (Xilinx XC40250) Wong et al. [7] (Altera Flex20KE) Our (Xilinx Virtex IV) Table 1 compares the performance of various existing architectures for a MB with a search range of p = 16. This paper aims toward the combined speedarea optimization of FSBM hardware. Hence, a new performance criteria of throughput/area has been used to compare the speed-area optimized performance of different architectures. High speed-area optimization of an architecture is denoted by its high values of the throughput/area parameter. The architectures have been compared in terms of (a) operating frequency, (b) CLB slices, (c) number of HDTV 720p (1280x720) frames that can be processed per second, (d) throughput or MBs processed per second, (e) throughput/area, and (f) the I/O bandwidth. As can be seen, the proposed design has a very high throughput and can process the maximum number of HDTV 720p frames per second (fps). The fps value of is close to that of 60 fps, which denotes that the proposed architecture can support both frame (25fps or 30 fps) and field (50 fps or 60 fps) processing. This is a big advantage over other existing FPGA designs. Moreover, the superior speed-area optimization in the proposed design is exhibited by its substantially high throughput/area value of

10 92 A. Saha and S. Ghosh As can be seen, in Table 1, the different implementations have been carried out on different platforms having varying perfomance levels. Xilinx Virtex IV, designed for higher performance as compared to already exisitng FPGAs, can inherently make implemented designs faster. To overcome this drawback in comparison results of Table 1, Table 2 makes a comparison of the proposed design on different FPGA implementation platforms of Xilinx, namely, Spartan3, Virtex II and Virtex IV. Table 2. Performance comparison of our hardware on different FPGA platforms with N=16 and p=16 Platform Frequency CLBs HDTV 720p Throughput Throughput/Area (in MHz) (in slices) (fps) (MBs/sec) Xilinx Virtex II Xilinx Spartan Xilinx Virtex IV Table 2 shows that the area requirements of our design are similar in Spartan3, Virtex II and Virtex IV. However, the MBs processed per second is different for different platorms with Virtex IV resulting in the highest throughput. Hence, among the three compared platforms, Virtex IV yields the best throughput/area ratio. It is to be noted that, Mohammad et al. [3], whose design was also implemented on Virtex II, has much lesser throughput/area value of 25.1, as compared to our Virtex II implementation with a throughput/area value of Similarly, the Spartan3 implementation by Olivares et al. [4] has a throughput/area value of only 5.8. This is substantially lesser than our Spartan3 implementation value of Conclusions This paper has presented some approaches toward throughput-area optimization of FSBM architectures. The first approach proposed a modification of the SAD computation. This modification reduced the total number of addition/subtraction operations involved in macroblock-matching within a pre-defined search window. In addition, this approach has been utilized to derive the SAD sum at the current MB location from the already computed SAD sum at the previous MB location. Finally, an FPGA hardware design to implement the proposed approaches has been outlined. The highlight of this design is the initial splitting of its pipeline to facilitate parallel processing of MBs. In addition, our hardware has used the proposed MB scan technique so as to take further advantage of the SAD modification. Experimental results demonstrate the higher throughput and smaller area requirements of our design when compared to other existing FPGA architectures.

11 A Speed-Area Optimization of Full Search Block Matching Hardware 93 References 1. Ghanbari, M.: Standard Codecs: Image Compression to Advanced Video Coding. IEE (2003) 2. Loukil, H., Ghozzi, F., Samet, A., Ben Ayed, M., Masmoudi, N.: Hardware implementation of block matching algorithm with fpga technology. In: Proc. Intl. Conf. on Microelectronics, pp (2004) 3. Mohammadzadeh, M., Eshghi, M., Azadfar, M.: Parameterizable implementation of full search block matching algorithm using fpga for real-time applications. In: Proc. 5th IEEE Intl. Caracas Conf. on Dev., Circ. and Sys., Dominican Republic, pp (2004) 4. Olivares, J., Hormigo, J., Villalba, J., Benavides, I., Zapata, E.: Sad computation based on online arithmetic for motion estimation. Jrnl. Microproc. and Microsys. 30, (2006) 5. Roma, N., Dias, T., Sousa, L.: Customisable core-based architectures for real-time motion estimation on fpgas. In: Proc. of 3rd Intl. Conf. on Field Prog. Logic and Appl., pp (2003) 6. Ryszko, A., Wiatr, K.: An assesment of fpga suitability for implementation of realtime motion estimation. In: Proc. IEEE Euromicro Symp. on DSD, pp (2001) 7. Wong, S., S., V., Cotofona, S.: A sum of absolute differences implementation in fpga hardware. In: Proc. 28th Euromicro Conf., pp (2002) 8. Komarek, T., Pirsch, P.: Array archtectures for block matching algorithms. IEEE Circ. and Sys. 36(10), (1989) 9. Vos, L., Stegherr, M.: Parameterizable vlsi architectures for the full- search blockmatching algorithm. IEEE Circ. and Sys. 36(10), (1989) 10. Yang, K., Sun, M., Wu, L.: A family of vlsi designs for the motion compensation block-matching algorithm. IEEE Circ. and Sys. 36(10), (1989) 11. Hsieh, C., Lin, T.: Vlsi architecture for block-matching motion estimation algorithm. IEEE Tran. Circ. and Sys. Video Tech. 2(2), (1992) 12. Jehng, Y., Chen, L., Chiueh, T.: Efficient and simple vlsi tree architecture for motion estimation algorithms. IEEE Tran. Sig. Pro. 41(2), (1993) 13. Yeo, H., Hu, Y.: A novel modular systolic array architecture for full-search blockmatching motion estimation. In: Proc. Intl. Conf. on Acou. Speech, and Sig. Proc., vol. 5, pp (1995) 14. Lai, Y., Chen, L.: A data-interlacing architecture with two-dimensional datareuse for full-search block-matching algorithm. IEEE Tran. Circ. and Sys. Video Tech. 8(2), (1998) 15. Yeh, Y., Lee, C.: Cost-effective vlsi architectures and buffer. size optimization for full-search block matching algorithms. IEEE Tran. VLSI Sys. 7(3), (1999) 16. Sousa, L., Roma, N.: Low-power array architectures for motion estimation. In: IEEE 3rd Workshop on Mult. Sig. Proc., pp (1999) 17. Do, V., Yun, K.: A low-power vlsi architecture for full-search block-matching. IEEE Tran. Circ. and Sys. Video Tech. 8(4), (1998) 18. Lin, S., Tseng, P., Chen, L.: Low-power parallel tree architecture for full search block-matching motion estimation. In: Proc. of Intl. Symp. Circ. and Sys., vol. 2, pp (2004) 19. Salomon, D.: Data Compression: The Complete Reference, 3rd edn. Springer, New York (2004)

12 94 A. Saha and S. Ghosh 20. Tuan, J., Jen, C.: An architecture of full-search block matching for minimum memory bandwidth requirement. In: Proceedings of the IEEE GLSVLSI, pp (1998) 21. Weblink: Famous equations and inequalities. (2006) 22. Efimov, A., Zolotarev, Y., Terpigoreva, V.: Mathematical Analysis (Advanced Topics). Mir Publishers, Moscow (1985) A Proof of Eq. 4 Lemma 1 A 1 B 1 A B 1, where, A 1 = k a k Proof. We know that, by triangle inequality [21] and reverse triangle inequality [21], Again, by Minkowski s inequality [22], a + b a + b (10) a b a b (11) A + B 1 A 1 + B 1 (12) Let, A 1 = a = b +(a b) b + a b [by Eq. 12] or, a b + a b or, a b a b which implies, A 1 B 1 A B 1 (13) Analogously, we can show that, B 1 A 1 A B 1 (14) Hence, from Eq. 13 and Eq. 14, we have, A 1 B 1 A B 1, and B 1 A 1 A B 1, i.e., A 1 B 1 A B 1 which gives, A 1 B 1 A B 1 (15) Hence, the result follows.

Speed-area optimized FPGA implementation for Full Search Block Matching

Speed-area optimized FPGA implementation for Full Search Block Matching Santosh Ghosh and Avishek Saha Department of Computer Science and Engineering, IIT Kharagpur, WB, India, 721302 {santosh, avishek}@cseiitkgpernetin