On-Chip CNN Accelerator for Image Super-Resolution

Size: px

Start display at page:

Download "On-Chip CNN Accelerator for Image Super-Resolution"

Roy Bruce
6 years ago
Views:

1 On-Chip CNN Acceerator for Image Super-Resoution Jung-Woo Chang and Suk-Ju Kang Dept. of Eectronic Engineering, Sogang University, Seou, South Korea {zwzang91, ABSTRACT To impement convoutiona neura networks (CNN) in hardware, the state-of-the-art CNN acceerators pipeine computation and data transfer stages using an off-chip memory and simutaneousy execute them on the same timeine. However, since a arge amount of feature maps generated during the operation shoud be transmitted to the off-chip memory, the pipeine stage ength is determined by the off-chip data transfer stage. Fusion architectures that can fuse mutipe ayers have been proposed to sove this probem, but appications such as super-resoution (SR) require a arge amount of an on-chip memory because of the high resoution of the feature maps. In this paper, we propose a nove on-chip CNN acceerator for SR to optimize the CNN datafow in the on-chip memory. First, the convoution oop optimization technique is proposed to prevent using a frame buffer. Second, we deveop a combined convoutiona ayer processor to reduce the buffer size used to store the feature maps. Third, we expore how to perform ow-cost mutipy-and-accumuate operations in the deconvoutiona ayer used in SR. Finay, we propose a two-stage quantization agorithm to seect the optimized hardware size for the imited number of DSPs to impement the on-chip CNN acceerator. We evauate our proposed acceerator with FSRCNN, which is most popuar as the CNN-based SR agorithm. Experimenta resuts show that the proposed acceerator requires 9.21ms to achieve an output image with the pixe resoution, which is 36 times faster than the conventiona method. In addition, we reduce the on-chip memory usage and DSP usage by 4 times and 1.44 times, respectivey, compared to the conventiona methods. 1 INTRODUCTION Image super-resoution (SR) and computer vision [1, 2], such as object detection and recognition, have attracted much attention due to the emergence of convoutiona neura networks (CNN). As a resut, CNN acceerators have been studied extensivey to appy various CNN agorithms to the rea-time embedded systems. Especiay, in hardware impementation, FPGA-based CNN acceerators are more energy efficient than those based on GPUs and can perform massive parae processing than those based on CPUs [3, 4, 5, 6, 7]. CNN agorithms are difficut to store a necessary data with an on-chip memory except binary feature maps [7] because of the 3-D feature maps generated at each time the convoutiona ayer is processed. Therefore, most FPGA-based CNN acceerators use an off-chip memory and perform off-chip data transfer and computation simutaneousy through ping-pong operations [3, 5, 6, 8]. As a resut, a mutipy-accumuate operation (MAC) is performed continuousy through a convoutiona ayer processor (CLP) with oop optimization techniques [3, 4, 5, 6, 8, 9]. Even if ping-pong operations are appied through doube buffers, CLP cannot perform the next operation unti a arge amount of output feature maps are stored in the off-chip memory. In this case, the performance of the acceerators is degraded [6, 8]. Previous studies propose CNN fusion architectures to reduce arge amounts of offchip data transfer [6, 8]. Fusion architectures are designed as muti- CLPs for processing mutipe convoutiona ayers within the CNN [6, 8]. Therefore, the data generated after CLP is executed is transferred to the next ayer processor using the on-chip memory. In addition, the off-chip data transfer occurs ony in the first ayer and the ast ayer of the fused ayers. However, in rea-time appications for the image processing such as SR converting the ow resoution (LR) images into high resoution (HR) images, the input and output buffers are difficut to be designed using ine buffers because feature maps require a arge amount of on-chip memory. On the other hand, a method for transforming deconvoutiona ayer into convoutiona ayer (TDC) is proposed to effectivey paraeize deconvoutiona neura networks (DCNN) that generate HR images from LR images [9]. In this method, severa pixes in the output feature maps are generated in parae by a convoution kerne which is smaer than a deconvoution kerne size. However, zero weights of converted convoution kerne are not considered, thereby performing unnecessary MAC operations. In this paper, we propose a nove on-chip memory based CNN acceerator for the efficient data transfer. The main contributions of this paper are as foows. We propose a nove convoution oop optimization technique to impement the CNN acceerator with the on-chip memory. We deveop the combined CLP architecture that combines consecutive CLPs to sove the probem of storing arge amounts of feature maps in the imited buffer size. We expore how to design the CLP designed with the TDC method in the deconvoutiona ayer into the imited hardware resources of FPGA. We propose a two-stage quantization agorithm to reduce the size of the CNN in order to impement the designed on-chip CNN acceerator through the target FPGA. 2 BACKGROUND 2.1 CNN-based SR Agorithms CNN-based SR agorithms are typicay performed in two cases. In the first case for methods incuding SRCNN [2] and SRCNN-Ex [11], an input image is processed by the bicubic interpoation [10]

Figure 1: Pseudo code of the acceerator with the conventiona oop optimization techniques to generate the HR image, and then CNN agorithms are performed.

In the second case for the methods incuding FSRCNN and FSRCNN-s [12], the HR image is directy generated from the LR image using a deconvoutiona ayer.

2 Figure 1: Pseudo code of the acceerator with the conventiona oop optimization techniques to generate the HR image, and then CNN agorithms are performed. However, since the tota execution cyces of the CNN acceerators are proportiona to the resoution of the input image, the performance of the CNN acceerator is degraded by a scae factor. In the second case for the methods incuding FSRCNN and FSRCNN-s [12], the HR image is directy generated from the LR image using a deconvoutiona ayer. However, the deconvoutiona ayer is sower than the convoutiona ayer because it has more oop iterations than the convoutiona ayer [9]. But, it can be converted to the convoutiona ayer through the TDC method to improve the performance. Generay, CNN-based agorithms require a arge amount of the mode size. The argest mode size of the CNN agorithms mentioned above is 230 KB, and the mode size of the CNN-based SR agorithms is sma enough to be stored in the on-chip memory. Therefore, it is not necessary to use a mode compression [13]. 2.2 Conventiona CNN Acceerators Figure 2: Required size of the ine buffers for each CNNbased SR agorithms Most CNN acceerators use oop optimization techniques to acceerate convoutiona ayers consisting of many stages of convoution oops [3, 4, 5, 6, 8, 9]. The pseudo code in Figure 1 shows an exampe of appying the oop optimization techniques. In computation(), the convoution oops are executed in parae with the size of tie parameters through the UNROLL directive. The intermediate data generated by the processor are too arge to be stored in the on-chip memory unti the ayer is competey processed. Therefore, conventiona CNN acceerators sove the memory shortage probem through the off-chip memory [3, 4, 5, 6, 8, 9]. In main(), computation and off-chip data transfer stages are performed simutaneousy on the same timeine through the DATAFLOW directive. However, since the pipeine stage ength is imited by the off-chip data transfer, the use of the off-chip memory is a major botteneck in the performance of the acceerators. Fusion architectures are proposed to reduce the off-chip data transfer [6, 8]. Awani et a. [6] propose a method of storing the ast feature maps into the off-chip memory after performing computation in fused convoutiona ayers. Specificay, the resoution of feature maps is decreased when passing through the convoutiona ayer, and the fina feature maps are stored in the offchip memory. However, since the on-chip data transfer between fused ayers is managed by the tie-based buffer approach, there is a burden that exception handing must be accompanied for various boundary conditions. To sove this probem, Xiao et a. [8] propose a method of managing feature maps as ine buffers. Even if there are boundaries that require exception handing, data can be easiy reused by using severa ine buffers. However, as the CNN-based SR agorithm has the high resoution of feature maps, the ine buffers are difficut to be expoited due to the burden on the on-chip memory. Unike the convoutiona ayer, the deconvoutiona ayer has the process of accumuating kerne-sized output bocks in the previousy obtained output feature maps. To impement this in hardware, it is necessary to carry out a compicated process of oading the output bock stored in the memory into the buffer, adding it, and storing it again in the memory. The TDC method soves this probem by generating an output bock with a fixed size that does not overap with other output bocks. In addition, there is an advantage that a the pixes in the output bock can be processed in parae by the convoution. However, when converting from the deconvoutiona ayer to the convoutiona ayer, the newy created convoution kerne contains a arge amount of weights with zero coefficients. Thus, we need to remove the operations for zero coefficients to enhance the FPGA resource utiization. 3 Proposed ARCHITECTURE DESIGN 3.1 Datafow Optimization In our system, the pixe data is sent to the FPGA through the dispay driver in the horizonta direction of the frame. When a the pixes in the frame have been transferred, the data for the next frame is sent to the FPGA. If a the convoutiona ayers are executed by the singe CLP, the pixe data coming into the FPGA must be stored in the on-chip memory unti the ast ayer is competey processed. Consequenty, severa frame buffers may be required depending on the execution time of the CLP. In order to sove this probem, we process input data coming into the FPGA by designing a the convoutiona ayers to run concurrenty through the mutipe CLPs such as the fusion architectures. In this case, the design of the muti- CLPs that can hande efficient on-chip datafow is required. Thus,

Figure 3: Proposed combined CLP architecture fused with the forward ayer and the shrinking ayer we must determine the oop tiing parameters of each CLPs.

The computation to transmission (CT) ratio is defined as the ratio of the cyces required to perform the th CLP to the cyces required to transmit the pixe data from the dispay driver or input buffers

3 Figure 3: Proposed combined CLP architecture fused with the forward ayer and the shrinking ayer we must determine the oop tiing parameters of each CLPs. To find the oop tiing parameters, we compare the computation speed of each CLP with the speed of the pixe data coming into the CLP. The computation to transmission (CT) ratio is defined as the ratio of the cyces required to perform the th CLP to the cyces required to transmit the pixe data from the dispay driver or input buffers into the th CLP. When the tie size combination of the th CLP is given as T y, T x, T m, T n, and T k, which denote the tie size of row, coumn, output feature maps, input feature maps, and kerne, the th CT, which is CT, is cacuated by Equation (1). In this case, we do not consider T y and T x when cacuating the CT because the input of each CLP is sequentiay transmitted in pixe data instead of tie-sized pixe bock. CT ratio = = tota number of execution cyces tota number of transmission cyces H W M T m N T n K T k K T k H W N T n = M T m K T k K T k where H and W are the height and width of feature maps, respectivey, N and M are the size of the input and output feature maps, respectivey, and K is the width of the kerne size in the th convoutiona ayer. If the CT is greater than 1, the transmission speed of the pixe data is faster than that of the CLP, and hence, the data transferred during the computation must be stored in the frame buffer. For exampe, if an output image with the (FHD) pixe resoution is generated using a SR agorithm with the scae factor of 2, a 4.15MB buffer memory is required to store an input image with the pixe resoution in 32-bit foating point data type. In addition, considering the size of input feature maps, it can exceed the aowabe on-chip memory of a typica FPGA. Therefore, we set the CTR of a ayers to 1 not to use the frame buffer. Thus, T m and T k become M and K, respectivey. Furthermore, the tiing factors of the input feature maps of the next (1) Figure 4: Proposed combined CLP architecture fused with the expanding ayer and the backward ayer ayer, T n +1 and T m shoud be same for feature maps to be transmitted without buffering to the next ayer. Since N +1 is equa to M, T n aso becomes equa to N. As a resut, a CLPs can fuy process the three convoution oops in parae. 3.2 Combined CLP Design A memory management technique is needed to efficienty store feature maps generated by muti-clps. Since the pixe data from the dispay driver is transmitted ine by ine, we use the ine buffer that can reuse the data without being restricted by boundary conditions [8]. The ine buffer is designed as a bock RAM (BRAM) that is a simpe dua-port mode [14] in which one read and one write are aowed concurrenty in order to fufi both input and output buffers of each CLPs. In addition, the capacity needed to impement the ine buffer shoud be same as the size of the 3-D data generated from the CLP. The number of ine buffers must be equa to the kerne size for the convoution. Therefore, the size of the input and output buffers for the th CLP, which are M in and M out respectivey, can be cacuated by Equation (2). M out M in = K W N pixe_datawidth = K +1 W +1 N +1 pixe_datawidth Figure 2 shows the size of the ine buffers required to generate output images with the output resoutions of FHD, (QHD), and (UHD) when using CNN-based SR agorithms [2, 10, 11]. FSRCNN-s, which requires the smaest amount of memory, shoud use 2.65MB memory to generate UHD image. These buffers are constructed with 588 numbers of 36K- BRAMs. Since it is arger than the number of the BRAMs avaiabe in genera FPGAs, the required ine buffers shoud be reduced. We can sove the memory shortage probem through the shrinking and expanding ayers, which are both 1 1 convoutiona ayers. The shrinking ayer compresses the argest input feature maps to a reduced size. Conversey, the expanding ayer restores the reduced size of the input feature maps back to the origina size. Figure 3 shows the combined CLP with the fusion of the forward ayer processor and the shrinking ayer processor. An activation function that contributes the noninear character-istics to the resut of the convoution is processed after performing the convoution oops on input feature maps and kernes. Since we competey unro the convoution oops for input feature maps, (2)

4 deconvoutiona ayer. However, since the TDC method maps the KD KD weights to a matrix of KC KC S D 2 size in each input and output feature maps, zero coefficients aways occurs in the matrix. The number of zero coefficients, numzero, in the transformed convoution kernes is derived by Equation (3). num Zero = K C 2 (M D S D 2 ) N D K D 2 M D N D (3) Therefore, when computing the convoution between input bocks and kernes in deconvoutiona ayer processor, we can reduce the use of FPGA resources by skipping the computations when weights are zero. Specificay, we reduce the DSP usage to 81% in FSRCNN and FSRCNN-s, compared to the TDC method when the kerne size of the deconvoutiona ayer is 9 9. output feature maps, and kernes, PReLU [15], the modues for activation function of FSRCNN and FSRCNN-s, are designed as many as the number of the output feature maps. Therefore, the output feature maps generated by the CLP of the forward ayer can be directy sent to the CLP of the shrinking ayer without being stored in the output buffer. Simiary, Figure 4 shows another combined CLP with the merging of the expanding ayer processor and the backward ayer processor. Large amounts of feature maps generated in the expanding ayer must be stored in the same number of ine buffers as the width of the kerne size in the backward ayer. Instead, we sove this probem by setting T y L 2 in the expanding ayer processor to the width of the kerne size in the backward ayer. Athough the number of ine buffers in the expanding ayer increases to the width of the kerne size in the backward ayer, the efficiency of the on-chip memory can be increased because it stores the smaest feature maps in FSRCNN and FSRCNN-s. Through this combined CLP architecture, FSRCNN and FSRCNN-s can reduce the required memory size by 3.62 times and 7.67 times compared with the fusion architecture [9]. 3.3 MAC Optimization of DCNN acceerator The state-of-the-art DCNN acceerator reduces the kerne size of the deconvoutiona ayer from KD KD to KC KC and increases the size of the output feature maps from MD to MD SD SD through the TDC method, where KD and KC denote the width of the kerne size of the deconvoutiona ayer and transformed convoutiona ayer, respectivey, and MD and SD represent the size of output feature maps and the stride of the deconvoutiona ayer, respectivey. As a resut, the tota number of weights in the transformed convoutiona ayer is KC KC SD SD MD ND, where ND denotes the size of the input feature maps in the 4 MODEL QUANTIZATION To optimize the utiization of the FPGA resources more compacty, pixes, weights and partia sums are expressed as 16-bit fixed points from the 32-bit singe-precision foating points using the bit-width quantization method [16]. The number of DSPs required for the mutipier and adder of the Xiinx Kintex-7 FPGA with the precision of a 16-bit fixed point is one each. If the mutipier and adder are impemented as ogic eements in FPGA, the required resources are 280 LUT6s and 16 LUTs+16 FFs, respectivey. When competey unroing severa convoution oops, we cannot use both mutipiers and adders as DSPs due to the imited number of DSPs. Therefore, we design an adder with LUTs and FFs that require the ower number of ogic eements than a mutipier, and a mutipier with a DSP. Therefore, the tota number of DSPs required to design the muti-clps is as foows. L 1 num DSP = T y M N K K num zero (4) =0 where tiing parameter T y used for the expanding ayer is K L-1 when is the order of the expanding ayer, L-2, and 1 for other cases. In addition, the deconvoutiona ayer processor performs the optimized MAC through the proposed method, and hence, numdsp is subtracted by the number of zero coefficients. Therefore, the number of avaiabe DSPs is increased. When the ayer configurations of FSRCNN and FSRCNN-s are substituted into Equation (4), the number of DSPs required is 16,216 and 5,185, respectivey, which are higher than the tota number of DSPs in the high-performance FPGAs. Therefore, the two-stage quantization agorithm is proposed to reduce the mode size of FSRCNN and FSRCNN-s. Agorithm 1 represents the pseu-do code for the twostage quantization agorithm. It is used to find the best CNN mode that can be impemented in the target FPGA. In the first stage, we cacuate the size of a receptive fied to figure how many surrounding pixes are used. The receptive fied R has the size of the input bock required to generate the pixe for the output feature maps of the ast ayer. Typicay, the performance of the SR is improved as R increases [2, 11, 12], thereby seecting the optima number considering the performance. R can be cacuated in Equation (5).

Figure 5: Bock diagram of the CNN acceeration system Figure 7: Comparison between the fusion architecture and the combined CLP architecture Figure 6: Convoutiona ayer pipeine L 1 R = K 0 + 2 K 2 By

However, as there are many configurations, we imit the size of the receptive fied using threshod1.

The constraints on the memory are not incuded because the combined CLP architecture soves them.

5 Figure 5: Bock diagram of the CNN acceeration system Figure 7: Comparison between the fusion architecture and the combined CLP architecture Figure 6: Convoutiona ayer pipeine L 1 R = K K 2 By Equation (5), the size of the receptive fied in FSRCNN and FSRCNN-s are and 11 11, respectivey. We need to find the best receptive fied among the various cases in the first stage. However, as there are many configurations, we imit the size of the receptive fied using threshod1. After that, we define the fundamenta probem for designing SR to maximize the performance and impement it in the target FPGA can be formuated as foows. =1 maximize PSNR i, s. t. num DSPi tota DSPs (6) where PSNRi is a peak signa-to-noise (PSNR) vaue for the i th combination of possibe configurations and numdspi is the number of DSPs needed at that time. The constraints on the memory are not incuded because the combined CLP architecture soves them. Since the object function PSNRi is cacuated by the CNN mode obtained through training, an exhaustive search must be performed for a configurations to find the optima quantized mode. In CNN-based SR agorithms, the output feature maps between ayers are nonineary connected and the performance increases as the number of connections increases [10, 11]. Therefore, we simpy change the object function to the function considering the tota number of connections in FSRCNN and FSRCNN-s. The probem is reformuated as foows. maximize M L 1 (5) M, s. t. num DSP tota DSPs (7) =0 There is a probem that the variabe M in the object function is as many as the number of ayers in the CNN. We sove the probem by grouping ayers with simiar characteristics. The tota number of connections depends on the size of the output feature maps in each ayers. Unike other ayers, the output feature maps of the deconvoutiona ayer are excuded from the quantization because each feature maps are merged into HR image. We describe the differentia of the connections by the feature maps as dc/dm. Except for the deconvoutiona ayer, there are an even number of ayers with equa dc/dm in the hourgass type FSRCNN and FSRCNN-s. We can represent a set of ayers with the same dc/dm and create vectors containing their configurations. For exampe, in FSRCNN-s, the first ayer and the fourth ayer have the same dc/dm but are smaer than the other ayers. So, the ayers beong to the set G[0] and the ayer configurations are stored in GM[0], GN[0], GK[0], and GTy[0], respectivey, representing vectors of incuding the tiing parameters, where a set with a higher index vaue has a arger vaue of dc/dm than a set with ower index vaue. Therefore, we sove the objective function separatey for each set. Thus the constraint probem for quantizing the output feature maps of the p th group can be expressed as beow: maximize M q (M q ) N p, s. t. ( α M q ) + β tota DSPs (8) where Mq is a variabe in which the output feature maps are quantized, and Np is the number of ayers having the same size of feature maps. α is the sum of the remaining mutipied configureations and β is the number of DSPs for the ayers in the other sets. Since Np is an even number, the object function becomes a convex downward. Therefore, the maximum integer satisfying the constraint becomes the soution. The resut of the quantized feature maps Mq is as foows. tota DSPs β M q = (9) α Consequenty, in the second stage, we first quantize the output feature maps GM[1] of the ayers beonging to the set G[1] without considering the output feature maps GM[0]. Then, the output feature maps of the ayers beonging to G[0] are strongy quantized. Since then we increments GM[0] by graduay decrementing the vaues of GM[1] unti the oop iterator is greater than the threshod2. We repeat this process with the first stage and find the best mode. As a resut, we reduce the combination of the feature maps in FSRCNN from <56, 12, 12, 12, 12, 12, 56, S D 2 > to <23, 3, 3, 3, 3, 3, 23, S D 2 >. And we reduce the kerne size of both the first ayer and

Figure 8: Comparison of DSP utiization between TDC method and proposed method in the FSRCNN Tabe 1: Comparison of between the proposed method and the conventiona method BRAM DSP FF LUT Cyces( 10 6 )

5 EXPERIMENTAL RESULTS We evauated the proposed on-chip CNN acceerator using the Xiinx Kintex-7 410T FPGA. We designed it with Veriog RTL through Vivado 2016.4 shown in Figure 5.

In addition, the expanding ayer and the fina ayer, which was the deconvoutiona ayer, were fused to another combined CLP.

6 Figure 8: Comparison of DSP utiization between TDC method and proposed method in the FSRCNN Tabe 1: Comparison of between the proposed method and the conventiona method BRAM DSP FF LUT Cyces( 10 6 ) [9] , , Ours 120 1,491 53,242 45, the ast ayer in FSRCNN from 5 5 to 3 3. The PSNR is 36.28dB, which is 0.72dB ower than the origina mode in a Set5 test dataset. 5 EXPERIMENTAL RESULTS We evauated the proposed on-chip CNN acceerator using the Xiinx Kintex-7 410T FPGA. We designed it with Veriog RTL through Vivado shown in Figure 5. The CLP of the first ayer and the CLP of the second ayer, the shrinking ayer, were fused to the combined CLP. In addition, the expanding ayer and the fina ayer, which was the deconvoutiona ayer, were fused to another combined CLP. Therefore, intermediate data that required ots of memory in the each combined CLPs were not stored. In addition, weight buffers were stored in ROM because it operated as read ony. Our system had the scae factor of 2 and rendered the QHD images from input image of pixe resoution. Tabe 1 shows the resources utiization of the proposed acceerator. Our acceerator used a sma amount of FPGA resources compared to [9]. We set working frequency of the FPGA as 100MHz. The execution times for generating FHD, QHD, and UHD images were 5.28ms, 9.21ms, and 20.73ms, respectivey. Thus its rendering speed was 36 times higher than of the conventiona method. Figure 6 shows that the ayers in the CNN are performed on the pipeine through the proposed acceerator. Thus, a ayer were performed concurrenty on the same timeine. For ayers with the width of the kerne size greater than 1, a ine deay was generated unti the ine buffers equa to the size of T k was fied. Exceptionay, the expanding ayer generated a ine deay for the combined CLP. In the PReLU, the output of the convoution was added to the bias, and when it was smaer than 0, the resut was mutipied by the PReLU coefficients. So, we impemented adders and mutipiers through ogic eements for every PReLU modue. Figure 7 shows the amount of memory required to impement the on-chip CNN acceerator using the fusion architecture and the combined CLP architecture for the quantized FSRCNN. The fusion architecture required about 1MB of the on-chip memory when generating UHD images, whie the proposed method required 250KB of memory, which was four times ess than that of the fusion architecture. Therefore, we coud sove the conventiona probem that required ots of memory to store feature maps. Figure 8 shows the comparison of DSP usage between the TDC method and the proposed method in the deconvoutiona ayer. In Figure 8 (a), if KD was 5, the difference in DSP usage between the two methods was about 500. We coud aso found that the arger the feature maps, the higher the difference in Figure 8 (b). This enormous difference was an important factor in designing the CNN acceerator. Because the CNN acceerators shoud process the convoutiona ayers and the deconvoutiona ayer as fast as possibe with a imited amount of DSPs in the FPGA. Therefore, the proposed method coud operate much more effectivey than the conventiona method in imited hardware resources. 6 CONCLUSIONS In this paper, we proposed a nove on-chip CNN acceerator to sove the imitation due to the off-chip data transfer. We impemented the FSRCNN, we known as CNN-based SR, using Xiinx Kintex-7 FPGA. We deveoped the 2-stage quantization technique to impement the imited FPGA resources. In experimenta resuts, our on-chip CNN acceerator generated HR images in rea time. In addition, we reduced the on-chip memory usage and DSP usage by 4 times and 1.44 times, respectivey, compared to conventiona methods. REFERENCES [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton Imagenet cassification with deep convoutiona neura networks. In NIPS 12.. [2] C. Dong, C. C. Loy, K. He, and X. Tang Learning a deep convoutiona network for image super-resoution. In ECCV [3] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong Optimizing FPGA-based Acceerator Design for Deep Convoutiona Neura Networks. In FPGA [4] Y. Ma, Y. Cao, S. Vrudhua, and J.-S. Seo Optimizing Loop Operation and Datafow in FPGA Acceeration of Deep Convoutiona Neura Networks. In FPGA [5] Y. Shen, M. Ferdman, and P. Mider Maximizing CNN Acceerator Efficiency Through Resource Partitioning. In ISCA

7 [6] M. Awani, H. Chen, M. Ferdman, and P. Mider Fused-ayer CNN acceerators In MICRO [7] H. Yonekawa and H. Nakahara On-Chip Memory based Binarized Convoutiona Deep Neura Networks Appying Batch Normaization Free Technique on an FPGA. In IPDPSW [8] Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y.-W. Tai Exporing Heterogeneous Agorithms for Acceerating Deep Convoutiona Neura Networks on FPGAs. In DAC [9] J.-W. Chang and S.-J. Kang Optimizing FPGA-based Convoutiona Neura Networks Acceerator for Image Super-Resoution. In ASP-DAC 17. [10] R. Keys Cubic convoution interpoation for digita image processing. IEEE Transactions on acoustics, speech, and signa processing, [11] C. Dong, C. C. Loy, and X. Tang Image Super-Resoution using Deep Convoutiona Networks. IEEE Trans. pattern anaysis and machine inteigence [12] C. Dong, C. C. Loy, and X. Tang Acceerating the super-resoution convoutiona neura networks. ECCV [13] S. Han, H. Mao, and W. J. Day Deep compression: Compressing deep neura networks with pruning, trained quantization and huffman coding. arxiv preprint arxiv: , [14] Xiinx, 7 series FPGAs memory resources user guide. [15] K. He, X. Zhang, S. Ren, and J. Sun Deving deep into retifiers: surpassing human-eve performance on imagenet cassification. In ICCV [16] D. Lin, S. Taathi, and S. Annapureddy Fixed point quantization of deep convoutiona networks. In ICML

Outline. Parallel Numerical Algorithms. Forward Substitution. Triangular Matrices. Solving Triangular Systems. Back Substitution. Parallel Algorithm

Outline. Parallel Numerical Algorithms. Forward Substitution. Triangular Matrices. Solving Triangular Systems. Back Substitution. Parallel Algorithm Outine Parae Numerica Agorithms Chapter 8 Prof. Michae T. Heath Department of Computer Science University of Iinois at Urbana-Champaign CS 554 / CSE 512 1 2 3 4 Trianguar Matrices Michae T. Heath Parae