Gist: Efficient Data Encoding for Deep Neural Network Training

Size: px
Start display at page:

Download "Gist: Efficient Data Encoding for Deep Neural Network Training"

Transcription

1 Gist: Efficient Data Encoding for Deep Neural Network Training Animesh Jain, Amar Phanishayee, Jason Mars, Lingjia Tang and Gennady Pekhimenko Microsoft Research, Univerity of Michigan, Ann Arbor, University of Toronto Abstract Modern deep neural networks (DNNs) training typically relies on GPUs to train complex hundred-layer deep networks. A significant problem facing both researchers and industry practitioners is that, as the networks get deeper, the available GPU main memory becomes a primary bottleneck, limiting the size of networks it can train. In this paper, we investigate widely used DNNs and find that the major contributors to memory footprint are intermediate layer outputs (feature maps). We then introduce a framework for DNN-layer-specific optimizations (e.g., convolution, ReLU, pool) that significantly reduce this source of main memory pressure on GPUs. We find that a feature map typically has two uses that are spread far apart temporally. Our key approach is to store an encoded representation of feature maps for this temporal gap and decode this data for use in the backward pass; the full-fidelity feature maps are used in the forward pass and relinquished immediately. Based on this approach, we present Gist, our system that employs two classes of layer-specific encoding schemes lossless and lossy to exploit existing value redundancy in DNN training to significantly reduce the memory consumption of targeted feature maps. For example, one insight is by taking advantage of the computational nature of back propagation from pool to ReLU layer, we can store the intermediate feature map using just 1 bit instead of 32 bits per value. We deploy these mechanisms in a state-of-the-art DNN framework (CNTK) and observe that Gist reduces the memory footprint to upto 2 across 5 state-of-the-art image classification DNNs, with an average of 1.8 with only 4% performance overhead. We also show that further software (e.g., CuDNN) and hardware (e.g., dynamic allocation) optimizations can result in even larger footprint reduction (upto 4.1 ). Keywords-DNN Training; Data Encodings; Compression; I. INTRODUCTION The availability of large datasets and powerful computing resources has enabled a new breed of deep neural networks (DNNs) to solve hitherto hard problems such as image classification, translation, and speech processing [22], [24], [31], [48]. These DNNs are trained by repeatedly iterating over datasets. The DNN training process has large compute and memory requirements and primarily relies on modern GPUs as the compute platform. Unfortunately, as DNN models are getting larger and deeper, the size of available GPU main memory quickly becomes the primary bottleneck 1, thus Work done during an internship at Microsoft Research, Redmond. 1 For GPU main memory (GDDR5/GDDR5X), the first order concern is bandwidth as many GPU applications are bandwidth-bound. It is hard to get both high bandwidth and high-density DRAM-based memory at low cost [33]. limiting the size of the DNNs that a GPU can support [39], [40]. Modern DNNs are already facing this issue, prompting researchers to develop memory-efficient implementations of the networks [37]. Many researchers have recognized this shortcoming and proposed approaches to reduce the memory footprint of DNN training. However, prior approaches are not able to simultaneously achieve all of the following three desirable properties: (i) provide high memory footprint reduction, (ii) low-performance overhead, and (iii) minimal effect on training accuracy. Most prior works propose efficient techniques to reduce the memory footprint in DNN inference with an emphasis on reducing the model size (also referred to as weights) [18], [19], [20], [17]. However for DNN training, weights are only a small fraction of total memory footprint. In training, intermediate computed values (usually called feature maps) need to be stored/stashed in the forward pass so that they can be used later in the backward pass. These feature maps are the primary contributor to the significant increase in memory footprint in DNN training compared to inference. This important factor renders prior efforts, that target weights for memory footprint reduction, ineffective for training. State-of-the-art memory footprint reduction approaches for training transfer data structures back and forth between CPU and GPU memory but pay a performance cost in doing so [39]. Finally, approaches that explore lower precision computations for DNN training, primarily in the context of ASICs and FPGAs, either do not target feature maps (and thus unable to achieve high memory footprint reduction) or, when used aggressively, result in reduced training accuracy [11], [15], [8]. The key insight of this work is in acknowledging that a feature map typically has two uses in the computation timeline and that these uses are spread far apart temporally. Its first use is in the forward pass and second is much later in the backward pass. Despite these uses being spread far apart, the feature map is still stashed in single precision format (32-bits) when they are unused between these accesses. We find that we can store the feature map data with efficient encodings that result in a much smaller footprint between the two temporal uses. Furthermore, we propose that if we take layer types and interactions into account, we can enable highly efficient layer-specific encodings these opportunities are missed if we limit ourselves to a layer-agnostic view. Using these key insights, we design two

2 layer-specific lossless encodings and one lossy encoding that are fast, efficient in reducing memory footprint, and have minimal effect on training accuracy. Our first lossless encoding, Binarize, specifically targets ReLU layers followed by a pooling layer. Upon careful examination of ReLU s backward pass calculation, we observe that the ReLU output, that has to be stashed for the backward pass, can be encoded using just 1-bit values, leading to 32 compression for the ReLU outputs. Our second lossless encoding, Sparse Storage and Dense Compute (SSDC), that specifically targets ReLU followed by convolution layer, is based on the observation that ReLU outputs have high sparsity. SSDC facilitates storage in memory-space efficient sparse format but performs computation in dense format, retaining the performance benefits of highly optimized cudnn dense computation, while exploiting sparsity to achieve high reduction in memory footprint. Finally, in the lossy domain, our key insight of representing the stashed feature maps in smaller format only between the two temporal uses enables us to be very aggressive with precision reduction without any loss in accuracy compared to baseline. This lossy encoding, Delayed Precision Reduction (DPR), delays precision reduction to the point where values are no longer needed in the forward pass and achieves significant bit savings (as small as 8 bits). Utilizing all these encodings, we present Gist that specifically targets feature maps to reduce the training memory footprint. It performs a static analysis on the DNN execution graph, identifies the applicable encodings, and creates a new execution graph with relevant encode and decode functions inserted. Gist also performs a static liveness analysis of the affected feature maps and newly generated encoded representations to assist the DNN framework s memory allocator, CNTK [43] in our case, to achieve an efficient memory allocation strategy. This paper makes the following contributions: Systematic Memory Breakdown Analysis. We perform a systematic memory footprint analysis, revealing that feature maps are the major memory consumers in the DNN training process. We also make a new observation that the feature maps have high data redundancy and can be stored in much more efficient formats between their forward and backward use. Layer-specific Lossless Encodings. We present two layer-specific encodings (1) Binarize that achieves 32 compression for ReLU outputs for layer combination of ReLU followed by Pool, and (2) Sparse Storage and Dense Compute that exploits high sparsity exhibited in ReLU outputs for ReLU followed by convolution layers. Aggressive Lossy Encoding. We present DPR, that applies precision reduction only for the backward use of the feature maps values in the forward pass are kept in full precision leading to aggressive bit savings without affecting accuracy. Footprint Reduction on a Real System. We observe that Gist reduces the memory footprint by 2 across 5 stateof-the-art image classification DNNs, with an average of 1.8 with only 4% performance overhead. By reducing memory footprint, Gist can fit larger minibatches in the GPU memory, improving GPU utilization and speeding up the training for very deep networks, e.g., a speedup of 22% for Resnet We also show that further optimizations to existing DNN libraries and memory allocation can result in even larger memory footprint reductions (upto 4.1 ). II. BACKGROUND AND MOTIVATION DNNs typically consist of an input and an output layer with multiple hidden layers in between. Recently, convolution neural networks (CNNs), a class of DNNs, have been shown to achieve significantly better accuracy compared to previous state-of-the-art algorithms for image classification [32], [45], [46], [22], [34]. CNNs have been growing deeper with every iteration, starting with few convolution layers in the beginning (AlexNet) to hundreds of convolution layers in recent ones (Inception). A. Training vs. Inference DNNs have two distinct modes of operation: (i) training, when a model is trained based on a set of inputs (training set) and a corresponding set of expected outputs, and (ii) inference, when an already trained network is used to generate predictions for new inputs. For this paper, we focus on two main differences between training and inference. First, training consists of two phases: forward and backward passes [41], [10]; inference only involves a forward pass. The goal of the backward pass in DNN training is to backpropagate error and find weight error gradients that can be applied to the weights to steer the parameters in the right direction. Training is performed in batches of input samples, commonly known as a minibatch [3]. Training on minibatches as opposed to training on an image-by-image basis has been shown to achieve better accuracy and better hardware utilization [12], [21], [9]. Second, in inference, the major part of storage overhead comes from weights. These weights are fixed after training, and hence many different optimizations can be applied to reduce their storage requirements [17], [2], [18], [5]. In contrast, training has many distinct data structures, e.g., weights that change over the course of training, weight gradients, feature maps (intermediate layer outputs) that need to be stashed in the forward pass for use in the backward pass, and backward gradient maps. 1) Why Memory Can Be a Problem in Training?: To understand the memory requirements of GPU-based training, we study the breakdown of memory footprint on five stateof-the-art CNNs in CNTK. While using FPGAs [50] and ASICs [28], [25] is also possible, most of these designs are either proprietary or in relatively early development stages.

3 Allocated memory (in GB) Weights Stashed feature maps Weight gradients Imm. consumed Workspace AlexNet (256) NiN (256) Overfeat (256) VGG16 (64) Inception (64) Figure 1: Breakdown of memory footprint in DNN training amongst different data structures Hence, we conduct our study on a modern GPU (Maxwell GTX Titan X in our case). However, our approaches are applicable to optimized hardware as well (Section V-H). Figure 1 shows the breakdown of total memory footprint across different data structures. Feature maps are the intermediate layer outputs that are passed on as an input to the following layer. Gradient maps are the intermediate gradients generated in the backward pass and passed as an input to the previous layer. In CNNs, not every feature map has to be saved for the backward pass. We thus distinguish stashed feature maps (generated in the forward and used in both forward and backward passes) from immediately consumed feature maps (generated in the forward pass and consumed immediately in the forward pass) and gradient maps (generated in the backward pass and consumed immediately). Stashed feature maps are required in the backward pass and thus stored for a long time in a minibatch processing. In contrast, immediately consumed feature maps and gradient maps can be discarded as soon as they are used. Finally, workspace is deep learning library s (cudnn in this case) intra-layer storage to support layer computations [7]. cudnn provides a choice between memory-optimal and performance-optimal implementations, translating to a tradeoff between algorithm performance and workspace storage requirements. In this work, we choose its memory-optimal implementation as an optimized baseline. We draw two major conclusions from this figure. First, larger (deeper) DNNs consume a large amount of memory even with relatively small minibatch sizes (64). VGG16 and Inception can only fit in our GPU memory if the minibatch size is 64 and start exceeding the 12 GB GPU memory limit at higher minibatch size. Higher minibatch size is desirable as it leads to better GPU utilization [12], [51]. Second, training memory footprint tends to be dominated primarily by stashed feature maps, followed by immediately consumed data structures. For example, in VGG16, 83% of memory is consumed by stashed feature maps and immediately consumed data; this number grows to 97% for Inception. This result stands in stark contrast to inference where feature maps don t need to be stashed and memory consumption is dominated by weights. We conclude that the stashed feature maps and immediately consumed data structures (in that order of importance) are key for optimizing GPU memory Timeline Baseline Gist X : X : Encode(X) : Lx Ly t a t b t d Encode() Forward pass t c Encoded during (tb, td) Backward pass Feature map : Generated 1st use long temporal gap between two uses 2nd use Stashed in FP32 format during (tb, td) Lz Decode() Figure 2: Long temporal gap between uses of a feature map consumption in CNN training. B. Limitations of Prior Work In this paper, we develop techniques to reduce the DNN training memory footprint. Here, we briefly describe the limitations of existing approaches. Prefetch and Swap-out. One potential approach is to move parts of the working set between CPU and GPU memory using PCIe links and smart prefetching analysis [39]. However, this approach still suffers from significant overheads of data transfer with respect to power/energy and their inability to completely mask the performance cost of swapping data in and out of GPU memory (upto 27% for Inception). In addition, it uses a shared resource, PCIe links, which sometimes can be of critical importance in distributed DNN training [9]. Reducing Minibatch Size. While reducing minibatch size is effective at reducing the memory footprint during training, it adversely affects the runtime of the training process because smaller minibatches lead to GPU underutilization [12]. This performance hit can be recovered by using more GPUs, where each GPU works on a smaller minibatch. But this is also an inefficient solution as GPU machines are both costly and power hungry, and might also result in sub-linear scaling due to stragglers and transfer across workers [9]. Recompute. Instead of a saving the output of a large layer, prior work has considered recomputing the output of a layer s forward pass again in the backward pass [4]. Unfortunately, we observe that the largest layers are usually the ones that also take the longest to recompute, that can cause significant performance overhead. Yet, this technique is still applicable for some specific layers (like batch normalization) and can be used in conjunction with our work. III. GIST: KEY IDEAS In this work, we design techniques to reduce DNN training memory footprint by focusing on the primary contributors feature maps. We find that a feature map typically has two uses in the computation timeline and these uses are spread far apart temporally, as shown in Figure 2. Its first use is in the forward pass and the second use is much later in the backward pass. In the baseline, the data is stashed in single precision (FP32) even though these uses are far apart. In our approach, we represent the data in a much smaller encoded format in the temporal gap and decode it just before it is needed again in the backward

4 Feature Map Breakdown 100% 80% 60% 40% 20% 0% Relu >Pool Relu/Pool >Conv Others AlexNet (256) NiN (256) Overfeat (256) VGG16 (64) Inception (64) Figure 3: Breakdown of memory within stashed feature maps. ReLU layer consumes a major portion of the footprint. pass; the forward use still gets the data in FP32 format, but the memory space is relinquished as soon as the forward use is complete. This results in an efficient memory sharing strategy (Section IV-C), reducing total memory footprint. Target data structures Footprint Reduction Technique Type ReLU-Pool feature map Binarize Lossless ReLU-Conv feature map Sparse Storage and Dense Compute Lossless Other feature map Delayed Precision Reduction Lossy Immediately consumed Inplace computation Lossless Table I: Summary of Gist techniques Our second key insight is that taking layer types and interactions into account opens up opportunities for designing highly aggressive encoding schemes, which were earlier hidden due to the layer agnostic nature of prior work. In this section, we first identify such opportunities that let us design two lossless and one lossy encodings specifically targeted to reduce memory footprint of stashed feature maps. Next, we observe that inplace computations can reduce the size of immediately consumed data structures. Table I shows a brief outline of our techniques and their target data structures. A. Opportunities For Lossless Encodings Among feature maps, we first target ReLU activation functions which are heavily used in all major CNNs targeted for vision-related tasks [22], [48], [13], [31]. In these CNNs, the convolution layers are typically followed by ReLU activations. This group of a Conv-ReLU pair is either followed by the same group or by a pool layer, resulting in many ReLU-Conv and ReLU-Pool layer pairs. Consequently, we observe that ReLU feature maps form a major fraction of total memory footprint. This is shown in Figure 3 which zooms in on the stashed feature maps (in Figure 1) and analyze their breakdown across different CNN layers, with an emphasis on three different categories: (i) ReLU outputs followed by a Pool layer (ReLU-Pool), (ii) ReLU/Pool outputs followed by a conv layer (ReLU-Conv), and (iii) remaining stashed feature maps (Others). We observe that significant portion of memory footprint is attributed to ReLU outputs (Pool outputs have very low contribution). For example, VGG16 has 40% and 49% of the stashed feature maps for ReLU-Pool and ReLU-Conv respectively (89% total used for ReLU outputs). We make two key observations for these ReLU outputs that let us store the stashed feature map with much fewer bits. First, when carefully examining the backward pass computation of ReLU layer, we observe that ReLU outputs for ReLU-Pool combination can be stored in just 1-bit. Second, we observe that ReLU outputs typically have high sparsity that can be exploited to encode the feature map in much smaller sparse format. Next, we expand on these opportunities. ReLU-Pool. Typically, in a backward pass calculation, a layer uses its stashed input feature map (X), stashed output feature map (Y) and output gradient map (dy) to calculate input gradient map (dx), as shown in Figure 4(a). However, every layer does not require all this data for the backward pass. Upon further investigation, we discover that although feature maps are stashed with the same precision across layer types, it is mostly for convenience, not out of computation necessity. We show backward pass calculation for ReLU in Figure 4(b). It only requires Y and dy to calculate dx. Moreover, an element of dy is passed to dx, only if the corresponding element in Y is positive, else dx is set to 0. With this observation in mind, it is natural to consider replacing Y with 1-bit values. Unfortunately, it is not always possible, because the next layer might require its stashed input feature map X (ReLU output in this case) for the backward pass calculation. However, upon further examination, we observe that in the case of ReLU-Pool, the pool backward pass does not require the actual values of ReLU output as shown in Figure 4(c) (described in more details in Section IV-A), resulting in significant encoding opportunities. This observation becomes the basis of our first lossless encoding, called Binarize. ReLU-Conv. Binarize is not applicable to ReLU-Conv pair, because convolution requires its stashed input feature map for the backward pass calculation (as shown in Figure 4(d)). However, upon careful data analysis, we observe that ReLU outputs have high sparsity (large number of zeroes) induced by the ReLU calculations in the forward pass. For example, for VGG16, we observe high sparsity, going even over 80%, for all the ReLU outputs, motivating us to apply sparse compression and computation for these feature maps. However, switching both compute and memory to sparse domain results in significant performance degradation [49], [23]. Building on this observation, we present Sparse Storage and Dense Compute (SSDC) encoding that stores the data in Compressed Sparse Row (CSR) encoding format while keeping the computation in dense format. B. Opportunities For Lossy Encodings For lossy encoding, we investigate precision reduction as it is amenable to GPU architecture compared to prior offline approaches like quantization and Huffman encoding [18]. We make two observations that let us achieve aggressive bit savings without any loss in accuracy. Non-Uniform Precision Reduction. We observe that uniform precision reduction to even 16 bits (IEEE half precision

5 input feature map (X) input gradient Map (dx) DNN Layer output feature map (Y) output gradient map (dy) dx = f(x, Y, dy) (a) Generic layer backward pass X dx Relu layer Y dy X dx Pooling layer dx = (y>0)? dy:0; YtoXmap = f(x,y) dx = f(y, dy) dx = f(ytoxmap, dy) dx = f(x, dy) (b) Relu layer (c) Pooling layer (d) Convolution layer Y dy dx Conv layer Figure 4: Backward pass computation. Only circled data structures are needed in the backward pass X Y dy format), where all the data structures are represented in 16 bits while the computation happens on 32 bits, leads to severe accuracy losses. We observe that if instead, we restrict the precision reduction to only gradient maps, then the training accuracy is not affected. Note that, although this type of selective precision reduction has been studied in a recent work [11], it does not study the effect of reducing precision in stashed feature maps. Delayed Precision Reduction. But more importantly, we find that the way the precision reduction is applied in training should be significantly changed. Currently, the conventional wisdom [11], [15], [8], [16] is to apply precision reduction right after the value is generated by a particular layer. This design choice leads to the situation when the error generated by the precision reduction is injected directly into the next layer and is then propagated (potentially increasing in magnitude) to all future layers in the forward pass. In our work, we observe that it is better to separate the two uses of every output layer in the forward and backward pass (as previously shown in Figure 2). The first immediate use (by the next layer) significantly benefits from the more precise (usually FP32) representation, avoiding any error injection due to precision reduction in the forward pass. While the second use, much later in the backward pass, can tolerate lower precision. This separation, implemented in our Delayed Precision Reduction encoding, push the bit lengths to a very small value like 8 bits for multiple DNNs (unseen in prior work). C. Opportunities For Inplace Computation Next, we shift our focus from stashed feature maps to immediately consumed data structures. We observe that a good portion of immediately consumed data can be removed by performing inplace computation. As discussed in a previous work [4], this optimization is applicable for the layers (specifically ReLU) that have a read-once and write-once property between each element of input and output. In the absence of inplace optimization, convolution output shows up in the immediately consumed category. With inplace computation, the memory space for convolution is reused by ReLU, reducing immediately consumed memory footprint. IV. DESIGN AND IMPLEMENTATION Based on the observations in the previous section, we present the design of our system Gist. Figure 5 shows our system architecture. Typically, DNN frameworks like Original computation graph Directed graph for DNN execution Gist Schedule Builder Modified computation graph Lifetime of data structures Gist encodings Runtime execution CNTK static memory allocator Figure 5: Gist System Architecture - Schedule Builder finds applicable Gist encodings and performs liveness analysis. CNTK [43] and TensorFlow [1] represent DNNs as a series of computational steps via a directed execution graph. Gist s Schedule Builder takes the original execution graph, identifies the edges where new encodings/decodings are needed, and creates a new execution graph with encode/decode functions inserted. In isolation, encoded representations only add to the memory footprint. To solve this problem, we utilize the CNTK memory allocator that uses the lifetimes of various data structures to find an efficient static memory sharing strategy. Schedule Builder performs liveness analysis, infers the lifetime of affected stashed feature maps and encoded representations, and presents them to the CNTK memory allocator for optimization. Gist encodings reduce the lifetime of FP32 stashed feature maps, opening up more opportunities for memory sharing, and thus reducing the total memory footprint. In this section, we present the details of Gist encodings and design of Gist s Schedule Builder and its interaction with CNTK memory allocator. A. Encodings We present a generic view of encodings in Figure 6, illustrating the difference between baseline and Gist encodings in terms of new data structures and changes in backward pass calculations (shown in blue color). For the baseline, backward pass calculation is a function of intermediate feature map (Y 1 in the figure) along with other data structures. Gist introduces two new data structures E, the Encoded intermediate feature map in the forward pass, and D, the decoded intermediate feature map in the backward pass. These new data structures change the backward pass calculation, which are now dependent on D instead of Y 1. We now present the details of these encodings. Lossless Encoding - Binarize. As discussed in Section III, we observe that for ReLU-Pool layer combination, ReLU output can be encoded much more efficiently, because (i) the backward pass of ReLU only needs to know if the stashed feature map is positive (1 bit), and (ii) the pool layer backward pass can be optimized so that it does not

6 X 1 Y 1 = X 2 Y 2 L1 L2 dx 1 dy 1 = dx 2 dy 2 Baseline: Gist data structures E = Encode(X2) D = Decode(E) dx1 = f(y 1, X 1, dy 1 ), dx 2 = f(x 2, Y 2, dy 2 ) Gist: dx 1 = f(d, X 1, dy 1 ), dx2 = f(d, Y 2, dy 2 ) Layer pairs Encode function Decode function ReLU - Pool FP32 to 1-bit Work on encoded data ReLU/Pool - Conv FP32 to CSR-like CSR-like to FP32 Others FP32 to FP8/10/16 FP8/10/16 to FP32 Figure 6: Gist encodings need ReLU outputs. CNNs typically use MaxPool layer to subsample the input matrix. MaxPool s forward pass slides a window of a specific size over the input matrix X, finds the maximum value in this window, and passes it on to the output Y. For the backward pass, it passes on dy to that location of dx in the window, from where the maximum value of X was chosen in the forward pass. In baseline CNTK implementation, the MaxPool layer stashes both input and output feature maps for the backward pass to find the location of maximum values. We instead, create a mapping from Y to X in the forward pass that keeps track of these locations (Y T ox map in Figure 4 (b)). We use this mapping in the MaxPool backward pass calculation, removing the dependence on its input and output stashed feature maps. With this optimization, Binarize encoding lets us achieve a significant reduction in memory footprint for ReLU-Pool layer combination. Binarize adds two encoded feature maps. First, a 1-bit data structure that replaces the ReLU feature map, storing information whether the stashed feature map value was positive. Second, for Pool layer, Binarize stores a Y to X mapping of the location of maximum values. This data structure has as many elements as the Pool output (typically, one-fourth or one-ninth of preceding ReLU output), where each element is stored in 4 bits (the largest sliding window in our application suite is 3 3). Therefore, these encoded data structures result in a compression of close to 16 (32 for ReLU output and 8 for MaxPool output) for ReLU-Pool stashed feature maps. Tying back to Figure 6, both encode and decode functions are implemented within ReLU and pool layers using CUDA. Pool and ReLU backward pass have been updated to perform computation on the encoded data structures itself. We observe small performance improvements with Binarize encoding, because Binarize encoding significantly increases effective memory bandwidth for ReLU layers, improving memory-bandwidth bound ReLU backward pass computation. Lossless Encoding - Sparse Storage and Dense Compute. As discussed in Section II, other set of ReLU layers ReLU- Conv, exhibits high amount of sparsity induced by ReLU calculations, making it suitable to store these feature maps in sparse format. We also observe high sparsity for a few Pool-Conv layer combinations if the preceding ReLU layer has high sparsity. SSDC encoding is applicable to both layer combinations. However, switching to sparse computation on GPUs shows performance improvements only when the sparsity is very high (> 97.5%), which is typically not the case in CNNs [49], [23]. To tackle this problem, we present Sparse Storage and Dense Compute (SSDC) encoding, that isolates computation and storage, facilitating storage of data in sparse format and computation in dense (FP32) format. SSDC stores the data in a sparse format for the majority of its lifetime, and converts the data back into dense format only before it is actually required for computation. This achieves significant memory footprint reduction, while retaining the performance benefits of highly optimized cudnn library. For choosing a suitable sparse format, we compare 3 commonly used formats - ELL, Hybrid and Compressed Sparse Row (CSR). We observe that CSR achieves lowest format-conversion latency among these options, achieving the best compression-performance overhead tradeoff. This format stores all the non-zero values, along with a meta array that holds the column indices of the non-zero values in each row. (There is an extra meta array which is very small in size, and, thus omitted for the rest of the discussion). Most DNN frameworks store data structures in an n-dimensional matrix, which can always be collapsed into two dimensions. We take these 2D matrices and convert them into a CSR format. We use Nvidia cusparse library to perform the encodings/decodings, listed in Figure 6. However, the original implementation stores each index as a 4-byte value, resulting in no improvement with compression if the sparsity is below 50%. This is due to cusparse conservative assumption that the number of elements in a row of the matrix can be high, allotting 4 bytes for every column index. We perform Narrow Value Optimization, where we reshape the 2D matrix and restrict the number of columns to 256, requiring only 1 byte per column index [36]. This reduces the minimal sparsity requirement for compression to be effective from 50% to 20%, resulting in both wider applicability and higher compression ratios. Lossy Encoding - Delayed Precision Reduction. Gist s third encoding uses precision reduction to exploit CNN error tolerance to reduce the memory footprint of remaining stashed feature maps. Similar to SSDC encoding, we isolate the storage from the computation. The computation still happens in FP32 format while the data is stored with lower precision for most of its lifetime. Though backward pass implementation can be modified to work directly on precision-reduced values, we convert back to FP32 because cudnn is closed-source library. Nevertheless, we discuss the impact of such optimization on compression ratio in Section V-H.

7 Our usage of precision reduction differs significantly from the previous research that applies it immediately after computation finishes. We delay the precision reduction until the feature map has been consumed in the forward pass, thus naming it Delayed Precision Reduction (DPR). This lets us achieve more aggressive precision reduction on GPUs. DPR is applicable to any layer combination. We also apply it over SSDC encoding, compressing the non-zero values array in the CSR format. We do not touch the meta array in CSR format and Binarize encoded data structures as these affect control, and thus are not suitable for lossy encoding. Figure 6 lists the encode and decode functions for DPR encoding. We use three smaller representations of 16, 10 and 8 bits, packing 2, 3 and 4 values, respectively, into 4 bytes. For packing 3 values into 4 bytes, 10 bits is the largest length possible (9 bits leave 5 bits unused, 11 bits requires one extra bit). For 16 bits, we use IEEE half precision floating point format (1 sign, 5 exponent and 10 mantissa bits), referred to as FP16. For 8-bits (FP8), we choose 1 bit for sign, 4 for exponent and 3 for mantissa, and for 10-bits (FP10), we use 1 sign, 5 exponent and 4 mantissa bits. In FP10, three 10-bit values are stored in a 4-byte space, rendering 2-bits useless. We ignore denormalized numbers as they have a negligible effect on CNNs accuracy. We use round-tonearest rounding strategy for these conversions. The value is clamped at maximum/minimum value if the FP32 value is larger/smaller than the range of the smaller format. We write CUDA implementations to perform these conversions. Since conversions can happen in parallel, DPR results in minimal performance overhead. B. Schedule Builder In Gist, the values in the forward pass are in FP32 format (for both lossless and lossy encodings) while only the data that is required for the backward pass is stored in an encoded format. However, since the feature maps are still generated in the original FP32 format in the forward pass before they are encoded, without any further optimization, the encodings will result in increased memory footprint. This gives rise to the question that how does Gist leads to memory footprint reduction. This task is handled by Gist s Schedule Builder that has two responsibilities. First, identifying the applicable layer encodings from the CNTK execution graph, performing a static analysis to distinguish between forward and backward use of a feature map, and inserting the encode and decode functions in the execution graph, thus creating a new execution graph that is used at runtime. And, second, performing a static liveness analysis for the affected stashed feature maps and newly generated encoded/decoded representations, and pass it on to the CNTK static memory allocator that finds an efficient memory allocation strategy (Section IV-C). Figure 2 illustrates the liveness analysis performed by the Schedule Builder. The figure shows the two uses of a feature map that are temporally far apart in the computation timeline one in the forward pass and one much later in the backward pass. In the baseline, the lifetime of this feature map is very long and it is stored in FP32 format for this whole duration. Gist breaks this lifetime into three regions FP32 format that is live only for the immediate forward use, encoded (much smaller) format that is live for the long temporal gap between the two uses, and FP32 format for the decoded value that is live only for the immediate backward use. The figure also shows the points at which Schedule Builder inserts encode and decode functions. C. CNTK Memory Allocator Schedule Builder passes this liveness analysis to the CNTK static memory allocator that finds an efficient strategy to allocate the memory. CNTK, similar to other deep learning frameworks, performs static memory allocation. The other alternative, dynamic allocation, results in a lower footprint but at the expense of many expensive cudamalloc calls for each minibatch, resulting in performance overhead. Nevertheless, we discuss the impact of dynamic memory allocation on compression ratio in Section V-H. The key idea employed in the CNTK memory allocator is memory sharing. It takes lifetimes of different data structures and their sizes as input, and finds an efficient memory sharing strategy. The memory allocator creates groups of data structures whose lifetimes do not overlap and thus can share the same memory space. Therefore, the size of this group is the largest size of the member within the group, as opposed to the sum of size of the members of the group. To come up with an efficient strategy, it first sorts the data structures on the basis of size, and then forms these groups, so that large data structures can share the same memory space. At the end of this process, memory allocator has multiple groups which are either dominated by feature maps that are stored for the backward pass or by immediately consumed feature maps or gradient maps. Gist encodings, by reducing the lifetime of FP32 stashed feature map, create higher opportunities for memory sharing, resulting in lower memory footprint. Example - Putting it all together. We present an example in Figure 7 to illustrate the interactions between static memory manager and Gist encodings. The example shows lifetimes of five variables (X, A, B, C and D). In part (a), we show the output of CNTK memory allocator for the baseline. Memory allocator forms 2 groups, resulting in a total size of 18 MB, 10 for stashed feature map (X) and 8 for immediately consumed. In part (b), we apply SSDC encoding, which breaks the lifetime of original X into three separate timelines. In this case, CNTK memory allocator again forms two groups, however, the total size is reduced to 12, 2 for (encoded) stashed feature map and 10 for immediately consumed. As shown in the figure, Gist encodings convert the original FP32 stashed feature maps

8 X 10 MB X 1 MB 6 MB 5 MB 8 MB A B C D (a) Baseline 1 MB 6 MB 10 MB Encoded X (2 MB) 5 MB 8 MB A B C D (b) After applying SSDC encoding on X 10 MB X Static memory allocation 10 goes down goes up Stashed feature-map Immediately consumed Figure 7: Example illustrating the interaction between Gist encodings and CNTK static memory allocator into immediately consumed data, and the encoded (and much smaller) data structure is now stashed for the backward pass. This might increase the total immediately consumed data (8 to 10 MB here), but reduces the stashed feature map significantly (10 to 2 MB), resulting in overall reduction in memory footprint (18 to 12 MB). A. Methodology V. EVALUATION Infrastructure. We evaluate Gist memory reduction capabilities on Microsoft CNTK deep learning framework. We implement Gist encodings, inplace optimization, Schedule Builder, and make necessary changes in CNTK static memory allocator. The evaluation is performed on an Nvidia Maxwell GTX Titan X [26] card with 12 GB of GDDR5 memory using cudnn v6.0 and CUDA 8.0. Applications. We evaluate Gist on 6 state-of-the-art image classification CNNs: AlexNet [32], NiN [34], Overfeat [44], VGG16 [45], Inception [46] and Resnet [22], using ImageNet training dataset [42]. These CNNs present a wide range of layer shapes and sizes, while also capturing the evolution of CNNs in past few years. Baselines. Our first baseline, referred to as CNTK baseline, is CNTK original static memory allocation strategy, without any of our optimizations. In Section II, we show that stashed feature maps and immediately consumed data structures are the major contributors to the total memory footprint, hence CNTK baseline consists of only these two data structures. It does not include weights, weight gradients, and workspace (also in line with previous work on DNN training [39], [40]). We also use a second baseline to study the effect of different encodings in isolation. This baseline, referred to as investigation baseline, is modified CNTK baseline where memory sharing is not allowed for stashed feature maps. Other data structures are shared exactly the same way as in the CNTK baseline. This baseline allows us to study the impact of our encodings on different data structures in isolation. For end-to-end memory reduction numbers, we still use CNTK baseline. For performance overhead evaluation, we use memoryoptimized cudnn configuration as the focus of our work is memory footprint reduction. Memory-optimized cudnn presents an optimized baseline for comparison. Note that CNTK baseline and investigation baseline have the same performance as they do not affect computation. Memory footprint ratio against CNTK baseline 2 x 1.5 x 1 x 0.5 x 0 x Baseline Lossless Lossless + Lossy AlexNet NiN Overfeat VGG16 Inception geomean Figure 8: Evaluation of memory footprint reduction - Gist cuts down total memory footprint significantly Comparison Metric. We use Memory Footprint Ratio (MFR) to evaluate the efficacy of Gist on reducing the memory footprint. MFR is described as follows. MF R = Memory F ootprint of Baseline Memory F ootprint after encoding B. Gist s Memory Footprint Reduction Gist is designed to tackle the increasing memory footprint in DNN training. In this section, we evaluate Gist s efficacy from this aspect. In this experiment, we, first, apply all lossless optimizations Binarize, SSDC and inplace and measure the total memory footprint (stashed feature maps and immediately consumed data structures). Then, we apply Gist s lossy encoding DPR on top of lossless optimizations and measure additional reduction in memory footprint. For lossy encoding, we choose the smallest floating point representation that does not affect the training accuracy (detailed in Section V-D1). The findings of this experiment are presented in Figure 8. Figure 8 shows the Memory Footprint Ratio (MFR) achieved by Lossless and Lossy optimizations when compared to CNTK Baseline. We observe that the lossless optimizations result in a MFR of more than 1.5 for AlexNet and VGG16 (1.4 on average). DPR, on top of lossless, further reduces the total memory footprint, achieving MFR of upto 2 for AlexNet, with an average of 1.8. This experiment shows that Gist optimizations result in significant memory footprint reductions, making it possible to fit a network that can be larger in depth or wider with respect to larger minibatch size. Performance Overhead. Next, we measure the performance overhead introduced by Gist encodings. We run the same experiment again, measuring the execution time of processing a minibatch, averaged over 5 minutes of training time (hundreds of minibatches). The findings of this experiment are presented in Figure 9. The figure shows the performance degradation for Lossless and Lossy encodings, with error bars capturing the performance variation. We observe minimal performance degradation across CNNs, resulting in an average 3% degradation for lossless and 4% for lossy and lossless optimizations combined, with a maximum overhead of 7% for VGG16 when both lossy and lossless (1)

9 Performance overhead against CNTK baseline 1.2 x 1 x 0.8 x 0.6 x 0.4 x 0.2 x 0 x Baseline Lossless Lossless + Lossy AlexNet NiN Overfeat VGG16 Inception geomean Figure 9: Performance overhead of Gist encodings. Gist results in minimal performance overhead Contribution to memory footprint 100 % 80 % 60 % 40 % 20 % 0 % Base Other FM Relu/Pool >Conv Relu >Pool Imm. consumed 1.06x 1.26x 1.35x 1.56x S B S+B S+B+I Base 1.14x 1.24x 1.47x 1.47x S B S+B S+B+I Base 1.1x S 1.3x 1.46x 1.55x B AlexNet NiN Overfeat VGG16 Inception S+B S+B+I Base 1.16x 1.34x 1.65x 1.65x S B S+B S+B+I Base 1.04x 1.09x 1.14x 1.17x Figure 10: Impact of Gist lossless techniques (S - SSDC, B - Binarize, I - Inplace) on memory footprint of different data structures. Total MFR for each configuration is present at the top of each bar optimizations are applied. This shows that Gist achieves significant MFR with minimal performance overhead. C. Lossless Encodings In this section, we evaluate the impact of lossless techniques on memory footprint and performance. 1) Impact on Memory Footprint: In this experiment, we apply Gist lossless encodings Binarize and SSDC in isolation and evaluate how they affect the memory consumed by stashed feature maps and immediately consumed data structures. Then, we study the same effect when both encodings are applied together. Finally, we evaluate inplace optimization, that targets the immediately consumed data structures. The findings of this experiment are presented in Figure 10. We perform this study on the investigation baseline, where stashed feature maps are not allowed in memory sharing, allowing us to study the impact of encodings in isolation. When an encoding is applied, the stashed feature map is converted to an immediately consumed data structure (possibly increasing its footprint), and this much smaller encoded data structure is now stashed for the backward pass, as discussed in Section IV-B. The figure shows this effect by breaking down the total memory footprint into 4 regions: ReLU/Pool-Conv (suitable for SSDC), ReLU-Pool (Binarize), other feature maps (untouched in this experiment as they are suitable for DPR), and immediately consumed. The first bar shows the breakdown across these categories for the baseline. Then, we apply SSDC encoding, that reduces ReLU/Pool-Conv footprint significantly and slightly S B S+B S+B+I Memory footprint ratio against category CNTK baseline 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1 x 0 x 16x 16.8x 16x 16x 14.6x SDCC Binarization Inplace AlexNet NiN Overfeat VGG16 Inception Figure 11: Gist lossless encodings MFR on target categories increases the immediately consumed memory footprint. For example, for AlexNet, it results in the total MFR of Similarly, the third bar is for Binarize encoding, targeting ReLU-Pool category, resulting in MFR of Then, we apply both these encodings together, shown in the fourth bar, resulting in total of 1.35 MFR. Finally, we apply inplace optimization that targets immediately consumed data structure, further increasing the MFR to These techniques result in different footprint reduction for CNNs, as the proportion changes across different categories. Note that Figure 10 only shows the total MFR, however Gist encodings reduce memory consumption of different categories to a different extent. We show this effect in Figure 11, presenting MFR for different optimizations on their target data structures. We observe that, as expected, Binarize results in significant memory savings, reaching close to 16 MFR (32 for ReLU output and at least 8 for pool output). Reduction for SSDC varies significantly across CNNs, providing upto 7 MFR for Overfeat. Finally, inplace optimization results in upto 1.4 MFR for AlexNet. Inplace optimization does not always reduce the total memory footprint, because the affected immediately consumed data structure might not be a heavy hitter in the memory groups formed by CNTK memory allocator (Section IV-C). 2) Impact on Performance: To evaluate the performance overhead of lossless techniques, we run the previous experiment and measure the performance overhead. We observe that Binarize, instead of showing performance degradation, results in small performance improvement. This is because ReLU backward pass calculations are memory bandwidth bound and Binarize encoding increases effective memory bandwidth by representing the data in 1-bit format. We observe that SSDC encoding results in small performance overhead upto 4% on average. The combination of two lossless encodings result in slightly better performance than just SSDC encoding alone, because of the better performing Binarize encoding. And, finally, inplace optimization has no effect on performance as it does not incur any encoding or decoding overhead. D. Lossy Encodings In this section, we study the impact of lossy encodings on accuracy, memory footprint, and performance.

10 Training accuracy loss FP32 All FP16 Gist FP16 Gist FP10 Gist FP8 100% 100% 100% 100% 80% 80% 80% 80% AlexNet Overfeat VGG16 Inception 60% 60% 60% 60% 40% 40% 40% 40% 20% 20% 20% 20% 0% 0% 0% 0% Number of epochs Number of epochs Number of epochs Number of epochs Figure 12: Impact of DPR Encoding on accuracy. The smallest format, having same accuracy compared to FP32 format, for AlexNet and Overfeat is FP8, for Inception is FP10 and for VGG16 is FP16; DPR achieves aggressive bit savings. Contribution to memory footprint 100 % 80 % 60 % 40 % 20 % 0 % Base FP x 2x DPR FP x 4x DPR FP8 Base FP32 Feature maps 1.33x 2x DPR FP x 4x DPR FP8 Base FP x 2x DPR FP16 Imm. Consumed AlexNet NiN Overfeat VGG16 Inception Figure 13: Impact of DPR encodings. Total MFR is present at the top and MFR achieved on the stashed feature maps is present at the bottom of each bar. Lowest precision for VGG16 with no loss in accuracy is FP16. 1) Impact on Accuracy: First, we study the effect of applying precision reduction on the training accuracy. In this experiment, we, first, train the network in FP32 precision (shown as Baseline-FP32). Second, in line with previous research [15], [8], [16], we represent all the data structures throughout the network in FP16 format and then train the network (shown as All-FP16). Finally, we train the network using FP16, FP10 and FP8 DPR encodings (shown as Gist-FP*). The computation is still performed with FP32 precision. The values are converted back to FP32 format just before the computation. The findings of this experiment are shown in Figure 12, showing the training accuracy loss for different networks. Training accuracy loss curve is a method of capturing training accuracy over time. At the start, the network has nearly 100% accuracy loss (almost 0% accuracy). Over time, the accuracy improves and correspondingly the accuracy loss reduces. For example, at 90th epoch, Overfeat achieves 22% accuracy loss i.e. Overfeats accuracy is 78% (100% - 22%). The figure compares the deviation between the accuracy of CNTK baseline (FP32 blue circles) and Gist-FP16/10/8 DPR encodings. Note that the y-axis is not the accuracy loss from Gist encodings. Instead, it measures the networks achieved accuracy as the training epoch increases. Therefore, for DPR encoding to be as accurate as the FP32 baseline, the curves should overlap. There are two key observations from this figure. First, representing all data structures in 16 bits leads to severe accuracy losses. This is because the precision reduction is applied immediately after each layer output is computed, propagating the error in the forward pass and resulting in severe accuracy losses. Second, applying precision reduction 1.65x 4x DPR FP8 Base FP32 1.1x 2x DPR FP16 Base FP x 2x DPR FP x 3x DPR FP10 Memory footprint ratio 10x 8x 6x 4x 2x 0x total pool1 relu2_0 pool2 relu3_0 relu3_1 pool3 relu4_0 relu4_1 pool4 relu5_0 relu5_1 Figure 14: SSDC sensitivity to Sparsity (15 epochs, VGG16) only for the backward use of stashed feature maps, as done in DPR, can result in aggressive bit savings (as low as 8 bits for AlexNet and Overfeat). For Inception, we observe that when FP8 is applied, the network stops training, but FP10 has enough precision to result in no accuracy losses. VGG16 needs the highest precision, and does not train with representation smaller than FP16, showing that the minimum acceptable precision is network dependent. 2) Impact on Footprint Reduction: In this section, we evaluate how much MFR these efficient representations achieve. The experiment involves running DPR FP16 and the next smallest representation (FP10/FP8) that has no accuracy losses, and measuring the total memory footprint. The findings of this experiment are presented in Figure 13, showing the MFR against investigation baseline. DPR encoding converts the stashed feature maps into immediately consumed data and stashes the encoded feature map for the backward pass. To see this effect, we break the total memory consumption into stashed feature maps and immediately consumed. When DPR encoding uses FP16, the stashed feature maps are compressed 2, with some increase in immediately consumed footprint, resulting in the total MFR of 1.18 for AlexNet as an example. FP8 further cuts down the memory footprint, resulting in MFR of 4 for stashed feature maps and a total of 1.48 MFR for AlexNet. As shown previously, FP8 leads to bad accuracy for VGG16, and thus we omit results for FP8 for VGG16. 3) Performance Overhead: Next, we evaluate the performance overhead of DPR encoding. For this study, we run the last experiment again and measure the execution time of a minibatch processing, averaged over 5 minutes of training. We observe that extremely parallel DPR encoding implementation has minimal performance overhead, with an average of 1%.

High Performance Computing

High Performance Computing High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason

More information

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs In-Place Activated BatchNorm for Memory- Optimized Training of DNNs Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder Mapillary Research Paper: https://arxiv.org/abs/1712.02616 Code: https://github.com/mapillary/inplace_abn

More information

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al. Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:

More information

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CafeGPI. Single-Sided Communication for Scalable Deep Learning CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

arxiv: v1 [cs.cv] 11 Feb 2018

arxiv: v1 [cs.cv] 11 Feb 2018 arxiv:8.8v [cs.cv] Feb 8 - Partitioning of Deep Neural Networks with Feature Space Encoding for Resource-Constrained Internet-of-Things Platforms ABSTRACT Jong Hwan Ko, Taesik Na, Mohammad Faisal Amir,

More information

Novel Lossy Compression Algorithms with Stacked Autoencoders

Novel Lossy Compression Algorithms with Stacked Autoencoders Novel Lossy Compression Algorithms with Stacked Autoencoders Anand Atreya and Daniel O Shea {aatreya, djoshea}@stanford.edu 11 December 2009 1. Introduction 1.1. Lossy compression Lossy compression is

More information

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision Training? Reduced precision tensor math with FP32 accumulation, FP16 storage Successfully used to train a variety

More information

Addressing Memory Bottlenecks for Emerging Applications

Addressing Memory Bottlenecks for Emerging Applications Addressing Memory Bottlenecks for Emerging Applications by Animesh Jain A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and

More information

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Jiecao Yu 1, Andrew Lukefahr 1, David Palframan 2, Ganesh Dasika 2, Reetuparna Das 1, Scott Mahlke 1 1 University of Michigan 2 ARM

More information

A Method to Estimate the Energy Consumption of Deep Neural Networks

A Method to Estimate the Energy Consumption of Deep Neural Networks A Method to Estimate the Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze Massachusetts Institute of Technology, Cambridge, MA, USA {tjy, yhchen, jsemer, sze}@mit.edu

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

Inception Network Overview. David White CS793

Inception Network Overview. David White CS793 Inception Network Overview David White CS793 So, Leonardo DiCaprio dreams about dreaming... https://m.media-amazon.com/images/m/mv5bmjaxmzy3njcxnf5bml5banbnxkftztcwnti5otm0mw@@._v1_sy1000_cr0,0,675,1 000_AL_.jpg

More information

CNN optimization. Rassadin A

CNN optimization. Rassadin A CNN optimization Rassadin A. 01.2017-02.2017 What to optimize? Training stage time consumption (CPU / GPU) Inference stage time consumption (CPU / GPU) Training stage memory consumption Inference stage

More information

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

More information

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru

More information

vdnn: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design

vdnn: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design Memory allocation size (MB) Max layer-wise usage vdnn: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design Minsoo Rhu Natalia Gimelshein Jason Clemons Arslan Zulfiqar

More information

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters Hao Zhang Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jianliang Wei, Pengtao Xie,

More information

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python. Inception and Residual Networks Hantao Zhang Deep Learning with Python https://en.wikipedia.org/wiki/residual_neural_network Deep Neural Network Progress from Large Scale Visual Recognition Challenge (ILSVRC)

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning

More information

arxiv: v1 [cs.dc] 8 Jun 2018

arxiv: v1 [cs.dc] 8 Jun 2018 PipeDream: Fast and Efficient Pipeline Parallel DNN Training Aaron Harlap Deepak Narayanan Amar Phanishayee Vivek Seshadri Nikhil Devanur Greg Ganger Phil Gibbons Microsoft Research Carnegie Mellon University

More information

A performance comparison of Deep Learning frameworks on KNL

A performance comparison of Deep Learning frameworks on KNL A performance comparison of Deep Learning frameworks on KNL R. Zanella, G. Fiameni, M. Rorro Middleware, Data Management - SCAI - CINECA IXPUG Bologna, March 5, 2018 Table of Contents 1. Problem description

More information

Training Deep Neural Networks (in parallel)

Training Deep Neural Networks (in parallel) Lecture 9: Training Deep Neural Networks (in parallel) Visual Computing Systems How would you describe this professor? Easy? Mean? Boring? Nerdy? Professor classification task Classifies professors as

More information

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 18: Deep learning and Vision: Convolutional neural networks Teacher: Gianni A. Di Caro DEEP, SHALLOW, CONNECTED, SPARSE? Fully connected multi-layer feed-forward perceptrons: More powerful

More information

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung

More information

Deep Learning with Tensorflow AlexNet

Deep Learning with Tensorflow   AlexNet Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification

More information

How to Estimate the Energy Consumption of Deep Neural Networks

How to Estimate the Energy Consumption of Deep Neural Networks How to Estimate the Energy Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze MIT 1 Problem of DNNs Recognition Smart Drone AI Computation DNN 15k 300k OP/Px DPM 0.1k

More information

Scaling Distributed Machine Learning

Scaling Distributed Machine Learning Scaling Distributed Machine Learning with System and Algorithm Co-design Mu Li Thesis Defense CSD, CMU Feb 2nd, 2017 nx min w f i (w) Distributed systems i=1 Large scale optimization methods Large-scale

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Hengyu Zhao, Colin Weinshenker*, Mohamed Ibrahim*, Adwait Jog*, Jishen Zhao University of California, Santa Cruz, *The College of William

More information

Deep Face Recognition. Nathan Sun

Deep Face Recognition. Nathan Sun Deep Face Recognition Nathan Sun Why Facial Recognition? Picture ID or video tracking Higher Security for Facial Recognition Software Immensely useful to police in tracking suspects Your face will be an

More information

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan

More information

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua

More information

OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi

OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015 Outline About nervana Optimizing deep learning at assembler level Limited precision for deep learning neon benchmarks 2 About nervana

More information

Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks

Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks Minsoo Rhu, Mike O Connor, Niladrish Chatterjee, Jeff Pool, Youngeun Kwon, and Stephen W. Keckler POSTECH, NVIDIA

More information

Know your data - many types of networks

Know your data - many types of networks Architectures Know your data - many types of networks Fixed length representation Variable length representation Online video sequences, or samples of different sizes Images Specific architectures for

More information

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018 Nvidia Jetson TX2 and its Software Toolset João Fernandes 2017/2018 In this presentation Nvidia Jetson TX2: Hardware Nvidia Jetson TX2: Software Machine Learning: Neural Networks Convolutional Neural Networks

More information

Dynamic Routing Between Capsules

Dynamic Routing Between Capsules Report Explainable Machine Learning Dynamic Routing Between Capsules Author: Michael Dorkenwald Supervisor: Dr. Ullrich Köthe 28. Juni 2018 Inhaltsverzeichnis 1 Introduction 2 2 Motivation 2 3 CapusleNet

More information

Xilinx ML Suite Overview

Xilinx ML Suite Overview Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame

More information

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters

More information

Convolutional Neural Networks

Convolutional Neural Networks NPFL114, Lecture 4 Convolutional Neural Networks Milan Straka March 25, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise

More information

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Unified Deep Learning with CPU, GPU, and FPGA Technologies Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

Spatial Localization and Detection. Lecture 8-1

Spatial Localization and Detection. Lecture 8-1 Lecture 8: Spatial Localization and Detection Lecture 8-1 Administrative - Project Proposals were due on Saturday Homework 2 due Friday 2/5 Homework 1 grades out this week Midterm will be in-class on Wednesday

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed

More information

Flow-Based Video Recognition

Flow-Based Video Recognition Flow-Based Video Recognition Jifeng Dai Visual Computing Group, Microsoft Research Asia Joint work with Xizhou Zhu*, Yuwen Xiong*, Yujie Wang*, Lu Yuan and Yichen Wei (* interns) Talk pipeline Introduction

More information

Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity

Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity Abstract: This project aims at creating a benchmark for Deep Learning (DL) algorithms

More information

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning, Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014 Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms Visual Computing Systems Review: mechanisms to reduce aliasing in the graphics pipeline When sampling visibility?! -

More information

Parallel Deep Network Training

Parallel Deep Network Training Lecture 19: Parallel Deep Network Training Parallel Computer Architecture and Programming How would you describe this professor? Easy? Mean? Boring? Nerdy? Professor classification task Classifies professors

More information

Profiling DNN Workloads on a Volta-based DGX-1 System

Profiling DNN Workloads on a Volta-based DGX-1 System Profiling DNN Workloads on a Volta-based DGX-1 System Saiful A. Mojumder 1, Marcia S Louis 1, Yifan Sun 2, Amir Kavyan Ziabari 3, José L. Abellán 4, John Kim 5, David Kaeli 2, Ajay Joshi 1 1 ECE Department,

More information

Parallel Deep Network Training

Parallel Deep Network Training Lecture 26: Parallel Deep Network Training Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Speech Debelle Finish This Album (Speech Therapy) Eat your veggies and study

More information

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018 Memory Bandwidth and Low Precision Computation CS6787 Lecture 10 Fall 2018 Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing

More information

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward

More information

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Decentralized and Distributed Machine Learning Model Training with Actors

Decentralized and Distributed Machine Learning Model Training with Actors Decentralized and Distributed Machine Learning Model Training with Actors Travis Addair Stanford University taddair@stanford.edu Abstract Training a machine learning model with terabytes to petabytes of

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 29 Source Coding (Part-4) We have already had 3 classes on source coding

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Efficient Edge Computing for Deep Neural Networks and Beyond Vivienne Sze In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac

More information

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Table of contents Faster Visualizations from Data Warehouses 3 The Plan 4 The Criteria 4 Learning

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina Residual Networks And Attention Models cs273b Recitation 11/11/2016 Anna Shcherbina Introduction to ResNets Introduced in 2015 by Microsoft Research Deep Residual Learning for Image Recognition (He, Zhang,

More information

Scalable Distributed Training with Parameter Hub: a whirlwind tour

Scalable Distributed Training with Parameter Hub: a whirlwind tour Scalable Distributed Training with Parameter Hub: a whirlwind tour TVM Stack Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC

More information

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017 Memory Bandwidth and Low Precision Computation CS6787 Lecture 9 Fall 2017 Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing

More information

Hello Edge: Keyword Spotting on Microcontrollers

Hello Edge: Keyword Spotting on Microcontrollers Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arxiv.org, 2017 Presented by Mohammad Mofrad University of

More information

Deep Learning and Its Applications

Deep Learning and Its Applications Convolutional Neural Network and Its Application in Image Recognition Oct 28, 2016 Outline 1 A Motivating Example 2 The Convolutional Neural Network (CNN) Model 3 Training the CNN Model 4 Issues and Recent

More information

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 7. Image Processing COMP9444 17s2 Image Processing 1 Outline Image Datasets and Tasks Convolution in Detail AlexNet Weight Initialization Batch Normalization

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

In Live Computer Vision

In Live Computer Vision EVA 2 : Exploiting Temporal Redundancy In Live Computer Vision Mark Buckler, Philip Bedoukian, Suren Jayasuriya, Adrian Sampson International Symposium on Computer Architecture (ISCA) Tuesday June 5, 2018

More information

Introduction to Video Compression

Introduction to Video Compression Insight, Analysis, and Advice on Signal Processing Technology Introduction to Video Compression Jeff Bier Berkeley Design Technology, Inc. info@bdti.com http://www.bdti.com Outline Motivation and scope

More information

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose Department of Electrical and Computer Engineering University of California,

More information

Implementing Deep Learning for Video Analytics on Tegra X1.

Implementing Deep Learning for Video Analytics on Tegra X1. Implementing Deep Learning for Video Analytics on Tegra X1 research@hertasecurity.com Index Who we are, what we do Video analytics pipeline Video decoding Facial detection and preprocessing DNN: learning

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Training Deep Convolutional Neural Networks with Resistive Cross-Point Devices

Training Deep Convolutional Neural Networks with Resistive Cross-Point Devices Training Deep Convolutional Neural Networks with Resistive Cross-Point Devices Authors: Tayfun Gokmen,* O. Murat Onen, Wilfried Haensch Affiliations IBM T.J. Watson Research Center, Yorktown Heights, NY

More information

All You Want To Know About CNNs. Yukun Zhu

All You Want To Know About CNNs. Yukun Zhu All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from http://imgur.com/ Deep Learning Image from http://imgur.com/ Deep Learning Image from http://imgur.com/ Deep Learning Image

More information

GeoImaging Accelerator Pansharpen Test Results. Executive Summary

GeoImaging Accelerator Pansharpen Test Results. Executive Summary Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance Whitepaper), the same approach has

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

How Much Logic Should Go in an FPGA Logic Block?

How Much Logic Should Go in an FPGA Logic Block? How Much Logic Should Go in an FPGA Logic Block? Vaughn Betz and Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S 3G4 {vaughn, jayar}@eecgutorontoca

More information

Fei-Fei Li & Justin Johnson & Serena Yeung

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9-1 Administrative A2 due Wed May 2 Midterm: In-class Tue May 8. Covers material through Lecture 10 (Thu May 3). Sample midterm released on piazza. Midterm review session: Fri May 4 discussion

More information

Annex 10 - Summary of analysis of differences between frequencies

Annex 10 - Summary of analysis of differences between frequencies Annex 10 - Summary of analysis of differences between frequencies Introduction A10.1 This Annex summarises our refined analysis of the differences that may arise after liberalisation between operators

More information

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017 Deep Learning on Modern Architectures Keren Zhou 4/17/2017 HPC Software Stack Application Algorithm Data Layout CPU GPU MIC Others HPC Software Stack Deep Learning Algorithm Data Layout CPU GPU MIC Others

More information

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer DEEP NEURAL NETWORKS AND GPUS Julie Bernauer GPU Computing GPU Computing Run Computations on GPUs x86 CUDA Framework to Program NVIDIA GPUs A simple sum of two vectors (arrays) in C void vector_add(int

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Design Tradeoffs for Data Deduplication Performance in Backup Workloads Design Tradeoffs for Data Deduplication Performance in Backup Workloads Min Fu,DanFeng,YuHua,XubinHe, Zuoning Chen *, Wen Xia,YuchengZhang,YujuanTan Huazhong University of Science and Technology Virginia

More information

Fuzzy Set Theory in Computer Vision: Example 3

Fuzzy Set Theory in Computer Vision: Example 3 Fuzzy Set Theory in Computer Vision: Example 3 Derek T. Anderson and James M. Keller FUZZ-IEEE, July 2017 Overview Purpose of these slides are to make you aware of a few of the different CNN architectures

More information

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019 Exploring the Structure of Data at Scale Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019 Outline Why exploration of large datasets matters Challenges in working with large data

More information

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE WHITEPAPER DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE A Detailed Review ABSTRACT While tape has been the dominant storage medium for data protection for decades because of its low cost, it is steadily

More information

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay! Lecture 16 Today: Start looking into memory hierarchy Cache$! Yay! Note: There are no slides labeled Lecture 15. Nothing omitted, just that the numbering got out of sequence somewhere along the way. 1

More information

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm

More information

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan CENG 783 Special topics in Deep Learning AlchemyAPI Week 11 Sinan Kalkan TRAINING A CNN Fig: http://www.robots.ox.ac.uk/~vgg/practicals/cnn/ Feed-forward pass Note that this is written in terms of the

More information

Convolutional Neural Network Layer Reordering for Acceleration

Convolutional Neural Network Layer Reordering for Acceleration R1-15 SASIMI 2016 Proceedings Convolutional Neural Network Layer Reordering for Acceleration Vijay Daultani Subhajit Chaudhury Kazuhisa Ishizaka System Platform Labs Value Co-creation Center System Platform

More information