Gregex: GPU based High Speed Regular Expression Matching Engine

Size: px
Start display at page:

Download "Gregex: GPU based High Speed Regular Expression Matching Engine"

Transcription

1 11 Fifth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing Gregex: GPU based High Speed Regular Expression Matching Engine Lei Wang 1, Shuhui Chen 2, Yong Tang 3,JinshuSu 4 School of Computer Science, National University of Defense Technology Changsha, China 1 wangleinuts@gmail.com 2 csh999@263.net 3 ytang@nudt.edu.cn 4 sjs@nudt.edu.cn Abstract Regular expression matching engine is a crucial infrastructure which is widely used in network security systems, like IDS. We propose Gregex, a Graphics Processing Unit (GPU) based regular expression matching engine for deep packet inspection (DPI). Gregex leverages the computational power and high memory bandwidth of GPUs by storing data in proper GPU memory space and executing massive GPU thread concurrently to process lots of packets in parallel. Three optimization techniques, ATP, CAB, and CAT are proposed to significantly improve the performance of Gregex. On a GTX260 GPU, Gregex achieves a regular matching throughputof Gbps, which is a speedup of 210 over traditional CPU-based implementation and a speedup of 7.9 over the state-of-the-art GPU based regular expression engine. I. INTRODUCTION Signature-based deep packet inspection (DPI) has been one of the most important mechanisms in network security systems nowadays. DPI inspects entire packets traveling the network in real time to detect threats, such as intrusions, worms, viruses and spam. Regular expression is widely used for describing DPI signatures because they are much more expressive and flexible than simple strings. Network intrusion detection systems (NIDS), like Snort[1] use regular expressions to describe more complicated signatures. Due to the limited computation power of CPU and the high latency of I/O access[2], pure software implementation of regular expression matching engine cannot satisfy the performance requirements of DPI. A possible solution is offloading regular expression matching to hardware platforms[3], [4], [5], [6], such as ASICs, FPGAs and NPs. Hardware-based solutions could achieve a high performance, but they are complex and not flexible enough. Modern GPUs are specialized for compute-intensive, highly parallel computation. Also, GPUs are more cheap and programmable than other hardware platforms. In this paper, we propose Gregex, a high speed GPU based regular expression matching engine for DPI. In Gregex, the DFA state transition table compiled from regular expressions resides in GPU s texture memory and a large amount of packets are copied to GPU s global memory for matching. Massive GPU threads run concurrently in the way that each GPU thread matches one packet. We describe three optimization techniques for Gregex. On a GTX260 device, Gregex achieves a regular expression matching throughput of Gbps which is about 210 faster than traditional CPU implementation[7] and 7.9 faster than solution proposed in [8]. The rest of this paper is organized as follows. In Section II, we present the background knowledge and related works on GPU based regular expressions matching techniques. The design and optimization of Gregex are introduced in Section III. The performance results are evaluated in Section IV. Finally, we conclude our works in Section V. II. BACKGROUND A. Regular Expression Matching Techniques Regular expression matching engines can be based on either nondeterministic finite automata (NFA) or deterministic finite automata (DFA). In DPI, DFA approaches are preferred for better performance. In DFA approaches, a set of regular expressions are usually converted to one DFA by first compiling them into an NFA using the Thompson algorithm[9] and then converting the NFA to DFA using the Subset Construction algorithm. Given the compiled DFA and an input string representing the network traffic, DPI needs to decide whether the DFA accepts the input string. DFA is represented by a state transition table and a state acceptance table. State transition table is a two-dimensional matrix whose width and height are equal to the size of the alphabet and the number of states in DFA respectively. Each cell of the state transition table contains the next state to move to in DFA. State acceptance table is a one-dimensional array, the length of which is equal to the number of states in DFA. Each cell of the state acceptance table indicates whether the corresponding state is an acceptable state or not. DFA matching requires two state table lookups (two memory accesses) per input byte: getting the next state and deciding whether this is a acceptable state. In modern CPU, one memory access may take many cycles to return a result. In contrast, when using GPU to perform DFA matching, massively threads execution concurrently could hide the memory access latency efficiently /11 $ IEEE DOI /IMIS

2 B. The CUDA Programming Model We briefly review CUDA which defines the architecture and programming model for NVIDIA GPUs. We focus on GeForce GTX 0 series GPUs, more information could be found in the CUDA documentation [10], [11]. GPU Architecture: The GeForce GTX 0 serior GPUs are based on a reengineered, enhanced, and extended Scalable Processor Array (SPA) architecture which consists of 10 Thread Processing Clusters (TPCs). Each TPC is in turn made up of 3 Streaming Multiprocessors (SMs), and each SM contains 8 Streaming Processors (SPs). Every SM also includes texture filtering processors used in graphics processing. The GPU s compute architecture is SIMT (single instruction, multiple threads) for execution across each SM. SIMT improves upon pure SIMD (single instruction, multiple data) designs in both performance and ease of programmability. Programming Model: In the CUDA model, data parallel portions of an application are expressed as device kernels which run on many threads. CUDA threads execute on device (GPU) that operates as a coprocessor to the host (CPU) running the C program. A CUDA kernel is executed as a grid of thread blocks. The number of threads per block and the number of blocks per grid are specified by the programmer. Threads within a block could cooperate via shared memory, atomic operations and barrier synchronization. All threads within a block are executed concurrently on a SM; several blocks can execute concurrently on a SM. Memory Hierarchy: CUDA devices use several memory spaces, which have different characteristics that reflect their distinct usages in CUDA applications. In addition to a number of 32-bit registers shared across all the active threads, each multiprocessor carries on-chip a 16KB shared memory. The off-chip global memory is connected to each SM with high transfer bandwidth, large amounts and high latency. There are also two additional read-only memory spaces that provide the additional benefit of hardware caching accessible by all threads: the constant and texture memory spaces. The global, constant, and texture memory spaces are optimized for different memory usages but their effectiveness cannot be guaranteed. C. GPU based Regular Expression Matching Engines Randy Smith et al. proposed a programmable signature matching system prototyped on an NVIDIA G80 GPU[12]. They have made a detailed analysis of regular control flow and parallelism available at the packet level. Two types for regular expression matching were examined in their work: standard DFA and extended finite automata (XFA)[13], [14]. XFA approach uses less memory than DFA but have a more complex execution control flow which can impact the performance of GPU by causing threads of the same warp to diverge. Evaluation shows the GPU based prototype achieves a speedup of 6 to 9 compared to implementation on a Pentium4 CPU. Giorgos Vasiliadis et al. presented a regular expression matching engine based on GPU[8]. In their work, regular expressions were compiled separately and the whole packets Global Memory Packet buffer Result buffer Fig. 1. #SP1 #SP2... #SP8 IU State table Texture Memory Framewok of Gregex which uses a GTX 260 GPU. were processed by every thread in isolation. The experimental results show regular expression matching on NVIDIA GeForce 9800 GX2 GPU can achieve up to 16 Gbps of raw processing throughput, which was a 48 speedup by CPU implementations. Furthermore, they had extended the architecture of Gnort[7] by adding a GPU-assisted regular expression matching engine. The overall processing throughput of Snort was increased by a factor of eight compared to the default implementation. Shuai Mu et al. proposed a GPU based solutions for a series of core IP routing applications[15]. In their work, they implemented a finite automata based regular expression matching algorithm for the deep packet inspection application. On a NVIDIA GTX280 GPU, the proposed regular expression matching algorithm can achieve up to a matching throughput of 9.3 Gbps and a overall throughput of 3.2 Gbps. A. Framework III. THE PROPOSED GREGEX The framework of Gregex is depicted by Fig. 1. In Gregex, packets are stored in GPU s global memory; the DFA state transition table resides in GPU s texture memory. Texture memory has a hardware cache so that DFA state transition table lookup latency could be significantly reduced. In Gregex, packets are processed in batches. Each thread processes one of the packets in isolation. Whenever a match occurs, the threads stores the regular expression s ID to the matching result buffer. Matching result buffer is a singledimension array allocated in the global device memory; the size of the array is equal to the number of packets that are processed by GPU at a time, as shown in Fig. 2(b). B. Workflow The packet processing workflow in Gregex could be divided into three phases: pre-processing phase, signature matching phase, and post-processing phase. Pre-processing and postprocessing phase run by CPU threads perform tasks of transferring packets from CPU to GPU and getting match results from GPU memory respectively. Signature matching phase runs by GPU threads and does the regular expression matching tasks

3 0 1 l (a) regular expression ID 1 regular expression ID l-1 regular expression ID Fig. 2. The format of (a) packets buffer and (b) matching results buffer in GPU global memory. 1) Pre-processing phase: In pre-processing phase, Gregex does some necessary preparation works, including constructing DFA from regular expression, transferring packets to GPU. Compiling regular expressions to DFA: In our work, the state acceptance table is merged into state transition table as the last column of the state transition table when constructting DFA. Once the DFA has been constructed, the state transition table is copied to texture memory of GPU by two steps: 1. Copy state transition table from CPU memory to GPU global memory; 2. Bind the state transition table in global memory to texture cache. Transferring packets to GPU: Now we consider how packets are transferred from CPU memory to the device memory. Due to the overhead associated with each transfer, batching many packets into one larger transfer performs significantly better than making each transfer separately [11]. So Gregex chooses to copy packets to device memory in batches. The format of buffer allocated for storing packets in global memory is illustrated in Fig. 2(a). The length of the packet is set to 2KB. If packet is shorter than 2KB, Gregex pads 0x00 at the end of the packet; if packet s longer than 2KB, Gregex splits it down into several smaller ones. The maximum IP packet may be up to 65,535 bytes in length. However, assigning the maximum packet length as the size of packet in the buffer would result in a waste of bandwidth. 2) Signature matching phase: Each GPU thread processes a respective packet in isolation during regular expression matching. Algorithm 1 gives the multi-thread version procedure for DFA matching on GPU. Algorithm 1. multi-thread DFA matching procedure. Input: packets : a batch of packets to match Input: DFA : state transition table Output: Results : match results 1 packet packets[thread ID]; 2 current state 0; 3 foreach byte in packet do 4 input packet[byte]; 5 next state DFA[current state, input]; 6 current state next state; 7 if DFA[current state, alphabet size +1] = 1 then 8 Results[thread ID] regex ID; 9 end 10 end (b) Line 1 gets the address of the packet to match according to thread s global ID. Lines 2-10 do work of DFA matching: at each iteration of foreach loop, matching threads read one byte from packet, look up state transition table for the next state and determine whether it is an acceptable state. If DFA goes to an acceptable state, the ID of the regular expression that matched packet is recorded to Results. 3) Post-processing phase : When all GPU threads finish matching, the matching result array is copied to the CPU memory. The kth cell of the matching result array contains the ID of the regular expression that matches the kth packet; if no match occurs, it is set to zero. C. Optimizations Gregex exploits optimization opportunities in workflow by maximizing parallelism as well as reducing GPU memory access latency. Three optimization techniques, ATP, CAB, and CAT are proposed to improve the performance of Gregex. 1) Asynchronous packets Transfer with Page-locked memory (ATP): Packets transferring throughput is the most important performance factor of Gregex. Higher bandwidth between the host and the device is achieved when using page-locked memory [11]. Asynchronous copy: In CUDA, data transfers between the host and the device using cudamemcpyasync function is nonblocking transfers, control is returned immediately to the host thread. Asynchronous copy enables overlap of data transfers with host and device computations. Zero copy: Zero copy requires mapped page-locked memory and enables GPU threads to directly access host memory. Zero copy can make kernel execution overlap data transfers automatically. 2) Coalesced global memory access in regular expression matching: Global memory access has a very high latency, about cycles for a load/store operation. All global memory accesses by a half-warp 1 of threads can be coalesced into one or two transactions if these threads access a contiguous range of addresses. In In Algorithm 1, special attention must be paid to how threads loading packet from global memory and storing matching results to Results are performed. Coalesced global memory Access by Buffering packets to shared Memory (CAB) In this work, coalesced global memory access is obtained by having each half warp reading contiguous locations of global memory to shared memory. There is no performance penalty for non-contiguous access in shared memory as there is in global memory. We use s packets which is a shared memory array of 32-bit words, to buffer packet from global memory for every thread. If the total length of packet is L bytes, it will take totally L/32 iterations for a thread to process a packet. In each iteration, threads in a block read data to s packets corporately to void uncoalesced global memory access, and then begin to match signatures on one row of s packets separately. 1 In CUDA, warp is a group of threads executed physically in parallel; half-warp is the first or second half of a warp of threads

4 Fig. 3. Throughput(Gbps) The format of packets buffer after transposing MB log(data Size/64K) Fig. 4. Throughput of transferring packets to NVIDIA GTX 260 GPU with different sizes. However, shared memory bank conflict will occur if two or more threads in a half warp access bytes within different 32-bit words belonging to the same bank. A way to avoid this conflict is to pad the shared memory array by one column. When changing the size of s packets to 32 33, data in cell (i, j) and cell (x, y) of s packets are mapped to the same bank if and only if (i 33 + j) (x 33 + y) mod banks num =0 where bank num = 16 in current GPU architecture. When threads in half warp read data in the same column, that is j = y, wehave, (i x) 33+(j y) mod banks num = i x 33 mod 16 Thus bank conflict will never occur in a half warp since i x 33 mod Coalesced global memory Access by Transposing packets buffer (CAT)Another technique to avoid uncoalesced global memory access is to transpose the packets buffer before matching. Transposing the packets buffer is similar to transpose a matrix. A detailed documentation[16] about optimizing matrix transpose in CUDA by Greg Ruetsch is released along with the CUDA SDK. In our work, we implement a high performance CUDA matrix transpose kernel simply following Ruetsch s steps in [16]. With the packet buffer transposing, the total time cost of packet processing by Gregex consists of time used to transfer packets to GPU memory, time used to transpose the packets buffer and time used to match packets with signatures. Transpose the packets buffer will make half warp of GPU threads access a contiguous range of addresses, as shown in Fig. 3. l 256MB TABLE I PERFORMANCE COMPARISON BETWEEN GREGEX AND OTHER GPU BASED IMPLEMENTATIONS. Hardware Algorithm Throughput(Gbps) Speedup GTX260 1 DFA(CAT) GTX260 DFA(CAB) GT 2 Gnort AC 1.4[7] GX2 3 DFA 16[8] 7.9 GTX280 4 AC 9.3[15] contains 216 SPs organized in 27 SMs, running at 1.35GHz with 896 MB of memory 2 contains 32 SPs organized in 4 SMs, running at 1.2GHz with 512 MB of memory 3 consists 256 SPs organized in 16 SMs, running at 1.5GHz with 512 MB of memory. 4 contains 240 SPs organized in 30 SMs, running at 1.45GHz with 1024 MB of memory IV. EVALUATION RESULTS A. Experimental Setup Gregex is implemented on a PC with a 2.66 GHz Intel Core 2 Duo processor, 4 GB memory and a NVIDIA GeForce GTX 260 GPU card. GTX260 GPU contains 216 SPs organized in 27 SMs, running at 1.35 GHz with 896 MB of global memory. We implement Gregex under CUDA version 3.1 with device driver version Gregex uses signatures in the rule set released with Snort 2.7. The rule set consists of 56 different signature sets. For each signature set, we construct a single DFA for all the regular expressions in it. We use two different network traces for evaluating the performance of Gregex: trace collected on the internet and trace from the DARPA intrusion detection evaluation data set [17]. In our experiments, Gregex reads packets from local disk, and then transfers them in batches to GPU memory for processing. B. Packets Transfer Performance We first evaluate the throughput of packets transfer from CPU memory to GPU global memory. The throughput of transferring packets to the GPU varies depending on the data size. For this experiment we test two different kinds of host memories: page-locked memory and pageable memory. Pagelocked memory cannot be swapped out to disk by operating system before the GPU is finished using it, so it s faster than pageable memory, as shown in Fig. 4. Both the graphics card and mainboard in our system support PCI-E 16 Gen2. The theoretical peak bandwidth between host memory and device memory (64 Gbps) is far exceeded what we obtain actually. Larger transfer performs significantly better than smaller transfer, but, when data size is larger than 8MB, the throughput no longer increases notably. C. Regular Expression Matching Performance In this experiment, we evaluate the processing performance of Gregex which is measured as the mean bits of data

5 Throughput(Gbps) Throughput (Gbps) CAT CAB ATP+ CAT CAT ATP + CAB CAB Blocks num per grid (a) 25.6 Gbps Gbps 26.9 Gbps Blocks num per grid (b) Fig. 5. Performance of Gregex. (a) Regular expression matching throughput and (b)overall throughput. processed per second. From Fig. 5(a), we can see that Gregex gets a regular expression matching throughput of Gbps in the best case. Table I compares Gregex with other GPU based regular expression matching engines. The performance statistics presented in Table I are raw performance: the time used for transferring packets to GPU memory is not included in the processing time. Gregex is about 7.9 faster than the stateof-the-art GPU solution proposed in [8]. D. Overall throughput of Gregex We now evaluate the overall performance of Gregex. As shown in Fig. 5(b), the best cast overall performance of Gregex is 25.6 Gbps when the packet transferred asynchronously to GPU global memory use page-locked memory, which is 8 faster than proposed in [15]. V. CONCLUSION A high speed GPU based regular expression matching engine, Gregex, is introduced in this paper. Gregex takes advantage of the high parallelism of GPU to process packets in parallel. We describe three optimization techniques for Gregex in details, including ATP, CAB, and CAT. These optimization techniques significantly improve the performance of Gregex. Our experimental results indicate that the performance of Gregex is about 7.9 faster than the state-of-the-art GPU based regular expression engine. Gregex is high-flexible, lowcost as well as high-speed. We can easily apply Gregex to network security applications such as IDS and anti-virus systems. VI. ACKNOWLEDGMENT This work has been supported by the National High- Tech Research and Development Plan of China under Grant No.09AA01A346. REFERENCES [1] Snort, [2] N. Jacob and C. Brodley, Offloading ids computation to the gpu, in Proceedings of the 22nd Annual Computer Security Applications Conference. IEEE Computer Society, 06, pp [3] S. Kumar, J. Turner, and J. Williams, Advanced algorithms for fast and scalable deep packet inspection, in Proceedings of the 06 ACM/IEEE symposium on Architecture for networking and communications systems. San Jose, California, USA: ACM, 06, pp [4] F. Yu, Z. Chen, Y. Diao, T. V. Lakshman, and R. H. Katz, Fast and memory-efficient regular expression matching for deep packet inspection, in Proceedings of the 06 ACM/IEEE symposium on Architecture for networking and communications systems. San Jose, California, USA: ACM, 06, pp [5] B. C. Brodie, R. K. Cytron, and D. E. Taylor, A scalable architecture for high-throughput regular-expression pattern matching, SIGARCH Comput. Archit. News, vol. 32, no. 2, pp , 06. [6] M. Becchi, C. Wiseman, and P. Crowley, Evaluating regular expression matching engines on network and general purpose processors, in Proceedings of the 09 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS), Princeton, New Jersey, 09. [7] G. Vasiliadis, S. Antonatos, M. Polychronakis, E. P. Markatos, and S. Ioannidis, Gnort: High performance network intrusion detection using graphics processors, in Proceedings of the 11th international symposium on Recent Advances in Intrusion Detection. Cambridge, MA, USA: Springer-Verlag, 08, pp [8] G. Vasiliadis, M. Polychronakis, S. Antonatos, E. P. Markatos, and S. Ioannidis, Regular expression matching on graphics hardware for intrusion detection, in Proceedings of the 12th International Symposium on Recent Advances in Intrusion Detection, Saint-Malo, France, 09, pp [9] K. Thompson, Programming techniques: Regular expression search algorithm, Commun. ACM, vol. 11, no. 6, pp , [10] NVIDIA, Cuda c programming guide version 3.1. [11], Cuda c best practices guide version 3.1. [12] R. Smith, N. Goyal, J. Ormont, K. Sankaralingam, and C. Estan, Evaluating gpus for network packet signature matching, in Proceedings of the International Symposium on Performance Analysis of Systems and Software, 09. [13] R. Smith, C. Estan, and S. Jha, Xfa: Faster signature matching with extended automata, in IEEE Symposium on Security and Privacy. IEEE Computer Society, 08, pp [14] R. Smith, C. Estan, S. Jha, and S. Kong, Deflating the big bang: fast and scalable deep packet inspection with extended finite automata, SIGCOMM Comput. Commun. Rev., vol. 38, no. 4, pp , 08. [15] S. Mu, X. Zhang, N. Zhang, J. Lu, Y. S. Deng, and S. Zhang, Ip routing processing with graphic processors, in the Design, Automation and Test in Europe, 10, pp [16] G. Ruetsch and P. Micikevicius, Optimizing matrix transpose in cuda, 09. [17] J. McHugh, Testing intrusion detection systems: a critique of the 1998 and 1999 darpa intrusion detection system evaluations as performed by lincoln laboratory, ACM Trans. Inf. Syst. Secur., vol. 3, no. 4, pp ,

SWM: Simplified Wu-Manber for GPU-based Deep Packet Inspection

SWM: Simplified Wu-Manber for GPU-based Deep Packet Inspection SWM: Simplified Wu-Manber for GPU-based Deep Packet Inspection Lucas Vespa Department of Computer Science University of Illinois at Springfield lvesp@uis.edu Ning Weng Department of Electrical and Computer

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Detecting Computer Viruses using GPUs

Detecting Computer Viruses using GPUs Detecting Computer Viruses using GPUs Alexandre Nuno Vicente Dias Instituto Superior Técnico, No. 62580 alexandre.dias@ist.utl.pt Abstract Anti-virus software is the main defense mechanism against malware,

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Accelerating String Matching Using Multi-threaded Algorithm

Accelerating String Matching Using Multi-threaded Algorithm Accelerating String Matching Using Multi-threaded Algorithm on GPU Cheng-Hung Lin*, Sheng-Yu Tsai**, Chen-Hsiung Liu**, Shih-Chieh Chang**, Jyuo-Min Shyu** *National Taiwan Normal University, Taiwan **National

More information

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Efficient Packet Pattern Matching for Gigabit Network Intrusion Detection using GPUs

Efficient Packet Pattern Matching for Gigabit Network Intrusion Detection using GPUs 2012 IEEE 14th International Conference on High Performance Computing and Communications Efficient Packet Pattern Matching for Gigabit Network Intrusion Detection using GPUs Che-Lun Hung Dept. of Computer

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

Multipattern String Matching On A GPU

Multipattern String Matching On A GPU Multipattern String Matching On A GPU Xinyan Zha and Sartaj Sahni Computer and Information Science and Engineering University of Florida Gainesville, FL 32611 Email: {xzha, sahni}@cise.ufl.edu Abstract

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

An Efficient AC Algorithm with GPU

An Efficient AC Algorithm with GPU Available online at www.sciencedirect.com Procedia Engineering 29 (2012) 4249 4253 2012 International Workshop on Information and Electronics Engineering (IWIEE) An Efficient AC Algorithm with GPU Liang

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing

Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu * Yangdong Deng Yubei Chen * Electrical and Computer Engineering University of Texas at Austin Institute of Microelectronics Tsinghua

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed

More information

CUDA Memories. Introduction 5/4/11

CUDA Memories. Introduction 5/4/11 5/4/11 CUDA Memories James Gain, Michelle Kuttel, Sebastian Wyngaard, Simon Perkins and Jason Brownbridge { jgain mkuttel sperkins jbrownbr}@cs.uct.ac.za swyngaard@csir.co.za 3-6 May 2011 Introduction

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University Outline

More information

GrAVity: A Massively Parallel Antivirus Engine

GrAVity: A Massively Parallel Antivirus Engine GrAVity: A Massively Parallel Antivirus Engine Giorgos Vasiliadis and Sotiris Ioannidis Institute of Computer Science, Foundation for Research and Technology Hellas, N. Plastira 100, Vassilika Vouton,

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU) Posted:

More information

Computer Sciences Department

Computer Sciences Department Computer Sciences Department Signature Matching in Network Processing using SIMD/GPU Architectures Neelam Goyal Justin Ormont Randy Smith Karthikeyan Sankaralingam Cristian Estan Technical Report #1628

More information

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010 Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:

More information

Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer

Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer Xiaoqian Zhu 1,2, Xin Liu 1, Xiangfei Meng 2, Jinghua Feng 2 1 School of Computer, National University of Defense

More information

Tuning CUDA Applications for Fermi. Version 1.2

Tuning CUDA Applications for Fermi. Version 1.2 Tuning CUDA Applications for Fermi Version 1.2 7/21/2010 Next-Generation CUDA Compute Architecture Fermi is NVIDIA s next-generation CUDA compute architecture. The Fermi whitepaper [1] gives a detailed

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

Dense matching GPU implementation

Dense matching GPU implementation Dense matching GPU implementation Author: Hailong Fu. Supervisor: Prof. Dr.-Ing. Norbert Haala, Dipl. -Ing. Mathias Rothermel. Universität Stuttgart 1. Introduction Correspondence problem is an important

More information

Design of Deterministic Finite Automata using Pattern Matching Strategy

Design of Deterministic Finite Automata using Pattern Matching Strategy Design of Deterministic Finite Automata using Pattern Matching Strategy V. N. V Srinivasa Rao 1, Dr. M. S. S. Sai 2 Assistant Professor, 2 Professor, Department of Computer Science and Engineering KKR

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

High-Performance Packet Classification on GPU

High-Performance Packet Classification on GPU High-Performance Packet Classification on GPU Shijie Zhou, Shreyas G. Singapura, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California 1 Outline Introduction

More information

CUDA Performance Optimization Mark Harris NVIDIA Corporation

CUDA Performance Optimization Mark Harris NVIDIA Corporation CUDA Performance Optimization Mark Harris NVIDIA Corporation Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary Optimize Algorithms for

More information

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster Advanced CUDA Optimizing to Get 20x Performance Brent Oster Outline Motivation for optimizing in CUDA Demo performance increases Tesla 10-series architecture details Optimization case studies Particle

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Exploiting graphical processing units for data-parallel scientific applications

Exploiting graphical processing units for data-parallel scientific applications CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2009; 21:2400 2437 Published online 20 July 2009 in Wiley InterScience (www.interscience.wiley.com)..1462 Exploiting

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

GASPP: A GPU- Accelerated Stateful Packet Processing Framework

GASPP: A GPU- Accelerated Stateful Packet Processing Framework GASPP: A GPU- Accelerated Stateful Packet Processing Framework Giorgos Vasiliadis, FORTH- ICS, Greece Lazaros Koromilas, FORTH- ICS, Greece Michalis Polychronakis, Columbia University, USA So5ris Ioannidis,

More information

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS BeBeC-2014-08 REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS Steffen Schmidt GFaI ev Volmerstraße 3, 12489, Berlin, Germany ABSTRACT Beamforming algorithms make high demands on the

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Hybrid Regular Expression Matching for Deep Packet Inspection on Multi-Core Architecture

Hybrid Regular Expression Matching for Deep Packet Inspection on Multi-Core Architecture Hybrid Regular Expression Matching for Deep Packet Inspection on Multi-Core Architecture Yan Sun, Haiqin Liu, Victor C. Valgenti, and Min Sik Kim School of Electrical and Computer Engineering Washington

More information

A Parallel Decoding Algorithm of LDPC Codes using CUDA

A Parallel Decoding Algorithm of LDPC Codes using CUDA A Parallel Decoding Algorithm of LDPC Codes using CUDA Shuang Wang and Samuel Cheng School of Electrical and Computer Engineering University of Oklahoma-Tulsa Tulsa, OK 735 {shuangwang, samuel.cheng}@ou.edu

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou A GPU Implementation of Tiled Belief Propagation on Markov Random Fields Hassan Eslami Theodoros Kasampalis Maria Kotsifakou BP-M AND TILED-BP 2 BP-M 3 Tiled BP T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 4 Tiled

More information

State of the Art for String Analysis and Pattern Search Using CPU and GPU Based Programming

State of the Art for String Analysis and Pattern Search Using CPU and GPU Based Programming Journal of Information Security, 2012, 3, 314-318 http://dx.doi.org/10.4236/jis.2012.34038 Published Online October 2012 (http://www.scirp.org/journal/jis) State of the Art for String Analysis and Pattern

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU

BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU 2013 8th International Conference on Communications and Networking in China (CHINACOM) BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU Xiang Chen 1,2, Ji Zhu, Ziyu Wen,

More information

Advanced CUDA Optimizing to Get 20x Performance

Advanced CUDA Optimizing to Get 20x Performance Advanced CUDA Optimizing to Get 20x Performance Brent Oster Outline Motivation for optimizing in CUDA Demo performance increases Tesla 10-series architecture details Optimization case studies Particle

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

GPU-based NFA Implementation for Memory Efficient High Speed Regular Expression Matching

GPU-based NFA Implementation for Memory Efficient High Speed Regular Expression Matching GPU-based NFA Implementation for Memory Efficient High Speed Regular Expression Matching Yuan Zu Ming Yang Zhonghu Xu Lin Wang Xin Tian Kunyang Peng Qunfeng Dong Institute of Networked Systems (IONS) &

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE

MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE Michael Repplinger 1,2, Martin Beyer 1, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken,

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU

More information

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Parallelizing Inline Data Reduction Operations for Primary Storage Systems Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

A Capability-Based Hybrid CPU/GPU Pattern Matching Algorithm for Deep Packet Inspection

A Capability-Based Hybrid CPU/GPU Pattern Matching Algorithm for Deep Packet Inspection A Capability-Based Hybrid CPU/GPU Pattern Matching Algorithm for Deep Packet Inspection Yi-Shan Lin 1, Chun-Liang Lee 2*, Yaw-Chung Chen 1 1 Department of Computer Science, National Chiao Tung University,

More information

Efficient Signature Matching with Multiple Alphabet Compression Tables

Efficient Signature Matching with Multiple Alphabet Compression Tables Efficient Signature Matching with Multiple Alphabet Compression Tables Shijin Kong Randy Smith Cristian Estan Presented at SecureComm, Istanbul, Turkey Signature Matching Signature Matching a core component

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Exploring GPU Architecture for N2P Image Processing Algorithms

Exploring GPU Architecture for N2P Image Processing Algorithms Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

An Efficient Regular Expressions Compression Algorithm From A New Perspective

An Efficient Regular Expressions Compression Algorithm From A New Perspective An Efficient Regular Expressions Compression Algorithm From A New Perspective Tingwen Liu, Yong Sun, Yifu Yang, Li Guo, Binxing Fang Institute of Computing Technology, Chinese Academy of Sciences, 119

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Advanced CUDA Programming. Dr. Timo Stich

Advanced CUDA Programming. Dr. Timo Stich Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput

More information

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs

More information

A Parallel Access Method for Spatial Data Using GPU

A Parallel Access Method for Spatial Data Using GPU A Parallel Access Method for Spatial Data Using GPU Byoung-Woo Oh Department of Computer Engineering Kumoh National Institute of Technology Gumi, Korea bwoh@kumoh.ac.kr Abstract Spatial access methods

More information

Advanced CUDA Optimization 1. Introduction

Advanced CUDA Optimization 1. Introduction Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines

More information

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL (stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s

More information

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Performance optimization with CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide

More information

Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA

Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA (Technical Report UMIACS-TR-2010-08) Zheng Wei and Joseph JaJa Department of Electrical and Computer Engineering Institute

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Improving Performance of Machine Learning Workloads

Improving Performance of Machine Learning Workloads Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,

More information

Programming GPUs for database applications - outsourcing index search operations

Programming GPUs for database applications - outsourcing index search operations Programming GPUs for database applications - outsourcing index search operations Tim Kaldewey Research Staff Member Database Technologies IBM Almaden Research Center tkaldew@us.ibm.com Quo Vadis? + special

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth

More information

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com

More information

Generic Polyphase Filterbanks with CUDA

Generic Polyphase Filterbanks with CUDA Generic Polyphase Filterbanks with CUDA Jan Krämer German Aerospace Center Communication and Navigation Satellite Networks Weßling 04.02.2017 Knowledge for Tomorrow www.dlr.de Slide 1 of 27 > Generic Polyphase

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

소프트웨어기반고성능침입탐지시스템설계및구현

소프트웨어기반고성능침입탐지시스템설계및구현 소프트웨어기반고성능침입탐지시스템설계및구현 KyoungSoo Park Department of Electrical Engineering, KAIST M. Asim Jamshed *, Jihyung Lee*, Sangwoo Moon*, Insu Yun *, Deokjin Kim, Sungryoul Lee, Yung Yi* Department of Electrical

More information

Dense Linear Algebra. HPC - Algorithms and Applications

Dense Linear Algebra. HPC - Algorithms and Applications Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:

More information

By: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information