Using OpenCL for Implementing Simple Parallel Graph Algorithms

Size: px
Start display at page:

Download "Using OpenCL for Implementing Simple Parallel Graph Algorithms"

Transcription

1 Using OpenCL for Implementing Simple Parallel Graph Algorithms Michael J. Dinneen, Masoud Khosravani and Andrew Probert Department of Computer Science, University of Auckland, Auckland, New Zealand {mjd, Abstract For the typical graph algorithms encountered most frequently in practice (such as those introduced in typical entry-level algorithms courses: graph searching/traversals, shortest paths problems, strongly connected components and minimum spanning trees) we want to consider practical non-sequential platforms such as the emergence of cost effective General-Purpose computation on Graphics Processing Units (GPGPU). In this paper we provide two simple design techniques that allow a nonspecialist computer scientist to harness the power of their GPUs as parallel compute devices. These two natural ideas are (a) using a host CPU script to synchronize a distributed view of a graph algorithm where each node of the input graph is associated with a unique processing thread ID and (b) using GPU atomic operations to synchronize a single kernel launch where a set of threads, upper-bounded by at most the number of streaming processing units available, continuously stay active and time-slice the total workload until the algorithm completes. We give concrete comparative implementations of both of these approaches for the simple problem of exploring a graph using breadthfirst search. Finally we conclude that OpenCL, in addition to CUDA, is a natural tool for modern graph algorithm designers, especially those who are not experts of GPU hardware architecture, to develop real-world usable graph applications. Keywords: parallel graph algorithms, GPGPU, OpenCL, CUDA Contact Author: M.J. Dinneen Conference: PDPTA 11 I. INTRODUCTION Parallel programming is a generic concept describing a range of technologies and approaches. However in general it describes a system whereby threads of instruction are executed truly in parallel over a shared or partitioned data source. As part of parallel computing, General Purpose computation on Graphics Processing Units (GPGPU) is a new and active field. The main goal in GPGPU is to find parallel algorithms capable of processing concurrently huge amounts of data over a number of Graphic Processing Units (GPU). GPGPU involves using the advanced parallel Graphics Processing Unit devices now readily available for general purpose parallel programming. Within GPGPU research, implementing graph algorithms is an important sub-field and is the focus of this paper. Recently, GPUs have found their places among general computing devices. They are affordable and easily accessible for those enterprises looking for relatively low cost devices to process their massive data. In some applications the size of the input data is so large that even a low-order polynomial-time algorithm surpasses the time limit. Here one may scale down the running time by using more processors to accomplish the computational task concurrently. But then the main challenge is to find a parallel GPU algorithm that accelerates computation with a significant speed up over a well designed sequential one. CUDA [11] is the GPGPU platform provided by Nvidia Corporation that enables software developers to access the low level instructions and memory of the Nvidia GPUs. With respect to the current architecture of GPUs, CUDA follows the Single Instruction with Multiple Threads approach to parallel processing. While CUDA is restricted to the Nvidia GPUs, OpenCL [8] is a generic overlay with the purpose of providing a common interface for heterogeneous and parallel processing for both CPU and GPU based systems on different devices, such as AMD Radeon graphics cards. Each GPGPU OpenCL application consists of a host program or script that runs on the CPU and which launches the kernels or kernel programs which are compiled and run on the OpenCL devices. We believe OpenCL makes programming on GPUs easier and safer because it limits access to the kernel (e.g. sandboxing).

2 Designing parallel algorithms for graph problems has been studied for many years [1], [12], [13]. Implementing these algorithms efficiently on GPUs is a challenging task. In [2], Dehne and Yogaratnam show that one may need to make non-trivial changes to import a PRAM graph algorithm efficiently on GPUs. They mentioned the irregularities among graphs as one of the main challenges. Graph irregularities, as an obstacle in designing fast parallel GPU algorithms for graph problems, is also addressed in [5], [7], and [14]. The Harish and Narayanan paper [6] on parallel GPU algorithms for graphs is widely cited. Another notable result is due to Luo, Wong and Hwu [10]. Both propose parallel implementations of basic graph algorithms which are implemented directly on the Nvidia CUDA platform. As far as we know, the latter paper provides the fastest known breadth-first search graph algorithm for GPUs. In principle we agree with the authors of [10] that the complexity of a GPGPU algorithm should be the same as the best known sequential one. However, from a practical point of view, simple (possibly non-optimal) correct algorithms are also of value. For instance, when it is known that the expected input cases are relatively small, the extra time and overhead of implementing an optimal algorithm may not be justifiable. II. TWO DIFFERENT GPGPU DESIGN APPROACHES We now explain two simple ways that may be used to synchronize graph (and other types of) GPGPU computations where we have a set of well-defined stages that need to be completed. For example, in doing a breadthfirst search (BFS) in a graph, the stages correspond to the times when the set of nodes at a given depth/level is determined. A. Host-based synchronization design The first natural approach is to use a host CPU program (or script) to synchronize stages of a graph algorithm where each part (usually nodes) of the input graph is associated with a unique processing thread. This is a standard way of synchronizing processing threads. Here a global variable, shared by all threads, is set to false by the host and set to true by any thread inside the kernel that requires another stage. For example we have the PyOpenCL [9] snippet, shown in Figure 1, from our breadth-first search implementation using host-based synchronization, where n is the number of threads. Note that we use the PyOpenCL method call enqueue_write_buffer to send data from the host to the GPU, while enqueue_read_buffer will retrieve data from the GPU to the host. Each of the kernel threads will set the global variable continue_flag if the algorithm needs another synchronization stage. Note that GPUs operate asynchronously from the host. Thus there is a requirement to use an explicit wait method call to wait for all kernel tasks to finish before going on to the next stage. Comment: In its original form the BFS algorithm of [6] uses kernel relaunch to provide a global inter-block barrier between search frontiers. Indeed their program launches two kernels for each iteration one to check the neighbors for each visited node and another to update the next frontier. In addition to our other changes we have developed an algorithm (see DKP-Host Sync form Section III) that runs in only one kernel launch per stage by using a method for synchronizing threads plus efficiently allocating data among the threads available. B. Kernel synchronization using atomic operations The second natural approach is to use GPU atomic operations to synchronize a single kernel launch. In this case a set of threads continuously stays active and timeslices the total workload until the algorithm completes. An important requirement for this approach is having an efficient way to partition an algorithm s workload. Suppose we have n tasks to complete and we only have a fixed number m = MAX THREADS of parallel processing threads. Thus, after evenly distributing, each thread should perform c = n/m tasks. This can be done in a number of ways depending on the stride through the tasks t 1, t 2,..., t n. If data is stored in memory we usually want to partition and stride through as [t 1,..., t c ], [t c+1,..., t 2c ],..., [t (m 1)c+1,..., t n ] or as [t 1, t 1+m, t 1+2m,..., t 1+(c 1)m ], [t 2, t 2+m,..., t 2+(c 1)m ],..., [t m, t m+m,..., t n ]. The listing of kernel code given in Figure 2 illustrates the distribution of these tasks to processing threads, where tid represents one of the active threads operating in parallel. By our convention the thread with tid=0 does the synchronization management for the algorithm. In this kernel listing, we use current_stage to represent a global clock and each thread keeps a local clock and only executes its set of tasks when they match. Note the use of atomic operations to ensure correctness of shared data.

3 while continue_flag[0]: continue_flag[0] = 0 cl.enqueue_write_buffer(queue, continue_flag_buf, continue_flag) cl.enqueue_write_buffer(queue, current_stage_buf, current_stage) cl.enqueue_nd_range_kernel(queue, kernel, (n, 1), None) cl.enqueue_read_buffer(queue, continue_flag_buf, continue_flag).wait() current_stage[0] += 1 Fig. 1. Host-based synchronization using PyOpenCL. while (1) // spin lock { // using current_stage as global clock if (*current_stage == local_clock[tid]) { // Is everything done? if (continue_flag[*current_stage] == 0) return; // process n/max_threads work at this sync time stage //... // if needed, we set next continue_flag[*curent_stage+1]=1 } atom_inc(finish_count); local_clock[tid] += 1; // this work thread is done if (tid==0) // thread 0 detects if everybody is done with stage { while (atom_cmpxchg(finish_count, MAX_THREADS, 0)!= 0) {} atom_inc(current_stage); } } // end kernel s algorithm loop Fig. 2. Thread-based synchronization in OpenCL kernel. The finish_count value is used by this kernel to synchronize the threads or processors involved in between stages. For example our Nvidia C2050 device has MAX THREADS=1024 and we used the second stride technique, described above, in our program DKP- Kernel Sync, which is discussed in Section III. Comment: Initially we experimented with a lock based inter-block barrier as described in the paper of Xiao and Feng [15]. This has worked reliably for small graphs and for very dense graphs. Unfortunately, Nvidia Corporation do not officially support this inter-block barrier technique and its use can lead to unpredictable results when run on the current family of Nvidia GPGPU offerings. We eventually came up with our own single-block synchronization technique (presented above) that makes use of atomic operations without the disadvantages of the approach of [15]. III. EXPERIMENTAL RESULTS As an illustrative example we develop two BFS algorithms similar to the ideas first presented as CUDA implementations by Harish and Narayanan [6]. We have recompiled it to run on our platform (see below) to get comparable running times. Partly due to the available precession of timing GPU computations, all our result times are given in milliseconds elapsed. Calculation times are the kernel run-time from launch until after the host program s final wait call. As with common

4 Graph Linear Array Representation 4 Adjacency Lists 0: 3 1 1: : 1 3 3: : 3 Sub-Index: n l 0 l 1 l 2 l 3 l 4 l 5=n Fig. 3. An effective way to represent sparse graphs in an array. practice we do not include disk I/O time and host to device copy times. We argue that often an application will copy a graph to memory or GPU and run many algorithms upon that copy (often set read-only ). So the real issue is how fast, assuming the graph data structure is available, does the actual algorithm take. For graph algorithms (on sparse graphs) one often prefers adjacency lists over adjacency matrices since it is easier to iterate through the neighbors of a node in time proportional to the node s out-degree [3]. For a GPU representation one usually linearizes this two dimensional adjacency lists to a one dimensional array such that no loss of efficiency occurs. We represent a flattened adjacency list representation of a graph of n nodes and m edges as a vector of length O(n + m) consisting of [n,l 0,...,l n, v 0,v 1,...,v m 1 ]. Here n is the order, l i is the index in this vector of the first neighbor of node i (i.e., points to some v j index, 0 j < m). Figure 3 illustrates this array representation. In particular, l n (plus the sub-index offset n + 2) is an index one past the end of the vector and the degree of node i is l i+1 l i. The expected performance of all three tested algorithms, mentioned in Table I, is O(nx + m), for a graph of order n, eccentricity x (distance of a farthest node from the starting node) and size m. We note that for sparse graphs the value of x is much less than log n, based on the known average height of a random rooted tree. So the dominant term is m in the complexity of these algorithms for most graphs thus, those chosen as our tests cases are exceptions. There are a number of variants of the BFS algorithm. One can gather information about predecessors or parents, about BFS tree levels and about the list of children. In addition one can gather all the BFS trees for each of the possible starting nodes, which approaches the task of computing the distance matrix. A. Test cases and benchmarking environment As a selection of somewhat tough test cases as suggested by Luo, Wong and Hwu s paper [10], we picked sparse graphs from the 9th DIMACS Implementation Challenge [4]. We took each of the 51 state road [di]graphs and made them connected so a BFS from starting node 0 would span and process the entire graph. We removed loops and connected the graphs by adding k 1 arcs to connect those with k > 1 components; e.g. for each lowest node index i not in the first component, we added arc (i 1, i) to the graph. The orders (number of nodes), sizes (number of arcs), and eccentricity of node 0 (distance of farthest node from node 0) are listed in the first few columns of Table I. The system used by the authors in implementing the BFS graph algorithm, using the above two design approaches, consists of a rack-mountable server with 2 quad core 2.5GHerz Intel CPUs and 2 Nvidia Tesla C2050 series (Fermi class) cards. The Tesla C2050 is classified as having Nvidia compute capability 2.0 which defines a range of attributes. In particular, the C2050 has 14 multiple processors (MPs) each with with 32 cores and 3Gb cache (global memory). Each of the 448 cores operate at a frequency of 1.15 GHz. The Tesla C2050 supports blocksizes of up to 1024, which can be viewed as the MAX THREAD value discussed earlier. B. Observations from our experiments All programs produced the same (correct) expected BFS depths for each node as a standard CPU-based BFS program. In addition to computing depths from the source node that the original Harish-Narayanan algorithm computes, we also record in our DKP programs, a BFS parent and (later) performed a sequential BFS dag search to verify correctness by ensuring that each parent is, in fact, adjacent and has depth one less than the child. We conclude that both the OpenCl and CUDA approaches have very little difference in overall performance. Here, in addition to what is reported, we actually converted our OpenCL DKP-Kernel Sync implementation to a pure CUDA implementation and measured the running times on these same test cases. The times, in all cases, were roughly ±2% of those times that are listed in the last column of Table I. There are a couple of extreme cases (MI and MO) in our experiment that we do not fully understand why the times are so high relatively to the other two programs

5 TABLE I GPU BFS ALGORITHM RUNNING TIMES (IN MILLISECONDS) ON USA STATE ROAD GRAPHS (ORIGINATING FROM NODE 0). State Graph Nodes Arcs Eccentricity Harish Narayanan DKP-Host Sync DKP-Kernel Sync AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY Total Average

6 (we reran each program a few times to double check the reliability of our times). Recall the Harish-Narayanan also uses host synchronization but the program is purely in C, not PyOpenCL but we do not believe that difference is the cause. It turns out that for large graphs such as the USA road graph with 24 million nodes, DKP-Host Sync (about 20 seconds GPU time) is about 3 5 times faster than DKP- Kernel Sync. It appears that at about one million nodes (on sparse graphs) the DKP-Host Sync runs faster than DKP-Kernel Sync (e.g. see the big states such as CA, FL and TX). However for these very rare extreme cases, it might be better to use a more optimal algorithm such as the one given in [10]. For small (or dense) graphs one should probably prefer the DKP-Kernel Sync program. Finally we want to mention that the common best practices of ensuring memory coalescence should not necessarily be taken as absolute advice. Our DKP-Kernel Sync program actually performs better with memory strides of increments of MAX THREADS compared to a version that does memory strides of distance one. We suggest the user try several equivalent variations of their program and take the best performer targeted for their expected input cases. IV. CONCLUSIONS AND OPEN PROBLEMS In this paper we introduced and compared two simple techniques for synchronizing processes in parallel graph algorithms. We first considered the scenario where the host CPU is responsible for the synchronization of kernel launches. For example, in DKP-Host Sync, each node of the graph is uniquely associated with a multiprocessor thread. For our second approach we consider the case where the input is partitioned into at most the number of possible parallel threads. For example, in DKP-Kernel Sync, we use only one block (work group) of parallel threads to avoid inter-block synchronization issues. Here the multiprocessor thread use atomic operations for synchronizing the computation. Our experiments showed that both these approaches work well for specific categories of graphs. DKP-Kernel Sync is better for small and dense graphs, while DKP-Host Sync is more efficient on sparse graphs of over a million nodes. We also compared the running times among the different implementations of the same algorithm via OpenCL and CUDA. We noticed that there is no remarkable difference in computation time between them. Hence OpenCL seems to be as mature and usable as CUDA, with at least one additional advantage of being portable onto more devices (CPUs and GPUs). There are many problems left to be investigated in this area. For example, we are interested in testing other graph algorithms via these synchronization techniques. Also finding a way to reliably implement an inter-block barrier on the GPU platforms would be extremely valuable. In addition, further work could include developing a OpenCL library of efficient parallel graph algorithms for GPUs. ACKNOWLEDGMENTS The authors would like to thank both P.J. Narayanan and Wen-mei Hwu for providing samples of their BFS GPU code for comparison and Radu Nicolescu for discussions and encouragement in designing GPGPU graph algorithms. REFERENCES [1] F. Y. Chin, J. Lam John and I. Chen, Efficient parallel algorithms for some graph problems, Communication of ACM, 25(9) 1982, [2] F. Dehne and K. Yogaratnam, Exploring the Limits of GPUs With Parallel Graph Algorithms, [3] M. J. Dinneen, G. Gimel farb, and M. C. Wilson. Introduction to Algorithms, Data Structures and Formal Languages, 2nd Edition. Pearson (Education New Zealand), ISBN [4] D. Schultes. 9th DIMACS Implementation Challenge, challenge9; USA state road graphs, http: // challenge9/data/tiger/, October [5] Y. Frishman and A. Tal, Multi-Level Graph Layout on the GPU, IEEE Transactions on Visualization and Computer Graphics, 13, 2007, [6] P. Harish and P. J. Narayanan, Accelerating large graph algorithms on the GPU using CUD in IEEE High Performance Computing, 2007, LNCS 4873, pp [7] K.A. Hawick, A. Leist and D.P. Playne, Parallel graph component labelling with GPUs and CUDA, Parallel Computing, 36(12), 2010, [8] Khronos Group. Open Standards for Media Authoring and Acceleration, [9] A. Klöckner. PyCUDA and PyOpenCL: Even Simpler GPU Programming with Python. Nvidia GPU Technology Conference, (see [10] L. Luo, M. Wong, W-M. Hwu, An Effective GPU Implementation of Breadth-First Search in Proceedings of the 47th Design Automation Conference (Anaheim, California, NY, [11] Nvidia, CUDA. [12] M. J. Quinn and N. Deo, Parallel Graph Algorithms, ACM Computing Survey, 16(3) 1984, [13] V. Rao and V. Kumar, Parallel depth first search. Part I. Implementation, International Journal of Parallel Programming, 16(6) 1984, [14] J. Soman, K. Kishore, P. J. Narayanan, A fast GPU algorithm for graph connectivity. IEEE International Symposium on Parallel Distributed Processing, 2010, 1 8. [15] S. Xiao and W. Feng, Inter-block GPU communication via fast barrier synchronization, Technical Report TR-09-19, Dept. of Computer Science, Virginia Tech., 2009

Telecommunications and Internet Access By Schools & School Districts

Telecommunications and Internet Access By Schools & School Districts Universal Service Funding for Schools and Libraries FY2014 E-rate Funding Requests Telecommunications and Internet Access By Schools & School Districts Submitted to the Federal Communications Commission,

More information

A New Method of Using Polytomous Independent Variables with Many Levels for the Binary Outcome of Big Data Analysis

A New Method of Using Polytomous Independent Variables with Many Levels for the Binary Outcome of Big Data Analysis Paper 2641-2015 A New Method of Using Polytomous Independent Variables with Many Levels for the Binary Outcome of Big Data Analysis ABSTRACT John Gao, ConstantContact; Jesse Harriott, ConstantContact;

More information

State IT in Tough Times: Strategies and Trends for Cost Control and Efficiency

State IT in Tough Times: Strategies and Trends for Cost Control and Efficiency State IT in Tough Times: Strategies and Trends for Cost Control and Efficiency NCSL Communications, Financial Services and Interstate Commerce Committee December 10, 2010 Doug Robinson, Executive Director

More information

DEPARTMENT OF HOUSING AND URBAN DEVELOPMENT. [Docket No. FR-6090-N-01]

DEPARTMENT OF HOUSING AND URBAN DEVELOPMENT. [Docket No. FR-6090-N-01] Billing Code 4210-67 This document is scheduled to be published in the Federal Register on 04/05/2018 and available online at https://federalregister.gov/d/2018-06984, and on FDsys.gov DEPARTMENT OF HOUSING

More information

The Lincoln National Life Insurance Company Universal Life Portfolio

The Lincoln National Life Insurance Company Universal Life Portfolio The Lincoln National Life Insurance Company Universal Life Portfolio State Availability as of 03/26/2012 PRODUCTS AL AK AZ AR CA CO CT DE DC FL GA GU HI ID IL IN IA KS KY LA ME MP MD MA MI MN MS MO MT

More information

Panelists. Patrick Michael. Darryl M. Bloodworth. Michael J. Zylstra. James C. Green

Panelists. Patrick Michael. Darryl M. Bloodworth. Michael J. Zylstra. James C. Green Panelists Darryl M. Bloodworth Dean, Mead, Egerton, Bloodworth, Capouano & Bozarth Orlando, FL dbloodworth@deanmead James C. Green VP, General Counsel & Corporate Secretary MANITOU AMERICAS, INC. West

More information

Fall 2007, Final Exam, Data Structures and Algorithms

Fall 2007, Final Exam, Data Structures and Algorithms Fall 2007, Final Exam, Data Structures and Algorithms Name: Section: Email id: 12th December, 2007 This is an open book, one crib sheet (2 sides), closed notebook exam. Answer all twelve questions. Each

More information

Figure 1 Map of US Coast Guard Districts... 2 Figure 2 CGD Zip File Size... 3 Figure 3 NOAA Zip File Size By State...

Figure 1 Map of US Coast Guard Districts... 2 Figure 2 CGD Zip File Size... 3 Figure 3 NOAA Zip File Size By State... Table of Contents NOAA RNC Charts (By Coast Guard District, NOAA Regions & States) Overview... 1 NOAA RNC Chart File Locations... 2 NOAA RNC by Coast Guard Districts(CGD)... 2 NOAA RNC By States... 3 NOAA

More information

Global Forum 2007 Venice

Global Forum 2007 Venice Global Forum 2007 Venice Broadband Infrastructure for Innovative Applications In Established & Emerging Markets November 5, 2007 Jacquelynn Ruff VP, International Public Policy Verizon Verizon Corporate

More information

Accommodating Broadband Infrastructure on Highway Rights-of-Way. Broadband Technology Opportunities Program (BTOP)

Accommodating Broadband Infrastructure on Highway Rights-of-Way. Broadband Technology Opportunities Program (BTOP) Accommodating Broadband Infrastructure on Highway Rights-of-Way Broadband Technology Opportunities Program (BTOP) Introduction Andy Spurgeon Director of Special Projects Denver, CO Key Responsibilities

More information

Ocean Express Procedure: Quote and Bind Renewal Cargo

Ocean Express Procedure: Quote and Bind Renewal Cargo Ocean Express Procedure: Quote and Bind Renewal Cargo This guide provides steps on how to Quote and Bind your Renewal business using Ocean Express. Renewal Process Click the Ocean Express link within the

More information

Distracted Driving- A Review of Relevant Research and Latest Findings

Distracted Driving- A Review of Relevant Research and Latest Findings Distracted Driving- A Review of Relevant Research and Latest Findings National Conference of State Legislatures Louisville, KY July 27, 2010 Stephen Oesch The sad fact is that in the coming weeks in particular,

More information

IT Modernization in State Government Drivers, Challenges and Successes. Bo Reese State Chief Information Officer, Oklahoma NASCIO President

IT Modernization in State Government Drivers, Challenges and Successes. Bo Reese State Chief Information Officer, Oklahoma NASCIO President IT Modernization in State Government Drivers, Challenges and Successes Bo Reese State Chief Information Officer, Oklahoma NASCIO President Top 10: State CIO Priorities for 2018 1. Security 2. Cloud Services

More information

2018 NSP Student Leader Contact Form

2018 NSP Student Leader Contact Form 2018 NSP Student Leader Contact Form Welcome to the Office of New Student Programs! We are extremely excited to have you on our team. Please complete the below form to confirm your acceptance. Student

More information

B.2 Measures of Central Tendency and Dispersion

B.2 Measures of Central Tendency and Dispersion Appendix B. Measures of Central Tendency and Dispersion B B. Measures of Central Tendency and Dispersion What you should learn Find and interpret the mean, median, and mode of a set of data. Determine

More information

CostQuest Associates, Inc.

CostQuest Associates, Inc. Case Study U.S. 3G Mobile Wireless Broadband Competition Report Copyright 2016 All rights reserved. Case Study Title: U.S. 3G Mobile Wireless Broadband Competition Report Client: All Service Area: Economic

More information

MAKING MONEY FROM YOUR UN-USED CALLS. Connecting People Already on the Phone with Political Polls and Research Surveys. Scott Richards CEO

MAKING MONEY FROM YOUR UN-USED CALLS. Connecting People Already on the Phone with Political Polls and Research Surveys. Scott Richards CEO MAKING MONEY FROM YOUR UN-USED CALLS Connecting People Already on the Phone with Political Polls and Research Surveys Scott Richards CEO Call Routing 800 Numbers Call Tracking Challenge Phone Carriers

More information

Post Graduation Survey Results 2015 College of Engineering Information Networking Institute INFORMATION NETWORKING Master of Science

Post Graduation Survey Results 2015 College of Engineering Information Networking Institute INFORMATION NETWORKING Master of Science INFORMATION NETWORKING Amazon (4) Software Development Engineer (3) Seattle WA Software Development Engineer Sunnyvale CA Apple GPU Engineer Cupertino CA Bloomberg LP Software Engineer New York NY Clari

More information

Department of Business and Information Technology College of Applied Science and Technology The University of Akron

Department of Business and Information Technology College of Applied Science and Technology The University of Akron Department of Business and Information Technology College of Applied Science and Technology The University of Akron 2017 Spring Graduation Exit Survey Q1 - How would you rate your OVERALL EXPERIENCE at

More information

Silicosis Prevalence Among Medicare Beneficiaries,

Silicosis Prevalence Among Medicare Beneficiaries, Silicosis Prevalence Among Medicare Beneficiaries, 1999 2014 Megan Casey, RN, BSN, MPH Nurse Epidemiologist Expanding Research Partnerships: State of the Science June 21, 2017 National Institute for Occupational

More information

Presented on July 24, 2018

Presented on July 24, 2018 Presented on July 24, 2018 Copyright 2018 NCCAOM. Any use of these materials, including reproduction, modification, distribution or republication without the prior written consent of NCCAOM is strictly

More information

NSA s Centers of Academic Excellence in Cyber Security

NSA s Centers of Academic Excellence in Cyber Security NSA s Centers of Academic Excellence in Cyber Security Centers of Academic Excellence in Cybersecurity NSA/DHS CAEs in Cyber Defense (CD) NSA CAEs in Cyber Operations (CO) Lynne Clark, Chief, NSA/DHS CAEs

More information

Amy Schick NHTSA, Occupant Protection Division April 7, 2011

Amy Schick NHTSA, Occupant Protection Division April 7, 2011 Amy Schick NHTSA, Occupant Protection Division April 7, 2011 In 2009, nearly 5,550 people were killed and an additional 448,000 were injured in crashes involving distraction, accounting for 16% of fatal

More information

DSC 201: Data Analysis & Visualization

DSC 201: Data Analysis & Visualization DSC 201: Data Analysis & Visualization Visualization Design Dr. David Koop Definition Computer-based visualization systems provide visual representations of datasets designed to help people carry out tasks

More information

2018 Supply Cheat Sheet MA/PDP/MAPD

2018 Supply Cheat Sheet MA/PDP/MAPD 2018 Supply Cheat Sheet MA/PDP/MAPD Please Note: All agents must be contracted, appointed and certified to order supplies and write business. AETNA/COVENTRY Website: www.aetnamedicare.com A. Click For

More information

Moonv6 Update NANOG 34

Moonv6 Update NANOG 34 Moonv6 Update Outline What is Moonv6? Previous Moonv6 testing April Application Demonstration Future Moonv6 Test Items 2 What is Moonv6? An international project led by the North American IPv6 Task Force

More information

CSE 781 Data Base Management Systems, Summer 09 ORACLE PROJECT

CSE 781 Data Base Management Systems, Summer 09 ORACLE PROJECT 1. Create a new tablespace named CSE781. [not mandatory] 2. Create a new user with your name. Assign DBA privilege to this user. [not mandatory] 3. SQL & PLSQL Star Courier Pvt. Ltd. a part of the evergreen

More information

2015 DISTRACTED DRIVING ENFORCEMENT APRIL 10-15, 2015

2015 DISTRACTED DRIVING ENFORCEMENT APRIL 10-15, 2015 2015 DISTRACTED DRIVING ENFORCEMENT APRIL 10-15, 2015 DISTRACTED DRIVING ENFORCEMENT CAMPAIGN COMMUNICATIONS DISTRACTED DRIVING ENFORCEMENT CAMPAIGN Campaign Information Enforcement Dates: April 10-15,

More information

Team Members. When viewing this job aid electronically, click within the Contents to advance to desired page. Introduction... 2

Team Members. When viewing this job aid electronically, click within the Contents to advance to desired page. Introduction... 2 Introduction Team Members When viewing this job aid electronically, click within the Contents to advance to desired page. Contents Introduction... 2 About STARS -... 2 Technical Assistance... 2 STARS User

More information

Name: Business Name: Business Address: Street Address. Business Address: City ST Zip Code. Home Address: Street Address

Name: Business Name: Business Address: Street Address. Business Address: City ST Zip Code. Home Address: Street Address Application for Certified Installer Onsite Wastewater Treatment Systems (CIOWTS) Credentials Rev. 6/2012 Step 1. Name and Address of Applicant (Please print or type.) Name: Business Name:_ Business Address:

More information

New Approach for Graph Algorithms on GPU using CUDA

New Approach for Graph Algorithms on GPU using CUDA New Approach for Graph Algorithms on GPU using CUDA 1 Gunjan Singla, 2 Amrita Tiwari, 3 Dhirendra Pratap Singh Department of Computer Science and Engineering Maulana Azad National Institute of Technology

More information

Presentation Outline. Effective Survey Sampling of Rare Subgroups Probability-Based Sampling Using Split-Frames with Listed Households

Presentation Outline. Effective Survey Sampling of Rare Subgroups Probability-Based Sampling Using Split-Frames with Listed Households Effectve Survey Samplng of Rare Subgroups Probablty-Based Samplng Usng Splt-Frames wth Lsted Households Nature of the Problem Presentaton Outlne Samplng Alternatves Dsproportonal Stratfed Samplng Mansour

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Tina Ladabouche. GenCyber Program Manager

Tina Ladabouche. GenCyber Program Manager Tina Ladabouche GenCyber Program Manager GenCyber Help all students understand correct and safe on-line behavior Increase interest in cybersecurity and diversity in cybersecurity workforce of the Nation

More information

AASHTO s National Transportation Product Evaluation Program

AASHTO s National Transportation Product Evaluation Program www.ntpep.org 8/20/2013 AASHTO s National Transportation Product Evaluation Program What is NTPEP? How to access data NTPEP generates-datamine Review of NTPEP Technical Committees for Traffic and Safety

More information

CIS 467/602-01: Data Visualization

CIS 467/602-01: Data Visualization CIS 467/602-01: Data Visualization Tables Dr. David Koop Assignment 2 http://www.cis.umassd.edu/ ~dkoop/cis467/assignment2.html Plagiarism on Assignment 1 Any questions? 2 Recap (Interaction) Important

More information

MERGING DATAFRAMES WITH PANDAS. Appending & concatenating Series

MERGING DATAFRAMES WITH PANDAS. Appending & concatenating Series MERGING DATAFRAMES WITH PANDAS Appending & concatenating Series append().append(): Series & DataFrame method Invocation: s1.append(s2) Stacks rows of s2 below s1 Method for Series & DataFrames concat()

More information

Charter EZPort User Guide

Charter EZPort User Guide Charter EZPort User Guide Version 2.4 September 14, 2017 Contents Document Information... 3 Version Notice and Change Log... 3 General... 6 Getting Started...7 System Requirements... 7 Initial Access Procedure...

More information

THE LINEAR PROBABILITY MODEL: USING LEAST SQUARES TO ESTIMATE A REGRESSION EQUATION WITH A DICHOTOMOUS DEPENDENT VARIABLE

THE LINEAR PROBABILITY MODEL: USING LEAST SQUARES TO ESTIMATE A REGRESSION EQUATION WITH A DICHOTOMOUS DEPENDENT VARIABLE PLS 802 Spring 2018 Professor Jacoby THE LINEAR PROBABILITY MODEL: USING LEAST SQUARES TO ESTIMATE A REGRESSION EQUATION WITH A DICHOTOMOUS DEPENDENT VARIABLE This handout shows the log of a Stata session

More information

PulseNet Updates: Transitioning to WGS for Reference Testing and Surveillance

PulseNet Updates: Transitioning to WGS for Reference Testing and Surveillance National Center for Emerging and Zoonotic Infectious Diseases PulseNet Updates: Transitioning to WGS for Reference Testing and Surveillance Kelley Hise, MPH Enteric Diseases Laboratory Branch Division

More information

State HIE Strategic and Operational Plan Emerging Models. February 16, 2011

State HIE Strategic and Operational Plan Emerging Models. February 16, 2011 State HIE Strategic and Operational Plan Emerging Models February 16, 2011 Goals and Objectives The State HIE emerging models can be useful in a wide variety of ways, both within the ONC state-level HIE

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

A Capabilities Presentation

A Capabilities Presentation A Capabilities Presentation Full Systems Integrator, Value-Add-Reseller, Service Provider for the Federal, State and Local market. Founded in 2006 by former Military IT professionals with Telecommunications,

More information

Sideseadmed (IRT0040) loeng 4/2012. Avo

Sideseadmed (IRT0040) loeng 4/2012. Avo Sideseadmed (IRT0040) loeng 4/2012 Avo avots@lr.ttu.ee 1 Internet Evolution BACKBONE ACCESS NETWORKS WIRELESS NETWORKS OSI mudeli arendus 3 Access technologies PAN / CAN: Bluedooth, Zigbee, IrDA ( WiFi

More information

Prizm. manufactured by. White s Electronics, Inc Pleasant Valley Road Sweet Home, OR USA. Visit our site on the World Wide Web

Prizm. manufactured by. White s Electronics, Inc Pleasant Valley Road Sweet Home, OR USA. Visit our site on the World Wide Web Prizm II III IV * V Prizm manufactured by White s Electronics, Inc. 1011 Pleasant Valley Road Sweet Home, OR 97386 USA Visit our site on the World Wide Web www.whiteselectronics.com for the latest information

More information

Contact Center Compliance Webinar Bringing you the ANSWERS you need about compliance in your call center.

Contact Center Compliance Webinar Bringing you the ANSWERS you need about compliance in your call center. Contact Center Compliance Webinar Bringing you the ANSWERS you need about compliance in your call center. Welcome Mitch Roth Business to Business Compliance Protocols ATA General Counsel Partner Williams

More information

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN *, INDRANI PAUL, JOSEPH GREATHOUSE, SRILATHA MANNE, AND SUDHKAHAR YALAMANCHILI * * GEORGIA INSTITUTE OF TECHNOLOGY AMD RESEARCH

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Graph Search Objective Study graph search as a prototypical graph-based algorithm Learn techniques to mitigate the memory-bandwidth-centric

More information

ACCESS PROCESS FOR CENTRAL OFFICE ACCESS

ACCESS PROCESS FOR CENTRAL OFFICE ACCESS ACCESS PROCESS FOR CENTRAL OFFICE ACCESS NOTE: Every person doing work of any nature in the central offices MUST have an access badge. Anyone who does not have a current access badge will be escorted from

More information

The Outlook for U.S. Manufacturing

The Outlook for U.S. Manufacturing The Outlook for U.S. Manufacturing Economic Forecasting Conference J. Mack Robinson College of Business Georgia State University Atlanta, GA November 15, 2006 William Strauss Senior Economist and Economic

More information

Best Practices in Rapid Deployment of PI Infrastructure and Integration with OEM Supplied SCADA Systems

Best Practices in Rapid Deployment of PI Infrastructure and Integration with OEM Supplied SCADA Systems Best Practices in Rapid Deployment of PI Infrastructure and Integration with OEM Supplied SCADA Systems Kevin Schroeder & Mike Liska OVERVIEW Company Overview Data Background/History Challenges Solutions

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

GPU Sparse Graph Traversal

GPU Sparse Graph Traversal GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex

More information

Touch Input. CSE 510 Christian Holz Microsoft Research February 11, 2016

Touch Input. CSE 510 Christian Holz Microsoft Research   February 11, 2016 Touch Input CSE 510 Christian Holz Microsoft Research http://www.christianholz.net February 11, 2016 hall of fame/hall of shame? Nokia 5800, 2008 hall of fame/hall of shame? stylus we ve invented [Lightpen

More information

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Ted Enamorado Benjamin Fifield Kosuke Imai Princeton Harvard Talk at the Tech Science Seminar IQSS, Harvard University

More information

Homework Assignment #5

Homework Assignment #5 Homework Assignment #5-5, Data Mining SOLUTIONS. (a) Create a plot showing the location of each state, with longitude on the horizontal axis, latitude on the vertical axis, and the states names or abbreviations

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

DTFH61-13-C Addressing Challenges for Automation in Highway Construction

DTFH61-13-C Addressing Challenges for Automation in Highway Construction DTFH61-13-C-00026 Addressing Challenges for Automation in Highway Construction Learning Objectives Research Objectives Research Team Introduce Part I: Implementation Challenges and Success Stories Describe

More information

Expanding Transmission Capacity: Options and Implications. What role does renewable energy play in driving transmission expansion?

Expanding Transmission Capacity: Options and Implications. What role does renewable energy play in driving transmission expansion? Expanding Transmission Capacity: Options and Implications What role does renewable energy play in driving transmission expansion? Beth Soholt Director, Wind on the Wires bsoholt@windonthewires.org Office:

More information

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Understanding Outstanding Memory Request Handling Resources in GPGPUs Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Accelerated Load Balancing of Unstructured Meshes

Accelerated Load Balancing of Unstructured Meshes Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require

More information

Presentation to NANC. January 22, 2003

Presentation to NANC. January 22, 2003 Presentation to NANC January 22, 2003 Introduction Service Offering Numbering No Special Number Exhaust Issues Associated with VoIP Providers January 22, 2003 Who is Vonage? 2002 saw the introduction of

More information

Geographic Accuracy of Cell Phone RDD Sample Selected by Area Code versus Wire Center

Geographic Accuracy of Cell Phone RDD Sample Selected by Area Code versus Wire Center Geographic Accuracy of Cell Phone RDD Sample Selected by versus Xian Tao 1, Benjamin Skalland 1, David Yankey 2, Jenny Jeyarajah 2, Phil Smith 2, Meena Khare 3 1 NORC at the University of Chicago 2 National

More information

What Did You Learn? Key Terms. Key Concepts. 68 Chapter P Prerequisites

What Did You Learn? Key Terms. Key Concepts. 68 Chapter P Prerequisites 7_0P0R.qp /7/06 9:9 AM Page 68 68 Chapter P Prerequisites What Did You Learn? Key Terms real numbers, p. rational and irrational numbers, p. absolute value, p. variables, p. 6 algebraic epressions, p.

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

A Comparative Study of Parallel Algorithms for the Girth Problem

A Comparative Study of Parallel Algorithms for the Girth Problem Proceedings of the Tenth Australasian Symposium on Parallel and Distributed Computing (AusPDC 2012), Melbourne, Australia A Comparative Study of Parallel Algorithms for the Girth Problem Michael J. Dinneen

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

Towards Breast Anatomy Simulation Using GPUs

Towards Breast Anatomy Simulation Using GPUs Towards Breast Anatomy Simulation Using GPUs Joseph H. Chui 1, David D. Pokrajac 2, Andrew D.A. Maidment 3, and Predrag R. Bakic 4 1 Department of Radiology, University of Pennsylvania, Philadelphia PA

More information

New Approach of Bellman Ford Algorithm on GPU using Compute Unified Design Architecture (CUDA)

New Approach of Bellman Ford Algorithm on GPU using Compute Unified Design Architecture (CUDA) New Approach of Bellman Ford Algorithm on GPU using Compute Unified Design Architecture (CUDA) Pankhari Agarwal Department of Computer Science and Engineering National Institute of Technical Teachers Training

More information

Inter-Block GPU Communication via Fast Barrier Synchronization

Inter-Block GPU Communication via Fast Barrier Synchronization CS 3580 - Advanced Topics in Parallel Computing Inter-Block GPU Communication via Fast Barrier Synchronization Mohammad Hasanzadeh-Mofrad University of Pittsburgh September 12, 2017 1 General Purpose Graphics

More information

Data Structures and Algorithms for Counting Problems on Graphs using GPU

Data Structures and Algorithms for Counting Problems on Graphs using GPU International Journal of Networking and Computing www.ijnc.org ISSN 85-839 (print) ISSN 85-847 (online) Volume 3, Number, pages 64 88, July 3 Data Structures and Algorithms for Counting Problems on Graphs

More information

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou A GPU Implementation of Tiled Belief Propagation on Markov Random Fields Hassan Eslami Theodoros Kasampalis Maria Kotsifakou BP-M AND TILED-BP 2 BP-M 3 Tiled BP T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 4 Tiled

More information

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox

More information

A new edge selection heuristic for computing the Tutte polynomial of an undirected graph.

A new edge selection heuristic for computing the Tutte polynomial of an undirected graph. FPSAC 2012, Nagoya, Japan DMTCS proc. (subm.), by the authors, 1 12 A new edge selection heuristic for computing the Tutte polynomial of an undirected graph. Michael Monagan 1 1 Department of Mathematics,

More information

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs Xin Huo, Vignesh T. Ravi, Wenjing Ma and Gagan Agrawal Department of Computer Science and Engineering

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Parallelising Pipelined Wavefront Computations on the GPU

Parallelising Pipelined Wavefront Computations on the GPU Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick

More information

Porting Performance across GPUs and FPGAs

Porting Performance across GPUs and FPGAs Porting Performance across GPUs and FPGAs Deming Chen, ECE, University of Illinois In collaboration with Alex Papakonstantinou 1, Karthik Gururaj 2, John Stratton 1, Jason Cong 2, Wen-Mei Hwu 1 1: ECE

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Configuring Oracle GoldenGate OGG 11gR2 local integrated capture and using OGG for mapping and transformations

Configuring Oracle GoldenGate OGG 11gR2 local integrated capture and using OGG for mapping and transformations Configuring Oracle GoldenGate OGG 11gR2 local integrated capture and using OGG for mapping and transformations In the article you will have a look at an OGG configuration example for local integrated capture

More information

Steve Stark Sales Executive Newcastle

Steve Stark Sales Executive Newcastle Theresa Lee Thermal Product Manager - Toshiba October 17, 2013 Theresa.lee@tabs.toshiba.com Copyright 2013 Toshiba Corporation. Steve Stark Sales Executive Newcastle sstark@newcastlesys.com Christine Wheeler

More information

Elevation Data Acquisition Update

Elevation Data Acquisition Update N G C E, F o r t W o r t h Elevation Data Acquisition Update November 27, 2017 Collin McCormick National Elevation Leader Data Acquisition Updates FY17/18 Contracts & Acquisitions Data Processing State

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

WHAT S NEW IN CHECKPOINT

WHAT S NEW IN CHECKPOINT WHAT S NEW IN CHECKPOINT This document covers the most recent Checkpoint enhancements as of September 15, 2014. STATE & LOCAL SALES TAXABILITY MATRIX: The new Sales Taxability Matrix has been added to

More information

2013 Product Catalog. Quality, affordable tax preparation solutions for professionals Preparer s 1040 Bundle... $579

2013 Product Catalog. Quality, affordable tax preparation solutions for professionals Preparer s 1040 Bundle... $579 2013 Product Catalog Quality, affordable tax preparation solutions for professionals 2013 Preparer s 1040 Bundle... $579 Includes all of the following: Preparer s 1040 Edition Preparer s 1040 All-States

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU

More information

Data Visualization (CIS/DSC 468)

Data Visualization (CIS/DSC 468) Data Visualization (CIS/DSC 468) Tabular Data Dr. David Koop Channel Considerations Discriminability Separability Visual Popout Weber's Law Luminance Perception 2 Separability Cannot treat all channels

More information

Exploring GPU Architecture for N2P Image Processing Algorithms

Exploring GPU Architecture for N2P Image Processing Algorithms Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information