Associative Operations from MASC to GPU

Size: px
Start display at page:

Download "Associative Operations from MASC to GPU"

Transcription

1 388 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'15 Associative Operations from MASC to GPU Mingxian Jin Department of Mathematics and Computer Science, Fayetteville State University 1200 Murchison Road, Fayetteville, NC 28301, USA Abstract - The Multiple Associative Computing (MASC) model is an enhanced SIMD (Single Instruction-stream Multiple Data-stream) model in the associative style for general parallel computation that has been studied last two decades. There have been a number of algorithms developed for this model. Recent research shows this model is extremely efficient when used for real-time scheduling in air traffic control systems. Associative operations are indeed key properties of this model. In particular, they all take constant time. In this paper, we present an implementation outline of these MASC associative operations on the popular architecture of GPU (Graphic Processing Units). This research aims to provide a bridge or general guidance to convert MASC algorithms to their GPU implementation. As the MASC architecture in today s technologies has not been built yet, this provides a possible way to implement MASC algorithms on an alternative platform so to verify their correctness and efficiencies especially for massive data input. Keywords: MASC, Associative computing, GPU, SIMD 1 Introduction The Multiple Associative Computing (MASC) model is an enhanced SIMD (Single Instruction-stream Multiple Datastream) model in the associative style for general parallel computation. It extends the concept of SIMD with associative properties and possesses features of easy programming and highly scalable. During last two decades, intensive research has been conducted regarding this model, mainly at Kent State University. It has been shown that this model is as powerful as some other well-known parallel computation models like PRAM (parallel random access machine) and restricted RM (reconfigurable meshes) [9]. There have been a number of algorithms developed on this model. These algorithms are across different application fields such as computer geometry, graphics, string matching, etc. Examples are in [1, 2, 4, 8]. Most recently, it has been shown that this model is extremely efficient when used for real-time scheduling in air traffic control system compared to its multiprocessor counterpart [12]. While more MASC algorithms are being developed, how to implement them becomes an interesting problem. Past effort has been made to build the MASC architecture, or ASC processor, using modern technologies of FPGS (fieldprogrammable gate array) [10]. This research is still in the stage of experiment with up to 52 processing units. It is difficult to be used for any massively parallel processing, which is normally expected in a MASC algorithm. A standard associative language, called ASC, has also been developed for MASC across some platforms including Goodyear/Loral/Martin-Marietta s ASPRO and Thinking Machine s CM-2 [7, 8]. However, these platforms are not accessible in today s computer lab settings. Although the ASC programming language has been emulated on both PCs and workstations running UNIX to compile and execute simple ASC programs, the running environment is restricted by the emulating software and the non-associative style hardware construction. It is impossible to truly evaluate the performance of a MASC algorithm with massive data input. We look for an alternative platform that is able to emulate the MASC model so to implement its algorithms in massively parallel with minimum efficiency loss. An ideal platform must have a close architecture and also be easily accessible. There is no doubt that GPU is an excellent choice. A general- purpose computer with graphic processing units (GPU) is an emerging architecture that has attracted a lot of attention in last few years. The original purpose of using GPU is to accelerate intensive graphic data processing. Later, with introduction of NVDIA CUDA (compute unified device architecture), a high-level programming interface, GPU is evolved to be a powerful computing platform to support general purpose parallel computation. It has been used in numerous application fields for massively data parallel processing [6, 11]. GPU is a typical SIMD architecture and is especially good for fine-grained large amount data-intensive parallel computation. Its features provide possibility of implementing MASC algorithms with easy accessibility and high scalability. In the MASC model, associative operations are indeed its key properties. To implement a MASC algorithm on a different architecture, we need to find a way to execute each of these operations in the corresponding running environment. This is our contribution in this paper. The remaining paper is organized as follows. Section 2 gives a description of the MASC model and the related research. Section 3 provides a

2 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA' brief overview of GPU architecture with CUDA framework. Section 4 presents implementation steps for each MASC associative operation on GPU. Section 5 remarks general comparison between MASC and GPU to conclude the paper and also discusses future work. 2 The MASC Model The MASC (for Multiple Associative Computing) model was developed at Kent State University based on earlier STARAN architecture at Goodyear Aerospace. It has been studied since the early 1970 s. The MASC model consists of an array of processing elements (PEs) and one or more instruction streams (ISs), each of which issues commands to a disjoint set of PEs partitioned dynamically. In a MASC machine, the number of IS is normally expected to be small in comparison to the number of PEs. In this paper, we assume the MASC with one IS unless explicitly specified. Detailed features of the MASC model can be found in [8]. A brief description follows. Each PE (or cell) has a local memory and is capable of performing the usual functions of a sequential processor other than issuing instructions. An IS is logically a processor which has a bus connecting it to each cell and can send an instruction to all cells. Each cell listens to only one IS and can switch to another IS based on local data tests when multiple ISs present. Cells can be active, inactive, or idle. An active cell executes the program steps from its IS while an inactive cell only listens. An IS can instruct an inactive cell to become active again (Fig 1). If the word length is assumed to be a constant, then the MASC model supports the following associative operations in constant time. These have all been justified in [3]. IS Cells PE PE PE Broadcast/reduction Network Memory Memory Memory Active Inactive Cell Network Fig. 1 The MASC model with one instruction stream Global reduction of OR and AND of binary values each being held by an PE Global maximum and minimum of integer or real values each being held by an PE Associative search which finds all cells whose data value matching the search pattern. All data in the local memories of the cells is located by content rather than by address. These matching cells are called responders and those not are called non-responders. Pick-one which is used by the IS to select (or pick one ) arbitrary responder from the set of its active cells Broadcast which is used by the IS to instruct the selected cell to place a data item on the bus and all other cells listening to the IS receive this value in one step. The MASC model may also include a cell network used for communications among PEs, an IS broadcast/reduction network (or the resolver network as another name) used for communication between the IS and cells, and a possible IS network in the case of multiple ISs that is used for IS communications. A wide range of types of algorithms and several large programs have been developed for the MASC model and many of these have appeared in the literature. Examples are in [1, 2, 4, 8]. Moreover, simulations between MASC and other well-known parallel computation models such as PRAM and restricted RM have been well studied and published in the literatures as well. (See [9] for example). Most recent research has shown that this model is extremely efficient when used for real-time scheduling in air traffic control system compared to its multiprocessor counterpart [12]. As mentioned earlier, a standard associative language, called ASC, has been developed for MASC with one IS across certain platforms including Goodyear/Loral/Martin-Marietta s ASPRO, the WaveTracer, and Thinking Machine s CM-2, and provides true portability for parallel algorithms [7]. In addition, an ASC simulator has been implemented on both PCs and workstations running UNIX. It provides an efficient and easy way to test simple programs for algorithms designed for the MASC model. However, they cannot be used to truly evaluate performance of an MASC algorithm due to the restrictions of the emulating software and the non-associative style hardware construction. Since the MASC model is developed based on early STARAN architecture that existed over 40 years ago, much effort has been made to build a new architecture based on today s technologies of FPGA so to support the MASC model and implement its algorithms. In [10], a scalable ASC processor with one IS using the FPGA technology is experimented with up to 52 PEs. It is still under study so difficult to be utilized for any MASC algorithm execution. It becomes a rising interest for us to find an alternative platform to implement MASC algorithms so to verify their correctness and efficiencies based on massive amount data input size.

3 390 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'15 GPU is an ideal choice which possesses features of easy accessibility and high scalability. 3 The GPU with CUDA framework A general- purpose computer with graphic processing units (GPU) is an emerging architecture that has been attracted a lot of attention in the past decade. GPU was originally used to accelerate graphic computation and later evolved to perform general computation with introduction of high-programming languages and a specific framework CUDA (Computer Unified Device Architecture) as the programming interface. It becomes a very popular computing platform and has been used in numerous application fields [5, 6, 11]. Host Global Memory... block Shared memory block block Fig. 2 Architecture of GPU with CUDA A modern GPU consists of a host computer with an array of streaming multiprocessors forming groups of building blocks as the device. The host is normally considered to be a traditional central processing unit (CPU). In each streaming multiprocessor, there are a fixed number of streaming processors that execute the same instruction stream but running on different data sets. Each stream processor runs its own thread, as shown in Fig 2. A CUDA-capable GPU supports several types of memory. Global memory and constant memory (not shown in the figure) can be read and written by the host such that data can be transferred between the host and the device. Each block has its shared memory that can be accessed by all threads. This provides an efficient way for threads within the block to communicate by sharing their input data and intermediate results during program execution. Each individual thread has registers as its locally accessed memory (not shown in the figure). Shared memory Shared memory Commonly a GPU uses the host as the instruction steam to instruct multiple graphic processing units to perform dataintensive computation. This architecture is similar to the MASC model in that the host can be a multi-core CPU which can be corresponded to multiple ISs in MASC and many-core GPUs corresponding to MASC cells. Specifically, GPU executes in a SIMT (Single Instruction-stream and Multiple Threads) manner in which a thread is comparable to a MASC cell. 4 Associative Operations on GPU Now implementation steps for MASC associative operations on the GPU platform are presented in this section. We map the data structures from MASC to GPU first. Then each associative operation is discussed one by one in regard of its implementation on GPU. 4.1 Mapping of the data structures On MASC, data is stored by content instead of by address. In particular, data is organized in a tabular format with each PE holding a group of associative data. In an associative search, the search pattern is compared against the table of stored data in a bit-serial fashion. This allows search can be done in massively parallel and much faster. For example, a graph can be stored as a table structured as its adjacent matrix with each PE holding one row of data representing a vertex. In addition, other data pertaining to the vertex can also be stored on the same PE for parallel processing. Alternatively, one PE can hold data of one edge which includes weight, two end vertices, and/or other data depending on the application needs. Another example is that, in an air traffic control system, each PE holds all the data pertaining to an aircraft such as (x, y) positions, altitude, velocity, and so on. GPU is in the traditional way to store data. A data item is identified by its memory address. In particular, for the device, a data item is stored in shared memory having its own memory address space or in the local memory of a thread that is addressed by its block index and thread index. To simplify our discussion, we assume both ASC and GPU have sufficient numbers of PEs and device threads. In order to map associative data items from an MASC PE to a GPU thread, we can make a direct function map: i = 0 k n 0 i

4 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA' All data that resides in local memory of a PE is mapped to the local memory of the corresponding thread. This provides a straightforward data mapping except identification number conversion. 4.2 Global reduction of OR and AND Although CUDA provides and functions, these functions are used to perform logic OR and AND operations on the two words of data at a specified address in global memory or in the shared memory. They cannot be used for the global reduction of OR and AND on a group of data items as on MASC. To perform a global reduction of OR on a group of data items on GPU, the n data items are assumed to be originally resided in the local memory of each thread. A global reduction is computed in the shared memory of the device with a block and then in the global memory among blocks. The final result is read by the host from the global memory. We follow the idea in [5] for a global reduction of partial sum. Every two data items from two adjacent threads are reduced to one by moving the data items to the shared memory. All even-numbered threads perform this reduction in parallel which reduces the number of data items from n to n/2. Next round, every two data items from two adjacent evennumbered threads are reduced and the partial results are in the four multiple numbered threads. Iteratively, all data items will be reduced to one result as the global result. In order to avoid thread divergence 1, the data movements can be slightly changed by aligning the first half block and the second half block. All aligned threads are reduced in pair in parallel and partial results are stored in the first half threads corresponded shared memory. Iterations are going through in the same manner until the last result is obtained. The reduction is done in place, which means the data item in the shared memory is replaced by the partial result of logic OR. The CUDAcompatible code is shown as follows. 1 Device divergence is a problem that occurs when some threads have to run a different instruction from other threads. It is usually caused by an if-then-else statement and degrades parallel processing performance. After each block gets its last result, it is further sent to the global memory. The final result across all blocks can be obtained using a few more logic OR operations and then read by the host from the global memory. Since this implementation uses software steps to emulate a hard-wired execution. It takes longer time by a factor of O(log n) for n threads, compared to the true MASC implementation. It is obvious that a global reduction of AND can be done in the similar way. 4.3 Global reduction of maximum and minimum On MASC, a global reduction maximum executes global OR in the bit-serial fashion. If the word length is ω, a global maximum takes O(ω). Since a word length is generally considered to be constant in all modern architectures, this is considered to be a constant operation as well. See [3] for detailed discussion. On GPU, we will not use the bit-serial fashion as on MASC. This is because a global OR operation on GPU already uses programmed steps in its implementation. There is no need to perform the operation in bit-serial. The same approach as in Section 4.2 can be applied to implement a global reduction of maximum (or minimum). Partial maximum (or minimum) is updated by iterative comparisons. The previous intermediate larger (or smaller) data value is kept as the partial result for next round comparisons until the last result is obtained. We process data from all threads within a block in the share memory and then process all data from all blocks. The final result is sent to the global memory and read by the host. Since this reduction operation goes through the same iterations as a global OR/AND as in section 4.2. It takes the same extra time in O(log n) for n threads. 4.4 Broadcasting Broadcasting on GPU is fairly easy as there are different levels of memory providing read/write access for the host and/or all units on the device(s). The data item that needs to be broadcast can be placed by the sending thread in a specified shared memory location. Then blocks/threads can read from it directly. This does not take any extra time. 4.5 Associative search Search on the ASC model is an operation combing broadcasting and global reduction of OR. On MASC, the IS broadcasts the search pattern to all PEs first. Each PEs compares this pattern with its local data. A PE with matching data sets a flag and is called a responder. Otherwise, the PE resets the flag and is called a non-responder. A global

5 392 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'15 reduction of OR is performed and sends the search result back to the IS. A global reduction of OR on the designated flags can be performed on GPU as described in 4.2. The final result of 1 or 0 indicates the search is successful or unsuccessful. After an associative search, all responders are normally instructed to do specified tasks and non-responders are set inactive to wait for future activation. On GPU, this can be executed by running these specific tasks on corresponding threads and set other threads idle without doing anything until they are activated later. It is similar to executing an if-thenelse statement on GPU for any general computation. Once again, device divergence could be a potential problem. The time complexity is dominated by the time of the global reduction of OR, which is O(log n), as broadcasting and setting flags only take constant number of steps by all threads in parallel. 4.6 PickOne An associative search may also be followed by a PickOne operation. An arbitrary PE (or a thread on GPU), normally the first responder in the responder group, is selected to perform individually instructed tasks. Using the function of provided by the CUDA framework on GPU, this can be implemented. It takes the same time as the function. It is noted that responder execution after a PickOne operation can cause the device divergence problem as mentioned in 4.5. Depending on algorithm design, we may expect different levels of performance degradation due to this problem. 5 Concluding remarks and Future Work MASC and GPU present many similarities. Both of them are in the SIMD category (SIMD vs. SIMT) in parallel computation. They are especially advantageous in fine grained parallel processing for massively data-intensive problems. They both have an ability of restrictive control parallelism with multiple ISs and the multi-core CPU, respectively. Both of them possess features of easy programmable (ASC vs. CUDA, respectively). They are also highly scalable because the array of cells and threads can be easily extended. Normally there are light or no overheads due to single instruction stream. The two architectures have some differences as well, however. There are constructed significantly different in hardware. MASC is a bit-serial model. On the other hand, GPU assumes each processing unit in a traditional manner. MASC allows cells communicate through the cell network. On GPU, there is no separate network connecting blocks and threads. All communications are through different levels of memory reads/writes. It is fast but lacks flexibility to dynamically partition threads. Threads within a block can be collaborated. However, threads across blocks cannot. In a MASC algorithm, an advantage to represent complex data structures is tabular representation. All data on a PE is stored and located by its content instead of by its address. This allows fast locate data and return the search results in constant time through the hardware construction the resolver network. However, GPU locates data by memory address, in particular, block indices and thread indices. With these similarities and differences in mind, in this paper, we have presented outlined implementation of associative operations of the MASC model on the GPU architecture. As shown in Section 4, most of these associative operations can be implemented on GPU with an extra O(log n) efficiency loss. This is due to the fact that software programmed steps on GPU are used to replace hardware wiring as the resolver (broadcast/reduction) network on MASC. These steps can be directly used to convert a MASC algorithm so to be implemented on the GPU-CUDA platform. Due to time limit and space constraints, the actual implementation results and performance analysis are not included in this paper. Future work is to analyze these results and evaluate their performance. Also, for some problems, if there exist an algorithm directly designed on GPU and it also has a MASC algorithm, it would be interesting to compare the performance of the GPU algorithm and the converted MASC algorithm running on GPU. 6 Acknowledgement This work is partially supported by the FSU Integrated STEM Academic Success (ISAS) program. 7 References [1] M. M. Atwah and J. W. Baker "An associative static and dynamic convex hull algorithm", in Proc. of the 16th International Parallel and Distributed Processing Symposium (Workshop in Massively Parallel Processing), abstract on page 249, full text on CDROM, April 2002 [2] M. Esenwein and J. Baker, "VLCD string matching for associative computing and multiple broadcast mesh", in Proc. of the IASTED International Conference on Parallel and Distributed Computing and Systems, pages [3] M. Jin, J. Baker, K. Batcher, Timings for associative operations on the MASC model, in: Proc. of the 15th International Parallel and Distributed Processing Symposium, IEEE Workshop on Massively Parallel Processing, San Francisco, CA, 2001, pp [4] M. Jin, J. Baker, Two graph algorithms on an associative computing model, in Proc. Of International

6 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA' Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA, Las Vegas, [5] D.B. Kirk and W. W. Hwu, Programming Massively Parallel Processors, Morgan Kaufmann Publishers, 2010 [6] I. Park, N. Signhal, M. Lee, S. Cho, and C. Kim, Design and performance evaluation of image processing algorithms on GPUs, IEEE Transactions on Parallel and Distributed Systems, Vol. 22, No.1, January 2011 [7] J. Potter, Associative Computing: A Programming Paradigm for Massively Parallel Computers, Plenum Press, New York, 1992 [8] J. Potter, J. Baker, S. Scott, A. Bansal, C. Leangsuksun, C. Asthagiri, ASC: an associative-computing paradigm, Computer 27 (11) (1994) [9] J. Trahan, M. Jin, W. Chantamas, J. Baker, Relating the power of the multiple associative computing model (MASC) to that of reconfigurable bus based models, Journal of Parallel and Distributed Computing, Elsevier Publishers, Vol. 70, No. 5, (2010) [10] H. Wang, and R. Walker, "Implementing a scalable ASC processor", in Proc. of the 17th International Parallel and Distributed Processing Symposium (Workshop in Massively Parallel Processing), April 2003 [11] W. W. Hwu, GPU Computing Gems Emerald Edition (Applications of GPU Computing Series), Morgan Kaufmann; 1 ed. February 2011 [12] M. Yuan, J. W. Baker, W. C. Meilander, Comparisons of Air Traffic Control Implementations on an Associative Processor with a MIMD and Consequences for Parallel Computing, Journal of Parallel and Distributed Computing, Elsevier Publishers, Volume 73, Issue 2, February 2013, pages , ISSN

An Extension of the ASC Language Compiler to Support Multiple Instruction Streams in the MASC Model using the Manager-Worker Paradigm

An Extension of the ASC Language Compiler to Support Multiple Instruction Streams in the MASC Model using the Manager-Worker Paradigm An Extension of the ASC Language Compiler to Support Multiple s in the MASC Model using the Manager-Worker Paradigm Wittaya Chantamas, Johnnie Baker, and Michael Scherger Department of Computer Science

More information

Implementing a Scalable ASC Processor

Implementing a Scalable ASC Processor Implementing a Scalable ASC Processor Hong Wang and Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 {honwang, walker}@cs.kent.edu Abstract Previous papers [1,2] have described

More information

Implementing Associative Search and Responder Resolution

Implementing Associative Search and Responder Resolution Implementing Associative Search and Responder Resolution Meiduo Wu, Robert A. Walker, and Jerry Potter Computer Science Department Kent State University Kent, OH 44242 {mwu, walker, potter@cs.kent.edu

More information

A Massively Parallel Line Simplification Algorithm Implemented Using Chapel

A Massively Parallel Line Simplification Algorithm Implemented Using Chapel A Massively Parallel Line Simplification Algorithm Implemented Using Chapel Michael Scherger Department of Computer Science Texas Christian University Email: m.scherger@tcu.edu Huy Tran Department of Computing

More information

A Prototype Multithreaded Associative SIMD Processor

A Prototype Multithreaded Associative SIMD Processor A Prototype Multithreaded Associative SIMD Processor Kevin Schaffer and Robert A. Walker Department of Computer Science Kent State University Kent, Ohio 44242 {kschaffe, walker}@cs.kent.edu Abstract The

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Instruction Streams. Cells I S P E. Sequential control 1 (IS #1) N E T W O R K. N E T W O R K Inactive Link Active Link. Sequential control j (IS #j)

Instruction Streams. Cells I S P E. Sequential control 1 (IS #1) N E T W O R K. N E T W O R K Inactive Link Active Link. Sequential control j (IS #j) Simulating PRAM with a MSIMD model (ASC) Darrell R. Ulm and Johnnie W. Baker Department of Mathematics and Computer Science Kent State University Kent, OH 44242 U.S.A. Voice: (330)-672-4004 Fax: (330)-672-7824

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

High Performance Computing. University questions with solution

High Performance Computing. University questions with solution High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

A SCALABLE PIPELINED ASSOCIATIVE SIMD ARRAY WITH RECONFIGURABLE PE INTERCONNECTION NETWORK FOR EMBEDDED APPLICATIONS

A SCALABLE PIPELINED ASSOCIATIVE SIMD ARRAY WITH RECONFIGURABLE PE INTERCONNECTION NETWORK FOR EMBEDDED APPLICATIONS SCLLE PIPELINED SSOCITIVE SIMD Y WITH ECONFIGULE PE INTECONNECTION NETWOK FO EMEDDED PPLICTIONS Hong Wang Kent State University Computer Science Department Kent, OH US 00-0-67- honwangcs.kent.edu STCT

More information

SWAMP: Smith-Waterman using Associative Massive Parallelism

SWAMP: Smith-Waterman using Associative Massive Parallelism SWAMP: Smith-Waterman using Associative Massive Parallelism Shannon Steinfadt Dr. Johnnie W. Baker Department of Computer Science, Kent State University, Kent, Ohio 44242 USA ssteinfa@cs.kent.edu jbaker@cs.kent.edu

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

A Massively Parallel Virtual Machine for. SIMD Architectures

A Massively Parallel Virtual Machine for. SIMD Architectures Advanced Studies in Theoretical Physics Vol. 9, 15, no. 5, 37-3 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/1.19/astp.15.519 A Massively Parallel Virtual Machine for SIMD Architectures M. Youssfi and

More information

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,

More information

A MASSIVELY PARALLEL LINE SIMPLIFICATION ALGORITHM USING AN ASSOCIATIVE COMPUTING MODEL

A MASSIVELY PARALLEL LINE SIMPLIFICATION ALGORITHM USING AN ASSOCIATIVE COMPUTING MODEL A MASSIVELY PARALLEL LINE SIMPLIFICATION ALGORITHM USING AN ASSOCIATIVE COMPUTING MODEL A Thesis By HUY V. TRAN Submitted in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Handout 3. HSAIL and A SIMT GPU Simulator

Handout 3. HSAIL and A SIMT GPU Simulator Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

A study on SIMD architecture

A study on SIMD architecture A study on SIMD architecture Gürkan Solmaz, Rouhollah Rahmatizadeh and Mohammad Ahmadian Department of Electrical Engineering and Computer Science University of Central Florida Email: {gsolmaz,rrahmati,mohammad}@knights.ucf.edu

More information

Harnessing Associative Computing for Sequence Alignment with Parallel Accelerators

Harnessing Associative Computing for Sequence Alignment with Parallel Accelerators Harnessing Associative Computing for Sequence Alignment with Parallel Accelerators Shannon I. Steinfadt Doctoral Research Showcase III Room 17 A / B 4:00-4:15 International Conference for High Performance

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics

More information

Massively Parallel Computation for Three-Dimensional Monte Carlo Semiconductor Device Simulation

Massively Parallel Computation for Three-Dimensional Monte Carlo Semiconductor Device Simulation L SIMULATION OF SEMICONDUCTOR DEVICES AND PROCESSES Vol. 4 Edited by W. Fichtner, D. Aemmer - Zurich (Switzerland) September 12-14,1991 - Hartung-Gorre Massively Parallel Computation for Three-Dimensional

More information

BİL 542 Parallel Computing

BİL 542 Parallel Computing BİL 542 Parallel Computing 1 Chapter 1 Parallel Programming 2 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 Advance Encryption Standard (AES) Rijndael algorithm is symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128, 192, and 256

More information

An Efficient List-Ranking Algorithm on a Reconfigurable Mesh with Shift Switching

An Efficient List-Ranking Algorithm on a Reconfigurable Mesh with Shift Switching IJCSNS International Journal of Computer Science and Network Security, VOL.7 No.6, June 2007 209 An Efficient List-Ranking Algorithm on a Reconfigurable Mesh with Shift Switching Young-Hak Kim Kumoh National

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Job Re-Packing for Enhancing the Performance of Gang Scheduling Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT

More information

Accelerated Load Balancing of Unstructured Meshes

Accelerated Load Balancing of Unstructured Meshes Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require

More information

Importance of SIMD Computation Reconsidered

Importance of SIMD Computation Reconsidered Importance of SIMD Computation Reconsidered Will C. Meilander, Johnnie W. Baker, and Mingxian Jin Department of Computer Science Kent State University, Kent, Ohio 44242-0001 Phone: 330-672-2430 Fax: (330)

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Introduction to High-Performance Computing

Introduction to High-Performance Computing Introduction to High-Performance Computing Simon D. Levy BIOL 274 17 November 2010 Chapter 12 12.1: Concurrent Processing High-Performance Computing A fancy term for computers significantly faster than

More information

Subset Sum Problem Parallel Solution

Subset Sum Problem Parallel Solution Subset Sum Problem Parallel Solution Project Report Harshit Shah hrs8207@rit.edu Rochester Institute of Technology, NY, USA 1. Overview Subset sum problem is NP-complete problem which can be solved in

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Scan Primitives for GPU Computing

Scan Primitives for GPU Computing Scan Primitives for GPU Computing Shubho Sengupta, Mark Harris *, Yao Zhang, John Owens University of California Davis, *NVIDIA Corporation Motivation Raw compute power and bandwidth of GPUs increasing

More information

Data Communication and Parallel Computing on Twisted Hypercubes

Data Communication and Parallel Computing on Twisted Hypercubes Data Communication and Parallel Computing on Twisted Hypercubes E. Abuelrub, Department of Computer Science, Zarqa Private University, Jordan Abstract- Massively parallel distributed-memory architectures

More information

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

More information

Reliability Analysis of Gamma Interconnection Network Systems

Reliability Analysis of Gamma Interconnection Network Systems International Journal of Performability Engineering, Vol. 5, No. 5, October 9, pp. 485-49. RAMS Consultants Printed in India Reliability Analysis of Gamma Interconnection Network Systems 1. Introduction

More information

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics Why GPU? Chapter 1 Graphics Hardware Graphics Processing Unit (GPU) is a Subsidiary hardware With massively multi-threaded many-core Dedicated to 2D and 3D graphics Special purpose low functionality, high

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Parallel Architectures

Parallel Architectures Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Parallel Evaluation of Hopfield Neural Networks

Parallel Evaluation of Hopfield Neural Networks Parallel Evaluation of Hopfield Neural Networks Antoine Eiche, Daniel Chillet, Sebastien Pillement and Olivier Sentieys University of Rennes I / IRISA / INRIA 6 rue de Kerampont, BP 818 2232 LANNION,FRANCE

More information

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm XMT-GPU A PRAM Architecture for Graphics Computation Tom DuBois, Bryant Lee, Yi Wang, Marc Olano and Uzi Vishkin Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel

More information

PAPER A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation

PAPER A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation IEICE TRANS.??, VOL.Exx??, NO.xx XXXX x PAPER A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation Yasuaki ITO and Koji NAKANO, Members SUMMARY This paper presents a GPU (Graphics

More information

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN Graphics Processing Unit Accelerate the creation of images in a frame buffer intended for the output

More information

Part IV. Chapter 15 - Introduction to MIMD Architectures

Part IV. Chapter 15 - Introduction to MIMD Architectures D. Sima, T. J. Fountain, P. Kacsuk dvanced Computer rchitectures Part IV. Chapter 15 - Introduction to MIMD rchitectures Thread and process-level parallel architectures are typically realised by MIMD (Multiple

More information

Back-Projection on GPU: Improving the Performance

Back-Projection on GPU: Improving the Performance UNIVERSITY OF MICHIGAN Back-Projection on GPU: Improving the Performance EECS 499 Independent Study Wenlay Esther Wei 4/29/2010 The purpose of this project is to accelerate the processing speed of the

More information

Program-Driven Fine-Grained Power Management for the Reconfigurable Mesh

Program-Driven Fine-Grained Power Management for the Reconfigurable Mesh Program-Driven Fine-Grained Power Management for the Reconfigurable Mesh Heiner Giefers, Marco Platzner Computer Engineering Group University of Paderborn {hgiefers, platzner}@upb.de Outline 1. Introduction

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing Linköping University Post Print epuma: a novel embedded parallel DSP platform for predictable computing Jian Wang, Joar Sohl, Olof Kraigher and Dake Liu N.B.: When citing this work, cite the original article.

More information

Hardware/Software Co-Design

Hardware/Software Co-Design 1 / 27 Hardware/Software Co-Design Miaoqing Huang University of Arkansas Fall 2011 2 / 27 Outline 1 2 3 3 / 27 Outline 1 2 3 CSCE 5013-002 Speical Topic in Hardware/Software Co-Design Instructor Miaoqing

More information

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

Analysis of Different Multiplication Algorithms & FPGA Implementation

Analysis of Different Multiplication Algorithms & FPGA Implementation IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 4, Issue 2, Ver. I (Mar-Apr. 2014), PP 29-35 e-issn: 2319 4200, p-issn No. : 2319 4197 Analysis of Different Multiplication Algorithms & FPGA

More information

Parallelism. Parallel Hardware. Introduction to Computer Systems

Parallelism. Parallel Hardware. Introduction to Computer Systems Parallelism We have been discussing the abstractions and implementations that make up an individual computer system in considerable detail up to this point. Our model has been a largely sequential one,

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

A Survey on Signaling Load in Mobility Management

A Survey on Signaling Load in Mobility Management ISSN: 2231-4946 Volume IV, Special Issue, December 2014 International Journal of Computer Applications in Engineering Sciences Special Issue on Advances in Computer and Communications www.caesjournals.org

More information

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

A Comparative Study on Exact Triangle Counting Algorithms on the GPU A Comparative Study on Exact Triangle Counting Algorithms on the GPU Leyuan Wang, Yangzihao Wang, Carl Yang, John D. Owens University of California, Davis, CA, USA 31 st May 2016 L. Wang, Y. Wang, C. Yang,

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Rule partitioning versus task sharing in parallel processing of universal production systems

Rule partitioning versus task sharing in parallel processing of universal production systems Rule partitioning versus task sharing in parallel processing of universal production systems byhee WON SUNY at Buffalo Amherst, New York ABSTRACT Most research efforts in parallel processing of production

More information

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Nikolai Zamarashkin and Dmitry Zheltkov INM RAS, Gubkina 8, Moscow, Russia {nikolai.zamarashkin,dmitry.zheltkov}@gmail.com

More information

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose Department of Electrical and Computer Engineering University of California,

More information

Component V Supporting Materials / Learn More Interesting Facts. Interesting Facts

Component V Supporting Materials / Learn More Interesting Facts. Interesting Facts Component V Supporting Materials / Learn More 1.4.1 Interesting Facts No. Interesting Facts 1. All computers operate by following machine language programs. 2. Machine language programs are long sequence

More information

Paralization on GPU using CUDA An Introduction

Paralization on GPU using CUDA An Introduction Paralization on GPU using CUDA An Introduction Ehsan Nedaaee Oskoee 1 1 Department of Physics IASBS IPM Grid and HPC workshop IV, 2011 Outline 1 Introduction to GPU 2 Introduction to CUDA Graphics Processing

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

Introduction to Computing and Systems Architecture

Introduction to Computing and Systems Architecture Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little

More information

Diversity Coded 5G Fronthaul Wireless Networks

Diversity Coded 5G Fronthaul Wireless Networks IEEE Wireless Telecommunication Symposium (WTS) 2017 Diversity Coded 5G Fronthaul Wireless Networks Nabeel Sulieman, Kemal Davaslioglu, and Richard D. Gitlin Department of Electrical Engineering University

More information

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects Administrative Midterm - In class April 4, open notes - Review notes, readings and review lecture (before break) - Will post prior exams Design Review - Intermediate assessment of progress on project,

More information

ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS

ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS Prabodha Srimal Rodrigo Registration No. : 138230V Degree of Master of Science Department of Computer Science & Engineering University

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Parallelizing TCP/IP Offline Log Analysis and Processing Exploiting Multiprocessor Functionality

Parallelizing TCP/IP Offline Log Analysis and Processing Exploiting Multiprocessor Functionality Parallelizing TCP/IP Offline Log Analysis and Processing Exploiting Multiprocessor Functionality Chirag Kharwar Department Of Computer Science & Engineering Nirma university Abstract In the era of internet

More information

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST SAKTHIVEL Assistant Professor, Department of ECE, Coimbatore Institute of Engineering and Technology Abstract- FPGA is

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

GPU-accelerated Verification of the Collatz Conjecture

GPU-accelerated Verification of the Collatz Conjecture GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

Towards a Portable Cluster Computing Environment Supporting Single System Image

Towards a Portable Cluster Computing Environment Supporting Single System Image Towards a Portable Cluster Computing Environment Supporting Single System Image Tatsuya Asazu y Bernady O. Apduhan z Itsujiro Arita z Department of Artificial Intelligence Kyushu Institute of Technology

More information

A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation

A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation 2596 IEICE TRANS. INF. & SYST., VOL.E96 D, NO.12 DECEMBER 2013 PAPER Special Section on Parallel and Distributed Computing and Networking A GPU Implementation of Dynamic Programming for the Optimal Polygon

More information

A Massively-Parallel SIMD Processor for Neural Network and Machine Vision Applications

A Massively-Parallel SIMD Processor for Neural Network and Machine Vision Applications A Massively-Parallel SIMD Processor for Neural Network and Machine Vision Applications Michael A. Glover Current Technology, Inc. 99 Madbury Road Durham, NH 03824 W. Thomas Miller, III Department of Electrical

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information