Associative Operations from MASC to GPU

Size: px

Start display at page:

Download "Associative Operations from MASC to GPU"

Donald Preston
5 years ago
Views:

1 388 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'15 Associative Operations from MASC to GPU Mingxian Jin Department of Mathematics and Computer Science, Fayetteville State University 1200 Murchison Road, Fayetteville, NC 28301, USA Abstract - The Multiple Associative Computing (MASC) model is an enhanced SIMD (Single Instruction-stream Multiple Data-stream) model in the associative style for general parallel computation that has been studied last two decades. There have been a number of algorithms developed for this model. Recent research shows this model is extremely efficient when used for real-time scheduling in air traffic control systems. Associative operations are indeed key properties of this model. In particular, they all take constant time. In this paper, we present an implementation outline of these MASC associative operations on the popular architecture of GPU (Graphic Processing Units). This research aims to provide a bridge or general guidance to convert MASC algorithms to their GPU implementation. As the MASC architecture in today s technologies has not been built yet, this provides a possible way to implement MASC algorithms on an alternative platform so to verify their correctness and efficiencies especially for massive data input. Keywords: MASC, Associative computing, GPU, SIMD 1 Introduction The Multiple Associative Computing (MASC) model is an enhanced SIMD (Single Instruction-stream Multiple Datastream) model in the associative style for general parallel computation. It extends the concept of SIMD with associative properties and possesses features of easy programming and highly scalable. During last two decades, intensive research has been conducted regarding this model, mainly at Kent State University. It has been shown that this model is as powerful as some other well-known parallel computation models like PRAM (parallel random access machine) and restricted RM (reconfigurable meshes) [9]. There have been a number of algorithms developed on this model. These algorithms are across different application fields such as computer geometry, graphics, string matching, etc. Examples are in [1, 2, 4, 8]. Most recently, it has been shown that this model is extremely efficient when used for real-time scheduling in air traffic control system compared to its multiprocessor counterpart [12]. While more MASC algorithms are being developed, how to implement them becomes an interesting problem. Past effort has been made to build the MASC architecture, or ASC processor, using modern technologies of FPGS (fieldprogrammable gate array) [10]. This research is still in the stage of experiment with up to 52 processing units. It is difficult to be used for any massively parallel processing, which is normally expected in a MASC algorithm. A standard associative language, called ASC, has also been developed for MASC across some platforms including Goodyear/Loral/Martin-Marietta s ASPRO and Thinking Machine s CM-2 [7, 8]. However, these platforms are not accessible in today s computer lab settings. Although the ASC programming language has been emulated on both PCs and workstations running UNIX to compile and execute simple ASC programs, the running environment is restricted by the emulating software and the non-associative style hardware construction. It is impossible to truly evaluate the performance of a MASC algorithm with massive data input. We look for an alternative platform that is able to emulate the MASC model so to implement its algorithms in massively parallel with minimum efficiency loss. An ideal platform must have a close architecture and also be easily accessible. There is no doubt that GPU is an excellent choice. A general- purpose computer with graphic processing units (GPU) is an emerging architecture that has attracted a lot of attention in last few years. The original purpose of using GPU is to accelerate intensive graphic data processing. Later, with introduction of NVDIA CUDA (compute unified device architecture), a high-level programming interface, GPU is evolved to be a powerful computing platform to support general purpose parallel computation. It has been used in numerous application fields for massively data parallel processing [6, 11]. GPU is a typical SIMD architecture and is especially good for fine-grained large amount data-intensive parallel computation. Its features provide possibility of implementing MASC algorithms with easy accessibility and high scalability. In the MASC model, associative operations are indeed its key properties. To implement a MASC algorithm on a different architecture, we need to find a way to execute each of these operations in the corresponding running environment. This is our contribution in this paper. The remaining paper is organized as follows. Section 2 gives a description of the MASC model and the related research. Section 3 provides a

2 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA' brief overview of GPU architecture with CUDA framework. Section 4 presents implementation steps for each MASC associative operation on GPU. Section 5 remarks general comparison between MASC and GPU to conclude the paper and also discusses future work. 2 The MASC Model The MASC (for Multiple Associative Computing) model was developed at Kent State University based on earlier STARAN architecture at Goodyear Aerospace. It has been studied since the early 1970 s. The MASC model consists of an array of processing elements (PEs) and one or more instruction streams (ISs), each of which issues commands to a disjoint set of PEs partitioned dynamically. In a MASC machine, the number of IS is normally expected to be small in comparison to the number of PEs. In this paper, we assume the MASC with one IS unless explicitly specified. Detailed features of the MASC model can be found in [8]. A brief description follows. Each PE (or cell) has a local memory and is capable of performing the usual functions of a sequential processor other than issuing instructions. An IS is logically a processor which has a bus connecting it to each cell and can send an instruction to all cells. Each cell listens to only one IS and can switch to another IS based on local data tests when multiple ISs present. Cells can be active, inactive, or idle. An active cell executes the program steps from its IS while an inactive cell only listens. An IS can instruct an inactive cell to become active again (Fig 1). If the word length is assumed to be a constant, then the MASC model supports the following associative operations in constant time. These have all been justified in [3]. IS Cells PE PE PE Broadcast/reduction Network Memory Memory Memory Active Inactive Cell Network Fig. 1 The MASC model with one instruction stream Global reduction of OR and AND of binary values each being held by an PE Global maximum and minimum of integer or real values each being held by an PE Associative search which finds all cells whose data value matching the search pattern. All data in the local memories of the cells is located by content rather than by address. These matching cells are called responders and those not are called non-responders. Pick-one which is used by the IS to select (or pick one ) arbitrary responder from the set of its active cells Broadcast which is used by the IS to instruct the selected cell to place a data item on the bus and all other cells listening to the IS receive this value in one step. The MASC model may also include a cell network used for communications among PEs, an IS broadcast/reduction network (or the resolver network as another name) used for communication between the IS and cells, and a possible IS network in the case of multiple ISs that is used for IS communications. A wide range of types of algorithms and several large programs have been developed for the MASC model and many of these have appeared in the literature. Examples are in [1, 2, 4, 8]. Moreover, simulations between MASC and other well-known parallel computation models such as PRAM and restricted RM have been well studied and published in the literatures as well. (See [9] for example). Most recent research has shown that this model is extremely efficient when used for real-time scheduling in air traffic control system compared to its multiprocessor counterpart [12]. As mentioned earlier, a standard associative language, called ASC, has been developed for MASC with one IS across certain platforms including Goodyear/Loral/Martin-Marietta s ASPRO, the WaveTracer, and Thinking Machine s CM-2, and provides true portability for parallel algorithms [7]. In addition, an ASC simulator has been implemented on both PCs and workstations running UNIX. It provides an efficient and easy way to test simple programs for algorithms designed for the MASC model. However, they cannot be used to truly evaluate performance of an MASC algorithm due to the restrictions of the emulating software and the non-associative style hardware construction. Since the MASC model is developed based on early STARAN architecture that existed over 40 years ago, much effort has been made to build a new architecture based on today s technologies of FPGA so to support the MASC model and implement its algorithms. In [10], a scalable ASC processor with one IS using the FPGA technology is experimented with up to 52 PEs. It is still under study so difficult to be utilized for any MASC algorithm execution. It becomes a rising interest for us to find an alternative platform to implement MASC algorithms so to verify their correctness and efficiencies based on massive amount data input size.

3 390 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'15 GPU is an ideal choice which possesses features of easy accessibility and high scalability. 3 The GPU with CUDA framework A general- purpose computer with graphic processing units (GPU) is an emerging architecture that has been attracted a lot of attention in the past decade. GPU was originally used to accelerate graphic computation and later evolved to perform general computation with introduction of high-programming languages and a specific framework CUDA (Computer Unified Device Architecture) as the programming interface. It becomes a very popular computing platform and has been used in numerous application fields [5, 6, 11]. Host Global Memory... block Shared memory block block Fig. 2 Architecture of GPU with CUDA A modern GPU consists of a host computer with an array of streaming multiprocessors forming groups of building blocks as the device. The host is normally considered to be a traditional central processing unit (CPU). In each streaming multiprocessor, there are a fixed number of streaming processors that execute the same instruction stream but running on different data sets. Each stream processor runs its own thread, as shown in Fig 2. A CUDA-capable GPU supports several types of memory. Global memory and constant memory (not shown in the figure) can be read and written by the host such that data can be transferred between the host and the device. Each block has its shared memory that can be accessed by all threads. This provides an efficient way for threads within the block to communicate by sharing their input data and intermediate results during program execution. Each individual thread has registers as its locally accessed memory (not shown in the figure). Shared memory Shared memory Commonly a GPU uses the host as the instruction steam to instruct multiple graphic processing units to perform dataintensive computation. This architecture is similar to the MASC model in that the host can be a multi-core CPU which can be corresponded to multiple ISs in MASC and many-core GPUs corresponding to MASC cells. Specifically, GPU executes in a SIMT (Single Instruction-stream and Multiple Threads) manner in which a thread is comparable to a MASC cell. 4 Associative Operations on GPU Now implementation steps for MASC associative operations on the GPU platform are presented in this section. We map the data structures from MASC to GPU first. Then each associative operation is discussed one by one in regard of its implementation on GPU. 4.1 Mapping of the data structures On MASC, data is stored by content instead of by address. In particular, data is organized in a tabular format with each PE holding a group of associative data. In an associative search, the search pattern is compared against the table of stored data in a bit-serial fashion. This allows search can be done in massively parallel and much faster. For example, a graph can be stored as a table structured as its adjacent matrix with each PE holding one row of data representing a vertex. In addition, other data pertaining to the vertex can also be stored on the same PE for parallel processing. Alternatively, one PE can hold data of one edge which includes weight, two end vertices, and/or other data depending on the application needs. Another example is that, in an air traffic control system, each PE holds all the data pertaining to an aircraft such as (x, y) positions, altitude, velocity, and so on. GPU is in the traditional way to store data. A data item is identified by its memory address. In particular, for the device, a data item is stored in shared memory having its own memory address space or in the local memory of a thread that is addressed by its block index and thread index. To simplify our discussion, we assume both ASC and GPU have sufficient numbers of PEs and device threads. In order to map associative data items from an MASC PE to a GPU thread, we can make a direct function map: i = 0 k n 0 i

4 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA' All data that resides in local memory of a PE is mapped to the local memory of the corresponding thread. This provides a straightforward data mapping except identification number conversion. 4.2 Global reduction of OR and AND Although CUDA provides and functions, these functions are used to perform logic OR and AND operations on the two words of data at a specified address in global memory or in the shared memory. They cannot be used for the global reduction of OR and AND on a group of data items as on MASC. To perform a global reduction of OR on a group of data items on GPU, the n data items are assumed to be originally resided in the local memory of each thread. A global reduction is computed in the shared memory of the device with a block and then in the global memory among blocks. The final result is read by the host from the global memory. We follow the idea in [5] for a global reduction of partial sum. Every two data items from two adjacent threads are reduced to one by moving the data items to the shared memory. All even-numbered threads perform this reduction in parallel which reduces the number of data items from n to n/2. Next round, every two data items from two adjacent evennumbered threads are reduced and the partial results are in the four multiple numbered threads. Iteratively, all data items will be reduced to one result as the global result. In order to avoid thread divergence 1, the data movements can be slightly changed by aligning the first half block and the second half block. All aligned threads are reduced in pair in parallel and partial results are stored in the first half threads corresponded shared memory. Iterations are going through in the same manner until the last result is obtained. The reduction is done in place, which means the data item in the shared memory is replaced by the partial result of logic OR. The CUDAcompatible code is shown as follows. 1 Device divergence is a problem that occurs when some threads have to run a different instruction from other threads. It is usually caused by an if-then-else statement and degrades parallel processing performance. After each block gets its last result, it is further sent to the global memory. The final result across all blocks can be obtained using a few more logic OR operations and then read by the host from the global memory. Since this implementation uses software steps to emulate a hard-wired execution. It takes longer time by a factor of O(log n) for n threads, compared to the true MASC implementation. It is obvious that a global reduction of AND can be done in the similar way. 4.3 Global reduction of maximum and minimum On MASC, a global reduction maximum executes global OR in the bit-serial fashion. If the word length is ω, a global maximum takes O(ω). Since a word length is generally considered to be constant in all modern architectures, this is considered to be a constant operation as well. See [3] for detailed discussion. On GPU, we will not use the bit-serial fashion as on MASC. This is because a global OR operation on GPU already uses programmed steps in its implementation. There is no need to perform the operation in bit-serial. The same approach as in Section 4.2 can be applied to implement a global reduction of maximum (or minimum). Partial maximum (or minimum) is updated by iterative comparisons. The previous intermediate larger (or smaller) data value is kept as the partial result for next round comparisons until the last result is obtained. We process data from all threads within a block in the share memory and then process all data from all blocks. The final result is sent to the global memory and read by the host. Since this reduction operation goes through the same iterations as a global OR/AND as in section 4.2. It takes the same extra time in O(log n) for n threads. 4.4 Broadcasting Broadcasting on GPU is fairly easy as there are different levels of memory providing read/write access for the host and/or all units on the device(s). The data item that needs to be broadcast can be placed by the sending thread in a specified shared memory location. Then blocks/threads can read from it directly. This does not take any extra time. 4.5 Associative search Search on the ASC model is an operation combing broadcasting and global reduction of OR. On MASC, the IS broadcasts the search pattern to all PEs first. Each PEs compares this pattern with its local data. A PE with matching data sets a flag and is called a responder. Otherwise, the PE resets the flag and is called a non-responder. A global

5 392 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'15 reduction of OR is performed and sends the search result back to the IS. A global reduction of OR on the designated flags can be performed on GPU as described in 4.2. The final result of 1 or 0 indicates the search is successful or unsuccessful. After an associative search, all responders are normally instructed to do specified tasks and non-responders are set inactive to wait for future activation. On GPU, this can be executed by running these specific tasks on corresponding threads and set other threads idle without doing anything until they are activated later. It is similar to executing an if-thenelse statement on GPU for any general computation. Once again, device divergence could be a potential problem. The time complexity is dominated by the time of the global reduction of OR, which is O(log n), as broadcasting and setting flags only take constant number of steps by all threads in parallel. 4.6 PickOne An associative search may also be followed by a PickOne operation. An arbitrary PE (or a thread on GPU), normally the first responder in the responder group, is selected to perform individually instructed tasks. Using the function of provided by the CUDA framework on GPU, this can be implemented. It takes the same time as the function. It is noted that responder execution after a PickOne operation can cause the device divergence problem as mentioned in 4.5. Depending on algorithm design, we may expect different levels of performance degradation due to this problem. 5 Concluding remarks and Future Work MASC and GPU present many similarities. Both of them are in the SIMD category (SIMD vs. SIMT) in parallel computation. They are especially advantageous in fine grained parallel processing for massively data-intensive problems. They both have an ability of restrictive control parallelism with multiple ISs and the multi-core CPU, respectively. Both of them possess features of easy programmable (ASC vs. CUDA, respectively). They are also highly scalable because the array of cells and threads can be easily extended. Normally there are light or no overheads due to single instruction stream. The two architectures have some differences as well, however. There are constructed significantly different in hardware. MASC is a bit-serial model. On the other hand, GPU assumes each processing unit in a traditional manner. MASC allows cells communicate through the cell network. On GPU, there is no separate network connecting blocks and threads. All communications are through different levels of memory reads/writes. It is fast but lacks flexibility to dynamically partition threads. Threads within a block can be collaborated. However, threads across blocks cannot. In a MASC algorithm, an advantage to represent complex data structures is tabular representation. All data on a PE is stored and located by its content instead of by its address. This allows fast locate data and return the search results in constant time through the hardware construction the resolver network. However, GPU locates data by memory address, in particular, block indices and thread indices. With these similarities and differences in mind, in this paper, we have presented outlined implementation of associative operations of the MASC model on the GPU architecture. As shown in Section 4, most of these associative operations can be implemented on GPU with an extra O(log n) efficiency loss. This is due to the fact that software programmed steps on GPU are used to replace hardware wiring as the resolver (broadcast/reduction) network on MASC. These steps can be directly used to convert a MASC algorithm so to be implemented on the GPU-CUDA platform. Due to time limit and space constraints, the actual implementation results and performance analysis are not included in this paper. Future work is to analyze these results and evaluate their performance. Also, for some problems, if there exist an algorithm directly designed on GPU and it also has a MASC algorithm, it would be interesting to compare the performance of the GPU algorithm and the converted MASC algorithm running on GPU. 6 Acknowledgement This work is partially supported by the FSU Integrated STEM Academic Success (ISAS) program. 7 References [1] M. M. Atwah and J. W. Baker "An associative static and dynamic convex hull algorithm", in Proc. of the 16th International Parallel and Distributed Processing Symposium (Workshop in Massively Parallel Processing), abstract on page 249, full text on CDROM, April 2002 [2] M. Esenwein and J. Baker, "VLCD string matching for associative computing and multiple broadcast mesh", in Proc. of the IASTED International Conference on Parallel and Distributed Computing and Systems, pages [3] M. Jin, J. Baker, K. Batcher, Timings for associative operations on the MASC model, in: Proc. of the 15th International Parallel and Distributed Processing Symposium, IEEE Workshop on Massively Parallel Processing, San Francisco, CA, 2001, pp [4] M. Jin, J. Baker, Two graph algorithms on an associative computing model, in Proc. Of International

6 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA' Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA, Las Vegas, [5] D.B. Kirk and W. W. Hwu, Programming Massively Parallel Processors, Morgan Kaufmann Publishers, 2010 [6] I. Park, N. Signhal, M. Lee, S. Cho, and C. Kim, Design and performance evaluation of image processing algorithms on GPUs, IEEE Transactions on Parallel and Distributed Systems, Vol. 22, No.1, January 2011 [7] J. Potter, Associative Computing: A Programming Paradigm for Massively Parallel Computers, Plenum Press, New York, 1992 [8] J. Potter, J. Baker, S. Scott, A. Bansal, C. Leangsuksun, C. Asthagiri, ASC: an associative-computing paradigm, Computer 27 (11) (1994) [9] J. Trahan, M. Jin, W. Chantamas, J. Baker, Relating the power of the multiple associative computing model (MASC) to that of reconfigurable bus based models, Journal of Parallel and Distributed Computing, Elsevier Publishers, Vol. 70, No. 5, (2010) [10] H. Wang, and R. Walker, "Implementing a scalable ASC processor", in Proc. of the 17th International Parallel and Distributed Processing Symposium (Workshop in Massively Parallel Processing), April 2003 [11] W. W. Hwu, GPU Computing Gems Emerald Edition (Applications of GPU Computing Series), Morgan Kaufmann; 1 ed. February 2011 [12] M. Yuan, J. W. Baker, W. C. Meilander, Comparisons of Air Traffic Control Implementations on an Associative Processor with a MIMD and Consequences for Parallel Computing, Journal of Parallel and Distributed Computing, Elsevier Publishers, Volume 73, Issue 2, February 2013, pages , ISSN

An Extension of the ASC Language Compiler to Support Multiple Instruction Streams in the MASC Model using the Manager-Worker Paradigm

An Extension of the ASC Language Compiler to Support Multiple s in the MASC Model using the Manager-Worker Paradigm Wittaya Chantamas, Johnnie Baker, and Michael Scherger Department of Computer Science