Implementation and Evaluation of NAS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA
|
|
- Megan Powers
- 5 years ago
- Views:
Transcription
1 Implementation an Evaluation of AS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA Kazuya Matsumoto 1, orihisa Fujita 2, Toshihiro Hanawa 3, an Taisuke Boku 1,2 1 Center for Computational Sciences, University of Tsukuba 2 Grauate School of Systems an Information Engineering, University of Tsukuba 3 Information Technology Center, The University of Tokyo Abstract. We have been eveloping a proprietary interconnect technology calle Tightly Couple Accelerators (TCA) architecture to improve communication latency an banwith between accelerators (GPUs) over ifferent noes. This paper presents a Conjugate Graient (CG) benchmark implementation using the TCA an results of performance evaluation on the HA-PACS/TCA system, which is a proof-of-concept GPU cluster base on the TCA concept. The implementation is base on the CG benchmark in AS Parallel Benchmarks, an its parallelization is achieve by a two-imensional ecomposition of matrix ata. The TCA utilization improves the communication performance compare with the implementation with MPI/InfiniBan utilization for small size benchmark classes. This stuy also shows that the CG implementation with the two-imensional ecomposition is more suitable for the TCA utilization than a CG implementation with a one-imensional ecomposition to make use of the interconnect. 1 Introuction Currently, GPU clusters are wiely use as high performance computing systems. A problem of GPU clusters is that the communication spee between multiple compute noes is not fast enough compare to its high computation spee. In orer to aress this problem, we have been researching the Tightly Couple Accelerators (TCA) architecture [2]. The TCA is a technology on a proprietary interconnect network to enable irect communication between accelerators over ifferent noes. We have conucte the basic performance evaluation of the TCA [4, 1]. In [4], a Conjugate Graient (CG) metho has been implemente by utilizing allgather an allreuce collective communications with TCA s communication functions. The parallelization of the CG implementation is accomplishe by a one-imensional ecomposition of matrix ata. Results of the performance evaluation shows that the CG metho implementation using TCA outperforms the implementation using MPI/InfiniBan for sparse matrices whose matrix (problem) size is nine thousan or smaller; however, the communication performance
2 of TCA tens to become lower when either or both of the target matrix sizes an the number of utilizing processes are larger. In the present stuy, we evaluate a ifferent CG metho implementation with the TCA utilization. The CG implementation is base on the CG benchmark in AS Parallel Benchmarks. We apply a two-imensional ecomposition of matrix as the enhance implementation from [4], an present results of performance evaluation on the HA-PACS/TCA GPU cluster. Aitionally, this stuy presents communication performance ifferences between the CG implementation with the two-imensional ecomposition an the one-imensional ecomposition. 2 Tightly Couple Accelerators Architecture This section briefly explains the Tightly Couple Accelerators (TCA) architecture (see [2, 1] for etaile information on the TCA). The TCA is a novel technology of proprietary interconnect for PC clusters. The PCI Express Aaptive Communication Hub ver. 2 (PEACH2) is a prototype implementation of the TCA architecture. We can construct a cluster system by connecting the PEACH2 boars with each other. The communication using the PEACH2 is conucte only with the PCIe protocol, an the overhea time for protocol conversion, which is require in the InfiniBan, is eliminate. Consequently, the PEACH2 enables ata communication with extremely low latency. The PEACH2 provies two types of ata communication functions: PIO an DMA. In the PIO communication, the ata is transferre to a remote noe by the CPU s remote write operation. The latency of PIO is very small, an, as a result, the PIO is useful to transfer short messages. The DMA function is achieve by the DMA controller, which has four DMA channels. While the latency of DMA is larger than that of PIO, the DMA emonstrates higher maximum banwith performance. The HA-PACS (Highly Accelerate Parallel Avance system for Computational Sciences) is a GPU cluster system at the Center for Computational Sciences, University of Tsukuba. The HA-PACS/TCA is a proof-of-concept system of TCA architecture concept an a performance evaluation test-be of PEACH2 boar. The HA-PACS/TCA contains the PEACH2 as an interconnect aapter in aition to a commoity InfiniBan interconnect. Table 1 shows the specification of HA-PACS/TCA. The HA-PACS/TCA consists of four sub-clusters. Each sub-cluster is compose of 16 compute noes. The 16 noes are connecte by the PEACH2 an configure a 2 8 torus network. ote that the 64 noes of HA-PACS/TCA are connecte also by two ports of InfiniBan QDR in a fat-tree configuration with full bisection banwith. 3 Implementation The CG benchmark in AS Parallel Benchmarks (PB) is originally written in Fortran. Though several CG benchmark stuies on GPUs have been conucte [3, 8], there is no available implementations written in MPI an CUDA to our knowlege. Because of the fact, in the present stuy, we moify the MPI version
3 Table 1. HA-PACS/TCA system configuration oe configuration Motherboar SuperMicro X9DRG-QF CPU Intel Xeon E v2 2.8 GHz 2 (IvyBrige 10 cores / CPU) Memory DDR MHz 4 ch, 128 GB (=8 16 GB) Peak performance 224 Gflops / CPU GPU VIDIA Tesla K20X 732 MHz 4 (Kepler GK cores / GPU) Memory GDDR5 6 GB / GPU Peak performance 1.31 Tflops / GPU Interconnect InfiniBan: Mellanox Connect-X3 Dual-port QDR TCA: PEACH2 boar (Altera Stratix-IV GX 530 FPGA) System configuration umber of noes 64 Interconnect InfiniBan QDR 108 ports switch 2 ch Peak performance 364 Tflops /2 /2 /4 /4 /4 /2 /2 P = 2 P = 4 P = 8 P = 16 Fig. 1. Process istribution by two-imensional ecomposition of matrix ata. The number in each rectangular represents its process rank i. of CG benchmark in AS such that the main computation part (conj gra function) is written in C language an CUDA, an its inter-noe ata communications are conucte with the TCA/PEACH2. Let us enote a linear equation system as Ax = b, where A is an symmetric positive efinite matrix, an both x an b are a vector with elements in the following. The moifie implementation uses the ientical ata istribution an communication pattern to the original MPI version. The parallelization of CG benchmark is achieve by a two-imensional ecomposition of the matrix A an conformable istribution of vectors. When we efine the number of processes as P, the matrix A is two-imensionally ecompose by P = P r P c processes. Base on the ecomposition, each process contains (/P r ) (/P c ) sub-matrix of A an vectors with /P c elements. Figure 1 shows the process istribution by two-imensional ecomposition of matrix A ata for P = 2, 4, 8, 16. Almost all of computations in the implementation are carrie out by GPUs. While we implement the sparse-matrix vector multiplication (SpMV) by ourselves, our CG implementation utilizes the VIDIA s CUBLAS library for vec-
4 tor ot prouct (DOT) an vector aition (AXPY) operations. ote that the computation part is not tune so eeply since this stuy focuses on performance evaluation of ata communication. Three kins of ata communication are require in the implementation. The first require communication is to sen vector ata for obtaining the prouct of SpMV after the local SpMV computation on a GPU in each noe. The communication is conucte among P c processes in the same process row in a binary tree fashion, an thus P c communication steps are neee (each step nees to sen 8/P c Bytes of vector ata in ouble precision). The secon communication is to sen a scalar (8 Bytes) value to compute the sum of local prouct by the DOT computation. This communication is also mae among the P c processes an P c steps are require. The thir communication is to sen 8/P c Bytes of a vector for ata exchange. In the following, let us name the first, secon an thir communication as COMM SpMV, COMM DOT an COMM EXCH, respectively. The COMM DOT is scalar ata communications between CPU memories of ifferent processes an the latency for issuing communication operations occupies almost all of its communication time. The COMM SpMV an COMM EXCH are block ata communications of 8/P c Bytes between ifferent GPU memories an a communication banwith is important for high performance communication as well as the issue latency. Consiering the communication characteristics, we implement the COMM SpMV an COMM EXCH with the DMA communication function of TCA/PEACH2 an implement the COMM DOT with the PIO function. ote that, as shown in Figure 2, the DMA communication is faster than the PIO communication when message sizes are larger than 128 Bytes, which are smaller than sizes for the require block communications in any problem classes of CG benchmark. The TCA/PEACH2 configures a 2 8 torus network on a sub-cluster of HA-PACS/TCA; thus, a way of process (noe) mapping also affects the communication performance. As can be seen from Fig. 1, the CG implementation with two-imensional ecomposition requires communication among P c processes (4 processes at most when we utilize up to 16 noes). We use a noe mapping shown in Figure 3. This mapping oes not cause message ata contentions an collisions within the TCA/PEACH2 s communication network on a sub-cluster even when P = 16 cases. 4 Performance Evaluation We conuct performance measurements on a sub-cluster of the HA-PACS/TCA. A single GPU an a single CPU socket are utilize per noe 4. For comparison with the implementation using TCA/PEACH2, this section also presents the 4 This is because using two or more sub-clusters entails a hybri utilization of the TCA/PEACH2 an MPI/IB, an because two or more GPUs usage requires aitional consierations to use the TCA/PEACH2 effectively. Both of the hybri utilization an the multi GPU usage are among our future work.
5 WhWh W,W/K WhWh W,D WhWhDW// 'Wh'Wh W,D e e e eee D 'Wh'Wh DW// Fig. 2. Ping-pong communication performance between two neighboring noes on the HA-PACS/TCA. Fig. 3. oe mapping optimize for CG benchmark implementation on a sub-cluster of HA- PACS/TCA. Circles represent compute noes, lines between circles represent ata links, an the number correspons to its process rank i. performance of an implementation using MPI/InfiniBan (MPI/IB) for internoe communications. We use the MVAPICH2 GDR 2.1a (MV2GDR) MPI library implementation [6]. As with the TCA/PEACH2, the MV2GDR utilizes the GPUDirect for RDMA (GDR) technology [5] for irect communication between GPUs. ote that the theoretical peak banwith of TCA/PEACH2 (PCIe Gen2 x8) is twice lower than that of MPI/IB (ual-rail InfiniBan QDR 5 ); therefore, the implementation using TCA/PEACH2 is outperforme when message sizes become large. On the conition where the program is compile by Intel C compiler with MV2GDR 2.1a an CUDA 6.5 usage, the TCA/PEACH2 is faster for message sizes up to 64 KB than the MPI/IB in terms of the ping-pong communication performance as shown in Fig. 2. In the CG benchmark, the problem sizes (CLASS) an the number of processes (P ) can be esignate. This section presents results of performance evaluations for CLASS=S, W, A, B an P = 2, 4, 8, 16. The problem (matrix/vector) size is 1,400 for CLASS=S, 7,000 for CLASS=W, 14,000 for CLASS=A, an 75,000 for CLASS=B. We measure the consuming time for each computation/communication operation in the conj gra program function. Figure 4 shows the measure time breakown on average time of ten times calls to the function. ote that this is the breakown of process rank 0 an the communication time is the communication wait time. A single call of the conj gra function inclues 26 P c of COMM SpMV, 52 P c of COMM DOT, 26 of COMM EXCH, 26 of SpMV, 52 of DOT, an P c of AXPY operations. In Fig. 4, the performance is shown for the implementation in Fortran (original PB-MPI 5 The theoretical peak banwith of the ual-rail InfiniBan QDR is 8 GB/s, which is equivalent to that of PCIe Gen3 x8.
6 e e, / / W, / / & D W, / / W & D W, / / W & D W W & D W KDD^Ds KDDK KDDy, ywz K ^Ds D K e e, / / W, / / & D W, / / W & D W, / / W & D W W & D W Wl Wl Wle Wle Wl Wl Wle Wle KDD^Ds KDDK KDDy, ywz e e, / / W, / / & D W, / / W & D W, / / W & D W W & D W K ^Ds D K, / / W, / / & D W, / / W & D W, / / W & D W W & D W Wl Wl Wle Wle Wl Wl Wle Wle Fig. 4. Time breakown on a single conj gra function call of CG benchmark. In each figure, the performance (time in microsecon) is shown for the implementation in Fortran (original PB-MPI coe), C language, CUDA with TCA/PEACH2 communication, an CUDA with MPI/IB communication. The upper plot results for several results in Fortran an C, such as CLASS=B & P=2 case, are cut for simplicity. Fig. 5. Time breakown of CG implementation with one-imensional ecomposition coe), C language, CUDA with TCA/PEACH2 communication, an CUDA with MPI/IB communication. The Fortran implementation is faster than the C implementation (similar performance results were reporte for the PB-LU benchmark by Pennycook et al. [7]). Compare with the original Fortran implementation, the GPU/CUDA usage for PB implementation eteriorates the performance in several cases
7 (particularly in cases of smaller size class an larger number of processes utilization). This is mainly ue to that the GPU usage brings bigger overheas for the CUDA kernel invocations 6 an for the inter-noe communication call between GPUs. The TCA/PEACH2 utilization contributes the performance improvement for the three small size classes (CLASS=S, CLASS=W, an CLASS=A) compare with the MPI/IB utilization. Especially, the communication performance is 3.00 times higher in the case CLASS=S with P = 16, an the overall performance incluing the computation time an communication time is 1.24 times higher. The large performance improvement by TCA/PEACH2 is erive from improvements of COMM SpMV an COMM DOT. The communication time is 2.42 times shorter for COMM SpMV an 4.72 times shorter for COMM DOT in this case. In the case CLASS=A with P = 16, the communication performance using the TCA/PEACH2 is 1.44 times higher an the overall performance is 1.12 times higher. The performance of COMM DOT with TCA/PEACH2 is higher in all four size cases since the communication latency mostly ecies the performance for the scalar ata communication. However, the TCA/PEACH2 is not always effective. The performance for CLASS=B is almost same. In the CLASS=B case, the COMM EXCH is the main performance bottleneck. To see how the performance is ifferent between the CG benchmark implementation with two-imensional ecomposition (2D-CG) in the present stuy an the CG implementation with one-imensional ecomposition (1D-CG) in our previous stuy [4], we aitionally measure the performance of 1D-CG implementation on the equivalent conition an cases (1D-CG is implemente by ourselves an not in the AS Parallel Benchmarks). Figure 5 shows the measure time breakown of the 1D-CG for matrices corresponing CLASS=S an CLASS=B. The 1D-CG utilizes allgather an allreuce collective communications (see [4] for implementation etails). All P processes are involve for both the collective communications in 1D-CG, whereas P c processes are involve at most in 2D-CG. Since the network topology of TCA/PEACH2 is 2 8 torus network, collisions within the communication network cannot be avoie for P = 16 cases in 1D-CG. As shown in Fig. 5, the communication time in 1D-CG becomes larger when P increases. In aition, the largest message size of 1D-CG is larger than that of 2D-CG for P = 8, 16 cases (the size is 8/2 Bytes in 1D-CG an 8/P c in 2D-CG). This fact is isavantage for large matrix sizes (CLASS=B) an relative performance ifferences between TCA/PEACH2 an MPI/IB is large compare with 2D-CG implementation. In general, the message size of each communication in 2D-CG is shorter than or equal to that in 1D-CG for corresponing communication. The performance egraation with shorter message size is serious in MPI/IB while TCA/PEACH2 provies a goo performance thanks to its very small latency. Thus, the combination of such short messages an avoiance of message collision on the torus network of TCA/PEACH2 leas this performance improvement on 2D-CG benchmark. 6 A CUDA kernel invocation at least takes 9.7 µsec, incluing the CUDA stream synchronization time, in our measurement.
8 5 Conclusion The present stuy has utilize the TCA/PEACH2 for an implementation of AS Parallel CG benchmark an conucte its performance evaluation on the HA-PACS/TCA GPU cluster. Results of the performance evaluation show that the CG implementation with the parallelization by a two-imensional ecomposition of matrix ata oes not cause message ata collisions within the communication network of TCA/PEACH when processes are properly mappe to noes; an the present CG implementation is consiere to be better suite for the TCA/PEACH2 utilization than the previous CG metho implementation with a one-imensional ecomposition [4]. The performance improvement over MPI/IB utilization is ue to the very small latency of TCA/PEACH2. We will continue researches on the TCA with the view that reucing the latency between accelerators by irect communication is important for strong-scaling computing. Acknowlegements. The present stuy was supporte by the Japan Science an Technology Agency s CREST program entitle Research an Development of Unifie Environment on Accelerate Computing an Interconnection for Post- Petascale Era. The authors woul like to thank the Center for Computational Sciences, University of Tsukuba for allowing us to use the HA-PACS/TCA system as part of the interisciplinary Collaborative Research Program. References 1. Hanawa, T., Fujii, H., Fujita,., Oajima, T., Matsumoto, K., Boku, T.: Evaluation of FFT for GPU Cluster Using Tightly Couple Accelerators Architecture. In: Proc. Cluster pp IEEE (2015) 2. Hanawa, T., Koama, Y., Boku, T., Sato, M.: Tightly Couple Accelerators Architecture for Minimizing Communication Latency among Accelerators. In: Proc. IPDPSW pp IEEE (2013) 3. Lee, S., Vetter, J.S.: Early evaluation of irective-base GPU programming moels for prouctive exascale computing. In: Proc. SC 12 (2012) 4. Matsumoto, K., Hanawa, T., Koama, Y., Fujii, H., Boku, T.: Implementation of CG Metho on GPU Cluster with Proprietary Interconnect TCA for GPU Direct Communication. In: Proc. IPDPSW pp IEEE (2015) 5. VIDIA: VIDIA GPUDirect (Accesse April 25, 2016), 6. Pana, D.K.: MVAPICH2-GDR (MVAPICH2 with GPUDirect RDMA) (Accesse April 25, 2016), 7. Pennycook, S.J., Hammon, S.D., Jarvis, S.A., Mualige, G.R.: Performance Analysis of a Hybri MPI/CUDA Implementation of the AS- LU Benchmark. SIGMETRICS Perform. Eval. Rev. 38(4), (2011), 8. Xu, R., Tian, X., Charasekaran, S., Yan, Y., Chapman, B.: AS Parallel Benchmarks for GPGPUs Using a Directive-Base Programming Moel. In: Languages an Compilers for Parallel Computing. LCS, vol. 8967, pp Springer (2015)
Interconnection Network for Tightly Coupled Accelerators Architecture
Interconnection Network for Tightly Coupled Accelerators Architecture Toshihiro Hanawa, Yuetsu Kodama, Taisuke Boku, Mitsuhisa Sato Center for Computational Sciences University of Tsukuba, Japan 1 What
More informationHA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs
HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs Yuetsu Kodama Division of High Performance Computing Systems Center for Computational Sciences University of Tsukuba,
More informationTightly Coupled Accelerators Architecture
Tightly Coupled Accelerators Architecture Yuetsu Kodama Division of High Performance Computing Systems Center for Computational Sciences University of Tsukuba, Japan 1 What is Tightly Coupled Accelerators
More informationTightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications
1 Tightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications Toshihiro Hanawa Information Technology Center, The University of Tokyo Taisuke Boku Center for Computational
More informationAlmost Disjunct Codes in Large Scale Multihop Wireless Network Media Access Control
Almost Disjunct Coes in Large Scale Multihop Wireless Network Meia Access Control D. Charles Engelhart Anan Sivasubramaniam Penn. State University University Park PA 682 engelhar,anan @cse.psu.eu Abstract
More information6.823 Computer System Architecture. Problem Set #3 Spring 2002
6.823 Computer System Architecture Problem Set #3 Spring 2002 Stuents are strongly encourage to collaborate in groups of up to three people. A group shoul han in only one copy of the solution to the problem
More informationLoop Scheduling and Partitions for Hiding Memory Latencies
Loop Scheuling an Partitions for Hiing Memory Latencies Fei Chen Ewin Hsing-Mean Sha Dept. of Computer Science an Engineering University of Notre Dame Notre Dame, IN 46556 Email: fchen,esha @cse.n.eu Tel:
More informationYet Another Parallel Hypothesis Search for Inverse Entailment Hiroyuki Nishiyama and Hayato Ohwada Faculty of Sci. and Tech. Tokyo University of Scien
Yet Another Parallel Hypothesis Search for Inverse Entailment Hiroyuki Nishiyama an Hayato Ohwaa Faculty of Sci. an Tech. Tokyo University of Science, 2641 Yamazaki, Noa-shi, CHIBA, 278-8510, Japan hiroyuki@rs.noa.tus.ac.jp,
More informationTable-based division by small integer constants
Table-base ivision by small integer constants Florent e Dinechin, Laurent-Stéphane Diier LIP, Université e Lyon (ENS-Lyon/CNRS/INRIA/UCBL) 46, allée Italie, 69364 Lyon Ceex 07 Florent.e.Dinechin@ens-lyon.fr
More informationQueueing Model and Optimization of Packet Dropping in Real-Time Wireless Sensor Networks
Queueing Moel an Optimization of Packet Dropping in Real-Time Wireless Sensor Networks Marc Aoun, Antonios Argyriou, Philips Research, Einhoven, 66AE, The Netherlans Department of Computer an Communication
More informationT2K & HA-PACS Projects Supercomputers at CCS
T2K & HA-PACS Projects Supercomputers at CCS Taisuke Boku Deputy Director, HPC Division Center for Computational Sciences University of Tsukuba Two Streams of Supercomputers at CCS Service oriented general
More informationIntensive Hypercube Communication: Prearranged Communication in Link-Bound Machines 1 2
This paper appears in J. of Parallel an Distribute Computing 10 (1990), pp. 167 181. Intensive Hypercube Communication: Prearrange Communication in Link-Boun Machines 1 2 Quentin F. Stout an Bruce Wagar
More informationCoupling the User Interfaces of a Multiuser Program
Coupling the User Interfaces of a Multiuser Program PRASUN DEWAN University of North Carolina at Chapel Hill RAJIV CHOUDHARY Intel Corporation We have evelope a new moel for coupling the user-interfaces
More informationTowards a Low-Power Accelerator of Many FPGAs for Stencil Computations
2012 Thir International Conference on Networking an Computing Towars a Low-Power Accelerator of Many FPGAs for Stencil Computations Ryohei Kobayashi Tokyo Institute of Technology, Japan E-mail: kobayashi@arch.cs.titech.ac.jp
More informationRandom Clustering for Multiple Sampling Units to Speed Up Run-time Sample Generation
DEIM Forum 2018 I4-4 Abstract Ranom Clustering for Multiple Sampling Units to Spee Up Run-time Sample Generation uzuru OKAJIMA an Koichi MARUAMA NEC Solution Innovators, Lt. 1-18-7 Shinkiba, Koto-ku, Tokyo,
More informationShift-map Image Registration
Shift-map Image Registration Svärm, Linus; Stranmark, Petter Unpublishe: 2010-01-01 Link to publication Citation for publishe version (APA): Svärm, L., & Stranmark, P. (2010). Shift-map Image Registration.
More informationParticle Swarm Optimization Based on Smoothing Approach for Solving a Class of Bi-Level Multiobjective Programming Problem
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 17, No 3 Sofia 017 Print ISSN: 1311-970; Online ISSN: 1314-4081 DOI: 10.1515/cait-017-0030 Particle Swarm Optimization Base
More informationTransient analysis of wave propagation in 3D soil by using the scaled boundary finite element method
Southern Cross University epublications@scu 23r Australasian Conference on the Mechanics of Structures an Materials 214 Transient analysis of wave propagation in 3D soil by using the scale bounary finite
More informationStudy of Network Optimization Method Based on ACL
Available online at www.scienceirect.com Proceia Engineering 5 (20) 3959 3963 Avance in Control Engineering an Information Science Stuy of Network Optimization Metho Base on ACL Liu Zhian * Department
More informationSURVIVABLE IP OVER WDM: GUARANTEEEING MINIMUM NETWORK BANDWIDTH
SURVIVABLE IP OVER WDM: GUARANTEEEING MINIMUM NETWORK BANDWIDTH Galen H Sasaki Dept Elec Engg, U Hawaii 2540 Dole Street Honolul HI 96822 USA Ching-Fong Su Fuitsu Laboratories of America 595 Lawrence Expressway
More informationPerformance Modelling of Necklace Hypercubes
erformance Moelling of ecklace ypercubes. Meraji,,. arbazi-aza,, A. atooghy, IM chool of Computer cience & harif University of Technology, Tehran, Iran {meraji, patooghy}@ce.sharif.eu, aza@ipm.ir a Abstract
More informationFast Fractal Image Compression using PSO Based Optimization Techniques
Fast Fractal Compression using PSO Base Optimization Techniques A.Krishnamoorthy Visiting faculty Department Of ECE University College of Engineering panruti rishpci89@gmail.com S.Buvaneswari Visiting
More informationInterconnect Your Future
Interconnect Your Future Gilad Shainer 2nd Annual MVAPICH User Group (MUG) Meeting, August 2014 Complete High-Performance Scalable Interconnect Infrastructure Comprehensive End-to-End Software Accelerators
More informationarxiv: v1 [physics.comp-ph] 4 Nov 2013
arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department
More informationGPU-centric communication for improved efficiency
GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop
More informationSupporting Fully Adaptive Routing in InfiniBand Networks
XIV JORNADAS DE PARALELISMO - LEGANES, SEPTIEMBRE 200 1 Supporting Fully Aaptive Routing in InfiniBan Networks J.C. Martínez, J. Flich, A. Robles, P. López an J. Duato Resumen InfiniBan is a new stanar
More informationComputer Organization
Computer Organization Douglas Comer Computer Science Department Purue University 250 N. University Street West Lafayette, IN 47907-2066 http://www.cs.purue.eu/people/comer Copyright 2006. All rights reserve.
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More information5th International Conference on Advanced Design and Manufacturing Engineering (ICADME 2015)
5th International Conference on Avance Design an Manufacturing Engineering (ICADME 25) Research on motion characteristics an application of multi egree of freeom mechanism base on R-W metho Xiao-guang
More informationAn Algorithm for Building an Enterprise Network Topology Using Widespread Data Sources
An Algorithm for Builing an Enterprise Network Topology Using Wiesprea Data Sources Anton Anreev, Iurii Bogoiavlenskii Petrozavosk State University Petrozavosk, Russia {anreev, ybgv}@cs.petrsu.ru Abstract
More informationOffloading Cellular Traffic through Opportunistic Communications: Analysis and Optimization
1 Offloaing Cellular Traffic through Opportunistic Communications: Analysis an Optimization Vincenzo Sciancalepore, Domenico Giustiniano, Albert Banchs, Anreea Picu arxiv:1405.3548v1 [cs.ni] 14 May 24
More informationLab work #8. Congestion control
TEORÍA DE REDES DE TELECOMUNICACIONES Grao en Ingeniería Telemática Grao en Ingeniería en Sistemas e Telecomunicación Curso 2015-2016 Lab work #8. Congestion control (1 session) Author: Pablo Pavón Mariño
More informationMILC Performance Benchmark and Profiling. April 2013
MILC Performance Benchmark and Profiling April 2013 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting
More information2008 International ANSYS Conference
2008 International ANSYS Conference Maximizing Productivity With InfiniBand-Based Clusters Gilad Shainer Director of Technical Marketing Mellanox Technologies 2008 ANSYS, Inc. All rights reserved. 1 ANSYS,
More informationCluster Center Initialization Method for K-means Algorithm Over Data Sets with Two Clusters
Available online at www.scienceirect.com Proceia Engineering 4 (011 ) 34 38 011 International Conference on Avances in Engineering Cluster Center Initialization Metho for K-means Algorithm Over Data Sets
More informationShift-map Image Registration
Shift-map Image Registration Linus Svärm Petter Stranmark Centre for Mathematical Sciences, Lun University {linus,petter}@maths.lth.se Abstract Shift-map image processing is a new framework base on energy
More informationClassifying Facial Expression with Radial Basis Function Networks, using Gradient Descent and K-means
Classifying Facial Expression with Raial Basis Function Networks, using Graient Descent an K-means Neil Allrin Department of Computer Science University of California, San Diego La Jolla, CA 9237 nallrin@cs.ucs.eu
More informationA Multi-class SVM Classifier Utilizing Binary Decision Tree
Informatica 33 (009) 33-41 33 A Multi-class Classifier Utilizing Binary Decision Tree Gjorgji Mazarov, Dejan Gjorgjevikj an Ivan Chorbev Department of Computer Science an Engineering Faculty of Electrical
More informationMultilevel Linear Dimensionality Reduction using Hypergraphs for Data Analysis
Multilevel Linear Dimensionality Reuction using Hypergraphs for Data Analysis Haw-ren Fang Department of Computer Science an Engineering University of Minnesota; Minneapolis, MN 55455 hrfang@csumneu ABSTRACT
More informationNon-homogeneous Generalization in Privacy Preserving Data Publishing
Non-homogeneous Generalization in Privacy Preserving Data Publishing W. K. Wong, Nios Mamoulis an Davi W. Cheung Department of Computer Science, The University of Hong Kong Pofulam Roa, Hong Kong {wwong2,nios,cheung}@cs.hu.h
More informationParallel Directionally Split Solver Based on Reformulation of Pipelined Thomas Algorithm
NASA/CR-1998-208733 ICASE Report No. 98-45 Parallel Directionally Split Solver Base on Reformulation of Pipeline Thomas Algorithm A. Povitsky ICASE, Hampton, Virginia Institute for Computer Applications
More informationGeneralized Edge Coloring for Channel Assignment in Wireless Networks
Generalize Ege Coloring for Channel Assignment in Wireless Networks Chun-Chen Hsu Institute of Information Science Acaemia Sinica Taipei, Taiwan Da-wei Wang Jan-Jan Wu Institute of Information Science
More informationSocially-optimal ISP-aware P2P Content Distribution via a Primal-Dual Approach
Socially-optimal ISP-aware P2P Content Distribution via a Primal-Dual Approach Jian Zhao, Chuan Wu The University of Hong Kong {jzhao,cwu}@cs.hku.hk Abstract Peer-to-peer (P2P) technology is popularly
More informationOn the Role of Multiply Sectioned Bayesian Networks to Cooperative Multiagent Systems
On the Role of Multiply Sectione Bayesian Networks to Cooperative Multiagent Systems Y. Xiang University of Guelph, Canaa, yxiang@cis.uoguelph.ca V. Lesser University of Massachusetts at Amherst, USA,
More informationSkyline Community Search in Multi-valued Networks
Syline Community Search in Multi-value Networs Rong-Hua Li Beijing Institute of Technology Beijing, China lironghuascut@gmail.com Jeffrey Xu Yu Chinese University of Hong Kong Hong Kong, China yu@se.cuh.eu.h
More informationEFFICIENT ON-LINE TESTING METHOD FOR A FLOATING-POINT ADDER
FFICINT ON-LIN TSTING MTHOD FOR A FLOATING-POINT ADDR A. Droz, M. Lobachev Department of Computer Systems, Oessa State Polytechnic University, Oessa, Ukraine Droz@ukr.net, Lobachev@ukr.net Abstract In
More informationTight Wavelet Frame Decomposition and Its Application in Image Processing
ITB J. Sci. Vol. 40 A, No., 008, 151-165 151 Tight Wavelet Frame Decomposition an Its Application in Image Processing Mahmu Yunus 1, & Henra Gunawan 1 1 Analysis an Geometry Group, FMIPA ITB, Banung Department
More informationLatest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand
Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationAnyTraffic Labeled Routing
AnyTraffic Labele Routing Dimitri Papaimitriou 1, Pero Peroso 2, Davie Careglio 2 1 Alcatel-Lucent Bell, Antwerp, Belgium Email: imitri.papaimitriou@alcatel-lucent.com 2 Universitat Politècnica e Catalunya,
More informationMORA: a Movement-Based Routing Algorithm for Vehicle Ad Hoc Networks
: a Movement-Base Routing Algorithm for Vehicle A Hoc Networks Fabrizio Granelli, Senior Member, Giulia Boato, Member, an Dzmitry Kliazovich, Stuent Member Abstract Recent interest in car-to-car communications
More informationThe Journal of Systems and Software
The Journal of Systems an Software 83 (010) 1864 187 Contents lists available at ScienceDirect The Journal of Systems an Software journal homepage: www.elsevier.com/locate/jss Embeing capacity raising
More informationDisjoint Multipath Routing in Dual Homing Networks using Colored Trees
Disjoint Multipath Routing in Dual Homing Networks using Colore Trees Preetha Thulasiraman, Srinivasan Ramasubramanian, an Marwan Krunz Department of Electrical an Computer Engineering University of Arizona,
More informationA New Search Algorithm for Solving Symmetric Traveling Salesman Problem Based on Gravity
Worl Applie Sciences Journal 16 (10): 1387-1392, 2012 ISSN 1818-4952 IDOSI Publications, 2012 A New Search Algorithm for Solving Symmetric Traveling Salesman Problem Base on Gravity Aliasghar Rahmani Hosseinabai,
More informationExploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR
Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication
More informationNew Geometric Interpretation and Analytic Solution for Quadrilateral Reconstruction
New Geometric Interpretation an Analytic Solution for uarilateral Reconstruction Joo-Haeng Lee Convergence Technology Research Lab ETRI Daejeon, 305 777, KOREA Abstract A new geometric framework, calle
More informationEffect of GPU Communication-Hiding for SpMV Using OpenACC
ICCM2014 28-30 th July, Cambridge, England Effect of GPU Communication-Hiding for SpMV Using OpenACC *Olav Aanes Fagerlund¹, Takeshi Kitayama 2,3, Gaku Hashimoto 2 and Hiroshi Okuda 2 1 Department of Systems
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationIndexing the Edges A simple and yet efficient approach to high-dimensional indexing
Inexing the Eges A simple an yet efficient approach to high-imensional inexing Beng Chin Ooi Kian-Lee Tan Cui Yu Stephane Bressan Department of Computer Science National University of Singapore 3 Science
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationImage Segmentation using K-means clustering and Thresholding
Image Segmentation using Kmeans clustering an Thresholing Preeti Panwar 1, Girhar Gopal 2, Rakesh Kumar 3 1M.Tech Stuent, Department of Computer Science & Applications, Kurukshetra University, Kurukshetra,
More informationReversible Image Watermarking using Bit Plane Coding and Lifting Wavelet Transform
250 IJCSNS International Journal of Computer Science an Network Security, VOL.9 No.11, November 2009 Reversible Image Watermarking using Bit Plane Coing an Lifting Wavelet Transform S. Kurshi Jinna, Dr.
More information6 Gradient Descent. 6.1 Functions
6 Graient Descent In this topic we will iscuss optimizing over general functions f. Typically the function is efine f : R! R; that is its omain is multi-imensional (in this case -imensional) an output
More informationA Duality Based Approach for Realtime TV-L 1 Optical Flow
A Duality Base Approach for Realtime TV-L 1 Optical Flow C. Zach 1, T. Pock 2, an H. Bischof 2 1 VRVis Research Center 2 Institute for Computer Graphics an Vision, TU Graz Abstract. Variational methos
More informationPerformance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA
Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to
More informationBackpressure-based Packet-by-Packet Adaptive Routing in Communication Networks
1 Backpressure-base Packet-by-Packet Aaptive Routing in Communication Networks Eleftheria Athanasopoulou, Loc Bui, Tianxiong Ji, R. Srikant, an Alexaner Stolyar Abstract Backpressure-base aaptive routing
More informationLearning Subproblem Complexities in Distributed Branch and Bound
Learning Subproblem Complexities in Distribute Branch an Boun Lars Otten Department of Computer Science University of California, Irvine lotten@ics.uci.eu Rina Dechter Department of Computer Science University
More informationInvestigation into a new incremental forming process using an adjustable punch set for the manufacture of a doubly curved sheet metal
991 Investigation into a new incremental forming process using an ajustable punch set for the manufacture of a oubly curve sheet metal S J Yoon an D Y Yang* Department of Mechanical Engineering, Korea
More informationAn Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters
An Extension of XcalableMP PGAS Lanaguage for Multi-node Clusters Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku and Mitsuhisa Sato University of Tsukuba 1 Presentation Overview l Introduction
More informationSupport for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth
Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU
More informationOverview. Operating Systems I. Simple Memory Management. Simple Memory Management. Multiprocessing w/fixed Partitions.
Overview Operating Systems I Management Provie Services processes files Manage Devices processor memory isk Simple Management One process in memory, using it all each program nees I/O rivers until 96 I/O
More informationOn Effectively Determining the Downlink-to-uplink Sub-frame Width Ratio for Mobile WiMAX Networks Using Spline Extrapolation
On Effectively Determining the Downlink-to-uplink Sub-frame With Ratio for Mobile WiMAX Networks Using Spline Extrapolation Panagiotis Sarigianniis, Member, IEEE, Member Malamati Louta, Member, IEEE, Member
More informationCPMD Performance Benchmark and Profiling. February 2014
CPMD Performance Benchmark and Profiling February 2014 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting
More informationA fast embedded selection approach for color texture classification using degraded LBP
A fast embee selection approach for color texture classification using egrae A. Porebski, N. Vanenbroucke an D. Hama Laboratoire LISIC - EA 4491 - Université u Littoral Côte Opale - 50, rue Ferinan Buisson
More informationAdjacency Matrix Based Full-Text Indexing Models
1000-9825/2002/13(10)1933-10 2002 Journal of Software Vol.13, No.10 Ajacency Matrix Base Full-Text Inexing Moels ZHOU Shui-geng 1, HU Yun-fa 2, GUAN Ji-hong 3 1 (Department of Computer Science an Engineering,
More informationfiltering LETTER An Improved Neighbor Selection Algorithm in Collaborative Taek-Hun KIM a), Student Member and Sung-Bong YANG b), Nonmember
107 IEICE TRANS INF & SYST, VOLE88 D, NO5 MAY 005 LETTER An Improve Neighbor Selection Algorithm in Collaborative Filtering Taek-Hun KIM a), Stuent Member an Sung-Bong YANG b), Nonmember SUMMARY Nowaays,
More informationHA-PACS Project Challenge for Next Step of Accelerating Computing
HA-PAS Project hallenge for Next Step of Accelerating omputing Taisuke Boku enter for omputational Sciences University of Tsukuba taisuke@cs.tsukuba.ac.jp 1 Outline of talk Introduction of S, U. Tsukuba
More informationETNA Kent State University
Electronic Transactions on Numerical Analysis. Volume 6, pp. 246-254, December 997. Copyright 997,. ISSN 068-963. ETNA etna@mcs.kent.eu FAST SOLUTION OF MSC/NASTRAN SPARSE MATRIX PROBLEMS USING A MULTILEVEL
More informationSecure Network Coding for Distributed Secret Sharing with Low Communication Cost
Secure Network Coing for Distribute Secret Sharing with Low Communication Cost Nihar B. Shah, K. V. Rashmi an Kannan Ramchanran, Fellow, IEEE Abstract Shamir s (n,k) threshol secret sharing is an important
More informationEDOVE: Energy and Depth Variance-Based Opportunistic Void Avoidance Scheme for Underwater Acoustic Sensor Networks
sensors Article EDOVE: Energy an Depth Variance-Base Opportunistic Voi Avoiance Scheme for Unerwater Acoustic Sensor Networks Safar Hussain Bouk 1, *, Sye Hassan Ahme 2, Kyung-Joon Park 1 an Yongsoon Eun
More informationAn Adaptive Routing Algorithm for Communication Networks using Back Pressure Technique
International OPEN ACCESS Journal Of Moern Engineering Research (IJMER) An Aaptive Routing Algorithm for Communication Networks using Back Pressure Technique Khasimpeera Mohamme 1, K. Kalpana 2 1 M. Tech
More informationOPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA
OPEN MPI WITH RDMA SUPPORT AND CUDA Rolf vandevaart, NVIDIA OVERVIEW What is CUDA-aware History of CUDA-aware support in Open MPI GPU Direct RDMA support Tuning parameters Application example Future work
More informationChapter 9 Memory Management
Contents 1. Introuction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threas 6. CPU Scheuling 7. Process Synchronization 8. Dealocks 9. Memory Management 10.Virtual Memory
More informationHandling missing values in kernel methods with application to microbiology data
an Machine Learning. Bruges (Belgium), 24-26 April 2013, i6oc.com publ., ISBN 978-2-87419-081-0. Available from http://www.i6oc.com/en/livre/?gcoi=28001100131010. Hanling missing values in kernel methos
More informationLecture 1 September 4, 2013
CS 84r: Incentives an Information in Networks Fall 013 Prof. Yaron Singer Lecture 1 September 4, 013 Scribe: Bo Waggoner 1 Overview In this course we will try to evelop a mathematical unerstaning for the
More informationUsing Vector and Raster-Based Techniques in Categorical Map Generalization
Thir ICA Workshop on Progress in Automate Map Generalization, Ottawa, 12-14 August 1999 1 Using Vector an Raster-Base Techniques in Categorical Map Generalization Beat Peter an Robert Weibel Department
More informationMessage Transport With The User Datagram Protocol
Message Transport With The User Datagram Protocol User Datagram Protocol (UDP) Use During startup For VoIP an some vieo applications Accounts for less than 10% of Internet traffic Blocke by some ISPs Computer
More informationRobust PIM-SM Multicasting using Anycast RP in Wireless Ad Hoc Networks
Robust PIM-SM Multicasting using Anycast RP in Wireless A Hoc Networks Jaewon Kang, John Sucec, Vikram Kaul, Sunil Samtani an Mariusz A. Fecko Applie Research, Telcoria Technologies One Telcoria Drive,
More informationOn the Placement of Internet Taps in Wireless Neighborhood Networks
1 On the Placement of Internet Taps in Wireless Neighborhoo Networks Lili Qiu, Ranveer Chanra, Kamal Jain, Mohamma Mahian Abstract Recently there has emerge a novel application of wireless technology that
More informationLearning convex bodies is hard
Learning convex boies is har Navin Goyal Microsoft Research Inia navingo@microsoftcom Luis Raemacher Georgia Tech lraemac@ccgatecheu Abstract We show that learning a convex boy in R, given ranom samples
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationOptimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2
Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 H. Wang, S. Potluri, M. Luo, A. K. Singh, X. Ouyang, S. Sur, D. K. Panda Network-Based
More informationBIJECTIONS FOR PLANAR MAPS WITH BOUNDARIES
BIJECTIONS FOR PLANAR MAPS WITH BOUNDARIES OLIVIER BERNARDI AND ÉRIC FUSY Abstract. We present bijections for planar maps with bounaries. In particular, we obtain bijections for triangulations an quarangulations
More informationImproving Spatial Reuse of IEEE Based Ad Hoc Networks
mproving Spatial Reuse of EEE 82.11 Base A Hoc Networks Fengji Ye, Su Yi an Biplab Sikar ECSE Department, Rensselaer Polytechnic nstitute Troy, NY 1218 Abstract n this paper, we evaluate an suggest methos
More informationOnline Appendix to: Generalizing Database Forensics
Online Appenix to: Generalizing Database Forensics KYRIACOS E. PAVLOU an RICHARD T. SNODGRASS, University of Arizona This appenix presents a step-by-step iscussion of the forensic analysis protocol that
More information1 Surprises in high dimensions
1 Surprises in high imensions Our intuition about space is base on two an three imensions an can often be misleaing in high imensions. It is instructive to analyze the shape an properties of some basic
More informationAdvanced method of NC programming for 5-axis machining
Available online at www.scienceirect.com Proceia CIRP (0 ) 0 07 5 th CIRP Conference on High Performance Cutting 0 Avance metho of NC programming for 5-axis machining Sergej N. Grigoriev a, A.A. Kutin
More informationOverview : Computer Networking. IEEE MAC Protocol: CSMA/CA Internet mobility TCP over noisy links
Overview 15-441 15-441: Computer Networking 15-641 Lecture 24: Wireless Eric Anerson Fall 2014 www.cs.cmu.eu/~prs/15-441-f14 Internet mobility TCP over noisy links Link layer challenges an WiFi Cellular
More informationEnabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters
Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationBackpressure-based Packet-by-Packet Adaptive Routing in Communication Networks
1 Backpressure-base Packet-by-Packet Aaptive Routing in Communication Networks Eleftheria Athanasopoulou, Loc Bui, Tianxiong Ji, R. Srikant, an Alexaner Stoylar arxiv:15.4984v1 [cs.ni] 27 May 21 Abstract
More informationPDE-BASED INTERPOLATION METHOD FOR OPTICALLY VISUALIZED SOUND FIELD. Kohei Yatabe and Yasuhiro Oikawa
4 IEEE International Conference on Acoustic, Speech an Signal Processing (ICASSP) PDE-BASED INTERPOLATION METHOD FOR OPTICALLY VISUALIZED SOUND FIELD Kohei Yatabe an Yasuhiro Oikawa Department of Intermeia
More information