Enabling GPU Virtualization in Cloud Environments

Size: px

Start display at page:

Download "Enabling GPU Virtualization in Cloud Environments"

Brendan Rich
6 years ago
Views:

1 Enablng GPU Vrtualzaton n Cloud Envronments Sergo Iserte, Francsco J. Clemente-Castelló, Adrán Castelló, Rafael Mayo and Enrque S. Quntana-Ortí Department of Computer Scence and Engneerng, Unverstat Jaume I, Castelló de la Plana, Span Keywords: Abstract: Cloud Computng, GPU Vrtualzaton, Amazon Web Servces (AWS), OpenStack, Resource Management. The use of accelerators, such as graphcs processng unts (GPUs), to reduce the executon tme of computentensve applcatons has become popular durng the past few years. These devces ncrement the computatonal power of a node thanks to ther parallel archtecture. Ths trend has led cloud servce provders as Amazon or mddlewares such as OpenStack to add vrtual machnes (VMs) ncludng GPUs to ther facltes nstances. To fulfll these needs, the guest hosts must be equpped wth GPUs whch, unfortunately, wll be barely utlzed f a non GPU-enabled VM s runnng n the host. The soluton presented n ths work s based on GPU vrtualzaton and shareablty n order to reach an equlbrum between servce supply and the applcatons demand of accelerators. Concretely, we propose to decouple real GPUs from the nodes by usng the vrtualzaton technology rcuda. Wth ths software confguraton, GPUs can be accessed from any VM avodng the need of placng a physcal GPUs n each guest host. Moreover, we study the vablty of ths approach usng a publc cloud servce confguraton, and we develop a module for OpenStack n order to add support for the vrtualzed devces and the logc to manage them. The results demonstrate ths s a vable confguraton whch adds flexblty to current and well-known cloud solutons. 1 INTRODUCTION Nowadays, many cloud vendors have started offerng vrtual machnes (VMs) wth graphcs processng unts (GPUs) n order to provde GPGPU (general-purpose GPU) computaton servces. A few relevant examples nclude Amazon Web Servces (AWS) 1, Pengun Computng 2, Softlayer 3 and Mcrosoft Azure 4. In the publc scope, one of the most popular cloud vendors s AWS, whch offers a wde range of preconfgured nstances ready to be launched. Alternatvely, ownng the proper nfrastructure, a prvate cloud can be deployed usng a specfc mddleware such as OpenStack 5 or Opennebula 6. Unfortunately, sharng GPU resources among multple VMs n cloud envronments s more complex than n physcal servers. On one hand, nstances n publc clouds are not easly customzable. On the other, although the nstances n a prvate cloud can be customzed n many aspects, when referrng to GPUs the number of optons s reduced. As a result, nether vendors nor tools offer GPGPU servces. Remote vrtualzaton has been recently proposed to deal wth the low-usage problem. Some relevant examples nclude rcuda (Peña, 2013), DS- CUDA (Kawa et al., 2012), gvrtus (Gunta et al., 2010), vcuda (Sh et al., 2012), VOCL (Xao et al., 2012), and SnuCL (Km et al., 2012). Roughly speakng these vrtualzaton frameworks enable cluster confguratons wth fewer GPUs than nodes. The goal s that GPU-equpped nodes act as GPGPU servers, yeldng a GPU-sharng soluton that potentally acheves a hgher overall utlzaton of the accelerators n the system. The man goals of ths work are to study current cloud solutons n an HPC GPU-enabled scenaro, and to analyze and mprove them by addng flexblty va GPU vrtualzaton. In order to reach ths goal, we select rcuda, a vrtualzaton tool that s possbly the more complete and up-to-date for NVIDIA GPUs. The rest of the paper s structured as follows. In Secton 2 we ntroduce the technologes used n ths work; Secton 3 summarzes related work; the effort 249 Iserte, S., Clemente-Castelló, F., Castelló, A., Mayo, R. and Quntana-Ortí, E. Enablng GPU Vrtualzaton n Cloud Envronments. In Proceedngs of the 6th Internatonal Conference on Cloud Computng and Servces Scence (CLOSER 2016) - Volume 2, pages ISBN: Copyrght c 2016 by SCITEPRESS Scence and Technology Publcatons, Lda. All rghts reserved

2 CLOSER th Internatonal Conference on Cloud Computng and Servces Scence to use AWS s explaned n Secton 4; whle the work wth Openstack s descrbed n Secton 5; fnally, Secton 6 summarzes the advances and Secton 7 outlnes the next steps of ths research. 2 BACKGROUND 2.1 The rcuda Framework rcuda (Peña et al., 2014) s a mddleware that enables transparent access to any NVIDIA GPU devce present n a cluster from all compute nodes. The GPUs can also be shared among nodes, and a sngle node can use all the graphc accelerators as f they were local. rcuda s structured followng a clent-server dstrbuted archtecture and ts clent exposes the same nterface as the regular NVIDIA CUDA 6.5 release (NVIDIA Corp., ).Wth ths mddleware, applcatons are not aware that they are executed on top of a vrtualzaton layer. To deal wth new GPU programmng models, rcuda has been recently extended to accommodate drectvebased models such as OmpSs (Castelló et al., 2015a) and OpenACC (Castelló et al., 2015b). The ntegraton of remote GPGPU vrtualzaton wth global resource schedulers such as SLURM (Iserte et al., 2014) completes ths appealng technology, makng accelerator-enabled clusters more flexble and energyeffcent (Castelló et al., 2014). 2.2 Amazon Web Servces AWS (Amazon Web Servces, 2015) s a publc cloud computng provder, composed of several servces, such as cloud-based computaton, storage and other functonalty, that enables organzatons and/or ndvduals to deploy servces and applcatons on demand. These servces replace company-owned local IT nfrastructure and provde aglty and nstant elastcty matchng perfectly wth enterprse software requrements. From the pont of vew of HPC, AWS offers hgh performance facltes va nstances equpped wth GPUs and hgh performance network nterconnecton. 2.3 OpenStack OpenStack (OpenStack Foundaton, 2015) s a cloud operatng system (OS) that provdes Infrastructure as a Servce (IaaS). OpenStack controls large pools of compute, storage, and networkng resources throughout a datacenter. All these resources are managed through a dashboard or an API that gves admnstrators control whle empowerng ther users to provson resources through a web nterface or a command-lne nterface. OpenStack supports most recent hypervsors and handles provsonng and lfe-cycle management of VMs. The OpenStack archtecture offers flexblty to create a custom cloud, wth no propretary hardware or software requrements, and the ablty to ntegrate wth legacy systems and thrd party technologes. From the HPC perspectve, OpenStack offers hgh performance vrtual machne confguratons wth dfferent hardware archtectures. Even though n Open- Stack t s possble to work wth GPUs, the Nova project does not support ths archtecture yet. 3 STATE-OF-THE-ART Our solutons to the defcences exposed n the prevous secton reles on GPU vrtualzaton, sharng resources n order to attan a far balance between supply and demand. Whle several efforts wth the same goal have been ntated n the past, as exposed next, none of them s as ambtous as ours. The work n (Younge et al., 2013) allows the VM managed by the Xen hypervsor to access the GPUs n a physcal node, but wth ths soluton a node cannot use more GPUs than those locally hosted, and an dle GPU cannot be shared wth other nodes. The soluton presented by gvrtus (Gunta et al., 2010) vrtualzes GPUs and makes them accessble for any VM n the cluster. However, ths knd of vrtualzaton strongly depends on the hypervsor, and so does ts performance. Smlar soluton s presented n gcloud (Dab et al., 2013). Whle ths soluton s not yet ntegrated n a Cloud Computng Manager, ts man drawback s that the applcaton s code must be modfed n order to run n the vrtual-gpu envronment. A runtme component to provde abstracton and sharng of GPUs s presented n (Becch et al., 2012), whch allows schedulng polces to solate and share GPUs n a cluster for a set of applcatons. The work ntroduced n (Jun et al., 2014) s more mature; however, t s only focused on compute-ntensve HPC applcatons. Our proposal goes further, not only brngng solutons for all knd of HPC applcatons, but also amng to boost flexblty n the use of GPUs. 250

3 Enablng GPU Vrtualzaton n Cloud Envronments 4 GPGPU SHARE SERVICE WITH AWS 4.1 Current Features Instances An nstance s a pre-confgured VM focused on an specfc target. Among the large lst of nstances offered by AWS, we can fnd specalzed versons for general-purpose (T2, M4 and M3); computer scence (C4 and C3); memory (R3); storage (I2 and D2) and GPU capable (G2). Each type of nstance has ts own purpose and cost (prce). Moreover, each type offers a dfferent number of CPUs as well as network nterconnecton, whch can be: low, medum, hgh or 10Gb. For our study, we worked n the AWS avalablty zone US EAST (N. VIRGINIA). The nstances avalable n that case present the features reported n Table 1. Table 1: Shown HPC nstances avalable n US EAST (N. VIRGINIA) n June Name vcpus Memory Network GPUs Prce c3.2xlarge 8 15 GB Hgh 0 $ 42 c3.8xlarge GB 10 Gb 0 $ 1.68 g2.2xlarge 8 15 GB Hgh 1 $ 65 g2.8xlarge GB 10 Gb 4 $ 2.6 For the followng experments, we select C3 famly nstances, whch are not equpped wth GPUs, as clents; whereas nstances of the G2 famly wll act as GPU-enabled servers Networkng Table 1 shows that each nstance ntegrates a dfferent network. As the bandwdth s crtcal when GPU vrtualzaton s appled, we frst perform a smple test to verfy the real network bandwdth. Table 2: IPERF results between selected nstances. Server Clent Network Bandwdth g2.8xlarge c3.2xlarge Hgh 1 Gb/s g2.8xlarge c3.8xlarge 10Gb 7.5 Gb/s g2.8xlarge g2.2xlarge Hgh 1 Gb/s g2.8xlarge g2.8xlarge 10Gb 7.5 Gb/s To evaluate the actual bandwdth, we executed the IPERF 7 tool between the nstances descrbed n Ta- 7 ble 1, wth the results shown n Table 2. From ths experment, we can derve that network Hgh corresponds to a 1 Gb nterconnect whle 10 Gb has a real bandwdth of 7.5 Gb/s. Moreover, t seems that the bandwdth of the nstances equpped wth a Hgh nterconnecton network s constraned by software to 1 Gb/s snce the theoretcal and real bandwdth match perfectly. The real gap between sustaned and theoretcal bandwdth can be observed wth the 10 Gb nterconnecton, whch reaches up to 7.5 Gb/s GPUs An nstance reles on a VM that runs on a real node wth ts own vrtualzed components. Therefore AWS can leverage a vrtualzaton framework to offer GPU servces to all the nstances. Although the nvda-sm command ndcates that the GPUs nstalled are NVIDIA GRID K520, we need to verfy that these are non-vrtualzed devces. For ths purpose, we execute the NVIDIA SDK bandwdthtest. As shown n Table 3, the bandwdth acheved n ths test s hgher than the network bandwdth, whch suggests that the accelerator s an actual GPU. Table 3: Results of bandwdthtest transferrng 32MB usng pageable memory n a local GPU. Name Data Movement Bandwdth g2.2xlarge Host to Devce 3,004 MB/s g2.2xlarge Devce to Host 2,809 MB/s g2.8xlarge Host to Devce 2,814 MB/s g2.8xlarge Devce to Host 3,182 MB/s 4.2 Testbed Scenaros All scenaros are based on the maxmum number of nstances that a user can freely select wthout submttng a formal request. In partcular, the maxmum number for g2.2xlarge s 5; for g2.8xlarge t s 2. Ant the nstances operate the RHEL bt OS and verson 6.5 of CUDA. We desgn three confguraton scenaros for our tests: Scenaro A (Fgure 1(a)) shows a common confguraton n GPU-accelerated clusters, wth each node populated wth a sngle GPU. Here, a node can access 5 GPUs usng the Hgh network. Scenaro B (Fgure 1(b)) s composed of 2 server nodes, equpped wth 4 GPUs each, and a GPUless clent. Ths scenaro ncludes a 10Gb network, and the clent can execute the applcaton usng up to 8 GPUs. 251

4 CLOSER th Internatonal Conference on Cloud Computng and Servces Scence Scenaro C (Fgure 1(c)) combnes scenaros A and B. A sngle clent, usng a 1Gb network nterconnecton, can leverage 13 GPUs as f they were local. Once the scenaros are confgured from the pont of vew of hardware, the rcuda mddleware needs to be nstalled n order to add the requred flexblty to the system. The rcuda server s executed n the server nodes and the rcuda lbrares are nvoked from the node that acts as clent. In order to evaluate the network bandwdth usng a remote GPU, we re-appled NVIDIA SDK bandwdthtest. Table 4 exposes that the bandwdth s lmted by the network. Table 4: Results of bandwdthtest transferrng 32MB usng pageable memory n a remote GPU usng rcuda. Scenaro Data Movement Network Bandwdth A Host-to-Devce Hgh 127 MB/s A Devce-to-Host Hgh 126 MB/s B Host-to-Devce 10 Gb 858 MB/s B Devce-to-Host 10 Gb 843 MB/s 4.3 Expermental Results The frst applcaton s MonteCarloMultGPU, from the NVIDIA SDK, a code that s compute bound (ts executon barely nvolves memory operatons). Ths was launched wth the default confguraton, scalng=weak, whch adjusts the sze of the problem dependng on the number of accelerators. Fgure 2 depcts the optons per second calculated by the applcaton runnng on the scenaros n Fgure 1 as well as usng local GPUs. For clarty, we have removed the results observed for Scenaro B as they are exactly the same as those obtaned from Scenaro C wth up to 8 GPUs. In ths partcular case, rcuda (remote GPUs) outperforms CUDA (local GPUs) because the former loads the lbrares when the daemon s started (Peña, 2013). Wth rcuda we can observe dfferences n the results between both scenaros. Here, Scenaro A can ncrease the throughput because the GPUs do not share the PCI bus wth other devces as each node only s equpped wth one GPU. On the other hand, when the 4-GPU nstances ( g2.8xlarge ) are added (Scenaro C), the PCI bus constrans the communcaton bandwdth, hurtng the scalablty. The second applcaton, LAMMPS 8, s a classcal molecular dynamcs smulator that can be appled at the atomc, meso, or contnuum scale. From the mplementaton perspectve, ths mult-process applcaton employs at least one GPU to host ts processes, but can beneft from the presence of multple GPUs. Fgure 3(a) shows that, for ths applcaton, the use of remote GPUs does not offer any advantage over the orgnal CUDA. Furthermore, for the executon on remote GPUs, the dfference between both networks s small, although, the results observed wth the Hgh network are worse than those obtaned wth the 10 Gb network. In executon of LAMMPS on a larger problem (see Fgure 3(b)), CUDA stll performs better, but the nterestng pont s the executon tme when usng remote GPUs. These are almost the same even wth dfferent networks, whch ndcates that the transfers turn the nterconnecton network nto a bottleneck. For ths type of applcaton, enlargng the problem sze compensates the negatve effect of a slower network. 4.4 Dscusson The prevous experments reveal that the AWS GPUnstances are not approprate for HPC because nether the network nor the accelerators are powerful enough to delver hgh performance when runnng compute-ntensve parallel applcatons. As (Peña, 2013) demonstrates, network and devce types are crtcal factors to performance. In other words, AWS s more orented toward offerng a general-purpose servce than to provde hgh performance. Also, AWS fals n offerng flexblty as t enforces the user to choose between a basc nstance wth a GPU and a powerful nstance wth 4 GPUs. Table 1 shows that the resources of the g2.8xlarge are quadrupled, but so s the cost per hour. Therefore, n the case of havng other necesstes (nstances type), usng GPU vrtualzaton technology we could n prncple attach an accelerator to any type of nstance. Furthermore, reducng the budget spent n cloud servces s possble by customzng the resources of the avalable nstances. For example, we can work on an nstance wth 2 GPUs for only $ 1.3 by launchng 2 g2.2xlarge and usng remote GPUs, avodng to pay the double for features that we do not need n g2.8xlarge. In terms of GPU-shareablty, AWS reserves GPU-capable physcal machnes whch wll be watng for a GPU-nstance request. Investng n expensve accelerators to keep them n a standby state s counter-productve. It makes more sense to dedcate less devces, accessble from any machne, resultng n a hgher utlzaton rate

Enablng GPU Vrtualzaton n Cloud Envronments 4GPUs 4GPUs 1Gb 10Gb 1Gb 4GPUs 4GPUs (a) 5 remote GPUs over 1GbE. (b) 8 remote GPUs over 10GbE. (c) 13 remote GPUs over 1GbE.

1 Managng Remote GPUs wth OpenStack Fgure 2: NVIDIA SDK MonteCarloMultGPU executon usng local (CUDA) and remote (Scenaros A and C) GPUs.

archtecture (see Fgure 4(b)). Ths new servce brngs more flexblty when managng GPUs and new workng modes for GPGPU computaton n the Cloud.

5 Enablng GPU Vrtualzaton n Cloud Envronments 4GPUs 4GPUs 1Gb 10Gb 1Gb 4GPUs 4GPUs (a) 5 remote GPUs over 1GbE. (b) 8 remote GPUs over 10GbE. (c) 13 remote GPUs over 1GbE. Fgure 1: Confguratons where an AWS nstance, actng as a clent, employs several remote GPUs. 5 GPGPU SHARED SERVICE IN OPENSTACK 5.1 Managng Remote GPUs wth OpenStack Fgure 2: NVIDIA SDK MonteCarloMultGPU executon usng local (CUDA) and remote (Scenaros A and C) GPUs. (a) Run sze = 10 The man dea s to evolve from the orgnal OpenStack archtecture (see Fgure 4(a)) to a soluton where a new shared servce, responsble for the GPU accelerators, becomes ntegrated nto the archtecture (see Fgure 4(b)). Ths new servce brngs more flexblty when managng GPUs and new workng modes for GPGPU computaton n the Cloud. As llustrated n Fgure 5, we alter the orgnal OpenStack Dashboard wth a new parser, whch splts the HTTP query n order to make use of both the GPGPU API for GPU-related operatons, and the Nova API for the rest of the computatons. The new GPGPU Servce grants access to GPUs n a VM, but also allows the creaton of GPU-pools, consstng of a set of ndependent GPUs (logcally dsattached from the nodes) that can be assgned to one or more VMs. Thanks to the modular approach of the GPGPU Servce, the Nova Project does not need to be modfed, and the tool can be easly ported to other Cloud Computng solutons. 5.2 Workng Modes (b) Run sze = 200 Fgure 3: Executon tme of LAMMPS. The developed module allows users to set up any of the scenaros dsplayed n Fgure 6. The users are gven two confguraton optons to decde whether a physcal GPU wll be completely reserved for an nstance (frst column) or the nstance wll address a partton of the GPU as f t were a real devce (second column). We refer to ths as the mode, wth possble values beng: exclusve or shared. Let us assume there are 4 GPU devces n the cluster (ndependently of where they were hosted). An example of these scenaros s shown n Fgure 6. There, n the exclusve 253

CLOSER 2016-6th Internatonal Conference on Cloud Computng and Servces Scence (a) Verson Icehouse Fgure 6: Examples of workng modes. (b) Wth GPGPU module Fgure 4: OpenStack Archtecture.

As a result of sharng the GPU memory, the nstance wll be able to work wth up to 8 GPUs, provded that each partton can be addressed as an ndependent GPU.

Ths behavor s refereed as scope, and t determnes that a group of nstances s logcally connected to a pool of GPUs.

Agan, the GPU pool can be composed of exclusve or shared GPUs. 5.3 GPUsMemor y( MB) 2048 Publ cpool Ex c l us vegpus Fgure 7: Launchng Instances and assgnng GPUs.

6 CLOSER th Internatonal Conference on Cloud Computng and Servces Scence (a) Verson Icehouse Fgure 6: Examples of workng modes. (b) Wth GPGPU module Fgure 4: OpenStack Archtecture. Fgure 5: Internal Communcaton among modules. GPUPool mode the nstance monopolzes all the GPUs; whle n the shared mode, the GPUs are parttoned. As a result of sharng the GPU memory, the nstance wll be able to work wth up to 8 GPUs, provded that each partton can be addressed as an ndependent GPU. Moreover, the users are also responsble for decdng whether a GPU (or a pool) wll be assgned to other nstances. Ths behavor s refereed as scope, and t determnes that a group of nstances s logcally connected to a pool of GPUs. Workng wth the publc scope (bottom row of Fgure 6) mples that the GPUs of a pool can be used smultaneously by all the nstances lnked to t. Agan, the GPU pool can be composed of exclusve or shared GPUs. 5.3 GPUsMemor y( MB) 2048 Publ cpool Ex c l us vegpus Fgure 7: Launchng Instances and assgnng GPUs. nstance wthout accelerators of ths knd. When the opton New GPU Pool s chosen, felds for the pool confguraton would appear (see Fgure 7). Furthermore, a new panel wth the exstent GPUs dsplays all the nformaton related to GPUs (see Fgure 8). User Interface In order to deal wth the new features, several modfcatons have been planned n the OpenStack Dashboard, they have not been mplemented yet, though. Frst of all, the Instance Launch Panel should be extended wth a new feld, where the user could assgn an exstng GPU pool, create a new one, or keep the 254 New GPUCount GPUs I D Addr es s Memor y Sc ope PoolorI ns t anc e Mode 1 1 1: Publ c Pool1 Shar ed 2 1 1: Publ c Pool2 Shar ed 3 1 2: Pr vat e ns t anc e00362 Pr vat e I ns t anc es GPUs Fgure 8: GPU Informaton Panel. Act ons

Enablng GPU Vrtualzaton n Cloud Envronments 5.4 Expermental Results Followng tests were executed on 2 sets of nodes usng a 1Gb Ethernet network.

7 Enablng GPU Vrtualzaton n Cloud Envronments 5.4 Expermental Results Followng tests were executed on 2 sets of nodes usng a 1Gb Ethernet network. All the nodes composed by an Intel Xeon E7420 quadcore processor, at 2.13 GHz, and 16 GB DDR2 RAM at 667 MHz. The frst set, n charge of provdng the cloud envronment, conssted of 3 nodes. To deploy an IaaS, we used OpenStack Icehouse verson, and QEMU/KVM 12.1 as the hypervsor. A fully-featured Open- Stack deployment requres at least three nodes: a controller manages the compute nodes where the VMs are hosted; a network node manages the logc vrtual network for the VMs, and one or more compute nodes run the hypervsor and VMs. The second set, composed of 4 nodes, were auxlary servers wth a Tesla C1060 GPU each. The OS was a Centos 6.6; the GPUs used CUDA 6.5; and rcuda v5.0 as GPU vrtualzaton framework. We have desgned 6 dfferent set-ups whch can be dvded nto 2 groups: exclusve and shared GPUs. The exclusve mode provdes, at most, 4 accelerators. The number of avalable GPU n shared mode wll depend on the partton sze. In ths case, we halved the GPU memory, resultng 8 parttons that can be addressed as ndependent GPUs. For each group, we deployed vrtual clusters of 1, 2 and 3 nodes, where the applcaton processes were executed. The nstances were launched wth the OpenStack predefned flavor m1.medum, whch determnes a confguraton of VMs consstng of 2 cores and 4 GB of memory. MCUDA-MEME (Lu et al., 2010) was the applcaton selected to test the set-ups. Thus s an MPI software, where each process must have access to a GPU. Therefore, the number of GPUs determnes the maxmum number of processes we can launch. Fgure 9 compares the executon tme of the applcaton wth dfferent confguratons over dfferent setups. We used up to 3 nodes to spread the processes and launched up to 8 processes (only 4 n exclusve mode), one per remote GPU. We can frst observe that the performance s hgher wth more than one node, because the traffc network s dstrbuted among the nodes. In addton, the shortest executon tme s obtaned by both modes (exclusve and shared) when runnng ther maxmum number of processes wth more than one node. Ths seems to mply that t s not worth to scale (ncrease) the number of resources, because the performance growth rate s barely ncreasng. Although, the tme s lower when the GPUs are shared, the setup cannot take advantage of an ncrease n the number of devces. Fgure 9: Scalablty results of MCUDAMEME wth a dfferent number of MPI processes. 5.5 Dscusson The network nterconnect restrcts the performance of our executons. The analyss n (Peña, 2013) reveals that mprovng the network nfrastructure can make a bg dfferent for GPU vrtualzaton. The most remarkable achevement s the wde range of possble confguratons and the flexblty to adapt a system to ft the user requrements. In addton, wth ths vrtualzaton technology, the requests for GPU devces can be fulflled wth small nvestment n nfrastructure and mantenance. Energy can be saved not only thanks to the remote access and the ablty to emulate several GPUs usng only a few real ones, but also by consoldatng the accelerators n a sngle machne (when possble), or turnng down nodes when ther GPUs are dle. 6 CONCLUSIONS We have presented a complete study of the possbltes offered by AWS when t comes to GPUs. The constrants mposed by ths servce motvated us to deploy our own prvate cloud, n order to gan flexblty when dealng wth these accelerators. For ths purpose, we have ntroduced an extenson of OpenStack whch can be easly exploted to create GPU-nstances as well as manage the physcal GPUs to better proft. As we expected, due to the lmted bandwdth of the nterconnects used n the expermentaton, the performances observed for the GPU vrtualzed scenaros n the tests were qute low. On the other hand, we have created new operaton modes that open nterestng new ways to leverage GPUs n stuatons where havng access to a GPU s more mportant than havng a powerful GPU to boost the performance. 255

8 CLOSER th Internatonal Conference on Cloud Computng and Servces Scence 7 FUTURE WORK The frst tem n the lst of pendng work s an upgrade of the network to an nterconnect that s more prone to HPC. In partcular, portng the setup and the tests to an nfrastructure wth an Infnband network wll shed lght on the vablty of ths knd of solutons. Smlar reasons, motvate us to try other Cloud vendors whch better support for HPC. Lookng for stuatons where performance s less mportant than flexblty wll drve us to explore alternatve tools to easly deploy GPU-programmng computer labs. Fnally, an nterestng future work s to desgn new strateges n order to decde where a remote GPUs s created and assgned to a physcal devce Concretely, to nnovate schedulng polces can enhance the flexblty offered by the GPGPU module for OpenStack. ACKNOWLEDGEMENTS The authors would lke to thank the IT members of the department Gustavo Edo and Vcente Roca for ther help. Ths research was supported by Unverstat Jaume I research project (P11B ); and project TIN R from MINECO and FEDER. The ntal verson of rcuda was jontly developed by Unverstat Poltècnca de Valènca (UPV) and Unverstat Jaume I de Castellón untl year 201 Ths ntal development was later splt nto two branches. Part of the UPV verson was used n ths paper and t was supported by Generaltat Valencana under Grants PROMETEO 2008/060 and Prometeo II 2013/009. REFERENCES Amazon Web Servces (2015). Amazon web servces. Accessed: Becch, M., Sajjapongse, K., Graves, I., Procter, A., Rav, V., and Chakradhar, S. (2012). A vrtual memory based runtme to support mult-tenancy n clusters wth GPUs. In 21st Int. symp. on Hgh-Performance Parallel and Dstrbuted Computng. Castelló, A., Duato, J., Mayo, R., Peña, A. J., Quntana- Ortí, E. S., Roca, V., and Slla, F. (2014). On the use of remote GPUs and low-power processors for the acceleraton of scentfc applcatons. In The Fourth Int. Conf. on Smart Grds, Green Communcatons and IT Energy-aware Technologes, pages 57 62, France. Castelló, A., Mayo, R., Planas, J., and Quntana-Ortí, E. S. (2015a). Explotng task-parallelsm on GPU clusters va OmpSs and rcuda vrtualzaton. In I IEEE Int. Workshop on Reengneerng for Parallelsm n Heterogeneous Parallel Platforms, Helsnk (Fnland). Castelló, A., Peña, A. J., Mayo, R., Balaj, P., and Quntana- Ortí, E. S. (2015b). Explorng the sutablty of remote GPGPU vrtualzaton for the OpenACC programmng model usng rcuda. In IEEE Int. Conference on Cluster Computng, Chcago, IL (USA). Dab, K. M., Rafque, M. M., and Hefeeda, M. (2013). Dynamc sharng of GPUs n cloud systems. In Parallel and Dstrbuted Processng Symp. Workshops & PhD Forum, 2013 IEEE 27th Internatonal. Gunta, G., Montella, R., Agrllo, G., and Covello, G. (2010). A GPGPU transparent vrtualzaton component for hgh performance computng clouds. In Euro- Par, Parallel Processng, pages Sprnger. Iserte, S., Castelló, A., Mayo, R., Quntana-Ortí, E. S., Reaño, C., Prades, J., Slla, F., and Duato, J. (2014). SLURM support for remote GPU vrtualzaton: Implementaton and performance study. In Int. Symposum on Computer Archtecture and Hgh Performance Computng, Pars, France. Jun, T. J., Van Quoc Dung, M. H. Y., Km, D., Cho, H., and Hahm, J. (2014). GPGPU enabled HPC cloud platform based on OpenStack. Kawa, A., Yasuoka, K., Yoshkawa, K., and Narum, T. (2012). Dstrbuted-shared CUDA: Vrtualzaton of large-scale GPU systems for programmablty and relablty. In The Fourth Int. Conf. on Future Computatonal Technologes and Applcatons, pages Km, J., Seo, S., Lee, J., Nah, J., Jo, G., and Lee, J. (2012). SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In Int. Conf. on Supercomputng (ICS). Lu, Y., Schmdt, B., Lu, W., and Maskell, D. L. (2010). CUDA-MEME: Acceleratng motf dscovery n bologcal sequences usng CUDA-enabled graphcs processng unts. Pattern Recognton Letters, 31(14). NVIDIA Corp. CUDA API Reference Manual Verson 6.5. OpenStack Foundaton (2015). OpenStack. Accessed: Peña, A. J. (2013). Vrtualzaton of accelerators n hgh performance clusters. PhD thess, Unverstat Jaume I, Castellon, Span. Peña, A. J., Reaño, C., Slla, F., Mayo, R., Quntana-Ortí, E. S., and Duato, J. (2014). A complete and effcent CUDA-sharng soluton for HPC clusters. Parallel Computer, 40(10). Sh, L., Chen, H., Sun, J., and L, K. (2012). vcuda: GPU-accelerated hgh-performance computng n vrtual machnes. IEEE Trans. on Comput., 61(6). Xao, S., Balaj, P., Zhu, Q., Thakur, R., Coghlan, S., Ln, H., Wen, G., Hong, J., and Feng, W. (2012). VOCL: An optmzed envronment for transparent vrtualzaton of graphcs processng unts. In Innovatve Parallel Computng. IEEE. Younge, A. J., Walters, J. P., Crago, S., and Fox, G. C. (2013). Enablng hgh performance computng n cloud nfrastructure usng vrtualzed GPUs. 256

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr