The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes relevant to HPC users only when their specific requirements can be met in a cloud environment. In broad terms these requirements involve hardware and OS/software dependencies. The more generic the requirements an HPC user has (particularly for hardware), the greater the probability their requirements can be met by a cloud. The hardware computing environment that HPC applications require, as opposed to the OS/software environment, is much harder to realize from a general cloud computing resource provider. This is because HPC users are generally interested in getting as much performance from hardware as possible. A knowledge of the CPU type, amount of L2 Cache, floating point operations per clock cycle, availability of additional accelerator hardware such as a GPU, special purpose FPGA or a Cell processor, bus architecture, the amount of memory per core, availability of parallel file system versus standard NFS storage, an Ethernet network versus a high-performance interconnect and a good compiler all contribute to how well a given code will perform. By appropriately modifying their code to mach the environment, a skilled researcher can speedup their job runtime 10 to 100 times. At the same time there are situations were running a job on a generic environment is sufficient enough no matter how long it takes to finish. This is generally due to insufficient local resources and getting access to something that runs slower is better than not getting anything at all. In any case the cloud environment must meet some level of applicability to be useful. Because the revenue model of many of the large cloud computing providers, such as Google and Amazon, is to use idle cycles from their fairly generic hardware, there is no incentive (at this time) for them to provide the highly optimized hardware that is often required for HPC applications. While as of now you can specify things like memory, CPU speed, number of cores and physical proximity of systems, to date no company has taken the next step and made truly HPC-caliber hardware resources available. For the main consumers of cloud services a typical applications might be one that includes a website, some type of application processing and the population of a database. In this case the user may only care about the CPU speed, memory and storage space. It is into this type of environment that we look to properly match with HPC requirements. 1
Definition Just as with Grid computing, cloud computing has many different definitions and interpretations. In this paper we are defining cloud computing in terms of HPC as a configurable versus a fixed resource such as those provided through a computational Grid. Configurable in this context means the OS/software operating environment can be tailored to a specific application/code run. This could be OS, kernel level, libraries and other software (versus hardware) related dependencies. As we mentioned above no cloud computing service that we are aware of offers hardware tailoring that would be of interest in HPC, i.e., highspeed interconnects and storage/scratch space. Hardware tailoring would be done by utilizing a given cloud resource that meets a specific requirement. Currently there is no way to dynamically change hardware other than by either targeting specific resources or by some type of scheduling/resource allocation system. Basically the hardware you need must already be installed on the resource you want to use. HPC Use Cases and Hardware Requirements High performance computing can be divided into two major use cases - serial or parallel and a hybrid called multi-threaded which is essentially parallel computing on a single node. Parallel can further be divided into loosely or tightly coupled. The most demanding characteristic is the amount of dependency or communication (coupling) between multiple processes being used for a given job. Serial (Single Threaded) A good candidate for cloud computing would be serial applications; those that run simultaneous but separate threads on separate hardware (hosts). These single threaded jobs, will runs as fast as the CPU, memory and I/O hardware permits. The only communication is done when a given process completes and writes results back to a central location. The job finishes when all processes complete.. This use case is often referred to as embarrassingly parallel as there is no dependency on other process that are running and can easily be scaled in a distributed computing environment depending on the ability of a given application to do so. A good example of this type of scenario is SETI@home run through the Berkeley Open Infrastructure for Network Computing (BOINC) project. Here over 500,000 hosts can work on discrete pieces of data and produce their results without any requirement to synchronize with any other host except for the main SETI@home server. Note that some users run commercial applications in their serial slots so licensing will be a factor as some of these licenses are limited to either number of nodes they can run or specific nodes identified by MAC address. It is entirely possible that some applications will not be able to be used in a cloud environment. Multi Threaded These are jobs that share memory on the same node and all threads must run on the same node. Most C/C++ and Fortran compilers can distribute 2
the compute intensive part of a job into multiple threads with the insertion of some compiler directives by the programmer. OpenMP is the industry standard for these kinds of jobs. This type of computing is becoming increasingly important as core counts increase. Both Intel and AMD have invested substantial resources to allow their chips to run multi-threaded applications more efficiently. Parallel (Distributed Memory) Jobs that run in this environment are CPU intensive, memory intensive or both CPU as well as memory intensive. Memory intensive jobs are those jobs whose memory requirements are higher than the maximum memory that a single node can provide no matter how fast the CPUs are. The distributed parallel jobs often send part of the data or the computations into multiple nodes interconnected through a network switch. Each of the threads may need to physically exchange or update the local memory with other threads on the same node or remote nodes periodically. This makes the parallel jobs both latency and memory data bandwidth dependent. Some parallel jobs become impossible to scale beyond a few nodes unless a faster interconnect fabric such as Infiniband or Myrinet is used. The amount (bandwidth) and the time sensitive nature of this communication (latency) determine the application coupling. In this case the network or interconnect determines how efficiently a given job is processed. A cloud resource would have some type of network, generally 100Mb Fast Ethernet or in some cases 1Gb Gigabit Ethernet. In contrast most tightly coupled clusters rely on at least Gigabit Ethernet while higher performance systems rely on a special interconnect such as Infiniband, Myrinet or other proprietary fabrics which have high bandwidth (10Gb or higher) and low latency (in the low, single figure microsecond range). Note that there is no reason a tightly coupled job could not run on a 100Mb network. But, depending on how much communication the application requires between hosts, the slowdown in processing a given job could be many orders of magnitude to the point that the hardware was for all intents and purposes useless. Custom Operating Environments One of the benefits of the current Grid environment is also a drawback to usability. That is, the ability to construct and run in a custom operating environment does not currently exist, as Grid resources are available only in a static, predefined way. On the positive side, with well-defined specifications, it makes matching requirements to resources much easier. For some users however, this rigidity inhibits running their code in an optimized (for them) environment. Frequently users have prototyped their applications on hardware that they own and have complete control over. Generally the environment they have created fulfills dependencies that are required by their application. In some cases when an application is moved to a new resource they 3
will have to modify it, sometimes substantially. If they were allowed to take their environment with them this precludes this extra step. On the other hand HPC applications that are written with scalability in mind should keep the need for a custom environment to a minimum. Building the Custom Environment For an HPC user with the requirement for a custom operating environment, using cloud resources requires some extra effort. They must recreate their operating environment that is compatible with the OS choices provided by the cloud, load it on to the cloud resource and deploy it over the virtual hardware assigned to them (Amazon, for instance, calls these hardware Instance Types). For most users who have created a custom operating environment, this is a somewhat trivial effort. For other, less sophisticated users, this could be beyond their ability to easily construct. One way around this issue is to provide pre-built environments that have been compiled and tested on a given hardware environment. Most cloud services provide this with environments that can include web servers, databases and application environments on specific operating systems. This can also be done for HPC users where specific libraries, compilers and applications can be pre-built for an environment. One could envision a service that would allow this to be done interactively from a menu of choices. Such things already exist for non-hpc environments and could be adopted for HPC users. How Cloud Computing Could Fit Into the HPC Environment From the various use cases discussed previously, there are several where ondemand resources from a cloud provider could be utilized in an HPC environment. Serial/embarrassingly parallel) jobs Serial-type jobs, with and without the requirement for a custom operating environment, are ideally suited for cloud environments. Generally the custom operating environment requirements are low, possibly limited to specific commercial applications, compilers or libraries. Multi threaded jobs Multi-threaded jobs run more efficiently on hardware that has been optimized for this purpose. CPUs with advanced multi-threading capabilities as well as fast memory and memory bus architecture are better suited for multi-threaded jobs. If a cloud computing provider is willing to make this type of hardware available then this is a viable use case. Multi-threaded jobs can run on less optimized hardware, just not as efficiently. For some use cases this is entirely acceptable. An ad on capability or overflow service on the Grid For serial and multi-threaded jobs, if cloud resources could be coupled to a Grid scheduling system, it would then be possible to extend a Grid with cloud resources when either sufficient resources were not available within 4
the Grid or a large burst requirement needs to be accommodated. In order for this to work it would require certain network and operating environment dependencies. One could look at this use case as either being stable or ad-hoc. For a stable resource, some type of longer term agreement would have to be worked out with the cloud computing provider. The fact that these resources must be paid for would also require careful accounting of their usage. For an ad-hoc resource a method would have to be developed to quickly establish and breakdown a connection with a cloud provider. Some companies are actually offering spare cycles to universities for a tax deduction. The notice given for this availability could be fairly short so a way of establishing this link quickly is extremely important. Supercomputing Centers as an HPC Cloud The national supercomputing centers are in the best position to provide a true HPC cloud environment, one that could be used for serial, multithreaded and parallel applications and could fulfill specific hardware requirements for interconnects, storage and node configurations. Defined loosely, supercomputing centers are a cloud without the ability to provide configurable resources. The ability and desire for the national supercomputing centers to provide a configurable environment is unknown at this time. Conclusions Cloud computing offers great promise for organizations to supplement their computing capabilities without the need to build out their IT infrastructure. Surges in demand can be met more efficiently and the build out of very expensive data centers and the related energy costs can be avoided or mitigated. Given the cost of cloud service are reasonable, this is a very attractive scenario. For HPC users, cloud computing provides a method of satisfying some specific use cases as described above. At this time cloud computing for high-end HPC usage is not a viable solution. It remains to be seen if cloud service providers can develop a revenue model that would make true HPC resources available at a reasonable price. 5