E S R F U P - W P 1 1

Size: px
Start display at page:

Download "E S R F U P - W P 1 1"

Transcription

1 E S R F U P - W P 1 1 F I N A L R E P O R T O N G R I D E V A L U A T I O N P R O J E C T EU Milestone M11.1 Document identifier: ESRFUP_M11.1.doc Date: 26/02/10 Authors F.Calvelo-Vazquez, A.Gotz, R.Dimper, E.Taurel, C.Koerdt, G.Foerstner Abstract: This document contains information which supports the milestone 11.1 Final report on operational experiences with the international test bed installation including future orientations for photon science Grid activities.

2 TABLE OF CONTENTS 1. EXECUTIVE SUMMARY AND RECOMMENDATIONS OUR INCENTIVES FOR THE USE OF GRID TECHNOLOGY BACKGROUND PROJECT DESCRIPTION TERMINOLOGY IMMEDIATE AND LONG TERM NEEDS OF THE SYNCHROTRON USER FACILITIES THE EXISTING EUROPEAN GRID INFRASTRUCTURE, THE TECHNOLOGY IN USE AND SUSPECTED OVERLAP The European Grid infrastructure project EGEE OUR EXPERIENCE WITH THE EGEE GRID THE INSTALLATION (AND THE OBSTACLES) Virtualization Networking Deployed components Monitoring tools MPI APPLICATIONS Different classes, long/short run, data intensive SPD a class 1 type application Gasbor a class 4 type application PyHST class 2 type application The process of porting Software management DATA TRANSFERS Throughput numbers, regular Gridftp transfers statistics SECURE REMOTE RESOURCE ACCESS AND USER MANAGEMENT Perimeter Protection User Management BUILDING A VIRTUAL PHOTON USER COMMUNITY The VO concept Experience gathered with collaborating partners OUR EVALUATION MATCH OR MISMATCH OF THE TECHNOLOGY WITH NEEDS THE IDEAL SYNCHROTRON GRID CONCLUSION

3 1. EXECUTIVE SUMMARY AND RECOMMENDATIONS Grid computing has been actively developed and financed over more than 10 years now. It has been proposed as the next generation World Wide Web (WWW) for high-performance computing. It has been successfully used in high energy physics. So far it has not found many uses in photon science. It is an obvious candidate for addressing the deluge of data produced in photon science at synchrotrons like the ESRF around the world. Because photon science needs in data analysis are not the same as high energy physics it is necessary to evaluate Grid technology specifically for photon science applications and communities. The ESRF Upgrade Program offers a unique opportunity to evaluate the Grid for photon science at a photon source i.e. a synchrotron. A work package (WP11) was setup as part of the European Union FP7 financed program for the ESRF Upgrade (ESRFUP ) to study Grid computing for photon science. This is the final report of this work package. The study was restricted to include only EGEE Grid computing and not other Grid computing technologies. In order to simulate a real Grid setup hardware was procured and installed at 3 partner sites with the EGEE glite software. The test Grid enabled experience to be gained on setting up and managing a Grid site both locally and site. This study has shown that overall the EGEE Grid is not suited to the case of photon science. The case studies demonstrated that except for a small number of applications (embarrassingly parallel programs which are CPU-intensive and require little input or output data) the majority of photon science applications are not suited for highly distributed Grid computing like EGEE. The data intensiveness of photon science applications does not scale to such Grids. Public networks are too slow for transporting large volumes of data to and from storage elements. Most simulations used in photon science which are CPU-intensive and require little data are not embarrassingly parallel. They require fast connections between the compute nodes usually based on a protocol called MPI. EGEE Grid provides little support for installing MPI. The photon science communities are organised in a very heterogeneous manner and are much smaller than high energy physics communities. This makes it much more difficult to apply Grid solutions for managing communities. Due to the high cost in terms of human resources to manage EGEE Grid computing sites and the low added value for photon science this report does not recommend EGEE-like Grid computing for photon science. We think the ESRF has much more to gain from investing in high performance computing centres locally and in speeding up the data reduction and analysis by porting programs to the new generation of GPU's and multi-core processors. Remote access to HPC centres should be provided to using well-known simple to manage solutions like secure shell. 3

4 2. OUR INCENTIVES FOR THE USE OF GRID TECHNOLOGY 2.1. BACKGROUND GRID computing is around since the end of the 90s and has quickly developed and gained in complexity. At the ESRF first attempts to use Grid software date back to the beginning of 2005 when the Condor batch submission software was successfully installed. Back at that time, it became quickly apparent that if the Grid may have a huge potential to address emerging needs of our user community, it needed an in depth analysis to understand and measure the technical and organisational implications. The ESRF Upgrade Programme triggered an in depth reflection on how to deal with the expected flood of data from high resolution detectors with unprecedented frame rates. Some of the challenges are technical, others are organisational, all of them deal with latency, i.e. the speed at which data can be transferred, stored, processed, visualised. The Grid was mentioned in the ESRF Science and Technology Programme as one of the promising technologies to deal with issues like computational resources, storage resources, code repositories, data catalogues, data security and data access. The ESRF Upgrade Programme being part of the ESFRI roadmap, preparatory money was made available for a number of activities, including for a Grid feasibility study, the so-called ESRFUP WP11. WP11 gave us the opportunity to acquire hands-on experience with Grid software, with the overarching aim to be able to decide at the end of the project whether Grid software could really fulfil our expectations and solve one or several of our emerging needs deriving from the data avalanche PROJECT DESCRIPTION The objective of the WP 11 is to study the feasibility to participate in the Enable Grids for escience (EGEE) Grid initiative as a complementary data management tool (storage and analysis) for data intensive Synchrotron Radiation research. Grid software could help in handling the enormous data output of certain experiments the upgraded ESRF will allow for. The implementation of Grid software and the porting of scientific software to such an environment is an integral part of the ESRF Upgrade Programme. Scientists coming to the ESRF for data intensive experiments do not necessarily have the means in their home laboratory to store and analyse the vast amount of data they will collect at the ESRF or in other Photon laboratories. Grid software may provide a cost effective solution to this problem. WP11 aimed at gaining hands-on experience with glite software, organise a workshop with international experts and key photon science users, configure a virtual organisation for synchrotron radiation, and train interested scientists in the usage of Grid tools. WP11 also included the procurement, installation, and operation of a test bed between the ESRF and three other partner laboratories in Europe. This installation was intended to test one or two resource intensive application, e.g. tomography volume reconstruction, to demonstrate data replication mechanisms, to test credential management, and to measure performances TERMINOLOGY What do we call Grid in this report? Grid computing or simply Grid means many things to many people. From computers connected to a network to a highly specific combination of software and 4

5 hardware to access a dedicated set of computers with a well defined set of services. The wikipedia entry for Grid computing: provides a good overview of the various meanings for Grid. In this project we have stuck to the general definition of Grid in Ian Foster's article What is the Grid? A Three Point Checklist 1 In this article Ian Foster lists these primary attributes: Computing resources are not administered centrally. Open standards are used. Non-trivial quality of service is achieved. We achieved these three attributes by installing computing resources in 4 sites managed by 4 different institutes connected via Internet i.e. dedicated network connections were not used. We used the open source glite 2 software to provide a non-trivial quality of service e.g. computing and storage resources available 24 hours a day, 7 days a week. What we do not mean by Grid in this project is the sharing of centralised computer infrastructure like in TeraGrid or the sharing of volunteered desktop resources like in the SETI@Home project which uses the Berkeley Open Infrastructure for Network Computing (BOINC 3 ) software. Similarly the Cloud as Grid-like technology has not been evaluated IMMEDIATE AND LONG TERM NEEDS OF THE SYNCHROTRON USER FACILITIES The driving force behind the computing infrastructure at the ESRF is the rapidly increasing flow of data generated by the 31 public and 11 non-public beamlines. The data rate has been multiplied by 300 over the last ten years, and considering the ambitious developments of new beamlines with better detectors, increased automation, and optimised operation schedules, we expect this trend to continue or maybe even amplify. In 2009 the ESRF generated some 400 TB of data in more than files. In addition to the needs deriving directly from the data avalanche, additional requirements are emerging in the European science arena for managing and preserving scientific data. The initial wish list of functionalities which motivated our desire to investigate whether Grid technology can provide solutions was: Computational resources: Compute clusters made from commodity computers, GPUs, or blade systems, are today s best solution to high-performance batch computing. We have started to concentrate our onsite compute power in several interconnected clusters connected to our storage facility and managed by the batch submission software Condor. This system has proven to be reliable and offers high performance. It will be upgraded to add functions such as checkpoint/restart and a better separation between interactive and batch processing. By adding a Grid software layer that will use Condor as a local workload manager, authenticated and authorised access to this resource for local (Intranet) as well as remote (Internet), users may share or aggregate the computational resources with other laboratories

6 Storage resources: New ways of interacting with the very large data sets generated by the new science coming from the Upgrade Programme will need to be found. Grid software may leverage data analysis and curation of these large data sets. It could open the possibility to give secure access from remote sites to this data, to analyse them with compute resources made available at the ESRF and to transfer raw or analysed data reliably to Grid resources at user s home institutions. Network resources and virtual organisation membership services: A high-performance and reliable networking infrastructure is fundamental for levering optimal usage of computing resources. Grid services for reliable data transfers, data replication and data duration as well as data treatment could give ESRF users the possibility to continue their scientific work from their home institutions in a secure and controlled manner. Virtual organisations: VOs grouping scientists working in the same field could be configured to give secure access to compute and storage resources, which are either centralised at ESRF or shared between participating parties inside the same virtual organisation. Grid certificates could be used to identify scientists across Europe and manage access rights to services and resources. Code repositories: Scientific software needs continuous upgrading and maintenance. Code repositories in a Grid environment could improve the use of and reduce the maintenance efforts for such software. It also could give the possibility to establish Computing On Demand (COD) in VOs, reducing the needs for time consuming data transfers. Catalogues: In some scientific areas, databases of metadata and publications (catalogues) linked to measurements and their data made at the ESRF or other synchrotron radiation facilities would allow queries or browsing of scientific results for judging, comparing, or complementing new data sets. The basis for such catalogues is a common data format containing not only the raw-data but also cured-data and metadata attached to a data set. Some data sets from a single experiment will be geographically dispersed and/or originate from cross disciplinary research. Merging such data sets requires transparent and easy access to distributed repositories. Data archiving and curation is increasingly important and indispensible to preserve data from unique samples. Tools are required to allow for long-term storage, browsing, and visualisation. Grid security infrastructure: The connectivity layer in Grid software defines core communication and authentication protocols required for the Grid Security Infrastructure (GSI). Authentication and authorisation protocols build on communication services to provide cryptographically secure mechanisms for verifying the identity of users and resources. The features available today make it possible to define VOs that span geographically distributed organisations with different administrations. The same functionality could be used to provide secure services to individual users or user groups THE EXISTING EUROPEAN GRID INFRASTRUCTURE, THE TECHNOLOGY IN USE AND SUSPECTED OVERLAP The European Grid infrastructure project EGEE As explained on their website the Enabling Grids for E-sciencE project is the largest Grid Computing project in Europe. The various project phases from EGEE-I to currently EGEE-III have covered six years with funding of roughly 20 million Euros of each of the phases. EGEE has 6

7 recently been replaced by the European Grid Initiative (EGI) project. Refer to their website for the latest information : The EGEE Grid has a large number of participating Virtual Organization and ranges from disciplines like High Energy Physics, its main client, to Life Science and Mathematics. The use of the infrastructure has increased steadily during the last couple of years, although disciplines other than High Energy Physics have played a secondary role only. 3. OUR EXPERIENCE WITH THE EGEE GRID The path to achieve our goal of testing the EGEE Grid was far from straightforward. Some detours were made at the beginning before arriving at the final configuration. We learned along the way the possibilities and limitations of glite services and how to manage them in the best way. On this course, the experience and feedback exchanged with our partners was of vital importance. We took important decisions, for the final deployed testbed, based on long discussion with EGEE experts and VO site administrators. Finally with their help, we could manage to deploy all services that we needed for our tests THE INSTALLATION (AND THE OBSTACLES) Virtualization Selected Platform Due to the large number of required middleware services and the limited number of servers in our test bed installation, it was decided to use virtualization techniques. This was expected to make better use of the available resources and improve the flexibility of the setup in case of necessary design changes after the first tests. We also expected to easily clone the system for the setup of our partner sites. 7

8 In terms of performance, this design was not the best platform for a future production system. Nevertheless, the choice of virtualization technology has favoured a system with limited impact on overall performance. Initially, two virtualization platforms were evaluated: VMware Citrix XenServer The following points were considered an important advantage and at that time justified the final decision in favour of Citrix XenServer : Xen hypervisor structure Paravirtualization Virtual machines on-line migration Number of managed resources by VM. Price Some of these features, like the XenHypervisor in conjunction with paravirtualization 4, can achieve high ranks of performance even on the x86 host architecture, which is notoriously uncooperative with traditional virtualization techniques 5. Others, like XenMotion (live migration of Virtual Machines between physical hosts), were included in the standard product by Citrix XenServer (but were and add-on for VMware). Also, some characteristics like competitive prices, number of managed resources by VM, support for hot-swap CPUs in Linux VM, support for Intel-VT and AMD-V hardware, were better achieved by Citrix XenServer at the time of the study (beginning of 2008). A first test Xen Server setup was installed in August The final Citrix Xen Server platform (Version 4.0) placed in the DMZ was set up early September Our virtualized platform was composed initially of three (and later of six) Sun Fire x4150 servers, forming a so-called XenPool. Almost a dozen middleware servers have been deployed as Virtual Machines (VM) within the ESRF's XenPool. The VMs run on Scientific Linux 4.7. Both, the i386 and the x86_64 architecture were required by the glite middleware. Platform limitations Of course, the use of a sharing platform gives us an inexpensive platform where we could test all glite services in a flexible and easily manageable virtualized environment. However, these kind of platforms, that are quite suitable for test beds, have also their limitations. Within a virtualized setup we can profit from the possibility of sharing most part of resources: like CPU, network interfaces, and available storage. However, RAM cannot be overbooked and throughput bandwidth can be seriously affected when the number of virtualized services grows. 4 Paravirtualization is a technique that presents a layer between the virtual machine and the actual underlying hardware and is therefore able to limit any performance degradation due to virtualization

9 Unfortunately, on our setup, the XenMotion capabilities had to be abandoned subsequently, after we found that VMs deployed on shared storage via NFS severely degraded the performance of any I/O operation. All virtual disks of the guests (the VMs) had to be moved locally and distributed over the available XenServer hosts Networking Working behind a NAT The first network scenario was strongly limited by the number of public IP addresses available on the DMZ. Due to that all servers were installed on a private DMZ segment, with NAT translation for all the services. Unfortunately, after considerable time invested to deploy this setpu, services never worked correctly behind the NAT box. Even with complex scenarios and setting up some imaginative solutions, some middleware components (ex. Grid-FTP on dcache) did not work correctly. Figure 1: The initial network setup turned out to be problematic 9

10 Second scenario (native public subnet) Finally a decision was made to reserve a new class C IP address range to redeploy all middleware glite components in its own external DMZ. The network team started then the lengthy negotiation procedure with RENATER for the new segment. It took some time to incorporate this new network in the existing infrastructure, and even more delay was added to migrate the middleware services or redeploy them. XRAY Grid Native Public DMZ Figure 2: Final network setup at XRAY resource centres However the experience gained was useful afterwards for deploying the equivalent infrastructure a partner sites PSI and Soleil. 10

11 The common backbone The GÉANT2 6 project is a collaboration between 34 participants: 32 European national research and education networks (NRENs), plus DANTE (managing partner, project co-ordinator and responsible of operations) and TERENA (to assure effort for correct developing of national NRENs). The GÉANT2 network provides connectivity and services to more than 30 million researchers at 8000 institutions in 34 different European countries, and links to a number of other world regions. It was established in 1993, and has since played a pivotal role in five consecutive generations of pan- European research network: EuropaNET, TEN-34, TEN-155, GÉANT and now GÉANT2. All our Grid partners are linked to this pan-european research network. Metropolitan LANs give them access to their local NRENs: DFN 7 for Desy SWITCH 8 for PSI RENATER 9 for Soleil and ESRF The main advantages of using this kind of high-speed public network to interconnect our resource centres are:

12 The backbone is operated at data transfer speeds of up to 10 Gbps across 50,000 km of network infrastructure, of which 12,000 km is based on GEANT2 fibre. It comprises 25 Points of Presence (PoPs), 44 routes and 18 dark fibre routes. Multiple 10Gbps wavelengths are employed in the network's core. The cost of a fine tuning Tuning and optimization issues, for getting the best possible network performance, include a very large range of different aspect to study. To achieve end-to-end throughput, every network hop on the path has to be analyzed in depth. Switches, routers and even host kernel parameters have to be optimised. Some useful network tools have greatly helped to accomplish these tedious tasks. Tools for measuring the bandwidth: like iperf Sniffers to examine the traffic: tcpdumps, wireshark Monitoring tools: ntop, MRTG graph In our specific network scenario, end switches have been enabled to use jumbo frames. This specific feature is also supported on the RENATER network. It means that we could employ Ethernet frames with more than 1,500 bytes of payload in case of end-to-end communications between GÉANT resource centres. The most common network protocol used in glite services is the Transmission Control Protocol, or TCP. TCP uses a "congestion window" to determine how many packets it can send at one time. The larger the congestion window size, the higher the throughput. The TCP "slow start" and "congestion avoidance" algorithms determine the size of the congestion window. The maximum congestion window is related to the amount of buffer space that the kernel allocates for each socket. For each socket there is a default value for the buffer size, which programs can change by using a system library call just before opening the socket. For some operating systems there is also a kernel-enforced maximum buffer size. You can adjust the buffer size for both the sending and receiving ends of the socket. To achieve maximum throughput, it is critical to use optimal TCP socket buffer sizes for the link you are using. If the buffers are too small, the TCP congestion window will never open up fully, so the sender will be throttled. If the buffers are too large, the sender can overrun the receiver, which will cause the receiver to drop packets and the TCP congestion window to shut down. Assuming there is no network congestion or packet loss, network throughput is directly related to TCP buffer size and the network latency. There are two TCP settings to consider: the default TCP buffer size and the maximum TCP buffer size. A user-level program can modify the default buffer size, but the maximum buffer size requires to be modified at kernel level. 12

13 Some specific works in this way has been carried out in earlier stages. Slow traffic rates between PSI and ESRF were detected during initial transfer tests. By reviewing every single step in the whole path, the problem was solved by tuning all elements in the chain. Also kernel parameters were adapted on both sides according to latencies and round-trip delays on the intermediate networks. Experiences gained with these time-consuming tasks have been reused to install the Grid hardware at SOLEIL Deployed components We set up the different middleware components on top of two virtual machine templates running Scientific Linux 4.7, one for the i386 and another for the x86_64 architecture. The software packages were installed with the help of the yum tool. The configuration was done using YAIM. Site Services a. glite-ui Installation of the glite-ui was rather straightforward. It was however disappointing that the glite-ui was not available on the x86_64 architecture. People who need to compile their programs then find a different architecture on the UI from the one on the Worker Nodes. We installed also a UI tarball accessible to the ESRF users, which was mounted via NFS on the ESRF's internal NICE cluster nodes. b. Cream CE We decided to start with the Cream CE. Some of us had learned how to set it up in the administration course during the GridKa school in September. In September 2008 it had been in PPS for a few month already and was expected to be released soon afterwards in production (as it was supposed to replace the older lcg-ce as soon as possible). Unfortunately, this release date turned out to be far too optimistic. c. Lcg-CE Due to the delay of the Cream-CE production release we had to install also the older lcg-ce in order to get the official certification from the French ROC. Another reason was the necessity of running MPI jobs, for which the cream CE was not (yet) tested, and we also could compare the performance of the two implementations with respect to our needs. After the network topology was changed (described in the previous chapter) the installed lcg-ce became operational. d. Site-BDII The information on our site resources were collected and published by the site-bdii. The component, although released in the 32bit version, was successfully installed on the x86_64 platform. e. Monitoring Box The mon-box was installed in the weeks before certification and subsequently hosted the nagios and ganglia monitoring tools. 13

14 VO Services f. VOMS server The VOMS server was operational in September 2008 and could be used from sites like DESY to run jobs under the XRAY virtual organization. We managed to run the VOMS server on the x86_64 platform, although the release was officially certified for the i386. The VOMS server suffered from a lot of bugs and only a couple of them were fixable with workarounds. The situation improved once the version was released in November g. glite-wms A first installation attempt within the older problematic network topology had been unsuccessful. After the move to the new network structure we repeated the attempt and everything went rather smoothly. It was a little unfortunate, however, that the Cream CE at the time of our site certification was not able to receive jobs via the WMS. The WMS release which supported the submission of jobs to the Cream CE was introduced approximately two month later in March Worker Nodes All our 14 compute nodes were installed with the glite-wn-torque module on the x86_64 architecture (SL4.7). Altogether 80 CPU cores were made available (6*quad core*dual CPU + 8*dual core*dual CPU). A shared software area between the ESRF s User Interface and all Worker Nodes was added, including later versions of Python (2.5.2, , 2.6.1), as well as MatLab Runtime 7.6, g95 and others. With the installation of MPI packages in February 2009 we also moved to a shared /home directory. Setting up the Storage Elements a. dcache test In September 2008 the situation was such that dcache had just produced a major new release. Moreover, it had changed the underlying technology of handling the name space (moved from pnfs to chimera). Since we wanted to avoid installing an old system that we would have to change soon after, we decided to go for the latest version. Unfortunately, the documentation of this new release was only partial and the release not yet in a very stable state. So after several unsuccessful steps to run version 1.9, we were advised by dcache support people to install the version 1.8 in combination with the Chimera namespace in order to limit future migration problems. The installation was internally operational in October. However, we had to wait for the move to the new network scenario to be able to use the storage element from outside the ESRF. Even after the installation, quite a few problems had to be solved. Due to the lack of good documentation and the sheer complexity of the dcache technology a lot of time had to be invested to get and keep the service running. h. DPM Because of the initial problems to make the dcache installation work, we decided to give DPM a try. We would then have another technology for comparison. 14

15 Indeed, the installation of DPM went rather smooth and fewer problems had to be solved while operating this storage element. Nevertheless, when it comes to transferring very large files (in excess of the 4GB recommended by high energy physics user groups) one runs also into problems and has to invest some time in finding the right configuration. In this respect, one has to mention that the support for DPM is quite well organised and the developers had been reactive to our problems Monitoring tools The whole XRAY VO infrastructure is being monitores all the time and a full set of complementary tools are exchanging information continuously, sending traps in case of any problem with the VO. In this sense we distinguish two different groups of tools. Those that are inherit to the standard EGEE surveillance program and traditional/local monitoring systems. EGEE Surveillance System The EGEE SA1 group, as a part of their activity, provide a set of tools that can be used to monitor and test register EGEE partner sites. Examples of those are CIC Portal, SAM tests, GOCDB 3 centre or even the EGEE security mailing list. One requirement to have access to most of these services is to register and get your resource centre certified. After your site is certified, you start to get GGUS trouble tickets automatically, you are allowed to search your site status at GSTAT or you can even join the security mailing lists. Screenshot from GSTAT portal This layer provides a good level of monitoring mechanisms from the point of view of glite services, but is not enough if the failure itself is at the operating system level or even lower, at server level. Therefore we had to use complementary local tools. 15

16 Traditional Monitoring tool On the other hand, a complete set of traditional monitoring tools has been deployed locally at each partner site to monitor the site infrastructure hardware and operating system layers. A full mesh NAGIOS system monitors all basic resource parameters by checking the most important server and OS features. Moreover, Ganglia daemons have been deployed at each partner site to keep watching resource utilization trends. 16

17 The MRTG graphs and ntop tools have also been added to provide useful information about throughput utilization and protocol distributions. Based on the output of these tools, we took relevant information to customize the QoS system (Packeteer) according to our needs. 17

18 Ntop snapshot Specific monitoring tools Another set of complementary tools were monitored more in depth glite services. This is the case of the WLCG Nagios component deployed on the MonBox server on each partner site, giving us a more detailed status of every single deployed facility on the resource centre. WLCG-Nagios MPI What is MPI? According to Wikipedia : MPI is a language-independent communication protocol used to program parallel computers. Both point-to-point and collective communication are supported. MPI "is a message-passing application programmer interface, together with protocol and semantic specifications for how its features must behave in any implementation." MPI's goals are high performance, scalability, and portability. MPI remains the dominant model used in high-performance computing today 10. The last sentence is particularly relevant. MPI is the de-facto standard in parallel programming in the scientific world. Many simulation programs used in synchrotron science have been ported to MPI to enable them to run in parallel on compute clusters. For this reason it is essential that if we want to attract simulation programs to the Grid we need to install MPI on the Grid. Unfortunately this is easier said than done for a number of reasons:

19 - MPI is supposed to be supported with glite, but it actually isn't - Releases are very, very few and new glite releases were not tested in a MPI environment - MPI is only released in combination with Torque (no Condor, or others) - the MPI version released uses an old version of Torque - making the installation tedious to maintain and to upgrade - it was necessary (or recommended) to change from key based WN access to shared home directories - a yaim installation procedure is proposed, but does (did) not work - documentation was not adequate and different parts available sometimes contradict each other - at the time of installation no functioning SAM tests - no easy tests available to check for oneself o the tests available were not sufficient, sometimes only showing that it did not work o no recipes to systematically check installation - how to handle MPI jobs on multi core machines - how to distribute over different sites - how to recollect jobs efficiently For the above reasons and after trying for some time we eventually gave up on getting MPI running on glite CREAM. This had the consequence that a potentially large number of simulations applications which only require compute power and MPI were not attracted to the Grid. Although this was an unsatisfactory result we need to add that even if we had managed to make MPI work on the Grid it is not as attractive running MPI on a local cluster for the following reasons: in our case the worker nodes were inter-connected with Ethernet whereas a local cluster can offer higher performance inter-connect like Infiniband it is easier to manage jobs on a local cluster than on the Grid Grid jobs seem to be less reliable than local batch jobs users find local cluster computing easier to learn and manage than the Grid In the future we urge the glite team to support MPI out-of-the-box if they want to make Grid attractive for these types of jobs APPLICATIONS Different classes, long/short run, data intensive The suitability of the Grid as a solution for any field depends entirely on the type of applications and how often they are executed. This is true for photon science too. By studying the typical jobs and the frequency with which they occur we have established classes of applications for synchrotron science. One application from three of these classes has been used in our case studies. There is a large disparity between the different photon science experiments. Some of them run all their data reduction and analysis on a single machine or laptop. Others need huge resources of the local cluster to run, thereby monopolising it for a single application. 19

20 Applications for synchrotron science can be divided roughly into the following classes: Class 1 : data intensive short jobs typically data reduction type jobs to correct or calibrate images, e.g. case study of SPD below, Class 2 : data intensive long jobs typically data analysis jobs on many images to reconstruct a 3D volume, e.g. case study of PyHST below, Class 3 : CPU intensive parallel jobs typically modelling or simulation type applications requiring MPI, e.g. FDMNES, MOLDY, etc., Class 4 : CPU intensive independent jobs typically modelling or simulation type applications, e.g. case study of gasbor below, Class 5 : CPU intensive single jobs typically data analysis jobs for fitting a model to measured data, e.g. GSAS. It is hard to give exact figures for what percentage each class of applications represents of the total number of photon science applications because many of the jobs are run on hosts which are not monitored e.g. on desktops, laptops or hosts as part of the experiment. The distribution of applications classes depends on the experimental technique used. The ESRF is a multi-disciplinary facility and many different types of techniques are used. Some produce only small amounts of data but make heavy use of simulation, e.g. spectroscopy, while others produce huge amounts and sometimes use simulations, e.g. tomography. Most imaging based techniques produce large numbers of images. Images constantly increase in size due to the increasing number of pixels, but also the number of images produced constantly increases. From our study we found that data intensive jobs (classes 1 and 2) are the most common and pose the biggest challenge. However simulation jobs are always required and as models increase in size their needs will increase. We found that the class of applications which are best suited to the Grid i.e. class 4, are in fact the least common. Class 5 applications represent another very common class of applications but were not studied on the Grid because users run them on their local machines and they are not resource bound. In the following section we will discuss case studies on 3 different applications which are each typical of one of the classes of applications identified above SPD a class 1 type application The application named SPD (SPatial Distorsion) has been selected to be ported on the Grid to check what could be the benefits of using a Grid infrastructure. This SPD application which has been developed at the ESRF is widely used on several beam-lines. Its aim is to do image correction on the images taken by the cameras used as beam-line detectors. These cameras are not perfect and this software generates corrected images from raw images coming out from the camera. This application has one input file which is the raw image and generates one output file which is the corrected image. The corrections are done based on 3 calibration files to correct: The camera dark level The camera imperfections (flood file) The camera distortion 20

21 The SPD usage can be summarized by the following diagram: Raw image Dark file Flood SPD Distortion file Corrected image The Grid_SPD application A Python script called Grid_SPD has been written to run this SPD software on a set of images using Grid the infrastructure. Several types of Grid usage have been implemented from something close to a real Grid usage (No knowledge of where the data are and no knowledge of where the software will be executed) to something close to a cluster usage (SPD running on dedicated computer(s) with NFS access to the image data set). This Grid_SPD python script always has a parameter which allows the user to select the number of images which will be processed by each Grid job. Grid_SPD will start as many jobs as needed to correct all the images in the set. For instance, with an image set of 100 images, if the user requires that 10 images have to be corrected by each job, Grid_SPD will create 10 jobs. Grid_SPD takes timing measurements of the miscellaneous actions it is doing on the Grid. It also implements a loop mode in which it will redo its work in a loop manner and store its timing measurement in a CSV file. The LFC (Logical File Catalog) server was the DESY LFC. When using the LFC, the SE (Storage Element) has been hard-coded to use the ESRF DPM SE (Physically located at the ESRF). The CE (Computing Element) has always been hard coded in the job description file as the ESRF LCG-CE CE except in the CREAM-CE mode. Therefore, this Grid_SPD, even in its Gridiest mode, cannot be considered a pure Grid application. Five Grid_SPD application running modes have been implemented: 1. The UI mode: This mode is the closest to a real Grid usage. The three correction files (dark file, flood file and distortion file) are stored on the UI (User Interface) computer. The images set is also stored on the UI and the corrected images will be put on the UI as well. 2. The LFC mode: In this mode, we try to minimize the data transfer between the UI computer and the Grid infrastructure. The image set is already on the LFC. The three correction files are stored on the UI. The corrected images will be put on the UI as well. 21

22 3. The parametric mode: In this mode, the image set is already on the LFC. The three correction files are stored on the UI. The corrected images will be stored on the UI as well. Grid_SPD will be used in the so called parametric job mode to send job request to the WMS (Workload Management System). It is this parametric job which will in turn start the underlying jobs. 4. The CREAM_CE mode: In this mode, the set of images is already on the LFC. The three correction files are stored on the UI. The corrected images will be put on the UI as well but the jobs are not submitted to the WMS (Workload Management System). They are submitted to the ESRF CREAM_CE directly. 5. The NFS mode: In this mode, the file system on which the image set is stored is mounted on the Grid worker node. The three correction files are also stored on this NFS mounted file system. The corrected images will be put in the same directory than the raw images (therefore on the NFS mounted file system as well). 6. Local mode : all files are on a local disk and the jobs run on the same local host The results Two charts are given for the following cases: images with 10 images per job images with 50 images per job In these two charts, the bar Submit jobs is the sum of: Time needed to send the correction files and the images to the Grid (when relevant) Time needed to submit the jobs The bar Retrieve job outputs is the sum of: Time needed to retrieve the job outputs Time needed to retrieve the corrected images (when relevant) The numbers in these charts are average numbers 18 jobs - 10 images/job 21:36 19:12 16:48 14:24 time 12:00 9:36 Submit jobs Waiting for jobs Retrieve job output Total time 7:12 4:48 2:24 0:00 UI LFC PARAMETRIC CREAM-CE NFS LOCAL 22

23 4 jobs - 50 images/job 16:48 14:24 12:00 time 09:36 07:12 Submit jobs Waiting for jobs Retrieve job output Total time 04:48 02:24 00:00 UI LFC PARAMETRIC CREAM-CE NFS LOCAL The detailed results for this study can be found in the deliverable D11.4 of this work-package Conclusions As we can notice from these charts, the time needed to correct images using a small number of images/job is quite high. Using the Grid to run many small jobs each one correcting a single image is not very efficient. The time needed to have the data at the right place (accessible for the jobs running on a CE) is also noticeable. The data have to be transferred from the storage element to a disk accessible by the computer element using the LFC to locate the storage element. The WMS parametric job allows us to decrease the time needed to start the jobs but finally, the best result we had was using the so-called NFS. But this mode is far from a typical Grid application (the data are on a file system NFS mounted on the worker node. To conclude, it seems that the Grid as it is today is not well adapted to this kind of applications (many small jobs which are I/O intensive). Two of its main components (WMS and LFC) introduce an overhead which is noticeable Gasbor a class 4 type application Gasbor calculates domain structures of proteins from X-ray Solution scattering. It relies on an Abinitio method for building a structural model of the proteins. [D. Svergun et al., Biophysical Journal 80, , 2001.] The execution of the program for a typical set of scattering data runs for several days and often two weeks on a local desktop computer. Both, required input data size and calculated results, are on the 23

24 order of Megabytes or less. For statistical reasons it is desirable to run many similar such jobs on a given data set. The required computing resources become quickly very large. As these large resources are needed only punctually, the Grid seems to offer the ideal solution to it. The tests presented here were done with a much shorter test job, in the order of a few hours, to allow for more rapid feedback. Job submission times to WMS Job submission times have improved considerably after the upgrade to version of the WMS. One can see in figures below the perfect linearity while submitting 277 jobs. The histogram plot shows a submission time narrowly focusing on 5 seconds/job. cummulative job submission tim job number normalized frequency job submission times histogram submission times/job [s] The ganglia plots of Grid-wms.esrf.eu below also show, that the WMS can handle the 250 jobs or so. Before the upgrade, the memory of 4GB was quickly filled and necessitated frequent restarts. Job finish times The following is a study of 277 jobs submitted at a time and executing on three different CEs on our XRAY infrastructure at DESY, PSI, and ESRF. A large part is rapidly executed after submission to the WMS and finishes after about one hour. The remaining jobs reports a status Done after two, four, and some only after 6 hours. The latter are due to busy resources and include the time waiting in the WMS queue. 24

25 # of jobs job finish time [h] Effective job run time on different CEs The effective time of the job between its starting on the worker node and reporting finished to the site's Compute Element is depending of course on the local hard and software environment it encountered on the respective worker node. The gasbor user reported a runtime of this job on its local machine of 3-4 hours. It runs about two hours on Grid worker nodes at the ESRF and PSI, and less than half that time on DESY machines. The runtime on ESRF and PSI machines shows a bigger dispersion and is due to the fact that four to eight job slots were available on each worker node depending on its number of available cores. It turned out that two machines at the ESRF were overloaded as these had eight job slots for four CPUs. 25

26 wall clock time per job # of jobs in time interval job runtime [minutes] Those two machines were alone responsible for the execution times above 140 minutes. Similar software There is another widely used program within the synchrotron radiation science community that fits somewhat in the same category as Gasbor, which is FDMNES. FDMNES is relying on a Finite- Difference-Method to calculate X-ray absorption Near-Edge-Structures. As in Gasbor the required input and output data is rather small. Conclusion This type of application with the combination of small input and output data and the need for a large number of independent jobs with long execution times seems to be the ideal Grid application. The figure below tries to make a comparison of job throughput in different environments: locally on the user's desktop, a batch job on a local cluster with 20 free job slots, and the results from the submission of the 277 jobs to the Grid for which we had roughly 150 job slots immediately available. A fourth case includes an 'optimized' Grid job, where we assume a better submission framework that would eliminate waiting jobs in the presence of free resources. 26

27 The assumption that more resources are immediately available on the Grid comes from the fact that by sharing resources in a Grid, one can reduce the so called 'wait-while-idle' cycles. This depends of course a lot on usage patterns in an actual production environment. More detailed studies of Grid taxonomy can be found in e.g. [Yin Fei et al., Computers and Electrical Engineering 35 (2009) and references therein] Minutes local job batch job grid job opt. grid job But even in this case there are certain negative aspects. Those result from the fact that Grid jobs have a non negligible risk of failing. The risks range from configuration errors on sites, middleware bugs, to network troubles. It is therefore wise to limit execution times to a day or so. Gasbor, however, does not offer this flexibility. Interaction with the developers becomes necessary, which is often impractical and would be resisted unless a critical mass of Grid users could be found PyHST class 2 type application What is PyHST? PyHST is a suite of programs for analysing synchrotron tomography data and producing 3D volumes. An example of a data set produced at the ESRF is the tomogram of the skull Australopethicus sediba, recently found in Malapa in South Africa and which could represent the missing link between primates and humans. 27

28 Rendering of the 3-D scan of the skull of Australopithecus sediba child. Credits: P. Tafforeau More examples of palaeontology data sets can be found at Experiments using imaging techniques are the biggest producer of data at the ESRF. One example of imaging is tomography. Tomography experiments account for over 50% of the data produced at the ESRF. For this reason it is important to study how the Grid can help analyse tomography. The diagram below shows the data flow of PyHST from the beamline to the local cluster when it is run at the ESRF. 28

29 Running PyHST on the Grid A theoretical study of the time needed to run PyHST on the Grid has been done. A typical use case for PyHST is to re-construct a volume of 2048x2048x2048 float from a set of 1600 images. Each image is an array of 2048x2048 float numbers. Using the ESRF cluster, this computation using one job takes 20 hours. The time needed to run the computation is a linear function of the job number. Therefore, the input numbers for our estimation are: Input data: 1600 files of 2048x2048 float Output data: A volume of 2048x2048x2048 in one file Computation time: 20 hours for one job To run PyHST on the Grid, a typical sequence is: 1. Step 1: Send the input data from the User Interface to a Storage Element 2. Step 2-a: Copy the input data from the Storage Element to each job running on the Worker Node 3. Step 2-b: Do the computation 4. Step 2-c: Copy the computation result to the Storage Element 5. Step 3: Retrieve the output volume in the User Interface from the Storage Element. When the computation is divided in several jobs, each job needs all the data. Each job computes a volume slice and at the end the volume needs to be re-constructed from the outputs of all the running jobs. This last step was neglected in this case study because it is the same for all the jobs (grid and non-grid). To estimate the time needed by step 1 and step 3, we will do 3 computations with different bandwidth available between the User Interface and the Storage Element: We will use: 1. 1 MByte/sec for a slow transfer MBytes/sec which is an average throughput MBytes/sec for a fast transfer. We will double these cases by studying a transfer of the input data (the 1600 files) in two different flavours: One big tar file containing the 1600 image files and 1600 different files. This is done to estimate the impact of the Logical File Catalogue. Registering a file in the Grid is a two steps process: 1. Register the file in the Logical File Catalogue. The time used for this registration is typically 2 seconds. 2. Send the file to the Storage Element. In the case of sending one big tar file, the time needed to create the tar file and to untar it will also been taken into account. The total needed time is the sum of time needed for Step1, Step 2 and Step 3. Time needed for Step 1 One file of 2048x2048 float means 16 Mbytes. Therefore, the amount of input data is 16x1600=25600 MBytes, i.e. 25 GBytes. On the computer used for our test-bench, the time needed to tar 20 files of 16 MBytes each is 10 sec which means 13 mins and 20 sec for 1600 files. Time needed for Step 1 is summarized in the following table: 29

30 1 MByte/sec 10 MBytes/sec 40 MBytes/sec 1 Big tar file 7 hours 20 min. 02 sec. 56 min. 02 sec. 24 min. 02 sec files 8 hours 1 hour 36 min. 1 hour 4 min. Time needed for Step 2 It's the sum of: Time needed to transfer data from the Storage Element to the Worker Node(s) Computation time Time needed to transfer the data back from the Worker Node to the Storage Element (s). The resulting volume (2048x2048x2048 float) is a MBytes (32 GBytes) We have the final equation: t = ((d / T) * N) + (20 * 60 * 60 / N) + (((c / N) / T ) * N) with: d = input data size in MBytes (25600) T = Transfer rate in MBytes /sec (40) N = Job number c = output data size in MBytes (32768) With these numbers, the equation becomes t = (640 * N) + (72000 / N) This equation has a minimum for N = sqrt (72000 / 640) = Therefore, the optimal job number is 10 and the time for step 2 becomes; 640 * 10 + (72000 / 10) = seconds which is 4 hours and 19 seconds In case of data being sent in a big tar file, time to untar the file (16 minuntes) has to be added. The following array summarizes the results: 1 Big tar file 4 hours 16 min. 19 sec files 4 hours 19 sec. Time needed for Step 3 The time needed for this step is the time needed to transfer the resulting volume which is MBytes (32 GBytes) 30

31 1 MByte / sec 10 MBytes / sec 40 MBytes /sec 9 hours 6 min. 8 sec. 54 min. 36 sec. 13 min. 39 sec. Total time and conclusions We are now able to compute an estimate of running PyHST on the Grid by summing the previous result: 1 MByte/sec. 10 MBytes/sec. 40 MBytes/sec. 1 Big tar file 20 hours 42 min. 29 sec. 6 hour 6 min. 57 sec. 4 hours 54 min files 21 hours 6 min. 27 sec. 6 hours 30 min. 55 sec. 5 hours 17 min. 58 sec. If you choose to use 100 jobs instead of the optimum number of 10, you will get 20 hours 18 min. and 57 sec. using one big tar file with a 10 Mbytes/sec bandwidth. Under the same conditions with 500 jobs, this time becomes 89 hours 9 min. and 23 sec.!! This time increases dramatically because all the input data has to be provided to all the jobs and with EGEE as it is today, a job running on a Worker Node does not see the data of another job even if it is running on the same Worker Node. From this table we can conclude that: It 's better to send a big tar file than 1600 different files (Logical File Catalogue effect) The bandwidth between the User Interface and the Storage Element has a huge effect on the total time. For such an application where all the jobs need all the data, carefully choose your job numbers. The best result is 4 hours 54 mins. This has to be compared with the 15 mins that we have with the ESRF local cluster (running 80 jobs) which takes its input data on a file system shared between itself and the data producer (the beamline). At the ESRF, PyHST has been ported to run on GPU (Graphical processor Unit) hardware. Using the same set of input files, the time needed to do the computation using the GPU version of PyHST is 8 min. The following chart summarizes these results: 31

32 PyHST computation time secondss GPU Local Cluster 10 jobs - 10 MB/sec 100 jobs - 10 MB/sec 10 Jobs 1 MB/sec 10 jobs - 40 MB/sec The process of porting There are basically two cases when you want to run an application on a Grid infrastructure: The application is already parallelized and therefore well adapted to a possible Grid usage The application is not parallelized By nature, all applications running on the Grid have to be parallelized. Therefore the first thing to do is to parallelize the application. Application parallelization is a complete subject on its own and will not be covered in this document. For application already parallelized, the process of porting the application to the Grid is a two steps process. First, you have to write a job description file. Then, you have to write a small script which will be executed on the worker node. As its name says, the job description file is the file where the job is described. The main parameters described in this file are the name of the executables you want to run on the worker node, its argument and if required the description of which files have to be transmitted using the job input or output sand boxes. These sand boxes are used to transfer small amounts of data. It is typically used to transmit logging information or error reports. The main job data (input and output) are normally transferred using the Grid LFC. It's also in this file that you can define some job specific requirements like a number of retries in case of job failure, a specific computing element where the job has to be run, a specific working node system architecture and many other parameters. The second step in the script describes what will be really executed on the worker node (very often, the name of this script in given in the job description file as the job executable name). The goals of this script is to retrieve the job input files from the LFC, to run the application with its necessary arguments (computed locally or given in the job description file) and to store the resulting data on the LFC making them available for the user. 32

33 Software management A wide variety of different programs are used for data reduction, analysis and modeling. Each experiment type has its own specialized programs for data processing. Some of them are just simple executables without dependencies, others need a special environment like runtime libraries, python modules, and/or several software packages to run. The first category of programs - the standalone programs - could be easily referenced in the job description file and send to the worker nodes, but the second one implies a global software installation on all CE of all Grid sites that support the XRAY VO. Before porting applications to the Grid, the important software packages should be installed on each CE of all Grid sites which support the XRAY VO. This is possible with the help of a dedicated software repository which is represented by VO_<name of the VO>_SW_DIR. This software area must be configured beforehand, which was done at DESY and ESRF, but missing for PSI. Further requirements: Only authorised users who authenticate with the software administrator role for the VO can install software. Software tags must be defined, which can be referenced in the JDL job submission files and insure that the job is submitted to a CE with has the desired software installed. Conclusions The installation and maintenance of software in a common software area is possible, and a simple test installation of FDMNES was done successfully for DESY and ESRF. The development and maintenance of the programs is done by software programmers or even by the scientists themselves, so a large community is installing and maintaining software for data analysis of Synchrotron experiments. Regarding the important number of different software packages and their dependencies, as well as the large number of software developers, it might become resource intensive and cumbersome to maintain software in a Grid environment. 33

34 3.3. DATA TRANSFERS One of the most important challenges regarding Synchrotron jobs is to move data around efficiently between Grid resource centres. Experiments carried out at beamlines produce a huge amount of data which is used as input for data analysis works Throughput numbers, regular Gridftp transfers statistics To measure inherent Grid capabilities, regular transfers have been done between partner sites. Every night, launched at off-peak hours, data files were regularly transferred by cron tasks, using different protocols: GridFTP, iperf tests, http... The results give us a good reference of comparison, and also some figures to measure the quality of service given in term of reliability and performance. HTTP IPerf transfer tests. Single Channel. 1 1 st curve: outbound connection. 2 2 nd curve: inbound connection GridFTP transfers. Single channel. 3 1 st curve: outbound connection. 4 2 nd curve: inbound connection GridFTP transfers. 10 channels session. 5 1 st curve: outbound connection. 6 2 nd curve: inbound connection Comparison chart between protocols Due to the inherent security mechanism employed by the GridFTP protocol (authentication and encryption on each file transfer) we can highlight that some overhead is introduced at the beginning of every new transfer. This becomes more critical when the job required hundreds or even thousands of small files as input data. On the other hand, we have also confirmed that no relevant improvement is introduced even if we employ inherent GridFTP mechanism for striping transfers into multiple parallel data channels. When the data source is unique (not distributed), the throughput rates are quite similar. 11 See applications section3.2 for practical examples.

35 3.4. SECURE REMOTE RESOURCE ACCESS AND USER MANAGEMENT Due to all Grid resources have to be deploy as a part of a public infrastructure, we have to provide also proper mechanisms to guaranty integrity of services. Keeping in mind that partner sites should be up and operational all the time, we have to supply tools to avoid abusive uses, track all events and log all relevant and necessary information. This goal has been achieved by using two different and complementary mechanisms: Perimeter Protection This term involves all relevant aspects regarding how to secure communication channels. Concerning this security framework we have found that three different scenarios have been adopted by partner sites. From the most simple to the most rugged one: PSI All resources were placed completely outside of the lab network. Protections have been set up by using inherit OS mechanisms, like iptables or snort tools. Iptables provides a solid performance, performs effective firewalling, and allows add-on functionality to enhance its reporting and response functions. On the other hand, Snort gives us a complementary free lightweight network intrusion detection system to our linux boxes. Soleil This time all Grid resources have been placed behind a corporate firewall, giving us a centralize point of security management policy and offering a strong platform of defense. ESRF Working also behind a corporate perimeter firewall (Checkpoint cluster), the platform has been rugged by using Quality of Service (QoS) appliances (Packeteer). These components guarantee that throughput is going to be regulate by a third party, avoiding abusive uses and ensuring that all the services are going to get the bandwidth they need to function at a desired level User Management A European Virtual User Office Federating users, in our case scientists who use analytical facilities like synchrotrons or neutron reactors, is one of the subjects which is under discussion since years. The potential benefits of a unique EU wide system are enormous: Scientists are using often more than just one facility to carry out their research project. Every facility currently manages the user information and the account creation separately. The maintenance of this information, and in particular the affiliation data of the scientists, is a daily time consuming activity in all labs. It is estimated that there are more than scientists using European photon and neutron facilities, coming from almost different institutions. A central repository of this information would allow for efficient update mechanisms and checking for double entries. Once the user information is federated, account creation at the facilities could be derived from this information within the workflow of the peer reviewed allocation process. The

36 same account information could be used to combine research done at two or more analytical facilities, e.g. for launching a data analysis job on data sets stored in several laboratories. A federated system would foster a community identity, something which is currently difficult to achieve considering the large variety of different origins of our user community. Initially the federated user database would simply act as a front-end to the individual User Office systems of the facilities. Gradually new functionality could be envisaged, like the parallel submission of beamtime requests to several facilities, or combining the peer review process between facilities. Ultimately this could lead to a real European Virtual User Office for a given class of facilities. A central repository of user information could also allow, with the agreement of the scientists, to foster information exchange about facility updates, workshops, special events, etc. Three ESFRI roadmap projects are currently investigating and discussing how a federated user database could be adopted and interfaced to their respective User Office Systems: the ILL 20/20, ESRFUP, and EuroFEL. Different authentication methods were considered, and a prototype setup is going to be implemented. The WP11 Grid project has allowed testing user authentication based on Grid certificates, the ESRFUP WP7 common entry point to the ILL and the ESRF based on Yale CAS, and finally the EuroFEL WP2 will soon put in place an authentication system based on Shibboleth. - Authentication with Grid certificates The clear advantage of using Grid or X509 certificates is the much improved security that it provides to a user and to the institutions managing users with respect to the now common username/password. X509 certificates are based on so called asymmetric cryptography algorithms in which every user gets two keys. One key remains private and the owner has to make sure this key does not get compromised. The other key is public and should be made available to all participants. No exchange of secrets is necessary for encryption/decryption or for authentication (signature). The other things necessary to form a public key infrastructure (PKI) are Certificate Authorities CA, which are trusted entities and guarantee the identity of the user as specified in the certificates they issue. Trust in a CA come usually through an agreed set of policies, etc. and is controlled by a Trust Federation which accredits CAs. The immediate advantage is that the EGEE project has already set up this infrastructure. National certification authorities, covering all European countries and beyond, have been created and accredited, which is quite a lengthy and tedious endeavour. For a user all he needs to do is to identify the appropriate CA and provide his name, affiliation, and address. His identity will be verified usually by showing his/her passport.

37 The EUGridPMA 12 itself does not issue certificates. It coordinates national and regional authorities that do the actual certificate issuing to end entities. In order for a new community to be integrated one has to make sure that the user's home institutes are registered with the PKI, and has people willing to act as the local registration authority (essentially checking people s passports). If a new community has users on the order of thousands or more - as is the case for the scientists working or visiting synchrotron installations -, one needs to make sure the national certificate authorities can handle the requests and support the users. If a community resists these last steps, it can decide to run its own certificate authority and thus making sure it keeps policies under their control and makes sure the delivery of certificates is timely and the user support is adequate. Although it looked manageable to the author of this paragraph, the actual user community and the people responsible for managing user accounts and access were very much resisting Grid certificates. 12 See the EUGridPMA Membership at

38 The concept looked very complex and the handling was too awkward to be considered an acceptable and feasible solution. A single harmless security message from a browser (like in the screenshot above) was enough to scare people away from getting themselves familiar with certificates. In the EGEE context, authorisation to access information on web applications is handled quite successfully. The GOCDB web portal (see picture below), hosted in the UK, or the CIC portal, hosted in Lyon, are good examples for this. The developers confirmed the simplicity with which the access to a page can be handled directly at the Apache level (SSL). This can easily be extended to handling roles by storing certificate identification strings in a small database and basing permissions on roles in the database and an associated scope of access.

39 The authorization part can also be handled in a central fashion via a Virtual Organization Management Service, like the one we have set up for the project:

XRAY Grid TO BE OR NOT TO BE?

XRAY Grid TO BE OR NOT TO BE? XRAY Grid TO BE OR NOT TO BE? 1 I was not always a Grid sceptic! I started off as a grid enthusiast e.g. by insisting that Grid be part of the ESRF Upgrade Program outlined in the Purple Book : In this

More information

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez Scientific data processing at global scale The LHC Computing Grid Chengdu (China), July 5th 2011 Who I am 2 Computing science background Working in the field of computing for high-energy physics since

More information

Monte Carlo Production on the Grid by the H1 Collaboration

Monte Carlo Production on the Grid by the H1 Collaboration Journal of Physics: Conference Series Monte Carlo Production on the Grid by the H1 Collaboration To cite this article: E Bystritskaya et al 2012 J. Phys.: Conf. Ser. 396 032067 Recent citations - Monitoring

More information

Deploying virtualisation in a production grid

Deploying virtualisation in a production grid Deploying virtualisation in a production grid Stephen Childs Trinity College Dublin & Grid-Ireland TERENA NRENs and Grids workshop 2 nd September 2008 www.eu-egee.org EGEE and glite are registered trademarks

More information

GRIDS INTRODUCTION TO GRID INFRASTRUCTURES. Fabrizio Gagliardi

GRIDS INTRODUCTION TO GRID INFRASTRUCTURES. Fabrizio Gagliardi GRIDS INTRODUCTION TO GRID INFRASTRUCTURES Fabrizio Gagliardi Dr. Fabrizio Gagliardi is the leader of the EU DataGrid project and designated director of the proposed EGEE (Enabling Grids for E-science

More information

Introduction to Grid Computing

Introduction to Grid Computing Milestone 2 Include the names of the papers You only have a page be selective about what you include Be specific; summarize the authors contributions, not just what the paper is about. You might be able

More information

Edinburgh (ECDF) Update

Edinburgh (ECDF) Update Edinburgh (ECDF) Update Wahid Bhimji On behalf of the ECDF Team HepSysMan,10 th June 2010 Edinburgh Setup Hardware upgrades Progress in last year Current Issues June-10 Hepsysman Wahid Bhimji - ECDF 1

More information

g-eclipse A Framework for Accessing Grid Infrastructures Nicholas Loulloudes Trainer, University of Cyprus (loulloudes.n_at_cs.ucy.ac.

g-eclipse A Framework for Accessing Grid Infrastructures Nicholas Loulloudes Trainer, University of Cyprus (loulloudes.n_at_cs.ucy.ac. g-eclipse A Framework for Accessing Grid Infrastructures Trainer, University of Cyprus (loulloudes.n_at_cs.ucy.ac.cy) EGEE Training the Trainers May 6 th, 2009 Outline Grid Reality The Problem g-eclipse

More information

IEPSAS-Kosice: experiences in running LCG site

IEPSAS-Kosice: experiences in running LCG site IEPSAS-Kosice: experiences in running LCG site Marian Babik 1, Dusan Bruncko 2, Tomas Daranyi 1, Ladislav Hluchy 1 and Pavol Strizenec 2 1 Department of Parallel and Distributed Computing, Institute of

More information

Monitoring System for the GRID Monte Carlo Mass Production in the H1 Experiment at DESY

Monitoring System for the GRID Monte Carlo Mass Production in the H1 Experiment at DESY Journal of Physics: Conference Series OPEN ACCESS Monitoring System for the GRID Monte Carlo Mass Production in the H1 Experiment at DESY To cite this article: Elena Bystritskaya et al 2014 J. Phys.: Conf.

More information

Outline. Infrastructure and operations architecture. Operations. Services Monitoring and management tools

Outline. Infrastructure and operations architecture. Operations. Services Monitoring and management tools EGI-InSPIRE EGI Operations Tiziana Ferrari/EGI.eu EGI Chief Operations Officer 1 Outline Infrastructure and operations architecture Services Monitoring and management tools Operations 2 Installed Capacity

More information

Users and utilization of CERIT-SC infrastructure

Users and utilization of CERIT-SC infrastructure Users and utilization of CERIT-SC infrastructure Equipment CERIT-SC is an integral part of the national e-infrastructure operated by CESNET, and it leverages many of its services (e.g. management of user

More information

glite Grid Services Overview

glite Grid Services Overview The EPIKH Project (Exchange Programme to advance e-infrastructure Know-How) glite Grid Services Overview Antonio Calanducci INFN Catania Joint GISELA/EPIKH School for Grid Site Administrators Valparaiso,

More information

EGEE and Interoperation

EGEE and Interoperation EGEE and Interoperation Laurence Field CERN-IT-GD ISGC 2008 www.eu-egee.org EGEE and glite are registered trademarks Overview The grid problem definition GLite and EGEE The interoperability problem The

More information

On the employment of LCG GRID middleware

On the employment of LCG GRID middleware On the employment of LCG GRID middleware Luben Boyanov, Plamena Nenkova Abstract: This paper describes the functionalities and operation of the LCG GRID middleware. An overview of the development of GRID

More information

NCP Computing Infrastructure & T2-PK-NCP Site Update. Saqib Haleem National Centre for Physics (NCP), Pakistan

NCP Computing Infrastructure & T2-PK-NCP Site Update. Saqib Haleem National Centre for Physics (NCP), Pakistan NCP Computing Infrastructure & T2-PK-NCP Site Update Saqib Haleem National Centre for Physics (NCP), Pakistan Outline NCP Overview Computing Infrastructure at NCP WLCG T2 Site status Network status and

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center

Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center, T. Mashimo, N. Matsui, H. Matsunaga, H. Sakamoto, I. Ueda International Center for Elementary Particle Physics, The University

More information

Real-time grid computing for financial applications

Real-time grid computing for financial applications CNR-INFM Democritos and EGRID project E-mail: cozzini@democritos.it Riccardo di Meo, Ezio Corso EGRID project ICTP E-mail: {dimeo,ecorso}@egrid.it We describe the porting of a test case financial application

More information

System upgrade and future perspective for the operation of Tokyo Tier2 center. T. Nakamura, T. Mashimo, N. Matsui, H. Sakamoto and I.

System upgrade and future perspective for the operation of Tokyo Tier2 center. T. Nakamura, T. Mashimo, N. Matsui, H. Sakamoto and I. System upgrade and future perspective for the operation of Tokyo Tier2 center, T. Mashimo, N. Matsui, H. Sakamoto and I. Ueda International Center for Elementary Particle Physics, The University of Tokyo

More information

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Why the Grid? Science is becoming increasingly digital and needs to deal with increasing amounts of

More information

VMs at a Tier-1 site. EGEE 09, Sander Klous, Nikhef

VMs at a Tier-1 site. EGEE 09, Sander Klous, Nikhef VMs at a Tier-1 site EGEE 09, 21-09-2009 Sander Klous, Nikhef Contents Introduction Who are we? Motivation Why are we interested in VMs? What are we going to do with VMs? Status How do we approach this

More information

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection Switching Operational modes: Store-and-forward: Each switch receives an entire packet before it forwards it onto the next switch - useful in a general purpose network (I.e. a LAN). usually, there is a

More information

Advanced School in High Performance and GRID Computing November Introduction to Grid computing.

Advanced School in High Performance and GRID Computing November Introduction to Grid computing. 1967-14 Advanced School in High Performance and GRID Computing 3-14 November 2008 Introduction to Grid computing. TAFFONI Giuliano Osservatorio Astronomico di Trieste/INAF Via G.B. Tiepolo 11 34131 Trieste

More information

Status of KISTI Tier2 Center for ALICE

Status of KISTI Tier2 Center for ALICE APCTP 2009 LHC Physics Workshop at Korea Status of KISTI Tier2 Center for ALICE August 27, 2009 Soonwook Hwang KISTI e-science Division 1 Outline ALICE Computing Model KISTI ALICE Tier2 Center Future Plan

More information

e-infrastructures in FP7 INFO DAY - Paris

e-infrastructures in FP7 INFO DAY - Paris e-infrastructures in FP7 INFO DAY - Paris Carlos Morais Pires European Commission DG INFSO GÉANT & e-infrastructure Unit 1 Global challenges with high societal impact Big Science and the role of empowered

More information

Grid services. Enabling Grids for E-sciencE. Dusan Vudragovic Scientific Computing Laboratory Institute of Physics Belgrade, Serbia

Grid services. Enabling Grids for E-sciencE. Dusan Vudragovic Scientific Computing Laboratory Institute of Physics Belgrade, Serbia Grid services Dusan Vudragovic dusan@phy.bg.ac.yu Scientific Computing Laboratory Institute of Physics Belgrade, Serbia Sep. 19, 2008 www.eu-egee.org Set of basic Grid services Job submission/management

More information

The INFN Tier1. 1. INFN-CNAF, Italy

The INFN Tier1. 1. INFN-CNAF, Italy IV WORKSHOP ITALIANO SULLA FISICA DI ATLAS E CMS BOLOGNA, 23-25/11/2006 The INFN Tier1 L. dell Agnello 1), D. Bonacorsi 1), A. Chierici 1), M. Donatelli 1), A. Italiano 1), G. Lo Re 1), B. Martelli 1),

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

Grid technologies, solutions and concepts in the synchrotron Elettra

Grid technologies, solutions and concepts in the synchrotron Elettra Grid technologies, solutions and concepts in the synchrotron Elettra Roberto Pugliese, George Kourousias, Alessio Curri, Milan Prica, Andrea Del Linz Scientific Computing Group, Elettra Sincrotrone, Trieste,

More information

Scalable Computing: Practice and Experience Volume 10, Number 4, pp

Scalable Computing: Practice and Experience Volume 10, Number 4, pp Scalable Computing: Practice and Experience Volume 10, Number 4, pp. 413 418. http://www.scpe.org ISSN 1895-1767 c 2009 SCPE MULTI-APPLICATION BAG OF JOBS FOR INTERACTIVE AND ON-DEMAND COMPUTING BRANKO

More information

iscsi Technology: A Convergence of Networking and Storage

iscsi Technology: A Convergence of Networking and Storage HP Industry Standard Servers April 2003 iscsi Technology: A Convergence of Networking and Storage technology brief TC030402TB Table of Contents Abstract... 2 Introduction... 2 The Changing Storage Environment...

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton

More information

HPC learning using Cloud infrastructure

HPC learning using Cloud infrastructure HPC learning using Cloud infrastructure Florin MANAILA IT Architect florin.manaila@ro.ibm.com Cluj-Napoca 16 March, 2010 Agenda 1. Leveraging Cloud model 2. HPC on Cloud 3. Recent projects - FutureGRID

More information

Andrea Sciabà CERN, Switzerland

Andrea Sciabà CERN, Switzerland Frascati Physics Series Vol. VVVVVV (xxxx), pp. 000-000 XX Conference Location, Date-start - Date-end, Year THE LHC COMPUTING GRID Andrea Sciabà CERN, Switzerland Abstract The LHC experiments will start

More information

Data transfer over the wide area network with a large round trip time

Data transfer over the wide area network with a large round trip time Journal of Physics: Conference Series Data transfer over the wide area network with a large round trip time To cite this article: H Matsunaga et al 1 J. Phys.: Conf. Ser. 219 656 Recent citations - A two

More information

EMI Deployment Planning. C. Aiftimiei D. Dongiovanni INFN

EMI Deployment Planning. C. Aiftimiei D. Dongiovanni INFN EMI Deployment Planning C. Aiftimiei D. Dongiovanni INFN Outline Migrating to EMI: WHY What's new: EMI Overview Products, Platforms, Repos, Dependencies, Support / Release Cycle Migrating to EMI: HOW Admin

More information

The LHC Computing Grid

The LHC Computing Grid The LHC Computing Grid Gergely Debreczeni (CERN IT/Grid Deployment Group) The data factory of LHC 40 million collisions in each second After on-line triggers and selections, only 100 3-4 MB/event requires

More information

FREE SCIENTIFIC COMPUTING

FREE SCIENTIFIC COMPUTING Institute of Physics, Belgrade Scientific Computing Laboratory FREE SCIENTIFIC COMPUTING GRID COMPUTING Branimir Acković March 4, 2007 Petnica Science Center Overview 1/2 escience Brief History of UNIX

More information

Integration of Cloud and Grid Middleware at DGRZR

Integration of Cloud and Grid Middleware at DGRZR D- of International Symposium on Computing 2010 Stefan Freitag Robotics Research Institute Dortmund University of Technology March 12, 2010 Overview D- 1 D- Resource Center Ruhr 2 Clouds in the German

More information

Network Design Considerations for Grid Computing

Network Design Considerations for Grid Computing Network Design Considerations for Grid Computing Engineering Systems How Bandwidth, Latency, and Packet Size Impact Grid Job Performance by Erik Burrows, Engineering Systems Analyst, Principal, Broadcom

More information

Benchmarking third-party-transfer protocols with the FTS

Benchmarking third-party-transfer protocols with the FTS Benchmarking third-party-transfer protocols with the FTS Rizart Dona CERN Summer Student Programme 2018 Supervised by Dr. Simone Campana & Dr. Oliver Keeble 1.Introduction 1 Worldwide LHC Computing Grid

More information

Some aspect of research and development in ICT in Bulgaria. Authors Kiril Boyanov and Stefan Dodunekov

Some aspect of research and development in ICT in Bulgaria. Authors Kiril Boyanov and Stefan Dodunekov Some aspect of research and development in ICT in Bulgaria Authors Kiril Boyanov and Stefan Dodunekov Introduction The development of economy and research is determined by a number of factors, among which:

More information

Introduction to Grid Infrastructures

Introduction to Grid Infrastructures Introduction to Grid Infrastructures Stefano Cozzini 1 and Alessandro Costantini 2 1 CNR-INFM DEMOCRITOS National Simulation Center, Trieste, Italy 2 Department of Chemistry, Università di Perugia, Perugia,

More information

The LHC Computing Grid

The LHC Computing Grid The LHC Computing Grid Visit of Finnish IT Centre for Science CSC Board Members Finland Tuesday 19 th May 2009 Frédéric Hemmer IT Department Head The LHC and Detectors Outline Computing Challenges Current

More information

AMGA metadata catalogue system

AMGA metadata catalogue system AMGA metadata catalogue system Hurng-Chun Lee ACGrid School, Hanoi, Vietnam www.eu-egee.org EGEE and glite are registered trademarks Outline AMGA overview AMGA Background and Motivation for AMGA Interface,

More information

Europeana Core Service Platform

Europeana Core Service Platform Europeana Core Service Platform DELIVERABLE D7.1: Strategic Development Plan, Architectural Planning Revision Final Date of submission 30 October 2015 Author(s) Marcin Werla, PSNC Pavel Kats, Europeana

More information

A VO-friendly, Community-based Authorization Framework

A VO-friendly, Community-based Authorization Framework A VO-friendly, Community-based Authorization Framework Part 1: Use Cases, Requirements, and Approach Ray Plante and Bruce Loftis NCSA Version 0.1 (February 11, 2005) Abstract The era of massive surveys

More information

VMware Technology Overview. Leverage Nextiva Video Management Solution with VMware Virtualization Technology

VMware Technology Overview. Leverage Nextiva Video Management Solution with VMware Virtualization Technology VMware Technology Overview Leverage Nextiva Video Management Solution with VMware Virtualization Technology Table of Contents Overview... 2 Seamless Integration within the IT Infrastructure... 2 Support

More information

HTRC Data API Performance Study

HTRC Data API Performance Study HTRC Data API Performance Study Yiming Sun, Beth Plale, Jiaan Zeng Amazon Indiana University Bloomington {plale, jiaazeng}@cs.indiana.edu Abstract HathiTrust Research Center (HTRC) allows users to access

More information

VERITAS Volume Replicator. Successful Replication and Disaster Recovery

VERITAS Volume Replicator. Successful Replication and Disaster Recovery VERITAS Volume Replicator Successful Replication and Disaster Recovery V E R I T A S W H I T E P A P E R Table of Contents Introduction.................................................................................1

More information

High Throughput WAN Data Transfer with Hadoop-based Storage

High Throughput WAN Data Transfer with Hadoop-based Storage High Throughput WAN Data Transfer with Hadoop-based Storage A Amin 2, B Bockelman 4, J Letts 1, T Levshina 3, T Martin 1, H Pi 1, I Sfiligoi 1, M Thomas 2, F Wuerthwein 1 1 University of California, San

More information

EUROPEAN MIDDLEWARE INITIATIVE

EUROPEAN MIDDLEWARE INITIATIVE EUROPEAN MIDDLEWARE INITIATIVE DSA2.3.1 - PERIODIC QA REPORTS EU DELIVERABLE: D4.3.1 Document identifier: EMI-DSA2.3.1-QAReport-Final.doc Date: 31/07/2010 Activity: Lead Partner: Document status: Document

More information

Background Brief. The need to foster the IXPs ecosystem in the Arab region

Background Brief. The need to foster the IXPs ecosystem in the Arab region Background Brief The need to foster the IXPs ecosystem in the Arab region The Internet has become a shared global public medium that is driving social and economic development worldwide. Its distributed

More information

The Photon and Neutron Data Initiative PaN-data

The Photon and Neutron Data Initiative PaN-data The Photon and Neutron Data Initiative PaN-data Why? With whom? When? What? Slide: 1 ICALEPCS-2011 PaN-data 13-Oct-2011 IT is transforming the practice of science Science is increasingly computational,

More information

Service Mesh and Microservices Networking

Service Mesh and Microservices Networking Service Mesh and Microservices Networking WHITEPAPER Service mesh and microservice networking As organizations adopt cloud infrastructure, there is a concurrent change in application architectures towards

More information

Data storage services at KEK/CRC -- status and plan

Data storage services at KEK/CRC -- status and plan Data storage services at KEK/CRC -- status and plan KEK/CRC Hiroyuki Matsunaga Most of the slides are prepared by Koichi Murakami and Go Iwai KEKCC System Overview KEKCC (Central Computing System) The

More information

Unit 5: Distributed, Real-Time, and Multimedia Systems

Unit 5: Distributed, Real-Time, and Multimedia Systems Unit 5: Distributed, Real-Time, and Multimedia Systems Unit Overview Unit 5 provides an extension to the core topics of operating systems. It introduces distributed systems and special-purpose operating

More information

Boundary control : Access Controls: An access control mechanism processes users request for resources in three steps: Identification:

Boundary control : Access Controls: An access control mechanism processes users request for resources in three steps: Identification: Application control : Boundary control : Access Controls: These controls restrict use of computer system resources to authorized users, limit the actions authorized users can taker with these resources,

More information

e-infrastructure: objectives and strategy in FP7

e-infrastructure: objectives and strategy in FP7 "The views expressed in this presentation are those of the author and do not necessarily reflect the views of the European Commission" e-infrastructure: objectives and strategy in FP7 National information

More information

where the Web was born Experience of Adding New Architectures to the LCG Production Environment

where the Web was born Experience of Adding New Architectures to the LCG Production Environment where the Web was born Experience of Adding New Architectures to the LCG Production Environment Andreas Unterkircher, openlab fellow Sverre Jarp, CTO CERN openlab Industrializing the Grid openlab Workshop

More information

MyCloud Computing Business computing in the cloud, ready to go in minutes

MyCloud Computing Business computing in the cloud, ready to go in minutes MyCloud Computing Business computing in the cloud, ready to go in minutes In today s dynamic environment, businesses need to be able to respond quickly to changing demands. Using virtualised computing

More information

QLogic TrueScale InfiniBand and Teraflop Simulations

QLogic TrueScale InfiniBand and Teraflop Simulations WHITE Paper QLogic TrueScale InfiniBand and Teraflop Simulations For ANSYS Mechanical v12 High Performance Interconnect for ANSYS Computer Aided Engineering Solutions Executive Summary Today s challenging

More information

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011 I Tier-3 di CMS-Italia: stato e prospettive Claudio Grandi Workshop CCR GRID 2011 Outline INFN Perugia Tier-3 R&D Computing centre: activities, storage and batch system CMS services: bottlenecks and workarounds

More information

InfiniBand SDR, DDR, and QDR Technology Guide

InfiniBand SDR, DDR, and QDR Technology Guide White Paper InfiniBand SDR, DDR, and QDR Technology Guide The InfiniBand standard supports single, double, and quadruple data rate that enables an InfiniBand link to transmit more data. This paper discusses

More information

The Grid: Processing the Data from the World s Largest Scientific Machine

The Grid: Processing the Data from the World s Largest Scientific Machine The Grid: Processing the Data from the World s Largest Scientific Machine 10th Topical Seminar On Innovative Particle and Radiation Detectors Siena, 1-5 October 2006 Patricia Méndez Lorenzo (IT-PSS/ED),

More information

Report on the HEPiX Virtualisation Working Group

Report on the HEPiX Virtualisation Working Group Report on the HEPiX Virtualisation Working Group Thomas Finnern Owen Synge (DESY/IT) The Arts of Virtualization > Operating System Virtualization Core component of today s IT infrastructure > Application

More information

Application of Virtualization Technologies & CernVM. Benedikt Hegner CERN

Application of Virtualization Technologies & CernVM. Benedikt Hegner CERN Application of Virtualization Technologies & CernVM Benedikt Hegner CERN Virtualization Use Cases Worker Node Virtualization Software Testing Training Platform Software Deployment }Covered today Server

More information

BEST PRACTICES FOR OPTIMIZING YOUR LINUX VPS AND CLOUD SERVER INFRASTRUCTURE

BEST PRACTICES FOR OPTIMIZING YOUR LINUX VPS AND CLOUD SERVER INFRASTRUCTURE BEST PRACTICES FOR OPTIMIZING YOUR LINUX VPS AND CLOUD SERVER INFRASTRUCTURE Maximizing Revenue per Server with Parallels Containers for Linux Q1 2012 1 Table of Contents Overview... 3 Maximizing Density

More information

Virtualization. A very short summary by Owen Synge

Virtualization. A very short summary by Owen Synge Virtualization A very short summary by Owen Synge Outline What is Virtulization? What's virtulization good for? What's virtualisation bad for? We had a workshop. What was presented? What did we do with

More information

A scalable storage element and its usage in HEP

A scalable storage element and its usage in HEP AstroGrid D Meeting at MPE 14 15. November 2006 Garching dcache A scalable storage element and its usage in HEP Martin Radicke Patrick Fuhrmann Introduction to dcache 2 Project overview joint venture between

More information

Installation of CMSSW in the Grid DESY Computing Seminar May 17th, 2010 Wolf Behrenhoff, Christoph Wissing

Installation of CMSSW in the Grid DESY Computing Seminar May 17th, 2010 Wolf Behrenhoff, Christoph Wissing Installation of CMSSW in the Grid DESY Computing Seminar May 17th, 2010 Wolf Behrenhoff, Christoph Wissing Wolf Behrenhoff, Christoph Wissing DESY Computing Seminar May 17th, 2010 Page 1 Installation of

More information

BX900 The right server for virtualization applications

BX900 The right server for virtualization applications BX900 The right server for virtualization applications 2 April 2009-04-16 8 pages 8 Content BX900 The right server for virtualization applications 2 Requirement criteria 2 Virtualization today 2 General

More information

Background Brief. The need to foster the IXPs ecosystem in the Arab region

Background Brief. The need to foster the IXPs ecosystem in the Arab region Background Brief The need to foster the IXPs ecosystem in the Arab region The Internet has become a shared global public medium that is driving social and economic development worldwide. Its distributed

More information

A Cloud WHERE PHYSICAL ARE TOGETHER AT LAST

A Cloud WHERE PHYSICAL ARE TOGETHER AT LAST A Cloud WHERE PHYSICAL AND VIRTUAL STORAGE ARE TOGETHER AT LAST Not all Cloud solutions are the same so how do you know which one is right for your business now and in the future? NTT Communications ICT

More information

Cisco Virtualized Workload Mobility Introduction

Cisco Virtualized Workload Mobility Introduction CHAPTER 1 The ability to move workloads between physical locations within the virtualized Data Center (one or more physical Data Centers used to share IT assets and resources) has been a goal of progressive

More information

Veeam and HP: Meet your backup data protection goals

Veeam and HP: Meet your backup data protection goals Sponsored by Veeam and HP: Meet your backup data protection goals Eric Machabert Сonsultant and virtualization expert Introduction With virtualization systems becoming mainstream in recent years, backups

More information

Managing Performance Variance of Applications Using Storage I/O Control

Managing Performance Variance of Applications Using Storage I/O Control Performance Study Managing Performance Variance of Applications Using Storage I/O Control VMware vsphere 4.1 Application performance can be impacted when servers contend for I/O resources in a shared storage

More information

Monitoring ARC services with GangliARC

Monitoring ARC services with GangliARC Journal of Physics: Conference Series Monitoring ARC services with GangliARC To cite this article: D Cameron and D Karpenko 2012 J. Phys.: Conf. Ser. 396 032018 View the article online for updates and

More information

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK [r.tasker@dl.ac.uk] DataTAG is a project sponsored by the European Commission - EU Grant IST-2001-32459

More information

L3.4. Data Management Techniques. Frederic Desprez Benjamin Isnard Johan Montagnat

L3.4. Data Management Techniques. Frederic Desprez Benjamin Isnard Johan Montagnat Grid Workflow Efficient Enactment for Data Intensive Applications L3.4 Data Management Techniques Authors : Eddy Caron Frederic Desprez Benjamin Isnard Johan Montagnat Summary : This document presents

More information

Presentation Title. Grid Computing Project Officer / Research Assistance. InfoComm Development Center (idec) & Department of Communication

Presentation Title. Grid Computing Project Officer / Research Assistance. InfoComm Development Center (idec) & Department of Communication BIRUNI Grid glite Middleware Deployment : From Zero to Hero, towards a certified EGEE site Presentation Title M. Farhan Sjaugi,, Mohamed Othman, Mohd. Zul Yusoff, Speaker Mohd. Rafizan Ramly and Suhaimi

More information

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to

More information

COMMVAULT. Enabling high-speed WAN backups with PORTrockIT

COMMVAULT. Enabling high-speed WAN backups with PORTrockIT COMMVAULT Enabling high-speed WAN backups with PORTrockIT EXECUTIVE SUMMARY Commvault offers one of the most advanced and full-featured data protection solutions on the market, with built-in functionalities

More information

The Industrial Intranet: a Cause For Change

The Industrial Intranet: a Cause For Change The Industrial Intranet: a Cause For Change by S Graham, Schneider Automation Electricity + Control March 2000 Most plant data collection applications today use a batch approach, where data is transmitted

More information

The Global Grid and the Local Analysis

The Global Grid and the Local Analysis The Global Grid and the Local Analysis Yves Kemp DESY IT GridKA School, 11.9.2008 Overview Global and globalization : Some thoughts Anatomy of an analysis and the computing resources needed Boundary between

More information

Private Cloud at IIT Delhi

Private Cloud at IIT Delhi Private Cloud at IIT Delhi Success Story Engagement: Long Term Industry: Education Offering: Private Cloud Deployment Business Challenge IIT Delhi, one of the India's leading educational Institute wanted

More information

DIRAC pilot framework and the DIRAC Workload Management System

DIRAC pilot framework and the DIRAC Workload Management System Journal of Physics: Conference Series DIRAC pilot framework and the DIRAC Workload Management System To cite this article: Adrian Casajus et al 2010 J. Phys.: Conf. Ser. 219 062049 View the article online

More information

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to

More information

On the EGI Operational Level Agreement Framework

On the EGI Operational Level Agreement Framework EGI-InSPIRE On the EGI Operational Level Agreement Framework Tiziana Ferrari, EGI.eu EGI Chief Operations Officer 1 Outline EGI and its ecosystem EGI Service Infrastructure Operational level agreements

More information

CernVM-FS beyond LHC computing

CernVM-FS beyond LHC computing CernVM-FS beyond LHC computing C Condurache, I Collier STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot, OX11 0QX, UK E-mail: catalin.condurache@stfc.ac.uk Abstract. In the last three years

More information

VERITAS Volume Replicator Successful Replication and Disaster Recovery

VERITAS Volume Replicator Successful Replication and Disaster Recovery VERITAS Replicator Successful Replication and Disaster Recovery Introduction Companies today rely to an unprecedented extent on online, frequently accessed, constantly changing data to run their businesses.

More information

Virtualization. Michael Tsai 2018/4/16

Virtualization. Michael Tsai 2018/4/16 Virtualization Michael Tsai 2018/4/16 What is virtualization? Let s first look at a video from VMware http://www.vmware.com/tw/products/vsphere.html Problems? Low utilization Different needs DNS DHCP Web

More information

Backtesting in the Cloud

Backtesting in the Cloud Backtesting in the Cloud A Scalable Market Data Optimization Model for Amazon s AWS Environment A Tick Data Custom Data Solutions Group Case Study Bob Fenster, Software Engineer and AWS Certified Solutions

More information

The Optimal CPU and Interconnect for an HPC Cluster

The Optimal CPU and Interconnect for an HPC Cluster 5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance

More information

EUMEDCONNECT3 and European R&E Developments

EUMEDCONNECT3 and European R&E Developments EUMEDCONNECT3 and European R&E Developments David West DANTE 17 September 2012 INTERNET2 Middle SIG, Abu Dhabi The Research and Education Network for the Mediterranean Covering GEANT Other regional network

More information

New Directions and BNL

New Directions and BNL New Directions and HTCondor @ BNL USATLAS TIER-3 & NEW COMPUTING DIRECTIVES William Strecker-Kellogg RHIC/ATLAS Computing Facility (RACF) Brookhaven National Lab May 2016 RACF Overview 2 RHIC Collider

More information

Parallel Performance Studies for a Clustering Algorithm

Parallel Performance Studies for a Clustering Algorithm Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,

More information

Deploying Application and OS Virtualization Together: Citrix and Virtuozzo

Deploying Application and OS Virtualization Together: Citrix and Virtuozzo White Paper Deploying Application and OS Virtualization Together: Citrix and Virtuozzo www.swsoft.com Version 1.0 Table of Contents The Virtualization Continuum: Deploying Virtualization Together... 3

More information

SDS: A Scalable Data Services System in Data Grid

SDS: A Scalable Data Services System in Data Grid SDS: A Scalable Data s System in Data Grid Xiaoning Peng School of Information Science & Engineering, Central South University Changsha 410083, China Department of Computer Science and Technology, Huaihua

More information