E S R F U P - W P 1 1

Size: px

Start display at page:

Download "E S R F U P - W P 1 1"

Bertha Bryant
5 years ago
Views:

E S R F U P - W P 1 1 F I N A L R E P O R T O N G R I D E V A L U A T I O N P R O J E C T EU Milestone M11.1 Document identifier: ESRFUP_M11.1.doc Date: 26/02/10 Authors F.Calvelo-Vazquez, A.Gotz, R.

1 E S R F U P - W P 1 1 F I N A L R E P O R T O N G R I D E V A L U A T I O N P R O J E C T EU Milestone M11.1 Document identifier: ESRFUP_M11.1.doc Date: 26/02/10 Authors F.Calvelo-Vazquez, A.Gotz, R.Dimper, E.Taurel, C.Koerdt, G.Foerstner Abstract: This document contains information which supports the milestone 11.1 Final report on operational experiences with the international test bed installation including future orientations for photon science Grid activities.

2 TABLE OF CONTENTS 1. EXECUTIVE SUMMARY AND RECOMMENDATIONS OUR INCENTIVES FOR THE USE OF GRID TECHNOLOGY BACKGROUND PROJECT DESCRIPTION TERMINOLOGY IMMEDIATE AND LONG TERM NEEDS OF THE SYNCHROTRON USER FACILITIES THE EXISTING EUROPEAN GRID INFRASTRUCTURE, THE TECHNOLOGY IN USE AND SUSPECTED OVERLAP The European Grid infrastructure project EGEE OUR EXPERIENCE WITH THE EGEE GRID THE INSTALLATION (AND THE OBSTACLES) Virtualization Networking Deployed components Monitoring tools MPI APPLICATIONS Different classes, long/short run, data intensive SPD a class 1 type application Gasbor a class 4 type application PyHST class 2 type application The process of porting Software management DATA TRANSFERS Throughput numbers, regular Gridftp transfers statistics SECURE REMOTE RESOURCE ACCESS AND USER MANAGEMENT Perimeter Protection User Management BUILDING A VIRTUAL PHOTON USER COMMUNITY The VO concept Experience gathered with collaborating partners OUR EVALUATION MATCH OR MISMATCH OF THE TECHNOLOGY WITH NEEDS THE IDEAL SYNCHROTRON GRID CONCLUSION

3 1. EXECUTIVE SUMMARY AND RECOMMENDATIONS Grid computing has been actively developed and financed over more than 10 years now. It has been proposed as the next generation World Wide Web (WWW) for high-performance computing. It has been successfully used in high energy physics. So far it has not found many uses in photon science. It is an obvious candidate for addressing the deluge of data produced in photon science at synchrotrons like the ESRF around the world. Because photon science needs in data analysis are not the same as high energy physics it is necessary to evaluate Grid technology specifically for photon science applications and communities. The ESRF Upgrade Program offers a unique opportunity to evaluate the Grid for photon science at a photon source i.e. a synchrotron. A work package (WP11) was setup as part of the European Union FP7 financed program for the ESRF Upgrade (ESRFUP ) to study Grid computing for photon science. This is the final report of this work package. The study was restricted to include only EGEE Grid computing and not other Grid computing technologies. In order to simulate a real Grid setup hardware was procured and installed at 3 partner sites with the EGEE glite software. The test Grid enabled experience to be gained on setting up and managing a Grid site both locally and site. This study has shown that overall the EGEE Grid is not suited to the case of photon science. The case studies demonstrated that except for a small number of applications (embarrassingly parallel programs which are CPU-intensive and require little input or output data) the majority of photon science applications are not suited for highly distributed Grid computing like EGEE. The data intensiveness of photon science applications does not scale to such Grids. Public networks are too slow for transporting large volumes of data to and from storage elements. Most simulations used in photon science which are CPU-intensive and require little data are not embarrassingly parallel. They require fast connections between the compute nodes usually based on a protocol called MPI. EGEE Grid provides little support for installing MPI. The photon science communities are organised in a very heterogeneous manner and are much smaller than high energy physics communities. This makes it much more difficult to apply Grid solutions for managing communities. Due to the high cost in terms of human resources to manage EGEE Grid computing sites and the low added value for photon science this report does not recommend EGEE-like Grid computing for photon science. We think the ESRF has much more to gain from investing in high performance computing centres locally and in speeding up the data reduction and analysis by porting programs to the new generation of GPU's and multi-core processors. Remote access to HPC centres should be provided to using well-known simple to manage solutions like secure shell. 3

4 2. OUR INCENTIVES FOR THE USE OF GRID TECHNOLOGY 2.1. BACKGROUND GRID computing is around since the end of the 90s and has quickly developed and gained in complexity. At the ESRF first attempts to use Grid software date back to the beginning of 2005 when the Condor batch submission software was successfully installed. Back at that time, it became quickly apparent that if the Grid may have a huge potential to address emerging needs of our user community, it needed an in depth analysis to understand and measure the technical and organisational implications. The ESRF Upgrade Programme triggered an in depth reflection on how to deal with the expected flood of data from high resolution detectors with unprecedented frame rates. Some of the challenges are technical, others are organisational, all of them deal with latency, i.e. the speed at which data can be transferred, stored, processed, visualised. The Grid was mentioned in the ESRF Science and Technology Programme as one of the promising technologies to deal with issues like computational resources, storage resources, code repositories, data catalogues, data security and data access. The ESRF Upgrade Programme being part of the ESFRI roadmap, preparatory money was made available for a number of activities, including for a Grid feasibility study, the so-called ESRFUP WP11. WP11 gave us the opportunity to acquire hands-on experience with Grid software, with the overarching aim to be able to decide at the end of the project whether Grid software could really fulfil our expectations and solve one or several of our emerging needs deriving from the data avalanche PROJECT DESCRIPTION The objective of the WP 11 is to study the feasibility to participate in the Enable Grids for escience (EGEE) Grid initiative as a complementary data management tool (storage and analysis) for data intensive Synchrotron Radiation research. Grid software could help in handling the enormous data output of certain experiments the upgraded ESRF will allow for. The implementation of Grid software and the porting of scientific software to such an environment is an integral part of the ESRF Upgrade Programme. Scientists coming to the ESRF for data intensive experiments do not necessarily have the means in their home laboratory to store and analyse the vast amount of data they will collect at the ESRF or in other Photon laboratories. Grid software may provide a cost effective solution to this problem. WP11 aimed at gaining hands-on experience with glite software, organise a workshop with international experts and key photon science users, configure a virtual organisation for synchrotron radiation, and train interested scientists in the usage of Grid tools. WP11 also included the procurement, installation, and operation of a test bed between the ESRF and three other partner laboratories in Europe. This installation was intended to test one or two resource intensive application, e.g. tomography volume reconstruction, to demonstrate data replication mechanisms, to test credential management, and to measure performances TERMINOLOGY What do we call Grid in this report? Grid computing or simply Grid means many things to many people. From computers connected to a network to a highly specific combination of software and 4

5 hardware to access a dedicated set of computers with a well defined set of services. The wikipedia entry for Grid computing: provides a good overview of the various meanings for Grid. In this project we have stuck to the general definition of Grid in Ian Foster's article What is the Grid? A Three Point Checklist 1 In this article Ian Foster lists these primary attributes: Computing resources are not administered centrally. Open standards are used. Non-trivial quality of service is achieved. We achieved these three attributes by installing computing resources in 4 sites managed by 4 different institutes connected via Internet i.e. dedicated network connections were not used. We used the open source glite 2 software to provide a non-trivial quality of service e.g. computing and storage resources available 24 hours a day, 7 days a week. What we do not mean by Grid in this project is the sharing of centralised computer infrastructure like in TeraGrid or the sharing of volunteered desktop resources like in the SETI@Home project which uses the Berkeley Open Infrastructure for Network Computing (BOINC 3 ) software. Similarly the Cloud as Grid-like technology has not been evaluated IMMEDIATE AND LONG TERM NEEDS OF THE SYNCHROTRON USER FACILITIES The driving force behind the computing infrastructure at the ESRF is the rapidly increasing flow of data generated by the 31 public and 11 non-public beamlines. The data rate has been multiplied by 300 over the last ten years, and considering the ambitious developments of new beamlines with better detectors, increased automation, and optimised operation schedules, we expect this trend to continue or maybe even amplify. In 2009 the ESRF generated some 400 TB of data in more than files. In addition to the needs deriving directly from the data avalanche, additional requirements are emerging in the European science arena for managing and preserving scientific data. The initial wish list of functionalities which motivated our desire to investigate whether Grid technology can provide solutions was: Computational resources: Compute clusters made from commodity computers, GPUs, or blade systems, are today s best solution to high-performance batch computing. We have started to concentrate our onsite compute power in several interconnected clusters connected to our storage facility and managed by the batch submission software Condor. This system has proven to be reliable and offers high performance. It will be upgraded to add functions such as checkpoint/restart and a better separation between interactive and batch processing. By adding a Grid software layer that will use Condor as a local workload manager, authenticated and authorised access to this resource for local (Intranet) as well as remote (Internet), users may share or aggregate the computational resources with other laboratories

6 Storage resources: New ways of interacting with the very large data sets generated by the new science coming from the Upgrade Programme will need to be found. Grid software may leverage data analysis and curation of these large data sets. It could open the possibility to give secure access from remote sites to this data, to analyse them with compute resources made available at the ESRF and to transfer raw or analysed data reliably to Grid resources at user s home institutions. Network resources and virtual organisation membership services: A high-performance and reliable networking infrastructure is fundamental for levering optimal usage of computing resources. Grid services for reliable data transfers, data replication and data duration as well as data treatment could give ESRF users the possibility to continue their scientific work from their home institutions in a secure and controlled manner. Virtual organisations: VOs grouping scientists working in the same field could be configured to give secure access to compute and storage resources, which are either centralised at ESRF or shared between participating parties inside the same virtual organisation. Grid certificates could be used to identify scientists across Europe and manage access rights to services and resources. Code repositories: Scientific software needs continuous upgrading and maintenance. Code repositories in a Grid environment could improve the use of and reduce the maintenance efforts for such software. It also could give the possibility to establish Computing On Demand (COD) in VOs, reducing the needs for time consuming data transfers. Catalogues: In some scientific areas, databases of metadata and publications (catalogues) linked to measurements and their data made at the ESRF or other synchrotron radiation facilities would allow queries or browsing of scientific results for judging, comparing, or complementing new data sets. The basis for such catalogues is a common data format containing not only the raw-data but also cured-data and metadata attached to a data set. Some data sets from a single experiment will be geographically dispersed and/or originate from cross disciplinary research. Merging such data sets requires transparent and easy access to distributed repositories. Data archiving and curation is increasingly important and indispensible to preserve data from unique samples. Tools are required to allow for long-term storage, browsing, and visualisation. Grid security infrastructure: The connectivity layer in Grid software defines core communication and authentication protocols required for the Grid Security Infrastructure (GSI). Authentication and authorisation protocols build on communication services to provide cryptographically secure mechanisms for verifying the identity of users and resources. The features available today make it possible to define VOs that span geographically distributed organisations with different administrations. The same functionality could be used to provide secure services to individual users or user groups THE EXISTING EUROPEAN GRID INFRASTRUCTURE, THE TECHNOLOGY IN USE AND SUSPECTED OVERLAP The European Grid infrastructure project EGEE As explained on their website the Enabling Grids for E-sciencE project is the largest Grid Computing project in Europe. The various project phases from EGEE-I to currently EGEE-III have covered six years with funding of roughly 20 million Euros of each of the phases. EGEE has 6

recently been replaced by the European Grid Initiative (EGI) project. Refer to their website for the latest information : http://www.egi.eu/.

7 recently been replaced by the European Grid Initiative (EGI) project. Refer to their website for the latest information : The EGEE Grid has a large number of participating Virtual Organization and ranges from disciplines like High Energy Physics, its main client, to Life Science and Mathematics. The use of the infrastructure has increased steadily during the last couple of years, although disciplines other than High Energy Physics have played a secondary role only. 3. OUR EXPERIENCE WITH THE EGEE GRID The path to achieve our goal of testing the EGEE Grid was far from straightforward. Some detours were made at the beginning before arriving at the final configuration. We learned along the way the possibilities and limitations of glite services and how to manage them in the best way. On this course, the experience and feedback exchanged with our partners was of vital importance. We took important decisions, for the final deployed testbed, based on long discussion with EGEE experts and VO site administrators. Finally with their help, we could manage to deploy all services that we needed for our tests THE INSTALLATION (AND THE OBSTACLES) Virtualization Selected Platform Due to the large number of required middleware services and the limited number of servers in our test bed installation, it was decided to use virtualization techniques. This was expected to make better use of the available resources and improve the flexibility of the setup in case of necessary design changes after the first tests. We also expected to easily clone the system for the setup of our partner sites. 7

8 In terms of performance, this design was not the best platform for a future production system. Nevertheless, the choice of virtualization technology has favoured a system with limited impact on overall performance. Initially, two virtualization platforms were evaluated: VMware Citrix XenServer The following points were considered an important advantage and at that time justified the final decision in favour of Citrix XenServer : Xen hypervisor structure Paravirtualization Virtual machines on-line migration Number of managed resources by VM. Price Some of these features, like the XenHypervisor in conjunction with paravirtualization 4, can achieve high ranks of performance even on the x86 host architecture, which is notoriously uncooperative with traditional virtualization techniques 5. Others, like XenMotion (live migration of Virtual Machines between physical hosts), were included in the standard product by Citrix XenServer (but were and add-on for VMware). Also, some characteristics like competitive prices, number of managed resources by VM, support for hot-swap CPUs in Linux VM, support for Intel-VT and AMD-V hardware, were better achieved by Citrix XenServer at the time of the study (beginning of 2008). A first test Xen Server setup was installed in August The final Citrix Xen Server platform (Version 4.0) placed in the DMZ was set up early September Our virtualized platform was composed initially of three (and later of six) Sun Fire x4150 servers, forming a so-called XenPool. Almost a dozen middleware servers have been deployed as Virtual Machines (VM) within the ESRF's XenPool. The VMs run on Scientific Linux 4.7. Both, the i386 and the x86_64 architecture were required by the glite middleware. Platform limitations Of course, the use of a sharing platform gives us an inexpensive platform where we could test all glite services in a flexible and easily manageable virtualized environment. However, these kind of platforms, that are quite suitable for test beds, have also their limitations. Within a virtualized setup we can profit from the possibility of sharing most part of resources: like CPU, network interfaces, and available storage. However, RAM cannot be overbooked and throughput bandwidth can be seriously affected when the number of virtualized services grows. 4 Paravirtualization is a technique that presents a layer between the virtual machine and the actual underlying hardware and is therefore able to limit any performance degradation due to virtualization

Unfortunately, on our setup, the XenMotion capabilities had to be abandoned subsequently, after we found that VMs deployed on shared storage via NFS severely degraded the performance of any I/O

9 Unfortunately, on our setup, the XenMotion capabilities had to be abandoned subsequently, after we found that VMs deployed on shared storage via NFS severely degraded the performance of any I/O operation. All virtual disks of the guests (the VMs) had to be moved locally and distributed over the available XenServer hosts Networking Working behind a NAT The first network scenario was strongly limited by the number of public IP addresses available on the DMZ. Due to that all servers were installed on a private DMZ segment, with NAT translation for all the services. Unfortunately, after considerable time invested to deploy this setpu, services never worked correctly behind the NAT box. Even with complex scenarios and setting up some imaginative solutions, some middleware components (ex. Grid-FTP on dcache) did not work correctly. Figure 1: The initial network setup turned out to be problematic 9

Second scenario (native public subnet) Finally a decision was made to reserve a new class C IP address range to redeploy all middleware glite components in its own

It took some time to incorporate this new network in the existing infrastructure, and even more delay was added to migrate the middleware services or redeploy them.

10 Second scenario (native public subnet) Finally a decision was made to reserve a new class C IP address range to redeploy all middleware glite components in its own external DMZ. The network team started then the lengthy negotiation procedure with RENATER for the new segment. It took some time to incorporate this new network in the existing infrastructure, and even more delay was added to migrate the middleware services or redeploy them. XRAY Grid Native Public DMZ Figure 2: Final network setup at XRAY resource centres However the experience gained was useful afterwards for deploying the equivalent infrastructure a partner sites PSI and Soleil. 10

The common backbone The GÉANT2 6 project is a collaboration between 34 participants: 32 European national research and education networks (NRENs), plus DANTE (managing partner, project co-ordinator

11 The common backbone The GÉANT2 6 project is a collaboration between 34 participants: 32 European national research and education networks (NRENs), plus DANTE (managing partner, project co-ordinator and responsible of operations) and TERENA (to assure effort for correct developing of national NRENs). The GÉANT2 network provides connectivity and services to more than 30 million researchers at 8000 institutions in 34 different European countries, and links to a number of other world regions. It was established in 1993, and has since played a pivotal role in five consecutive generations of pan- European research network: EuropaNET, TEN-34, TEN-155, GÉANT and now GÉANT2. All our Grid partners are linked to this pan-european research network. Metropolitan LANs give them access to their local NRENs: DFN 7 for Desy SWITCH 8 for PSI RENATER 9 for Soleil and ESRF The main advantages of using this kind of high-speed public network to interconnect our resource centres are:

12 The backbone is operated at data transfer speeds of up to 10 Gbps across 50,000 km of network infrastructure, of which 12,000 km is based on GEANT2 fibre. It comprises 25 Points of Presence (PoPs), 44 routes and 18 dark fibre routes. Multiple 10Gbps wavelengths are employed in the network's core. The cost of a fine tuning Tuning and optimization issues, for getting the best possible network performance, include a very large range of different aspect to study. To achieve end-to-end throughput, every network hop on the path has to be analyzed in depth. Switches, routers and even host kernel parameters have to be optimised. Some useful network tools have greatly helped to accomplish these tedious tasks. Tools for measuring the bandwidth: like iperf Sniffers to examine the traffic: tcpdumps, wireshark Monitoring tools: ntop, MRTG graph In our specific network scenario, end switches have been enabled to use jumbo frames. This specific feature is also supported on the RENATER network. It means that we could employ Ethernet frames with more than 1,500 bytes of payload in case of end-to-end communications between GÉANT resource centres. The most common network protocol used in glite services is the Transmission Control Protocol, or TCP. TCP uses a "congestion window" to determine how many packets it can send at one time. The larger the congestion window size, the higher the throughput. The TCP "slow start" and "congestion avoidance" algorithms determine the size of the congestion window. The maximum congestion window is related to the amount of buffer space that the kernel allocates for each socket. For each socket there is a default value for the buffer size, which programs can change by using a system library call just before opening the socket. For some operating systems there is also a kernel-enforced maximum buffer size. You can adjust the buffer size for both the sending and receiving ends of the socket. To achieve maximum throughput, it is critical to use optimal TCP socket buffer sizes for the link you are using. If the buffers are too small, the TCP congestion window will never open up fully, so the sender will be throttled. If the buffers are too large, the sender can overrun the receiver, which will cause the receiver to drop packets and the TCP congestion window to shut down. Assuming there is no network congestion or packet loss, network throughput is directly related to TCP buffer size and the network latency. There are two TCP settings to consider: the default TCP buffer size and the maximum TCP buffer size. A user-level program can modify the default buffer size, but the maximum buffer size requires to be modified at kernel level. 12

13 Some specific works in this way has been carried out in earlier stages. Slow traffic rates between PSI and ESRF were detected during initial transfer tests. By reviewing every single step in the whole path, the problem was solved by tuning all elements in the chain. Also kernel parameters were adapted on both sides according to latencies and round-trip delays on the intermediate networks. Experiences gained with these time-consuming tasks have been reused to install the Grid hardware at SOLEIL Deployed components We set up the different middleware components on top of two virtual machine templates running Scientific Linux 4.7, one for the i386 and another for the x86_64 architecture. The software packages were installed with the help of the yum tool. The configuration was done using YAIM. Site Services a. glite-ui Installation of the glite-ui was rather straightforward. It was however disappointing that the glite-ui was not available on the x86_64 architecture. People who need to compile their programs then find a different architecture on the UI from the one on the Worker Nodes. We installed also a UI tarball accessible to the ESRF users, which was mounted via NFS on the ESRF's internal NICE cluster nodes. b. Cream CE We decided to start with the Cream CE. Some of us had learned how to set it up in the administration course during the GridKa school in September. In September 2008 it had been in PPS for a few month already and was expected to be released soon afterwards in production (as it was supposed to replace the older lcg-ce as soon as possible). Unfortunately, this release date turned out to be far too optimistic. c. Lcg-CE Due to the delay of the Cream-CE production release we had to install also the older lcg-ce in order to get the official certification from the French ROC. Another reason was the necessity of running MPI jobs, for which the cream CE was not (yet) tested, and we also could compare the performance of the two implementations with respect to our needs. After the network topology was changed (described in the previous chapter) the installed lcg-ce became operational. d. Site-BDII The information on our site resources were collected and published by the site-bdii. The component, although released in the 32bit version, was successfully installed on the x86_64 platform. e. Monitoring Box The mon-box was installed in the weeks before certification and subsequently hosted the nagios and ganglia monitoring tools. 13

14 VO Services f. VOMS server The VOMS server was operational in September 2008 and could be used from sites like DESY to run jobs under the XRAY virtual organization. We managed to run the VOMS server on the x86_64 platform, although the release was officially certified for the i386. The VOMS server suffered from a lot of bugs and only a couple of them were fixable with workarounds. The situation improved once the version was released in November g. glite-wms A first installation attempt within the older problematic network topology had been unsuccessful. After the move to the new network structure we repeated the attempt and everything went rather smoothly. It was a little unfortunate, however, that the Cream CE at the time of our site certification was not able to receive jobs via the WMS. The WMS release which supported the submission of jobs to the Cream CE was introduced approximately two month later in March Worker Nodes All our 14 compute nodes were installed with the glite-wn-torque module on the x86_64 architecture (SL4.7). Altogether 80 CPU cores were made available (6*quad core*dual CPU + 8*dual core*dual CPU). A shared software area between the ESRF s User Interface and all Worker Nodes was added, including later versions of Python (2.5.2, , 2.6.1), as well as MatLab Runtime 7.6, g95 and others. With the installation of MPI packages in February 2009 we also moved to a shared /home directory. Setting up the Storage Elements a. dcache test In September 2008 the situation was such that dcache had just produced a major new release. Moreover, it had changed the underlying technology of handling the name space (moved from pnfs to chimera). Since we wanted to avoid installing an old system that we would have to change soon after, we decided to go for the latest version. Unfortunately, the documentation of this new release was only partial and the release not yet in a very stable state. So after several unsuccessful steps to run version 1.9, we were advised by dcache support people to install the version 1.8 in combination with the Chimera namespace in order to limit future migration problems. The installation was internally operational in October. However, we had to wait for the move to the new network scenario to be able to use the storage element from outside the ESRF. Even after the installation, quite a few problems had to be solved. Due to the lack of good documentation and the sheer complexity of the dcache technology a lot of time had to be invested to get and keep the service running. h. DPM Because of the initial problems to make the dcache installation work, we decided to give DPM a try. We would then have another technology for comparison. 14

15 Indeed, the installation of DPM went rather smooth and fewer problems had to be solved while operating this storage element. Nevertheless, when it comes to transferring very large files (in excess of the 4GB recommended by high energy physics user groups) one runs also into problems and has to invest some time in finding the right configuration. In this respect, one has to mention that the support for DPM is quite well organised and the developers had been reactive to our problems Monitoring tools The whole XRAY VO infrastructure is being monitores all the time and a full set of complementary tools are exchanging information continuously, sending traps in case of any problem with the VO. In this sense we distinguish two different groups of tools. Those that are inherit to the standard EGEE surveillance program and traditional/local monitoring systems. EGEE Surveillance System The EGEE SA1 group, as a part of their activity, provide a set of tools that can be used to monitor and test register EGEE partner sites. Examples of those are CIC Portal, SAM tests, GOCDB 3 centre or even the EGEE security mailing list. One requirement to have access to most of these services is to register and get your resource centre certified. After your site is certified, you start to get GGUS trouble tickets automatically, you are allowed to search your site status at GSTAT or you can even join the security mailing lists. Screenshot from GSTAT portal This layer provides a good level of monitoring mechanisms from the point of view of glite services, but is not enough if the failure itself is at the operating system level or even lower, at server level. Therefore we had to use complementary local tools. 15

Traditional Monitoring tool On the other hand, a complete set of traditional monitoring tools has been deployed locally at each partner site to monitor the site infrastructure hardware and operating

16 Traditional Monitoring tool On the other hand, a complete set of traditional monitoring tools has been deployed locally at each partner site to monitor the site infrastructure hardware and operating system layers. A full mesh NAGIOS system monitors all basic resource parameters by checking the most important server and OS features. Moreover, Ganglia daemons have been deployed at each partner site to keep watching resource utilization trends. 16

17 The MRTG graphs and ntop tools have also been added to provide useful information about throughput utilization and protocol distributions. Based on the output of these tools, we took relevant information to customize the QoS system (Packeteer) according to our needs. 17

18 Ntop snapshot Specific monitoring tools Another set of complementary tools were monitored more in depth glite services. This is the case of the WLCG Nagios component deployed on the MonBox server on each partner site, giving us a more detailed status of every single deployed facility on the resource centre. WLCG-Nagios MPI What is MPI? According to Wikipedia : MPI is a language-independent communication protocol used to program parallel computers. Both point-to-point and collective communication are supported. MPI "is a message-passing application programmer interface, together with protocol and semantic specifications for how its features must behave in any implementation." MPI's goals are high performance, scalability, and portability. MPI remains the dominant model used in high-performance computing today 10. The last sentence is particularly relevant. MPI is the de-facto standard in parallel programming in the scientific world. Many simulation programs used in synchrotron science have been ported to MPI to enable them to run in parallel on compute clusters. For this reason it is essential that if we want to attract simulation programs to the Grid we need to install MPI on the Grid. Unfortunately this is easier said than done for a number of reasons:

19 - MPI is supposed to be supported with glite, but it actually isn't - Releases are very, very few and new glite releases were not tested in a MPI environment - MPI is only released in combination with Torque (no Condor, or others) - the MPI version released uses an old version of Torque - making the installation tedious to maintain and to upgrade - it was necessary (or recommended) to change from key based WN access to shared home directories - a yaim installation procedure is proposed, but does (did) not work - documentation was not adequate and different parts available sometimes contradict each other - at the time of installation no functioning SAM tests - no easy tests available to check for oneself o the tests available were not sufficient, sometimes only showing that it did not work o no recipes to systematically check installation - how to handle MPI jobs on multi core machines - how to distribute over different sites - how to recollect jobs efficiently For the above reasons and after trying for some time we eventually gave up on getting MPI running on glite CREAM. This had the consequence that a potentially large number of simulations applications which only require compute power and MPI were not attracted to the Grid. Although this was an unsatisfactory result we need to add that even if we had managed to make MPI work on the Grid it is not as attractive running MPI on a local cluster for the following reasons: in our case the worker nodes were inter-connected with Ethernet whereas a local cluster can offer higher performance inter-connect like Infiniband it is easier to manage jobs on a local cluster than on the Grid Grid jobs seem to be less reliable than local batch jobs users find local cluster computing easier to learn and manage than the Grid In the future we urge the glite team to support MPI out-of-the-box if they want to make Grid attractive for these types of jobs APPLICATIONS Different classes, long/short run, data intensive The suitability of the Grid as a solution for any field depends entirely on the type of applications and how often they are executed. This is true for photon science too. By studying the typical jobs and the frequency with which they occur we have established classes of applications for synchrotron science. One application from three of these classes has been used in our case studies. There is a large disparity between the different photon science experiments. Some of them run all their data reduction and analysis on a single machine or laptop. Others need huge resources of the local cluster to run, thereby monopolising it for a single application. 19

20 Applications for synchrotron science can be divided roughly into the following classes: Class 1 : data intensive short jobs typically data reduction type jobs to correct or calibrate images, e.g. case study of SPD below, Class 2 : data intensive long jobs typically data analysis jobs on many images to reconstruct a 3D volume, e.g. case study of PyHST below, Class 3 : CPU intensive parallel jobs typically modelling or simulation type applications requiring MPI, e.g. FDMNES, MOLDY, etc., Class 4 : CPU intensive independent jobs typically modelling or simulation type applications, e.g. case study of gasbor below, Class 5 : CPU intensive single jobs typically data analysis jobs for fitting a model to measured data, e.g. GSAS. It is hard to give exact figures for what percentage each class of applications represents of the total number of photon science applications because many of the jobs are run on hosts which are not monitored e.g. on desktops, laptops or hosts as part of the experiment. The distribution of applications classes depends on the experimental technique used. The ESRF is a multi-disciplinary facility and many different types of techniques are used. Some produce only small amounts of data but make heavy use of simulation, e.g. spectroscopy, while others produce huge amounts and sometimes use simulations, e.g. tomography. Most imaging based techniques produce large numbers of images. Images constantly increase in size due to the increasing number of pixels, but also the number of images produced constantly increases. From our study we found that data intensive jobs (classes 1 and 2) are the most common and pose the biggest challenge. However simulation jobs are always required and as models increase in size their needs will increase. We found that the class of applications which are best suited to the Grid i.e. class 4, are in fact the least common. Class 5 applications represent another very common class of applications but were not studied on the Grid because users run them on their local machines and they are not resource bound. In the following section we will discuss case studies on 3 different applications which are each typical of one of the classes of applications identified above SPD a class 1 type application The application named SPD (SPatial Distorsion) has been selected to be ported on the Grid to check what could be the benefits of using a Grid infrastructure. This SPD application which has been developed at the ESRF is widely used on several beam-lines. Its aim is to do image correction on the images taken by the cameras used as beam-line detectors. These cameras are not perfect and this software generates corrected images from raw images coming out from the camera. This application has one input file which is the raw image and generates one output file which is the corrected image. The corrections are done based on 3 calibration files to correct: The camera dark level The camera imperfections (flood file) The camera distortion 20

21 The SPD usage can be summarized by the following diagram: Raw image Dark file Flood SPD Distortion file Corrected image The Grid_SPD application A Python script called Grid_SPD has been written to run this SPD software on a set of images using Grid the infrastructure. Several types of Grid usage have been implemented from something close to a real Grid usage (No knowledge of where the data are and no knowledge of where the software will be executed) to something close to a cluster usage (SPD running on dedicated computer(s) with NFS access to the image data set). This Grid_SPD python script always has a parameter which allows the user to select the number of images which will be processed by each Grid job. Grid_SPD will start as many jobs as needed to correct all the images in the set. For instance, with an image set of 100 images, if the user requires that 10 images have to be corrected by each job, Grid_SPD will create 10 jobs. Grid_SPD takes timing measurements of the miscellaneous actions it is doing on the Grid. It also implements a loop mode in which it will redo its work in a loop manner and store its timing measurement in a CSV file. The LFC (Logical File Catalog) server was the DESY LFC. When using the LFC, the SE (Storage Element) has been hard-coded to use the ESRF DPM SE (Physically located at the ESRF). The CE (Computing Element) has always been hard coded in the job description file as the ESRF LCG-CE CE except in the CREAM-CE mode. Therefore, this Grid_SPD, even in its Gridiest mode, cannot be considered a pure Grid application. Five Grid_SPD application running modes have been implemented: 1. The UI mode: This mode is the closest to a real Grid usage. The three correction files (dark file, flood file and distortion file) are stored on the UI (User Interface) computer. The images set is also stored on the UI and the corrected images will be put on the UI as well. 2. The LFC mode: In this mode, we try to minimize the data transfer between the UI computer and the Grid infrastructure. The image set is already on the LFC. The three correction files are stored on the UI. The corrected images will be put on the UI as well. 21

22 3. The parametric mode: In this mode, the image set is already on the LFC. The three correction files are stored on the UI. The corrected images will be stored on the UI as well. Grid_SPD will be used in the so called parametric job mode to send job request to the WMS (Workload Management System). It is this parametric job which will in turn start the underlying jobs. 4. The CREAM_CE mode: In this mode, the set of images is already on the LFC. The three correction files are stored on the UI. The corrected images will be put on the UI as well but the jobs are not submitted to the WMS (Workload Management System). They are submitted to the ESRF CREAM_CE directly. 5. The NFS mode: In this mode, the file system on which the image set is stored is mounted on the Grid worker node. The three correction files are also stored on this NFS mounted file system. The corrected images will be put in the same directory than the raw images (therefore on the NFS mounted file system as well). 6. Local mode : all files are on a local disk and the jobs run on the same local host The results Two charts are given for the following cases: images with 10 images per job images with 50 images per job In these two charts, the bar Submit jobs is the sum of: Time needed to send the correction files and the images to the Grid (when relevant) Time needed to submit the jobs The bar Retrieve job outputs is the sum of: Time needed to retrieve the job outputs Time needed to retrieve the corrected images (when relevant) The numbers in these charts are average numbers 18 jobs - 10 images/job 21:36 19:12 16:48 14:24 time 12:00 9:36 Submit jobs Waiting for jobs Retrieve job output Total time 7:12 4:48 2:24 0:00 UI LFC PARAMETRIC CREAM-CE NFS LOCAL 22

23 4 jobs - 50 images/job 16:48 14:24 12:00 time 09:36 07:12 Submit jobs Waiting for jobs Retrieve job output Total time 04:48 02:24 00:00 UI LFC PARAMETRIC CREAM-CE NFS LOCAL The detailed results for this study can be found in the deliverable D11.4 of this work-package Conclusions As we can notice from these charts, the time needed to correct images using a small number of images/job is quite high. Using the Grid to run many small jobs each one correcting a single image is not very efficient. The time needed to have the data at the right place (accessible for the jobs running on a CE) is also noticeable. The data have to be transferred from the storage element to a disk accessible by the computer element using the LFC to locate the storage element. The WMS parametric job allows us to decrease the time needed to start the jobs but finally, the best result we had was using the so-called NFS. But this mode is far from a typical Grid application (the data are on a file system NFS mounted on the worker node. To conclude, it seems that the Grid as it is today is not well adapted to this kind of applications (many small jobs which are I/O intensive). Two of its main components (WMS and LFC) introduce an overhead which is noticeable Gasbor a class 4 type application Gasbor calculates domain structures of proteins from X-ray Solution scattering. It relies on an Abinitio method for building a structural model of the proteins. [D. Svergun et al., Biophysical Journal 80, , 2001.] The execution of the program for a typical set of scattering data runs for several days and often two weeks on a local desktop computer. Both, required input data size and calculated results, are on the 23

24 order of Megabytes or less. For statistical reasons it is desirable to run many similar such jobs on a given data set. The required computing resources become quickly very large. As these large resources are needed only punctually, the Grid seems to offer the ideal solution to it. The tests presented here were done with a much shorter test job, in the order of a few hours, to allow for more rapid feedback. Job submission times to WMS Job submission times have improved considerably after the upgrade to version of the WMS. One can see in figures below the perfect linearity while submitting 277 jobs. The histogram plot shows a submission time narrowly focusing on 5 seconds/job. cummulative job submission tim job number normalized frequency job submission times histogram submission times/job [s] The ganglia plots of Grid-wms.esrf.eu below also show, that the WMS can handle the 250 jobs or so. Before the upgrade, the memory of 4GB was quickly filled and necessitated frequent restarts. Job finish times The following is a study of 277 jobs submitted at a time and executing on three different CEs on our XRAY infrastructure at DESY, PSI, and ESRF. A large part is rapidly executed after submission to the WMS and finishes after about one hour. The remaining jobs reports a status Done after two, four, and some only after 6 hours. The latter are due to busy resources and include the time waiting in the WMS queue. 24

25 # of jobs job finish time [h] Effective job run time on different CEs The effective time of the job between its starting on the worker node and reporting finished to the site's Compute Element is depending of course on the local hard and software environment it encountered on the respective worker node. The gasbor user reported a runtime of this job on its local machine of 3-4 hours. It runs about two hours on Grid worker nodes at the ESRF and PSI, and less than half that time on DESY machines. The runtime on ESRF and PSI machines shows a bigger dispersion and is due to the fact that four to eight job slots were available on each worker node depending on its number of available cores. It turned out that two machines at the ESRF were overloaded as these had eight job slots for four CPUs. 25

26 wall clock time per job # of jobs in time interval job runtime [minutes] Those two machines were alone responsible for the execution times above 140 minutes. Similar software There is another widely used program within the synchrotron radiation science community that fits somewhat in the same category as Gasbor, which is FDMNES. FDMNES is relying on a Finite- Difference-Method to calculate X-ray absorption Near-Edge-Structures. As in Gasbor the required input and output data is rather small. Conclusion This type of application with the combination of small input and output data and the need for a large number of independent jobs with long execution times seems to be the ideal Grid application. The figure below tries to make a comparison of job throughput in different environments: locally on the user's desktop, a batch job on a local cluster with 20 free job slots, and the results from the submission of the 277 jobs to the Grid for which we had roughly 150 job slots immediately available. A fourth case includes an 'optimized' Grid job, where we assume a better submission framework that would eliminate waiting jobs in the presence of free resources. 26

27 The assumption that more resources are immediately available on the Grid comes from the fact that by sharing resources in a Grid, one can reduce the so called 'wait-while-idle' cycles. This depends of course a lot on usage patterns in an actual production environment. More detailed studies of Grid taxonomy can be found in e.g. [Yin Fei et al., Computers and Electrical Engineering 35 (2009) and references therein] Minutes local job batch job grid job opt. grid job But even in this case there are certain negative aspects. Those result from the fact that Grid jobs have a non negligible risk of failing. The risks range from configuration errors on sites, middleware bugs, to network troubles. It is therefore wise to limit execution times to a day or so. Gasbor, however, does not offer this flexibility. Interaction with the developers becomes necessary, which is often impractical and would be resisted unless a critical mass of Grid users could be found PyHST class 2 type application What is PyHST? PyHST is a suite of programs for analysing synchrotron tomography data and producing 3D volumes. An example of a data set produced at the ESRF is the tomogram of the skull Australopethicus sediba, recently found in Malapa in South Africa and which could represent the missing link between primates and humans. 27

Experiments using imaging techniques are the biggest producer of data at the ESRF. One example of imaging is tomography.

28 Rendering of the 3-D scan of the skull of Australopithecus sediba child. Credits: P. Tafforeau More examples of palaeontology data sets can be found at Experiments using imaging techniques are the biggest producer of data at the ESRF. One example of imaging is tomography. Tomography experiments account for over 50% of the data produced at the ESRF. For this reason it is important to study how the Grid can help analyse tomography. The diagram below shows the data flow of PyHST from the beamline to the local cluster when it is run at the ESRF. 28

29 Running PyHST on the Grid A theoretical study of the time needed to run PyHST on the Grid has been done. A typical use case for PyHST is to re-construct a volume of 2048x2048x2048 float from a set of 1600 images. Each image is an array of 2048x2048 float numbers. Using the ESRF cluster, this computation using one job takes 20 hours. The time needed to run the computation is a linear function of the job number. Therefore, the input numbers for our estimation are: Input data: 1600 files of 2048x2048 float Output data: A volume of 2048x2048x2048 in one file Computation time: 20 hours for one job To run PyHST on the Grid, a typical sequence is: 1. Step 1: Send the input data from the User Interface to a Storage Element 2. Step 2-a: Copy the input data from the Storage Element to each job running on the Worker Node 3. Step 2-b: Do the computation 4. Step 2-c: Copy the computation result to the Storage Element 5. Step 3: Retrieve the output volume in the User Interface from the Storage Element. When the computation is divided in several jobs, each job needs all the data. Each job computes a volume slice and at the end the volume needs to be re-constructed from the outputs of all the running jobs. This last step was neglected in this case study because it is the same for all the jobs (grid and non-grid). To estimate the time needed by step 1 and step 3, we will do 3 computations with different bandwidth available between the User Interface and the Storage Element: We will use: 1. 1 MByte/sec for a slow transfer MBytes/sec which is an average throughput MBytes/sec for a fast transfer. We will double these cases by studying a transfer of the input data (the 1600 files) in two different flavours: One big tar file containing the 1600 image files and 1600 different files. This is done to estimate the impact of the Logical File Catalogue. Registering a file in the Grid is a two steps process: 1. Register the file in the Logical File Catalogue. The time used for this registration is typically 2 seconds. 2. Send the file to the Storage Element. In the case of sending one big tar file, the time needed to create the tar file and to untar it will also been taken into account. The total needed time is the sum of time needed for Step1, Step 2 and Step 3. Time needed for Step 1 One file of 2048x2048 float means 16 Mbytes. Therefore, the amount of input data is 16x1600=25600 MBytes, i.e. 25 GBytes. On the computer used for our test-bench, the time needed to tar 20 files of 16 MBytes each is 10 sec which means 13 mins and 20 sec for 1600 files. Time needed for Step 1 is summarized in the following table: 29

30 1 MByte/sec 10 MBytes/sec 40 MBytes/sec 1 Big tar file 7 hours 20 min. 02 sec. 56 min. 02 sec. 24 min. 02 sec files 8 hours 1 hour 36 min. 1 hour 4 min. Time needed for Step 2 It's the sum of: Time needed to transfer data from the Storage Element to the Worker Node(s) Computation time Time needed to transfer the data back from the Worker Node to the Storage Element (s). The resulting volume (2048x2048x2048 float) is a MBytes (32 GBytes) We have the final equation: t = ((d / T) * N) + (20 * 60 * 60 / N) + (((c / N) / T ) * N) with: d = input data size in MBytes (25600) T = Transfer rate in MBytes /sec (40) N = Job number c = output data size in MBytes (32768) With these numbers, the equation becomes t = (640 * N) + (72000 / N) This equation has a minimum for N = sqrt (72000 / 640) = Therefore, the optimal job number is 10 and the time for step 2 becomes; 640 * 10 + (72000 / 10) = seconds which is 4 hours and 19 seconds In case of data being sent in a big tar file, time to untar the file (16 minuntes) has to be added. The following array summarizes the results: 1 Big tar file 4 hours 16 min. 19 sec files 4 hours 19 sec. Time needed for Step 3 The time needed for this step is the time needed to transfer the resulting volume which is MBytes (32 GBytes) 30

31 1 MByte / sec 10 MBytes / sec 40 MBytes /sec 9 hours 6 min. 8 sec. 54 min. 36 sec. 13 min. 39 sec. Total time and conclusions We are now able to compute an estimate of running PyHST on the Grid by summing the previous result: 1 MByte/sec. 10 MBytes/sec. 40 MBytes/sec. 1 Big tar file 20 hours 42 min. 29 sec. 6 hour 6 min. 57 sec. 4 hours 54 min files 21 hours 6 min. 27 sec. 6 hours 30 min. 55 sec. 5 hours 17 min. 58 sec. If you choose to use 100 jobs instead of the optimum number of 10, you will get 20 hours 18 min. and 57 sec. using one big tar file with a 10 Mbytes/sec bandwidth. Under the same conditions with 500 jobs, this time becomes 89 hours 9 min. and 23 sec.!! This time increases dramatically because all the input data has to be provided to all the jobs and with EGEE as it is today, a job running on a Worker Node does not see the data of another job even if it is running on the same Worker Node. From this table we can conclude that: It 's better to send a big tar file than 1600 different files (Logical File Catalogue effect) The bandwidth between the User Interface and the Storage Element has a huge effect on the total time. For such an application where all the jobs need all the data, carefully choose your job numbers. The best result is 4 hours 54 mins. This has to be compared with the 15 mins that we have with the ESRF local cluster (running 80 jobs) which takes its input data on a file system shared between itself and the data producer (the beamline). At the ESRF, PyHST has been ported to run on GPU (Graphical processor Unit) hardware. Using the same set of input files, the time needed to do the computation using the GPU version of PyHST is 8 min. The following chart summarizes these results: 31

32 PyHST computation time secondss GPU Local Cluster 10 jobs - 10 MB/sec 100 jobs - 10 MB/sec 10 Jobs 1 MB/sec 10 jobs - 40 MB/sec The process of porting There are basically two cases when you want to run an application on a Grid infrastructure: The application is already parallelized and therefore well adapted to a possible Grid usage The application is not parallelized By nature, all applications running on the Grid have to be parallelized. Therefore the first thing to do is to parallelize the application. Application parallelization is a complete subject on its own and will not be covered in this document. For application already parallelized, the process of porting the application to the Grid is a two steps process. First, you have to write a job description file. Then, you have to write a small script which will be executed on the worker node. As its name says, the job description file is the file where the job is described. The main parameters described in this file are the name of the executables you want to run on the worker node, its argument and if required the description of which files have to be transmitted using the job input or output sand boxes. These sand boxes are used to transfer small amounts of data. It is typically used to transmit logging information or error reports. The main job data (input and output) are normally transferred using the Grid LFC. It's also in this file that you can define some job specific requirements like a number of retries in case of job failure, a specific computing element where the job has to be run, a specific working node system architecture and many other parameters. The second step in the script describes what will be really executed on the worker node (very often, the name of this script in given in the job description file as the job executable name). The goals of this script is to retrieve the job input files from the LFC, to run the application with its necessary arguments (computed locally or given in the job description file) and to store the resulting data on the LFC making them available for the user. 32

33 Software management A wide variety of different programs are used for data reduction, analysis and modeling. Each experiment type has its own specialized programs for data processing. Some of them are just simple executables without dependencies, others need a special environment like runtime libraries, python modules, and/or several software packages to run. The first category of programs - the standalone programs - could be easily referenced in the job description file and send to the worker nodes, but the second one implies a global software installation on all CE of all Grid sites that support the XRAY VO. Before porting applications to the Grid, the important software packages should be installed on each CE of all Grid sites which support the XRAY VO. This is possible with the help of a dedicated software repository which is represented by VO_<name of the VO>_SW_DIR. This software area must be configured beforehand, which was done at DESY and ESRF, but missing for PSI. Further requirements: Only authorised users who authenticate with the software administrator role for the VO can install software. Software tags must be defined, which can be referenced in the JDL job submission files and insure that the job is submitted to a CE with has the desired software installed. Conclusions The installation and maintenance of software in a common software area is possible, and a simple test installation of FDMNES was done successfully for DESY and ESRF. The development and maintenance of the programs is done by software programmers or even by the scientists themselves, so a large community is installing and maintaining software for data analysis of Synchrotron experiments. Regarding the important number of different software packages and their dependencies, as well as the large number of software developers, it might become resource intensive and cumbersome to maintain software in a Grid environment. 33

3.3. DATA TRANSFERS One of the most important challenges regarding Synchrotron jobs is to move data around efficiently between Grid resource centres.

3.3.1. Throughput numbers, regular Gridftp transfers statistics To measure inherent Grid capabilities, regular transfers have been done between partner sites.

.. The results give us a good reference of comparison, and also some figures to measure the quality of service given in term of reliability and performance. HTTP IPerf transfer tests. Single Channel.

34 3.3. DATA TRANSFERS One of the most important challenges regarding Synchrotron jobs is to move data around efficiently between Grid resource centres. Experiments carried out at beamlines produce a huge amount of data which is used as input for data analysis works Throughput numbers, regular Gridftp transfers statistics To measure inherent Grid capabilities, regular transfers have been done between partner sites. Every night, launched at off-peak hours, data files were regularly transferred by cron tasks, using different protocols: GridFTP, iperf tests, http... The results give us a good reference of comparison, and also some figures to measure the quality of service given in term of reliability and performance. HTTP IPerf transfer tests. Single Channel. 1 1 st curve: outbound connection. 2 2 nd curve: inbound connection GridFTP transfers. Single channel. 3 1 st curve: outbound connection. 4 2 nd curve: inbound connection GridFTP transfers. 10 channels session. 5 1 st curve: outbound connection. 6 2 nd curve: inbound connection Comparison chart between protocols Due to the inherent security mechanism employed by the GridFTP protocol (authentication and encryption on each file transfer) we can highlight that some overhead is introduced at the beginning of every new transfer. This becomes more critical when the job required hundreds or even thousands of small files as input data. On the other hand, we have also confirmed that no relevant improvement is introduced even if we employ inherent GridFTP mechanism for striping transfers into multiple parallel data channels. When the data source is unique (not distributed), the throughput rates are quite similar. 11 See applications section3.2 for practical examples.

35 3.4. SECURE REMOTE RESOURCE ACCESS AND USER MANAGEMENT Due to all Grid resources have to be deploy as a part of a public infrastructure, we have to provide also proper mechanisms to guaranty integrity of services. Keeping in mind that partner sites should be up and operational all the time, we have to supply tools to avoid abusive uses, track all events and log all relevant and necessary information. This goal has been achieved by using two different and complementary mechanisms: Perimeter Protection This term involves all relevant aspects regarding how to secure communication channels. Concerning this security framework we have found that three different scenarios have been adopted by partner sites. From the most simple to the most rugged one: PSI All resources were placed completely outside of the lab network. Protections have been set up by using inherit OS mechanisms, like iptables or snort tools. Iptables provides a solid performance, performs effective firewalling, and allows add-on functionality to enhance its reporting and response functions. On the other hand, Snort gives us a complementary free lightweight network intrusion detection system to our linux boxes. Soleil This time all Grid resources have been placed behind a corporate firewall, giving us a centralize point of security management policy and offering a strong platform of defense. ESRF Working also behind a corporate perimeter firewall (Checkpoint cluster), the platform has been rugged by using Quality of Service (QoS) appliances (Packeteer). These components guarantee that throughput is going to be regulate by a third party, avoiding abusive uses and ensuring that all the services are going to get the bandwidth they need to function at a desired level User Management A European Virtual User Office Federating users, in our case scientists who use analytical facilities like synchrotrons or neutron reactors, is one of the subjects which is under discussion since years. The potential benefits of a unique EU wide system are enormous: Scientists are using often more than just one facility to carry out their research project. Every facility currently manages the user information and the account creation separately. The maintenance of this information, and in particular the affiliation data of the scientists, is a daily time consuming activity in all labs. It is estimated that there are more than scientists using European photon and neutron facilities, coming from almost different institutions. A central repository of this information would allow for efficient update mechanisms and checking for double entries. Once the user information is federated, account creation at the facilities could be derived from this information within the workflow of the peer reviewed allocation process. The

36 same account information could be used to combine research done at two or more analytical facilities, e.g. for launching a data analysis job on data sets stored in several laboratories. A federated system would foster a community identity, something which is currently difficult to achieve considering the large variety of different origins of our user community. Initially the federated user database would simply act as a front-end to the individual User Office systems of the facilities. Gradually new functionality could be envisaged, like the parallel submission of beamtime requests to several facilities, or combining the peer review process between facilities. Ultimately this could lead to a real European Virtual User Office for a given class of facilities. A central repository of user information could also allow, with the agreement of the scientists, to foster information exchange about facility updates, workshops, special events, etc. Three ESFRI roadmap projects are currently investigating and discussing how a federated user database could be adopted and interfaced to their respective User Office Systems: the ILL 20/20, ESRFUP, and EuroFEL. Different authentication methods were considered, and a prototype setup is going to be implemented. The WP11 Grid project has allowed testing user authentication based on Grid certificates, the ESRFUP WP7 common entry point to the ILL and the ESRF based on Yale CAS, and finally the EuroFEL WP2 will soon put in place an authentication system based on Shibboleth. - Authentication with Grid certificates The clear advantage of using Grid or X509 certificates is the much improved security that it provides to a user and to the institutions managing users with respect to the now common username/password. X509 certificates are based on so called asymmetric cryptography algorithms in which every user gets two keys. One key remains private and the owner has to make sure this key does not get compromised. The other key is public and should be made available to all participants. No exchange of secrets is necessary for encryption/decryption or for authentication (signature). The other things necessary to form a public key infrastructure (PKI) are Certificate Authorities CA, which are trusted entities and guarantee the identity of the user as specified in the certificates they issue. Trust in a CA come usually through an agreed set of policies, etc. and is controlled by a Trust Federation which accredits CAs. The immediate advantage is that the EGEE project has already set up this infrastructure. National certification authorities, covering all European countries and beyond, have been created and accredited, which is quite a lengthy and tedious endeavour. For a user all he needs to do is to identify the appropriate CA and provide his name, affiliation, and address. His identity will be verified usually by showing his/her passport.

The EUGridPMA 12 itself does not issue certificates. It coordinates national and regional authorities that do the actual certificate issuing to end entities.

37 The EUGridPMA 12 itself does not issue certificates. It coordinates national and regional authorities that do the actual certificate issuing to end entities. In order for a new community to be integrated one has to make sure that the user's home institutes are registered with the PKI, and has people willing to act as the local registration authority (essentially checking people s passports). If a new community has users on the order of thousands or more - as is the case for the scientists working or visiting synchrotron installations -, one needs to make sure the national certificate authorities can handle the requests and support the users. If a community resists these last steps, it can decide to run its own certificate authority and thus making sure it keeps policies under their control and makes sure the delivery of certificates is timely and the user support is adequate. Although it looked manageable to the author of this paragraph, the actual user community and the people responsible for managing user accounts and access were very much resisting Grid certificates. 12 See the EUGridPMA Membership at

38 The concept looked very complex and the handling was too awkward to be considered an acceptable and feasible solution. A single harmless security message from a browser (like in the screenshot above) was enough to scare people away from getting themselves familiar with certificates. In the EGEE context, authorisation to access information on web applications is handled quite successfully. The GOCDB web portal (see picture below), hosted in the UK, or the CIC portal, hosted in Lyon, are good examples for this. The developers confirmed the simplicity with which the access to a page can be handled directly at the Apache level (SSL). This can easily be extended to handling roles by storing certificate identification strings in a small database and basing permissions on roles in the database and an associated scope of access.

39 The authorization part can also be handled in a central fashion via a Virtual Organization Management Service, like the one we have set up for the project:

XRAY Grid TO BE OR NOT TO BE?

XRAY Grid TO BE OR NOT TO BE? 1 I was not always a Grid sceptic! I started off as a grid enthusiast e.g. by insisting that Grid be part of the ESRF Upgrade Program outlined in the Purple Book : In this