Based on: Grid Intro and Fundamentals Review Talk by Gabrielle Allen Talk by Laura Bright / Bill Howe

Introduction to Grid Computing 1 Based on: Grid Intro and Fundamentals Review Talk by Gabrielle Allen Talk by Laura Bright / Bill Howe 2 Overview Background: What is the Grid? Related technologies Grid applications Communities Grid Tools Case Studies 3 1

PUMA: Analysis of Metabolism PUMA Knowledge Base Information about proteins analyzed against ~2 million gene sequences Analysis on Grid Involves millions of BLAST, BLOCKS, and other processes Natalia Maltsev et al. 4 http://compbio.mcs.anl.gov/puma2 4 Tier 1 Initial driver: High Energy Physics ~PBytes/sec 1 TIPS is approximately 25,000 Online System ~100 MBytes/sec SpecInt95 equivalents Offline Processor Farm There is a bunch crossing every 25 nsecs. ~20 TIPS There are 100 triggers per second ~100 MBytes/sec Each triggered event is ~1 MByte in size Tier 0 ~622 Mbits/sec or Air Freight (deprecated) France Regional Germany Regional Italy Regional Centre Centre Centre CERN Computer Centre FermiLab ~4 TIPS ~622 Mbits/sec ~622 Mbits/sec Tier 2 Caltech ~1 TIPS Tier2 Centre Tier2 Centre Tier2 Centre Tier2 Centre ~1 TIPS ~1 TIPS ~1 TIPS ~1 TIPS Physics data cache Institute Institute ~0.25TIPS Physicist workstations Institute ~1 MBytes/sec Institute Tier 4 Physicists work on analysis channels. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server 5 Image courtesy Harvey Newman, Caltech 5 Computing clusters have commoditized supercomputing Cluster Management frontend I/O Servers typically RAID fileserver A few Headnodes, gatekeepers and other service nodes Tape Backup robots Disk Arrays Lots of Worker Nodes 6 6 2

What is a Grid? Many definitions exist in the literature Early defs: Foster and Kesselman, 1998 A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational facilities Kleinrock 1969: We will probably see the spread of computer utilities, which, like present electric and telephone utilities, will service individual homes and offices 7 across the country. 3-point checklist (Foster 2002) 1. Coordinates resources not subject to centralized control 2. Uses standard, open, general purpose protocols and interfaces 3. Deliver nontrivial qualities of service e.g., response time, throughput, availability, security 8 Grid Architecture Autonomous, globally distributed computers/clusters 9 3

Why do we need Grids? Many large-scale problems cannot be solved by a single computer Globally distributed data and resources 10 Background: Related technologies Cluster computing Peer-to-peer computing Internet computing 11 Cluster computing Idea: put some PCs together and get them to communicate Cheaper to build than a mainframe supercomputer Different sizes of clusters Scalable can grow a cluster by adding more PCs 12 4

Cluster Architecture 13 Peer-to-Peer computing Connect to other computers Can access files from any computer on the network Allows data sharing without going through central server Decentralized approach also useful for Grid 14 Peer to Peer architecture 15 5

Internet computing Idea: many idle PCs on the Internet Can perform other computations while not being used Cycle scavenging rely on getting free time on other people s computers Example: SETI@home What are advantages/disadvantages of cycle scavenging? 16 Some Grid Applications Distributed supercomputing High-throughput computing On-demand computing Data-intensive computing Collaborative computing 17 Distributed Supercomputing Idea: aggregate computational resources to tackle problems that cannot be solved by a single system Examples: climate modeling, computational chemistry Challenges include: Scheduling scarce and expensive resources Scalability of protocols and algorithms Maintaining high levels of performance across heterogeneous systems 18 6

High-throughput computing Schedule large numbers of independent tasks Goal: exploit unused CPU cycles (e.g., from idle workstations) Unlike distributed computing, tasks loosely coupled Examples: parameter studies, cryptographic problems 19 On-demand computing Use Grid capabilities to meet short-term requirements for resources that cannot conveniently be located locally Unlike distributed computing, driven by cost-performance concerns rather than absolute performance Dispatch expensive or specialized computations to remote servers 20 Data-intensive computing Synthesize data in geographically distributed repositories Synthesis may be computationally and communication intensive Examples: High energy physics generate terabytes of distributed data, need complex queries to detect interesting events Distributed analysis of Sloan Digital Sky Survey data 21 7

Collaborative computing Enable shared use of data archives and simulations Examples: Collaborative exploration of large geophysical data sets Challenges: Real-time demands of interactive applications Rich variety of interactions 22 Grid Communities Who will use Grids? Broad view Benefits of sharing outweigh costs Universal, like a power Grid Narrow view Cost of sharing across institutional boundaries is too high Resources only shared when incentive to do so Grid will be specialized to support specific communities with specific goals 23 Grid Users Many levels of users Grid developers Tool developers Application developers End users System administrators 24 8

Some Grid challenges Data movement Data replication Resource management Job submission 25 Some Grid-Related Projects Globus Condor Nimrod-G 26 Grid Client Application User Interface What is a grid made of? Middleware. Grid Resources dedicated by UC, IU, Boston Grid Storage Grid Middleware Computing Cluster Grid Middleware Grid Protocols Grid resources shared by OSG, LCG, NorduGRID Grid Storage Grid Middleware Computing Cluster Resource, Workflow And Data Catalogs Grid resource time purchased from commercial provider Grid Storage Grid Middleware Computing Cluster Security to control access and protect communication (GSI) Directory to locate grid sites and services: (VORS, MDS) Uniform interface to computing sites (GRAM) Facility to maintain and schedule queues of work (Condor-G) Fast and secure data set mover (GridFTP, RFT) Directory to track where datasets live (RLS) 27 27 9

Globus Grid Toolkit Open source toolkit for building Grid systems and applications Enabling technology for the Grid Share computing power, databases, and other tools securely online Facilities for: Resource monitoring Resource discovery Resource management Security File management 28 Data Management in Globus Toolkit Data movement GridFTP Reliable File Transfer (RFT) Data replication Replica Location Service (RLS) Data Replication Service (DRS) 29 GridFTP High performance, secure, reliable data transfer protocol Optimized for wide area networks Superset of Internet FTP protocol Features: Multiple data channels for parallel transfers Partial file transfers Third party transfers Reusable data channels Command pipelining 30 10

More GridFTP features Auto tuning of parameters Striping Transfer data in parallel among multiple senders and receivers instead of just one Extended block mode Send data in blocks Know block size and offset Data can arrive out of order Allows multiple streams 31 Striping Architecture Use Striped servers 32 Limitations of GridFTP Not a web service protocol (does not employ SOAP, WSDL, etc.) Requires client to maintain open socket connection throughout transfer Inconvenient for long transfers Cannot recover from client failures 33 11

GridFTP 34 Reliable File Transfer (RFT) Web service with job-scheduler functionality for data movement User provides source and destination URLs Service writes job description to a database and moves files Service methods for querying transfer status 35 RFT 36 12

Replica Location Service (RLS) Registry to keep track of where replicas exist on physical storage system Users or services register files in RLS when files created Distributed registry May consist of multiple servers at different sites Increase scale Fault tolerance 37 Replica Location Service (RLS) Logical file name unique identifier for contents of file Physical file name location of copy of file on storage system User can provide logical name and ask for replicas Or query to find logical name associated with physical file location 38 Data Replication Service (DRS) Pull-based replication capability Implemented as a web service Higher-level data management service built on top of RFT and RLS Goal: ensure that a specified set of files exists on a storage site First, query RLS to locate desired files Next, creates transfer request using RFT Finally, new replicas are registered with RLS 39 13

Condor Original goal: high-throughput computing Harvest wasted CPU power from other machines Can also be used on a dedicated cluster Condor-G Condor interface to Globus resources 40 Condor Provides many features of batch systems: job queueing scheduling policy priority scheme resource monitoring resource management Users submit their serial or parallel jobs Condor places them into a queue Scheduling and monitoring Informs the user upon completion 41 Nimrod-G Tool to manage execution of parametric studies across distributed computers Manages experiment Distributing files to remote systems Performing the remote computation Gathering results User submits declarative plan file Parameters, default values, and commands necessary for performing the work Nimrod-G takes advantage of Globus toolkit 42 features 14

Nimrod-G Architecture 43 Security 44 Grid security is a crucial component Resources are typically valuable Problems being solved might be sensitive Resources are located in distinct administrative domains Each resource has own policies, procedures, security mechanisms, etc. Implementation must be broadly available & applicable Standard, well-tested, well-understood protocols; integrated with wide variety of tools 45 15

Security Services Forms the underlying communication medium for all the services Secure Authentication and Authorization Single Sign-on User explicitly authenticates only once then single sign-on works for all service requests Uniform Credentials Example: GSI (Grid Security Infrastructure) 46 Authentication means identifying that you are whom you claim to be Authentication stops imposters Examples of authentication: Username and password Passport ID card Public keys or certificates Fingerprint 47 Authorization controls what you are allowed to do. Is this device allowed to access to this service? Read, write, execute permissions in Unix Access conrol lists (ACLs) provide more flexible control Special callouts in the grid stack in job and data management perform authorization checks. 48 16

Job and resource management 49 Job Management Services provide a standard interface to remote resources Includes CPU, Storage and Bandwidth Globus component is Globus Resource Allocation Manager (GRAM) The primary Condor grid client component is Condor-G Other needs: scheduling monitoring job migration notification 50 GRAM provides a uniform interface to diverse resource scheduling systems. User GRAM Condor Site A VO LSF Site C VO PBS Site B VO UNIX fork() Site D VO Grid 51 17

GRAM: What is it? Globus Resource Allocation Manager Given a job specification: Create an environment for a job Stage files to and from the environment Submit a job to a local resource manager Monitor a job Send notifications of the job state change Stream a job s stdout/err during execution 52 A Local Resource Manager is a batch system for running jobs across a computing cluster In GRAM Examples: Condor PBS LSF Sun Grid Engine Most systems allow you to access fork Default behavior It runs on the gatekeeper: A bad idea in general, but okay for testing 53 Managing your jobs We need something more than just the basic functionality of the globus job submission commands Some desired features Job tracking Submission of a set of inter-dependant jobs Check-pointing and Job resubmission capability Matchmaking for selecting appropriate resource for executing the job Options: Condor, PBS, LSF, 54 18

Grid Workflow 55 A typical workflow pattern in image analysis runs many filtering apps. 3a.h 3a.i 4a.h 4a.i ref.h ref.i 5a.h 5a.i 6a.h 6a.i align_warp/1 align_warp/3 align_warp/5 align_warp/7 3a.w 4a.w 5a.w 6a.w reslice/2 reslice/4 reslice/6 reslice/8 3a.s.h 3a.s.i 4a.s.h 4a.s.i 5a.s.h 5a.s.i 6a.s.h 6a.s.i softmean/9 atlas.h atlas.i slicer/10 slicer/12 slicer/14 atlas_x.ppm atlas_y.ppm atlas_z.ppm convert/11 convert/13 convert/15 atlas_x.jpg atlas_y.jpg atlas_z.jpg 56 Workflow courtesy James Dobson, Dartmouth Brain Imaging Center 56 Workflows can process vast datasets. Many HEP and Astronomy experiments consist of: Large datasets as inputs (find datasets) Transformations which work on the input datasets (process) The output datasets (store and publish) The emphasis is on the sharing of the large datasets Transformations are usually independent and can be parallelized. But they can vary greatly in duration. Mosaic of M42 created on TeraGrid Montage Workflow: ~1200 jobs, 7 levels = Data Transfer = Compute Job 57 19

Virtual data model enables workflow to abstract grid details. 58 58 Grid Case Studies Earth System Grid LIGO TeraGrid 59 Earth System Grid Provide climate studies scientists with access to large datasets Data generated by computational models requires massive computational power Most scientists work with subsets of the data Requires access to local copies of data 60 20

ESG Infrastructure Archival storage systems and disk storage systems at several sites Storage resource managers and GridFTP servers to provide access to storage systems Metadata catalog services Replica location services Web portal user interface 61 Earth System Grid 62 Earth System Grid Interface 63 21

Laser Interferometer Gravitational Wave Observatory (LIGO) Instruments at two sites to detect gravitational waves Each experiment run produces millions of files Scientists at other sites want these datasets on local storage LIGO deploys RLS servers at each site to register local mappings and collect info about mappings at other sites 64 Large Scale Data Replication for LIGO Goal: detection of gravitational waves Three interferometers at two sites Generate 1 TB of data daily Need to replicate this data across 9 sites to make it available to scientists Scientists need to learn where data items are, and how to access them 65 LIGO 66 22

LIGO Solution Lightweight data replicator (LDR) Uses parallel data streams, tunable TCP windows, and tunable write/read buffers Tracks where copies of specific files can be found Stores descriptive information (metadata) in a database Can select files based on description rather than filename 67 TeraGrid NSF high-performance computing facility Nine distributed sites, each with different capability, e.g., computation power, archiving facilities, visualization software Applications may require more than one site Data sizes on the order of gigabytes or terabytes 68 TeraGrid 69 23

TeraGrid Solution: Use GridFTP and RFT with front end command line tool (tgcp) Benefits of system: Simple user interface High performance data transfer capability Ability to recover from both client and server software failures Extensible configuration 70 TGCP Details Idea: hide low level GridFTP commands from users Copy file smallfile.dat in a working directory to another system: tgcp smallfile.dat tg-login.sdsc.teragrid.org:/users/ux454332 GridFTP command: globus-url-copy -p 8 -tcp-bs 1198372 \ gsiftp://tg-gridftprr.uc.teragrid.org:2811/home/navarro/smallfile.dat \ gsiftp://tg-login.sdsc.teragrid.org:2811/users/ux454332/smallfile.dat 71 The reality We have spent a lot of time talking about The Grid There is the Web and the Internet Is there a single Grid? 72 24

The reality Many types of Grids exist Private vs. public Regional vs. Global All-purpose vs. particular scientific problem 73 Conclusion: Why Grids? New approaches to inquiry based on Deep analysis of huge quantities of data Interdisciplinary collaboration Large-scale simulation and analysis Smart instrumentation Dynamically assemble the resources to tackle a new scale of problem Enabled by access to resources & services without regard for location & other 74 barriers 74 Grids: Because Science Takes a Village Teams organized around common goals People, resource, software, data, instruments With diverse membership & capabilities Expertise in multiple areas required And geographic and political distribution No location/organization possesses all required skills and resources Must adapt as a function of the situation Adjust membership, reallocate responsibilities, renegotiate resources 75 75 25