GFS: A Distributed File System with Multi-source Data Access and Replication for Grid Computing

Size: px
Start display at page:

Download "GFS: A Distributed File System with Multi-source Data Access and Replication for Grid Computing"

Transcription

1 GFS: A Distributed File System with Multi-source Data Access and Replication for Grid Computing Chun-Ting Chen 1, Chun-Chen Hsu 1, 2, Jan-Jan Wu 2, and Pangfeng Liu 1, 3 1 Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan {r94006,d95006,pangfeng}@csie.ntu.edu.tw 2 Institute of Information Science, Academia Sinica, Taipei, Taiwan wuj@iis.sinica.edu.tw 3 Graduated Institute of Networking and Multimedia National Taiwan University, Taipei, Taiwan Abstract. In this paper, we design and implement a distributed file system with multi-source data replication ability, called Grid File System (GFS), for Unix-based grid systems. Traditional distributed file system technologies designed for local and campus area networks do not adapt well to wide area grid computing environments. Therefore, we design GFS file system that meets the needs of grid computing. With GFS, existing applications are able to access remote files without any modification, and jobs submitted in grid systems can access data transparently with GFS. GFS can be easily deployed and can be easily accessed without special accounts. Our system also provides strong security mechanisms and a multisource data transfer method to increase communication throughput. 1 Introduction Large-scale computing grids give ordinary users access to enormous computing power. Production systems such as Taiwan UniGrid [1] regularly provide CPUs to cycle-hungry researchers in a wide variety of domains. However, it is not easy to run data-intensive jobs in a computational grid. In most grid systems, a user must specify in advance the precise set of files to be used by the jobs before submitting jobs. In some cases this may not be possible because the set of files or the fragments of file to be accessed may be determined only by the program at runtime, rather than given as command line arguments. In other cases, the user may wish to delay the assignment of data items to batch jobs until the moment of execution, so as to better schedule the processing of data items. To cope with the difficulties in running data-intensive applications with runtime-dependent data requirements, we propose a distributed file system that supports Unix-like run-time file access. The distributed file system provides the same namespace and semantics as if the files are stored on a local machine. Although a number of distributed file systems have been developed in the past N. Abdennadher and D. Petcu (Eds.): GPC 2009, LNCS 5529, pp , c Springer-Verlag Berlin Heidelberg 2009

2 120 C.-T. Chen et al. decade, none of them are well suited for deployment on a computational grid. Even some distributed file systems such as the Andrew File System [2] are not appropriate for use in grid computing systems because of the following reasons: 1. They cannot be deployed without intervention by the administrator at both client and server 2. They do not provide security mechanisms needed for grid computing. To address this problem, we have designed a distributed file system for cluster and grid computing, called Grid File System (GFS). GFS allows a grid user to easily deploy and harness distributed storage without any operating system kernel changes, special privileges, or attention from the system administrator at either client or server. This important property allows an end user to rapidly deploy GFS into an existing grid (or several grids simultaneously) and use the file system to access data transparently and securely from multiple sources. The rest of this paper is organized as follows. Section 2 describes the system architecture of GFS. Section 3 describes our security mechanisms for file access. Section 4 presents GFS s multi-source data transfer. Section 5 describes our implementation of GFS on Taiwan Unigrid, as well as experiment results on the improvement of communication throughput and system performance. Section 6 gives some concluding remarks. 2 Architecture of Grid File System This section describes the functionality of the components of Grid File System (GFS), and how GFS utilizes theses components to construct a grid-enabled distributed file system. The Grid File System consists of three major components a directory server, file servers, andgfs clients. The directory server manages all metadata for GFS. File servers are responsible for the underlying file transfers between sites, and a GFS client serves as an interface between a user and GFS; users manipulate and access files in the GFS via GFS clients. 2.1 Directory Server The directory server contains five services File control service that receives requests from GFS clients and relays requests to appropriate services, Host management service that manages host information, File management service that maps a physical file to a logical file and locates a logical file, Metadata service that manages metadata of files and searches a registered file, and Replica placement service that decides where to create a replica of a logical file. File Control Service. The file control service is responsible for receiving requests from GFS clients and relaying them to appropriate services of the directory server. The file control service also updates the information in the host management service. Host Management Service. The host management service maintains the available space and the status of the hosts. Each GFS host record contains the

3 GFS: A Distributed File System with Multi-source Data Access 121 following information: the host name, available/total disk space and the status which indicates whether a host is on-line or not. A participant host will update its host information periodically. The file control service marks the status of a host as off-line if it does not update its host information over a certain period of time. File Management Service. The file management service manages files as logical GFS files. For each logical GFS file, the file management service records the following information: the logical file number, the physical file information, the owner tag, themodifier tag and the status. The logical file number is a globally unique identifier for a logical GFS file, which is determined by the metadata service. The physical file information helps GFS clients to locate a logical file. It contains a physical file name, physical file location, physical file tag, and a physical file status. The physical file name consists of the logical file name and a version number. The physical file location is the host name of the node where the physical file is stored. The physical file tag indicates whether this physical file is the master copy or a replica. With these information the file management service allows a GFS client to register, locate, modify and delete physical files within GFS. The File Management Service also maintains the owner tag, the modifier tag, and the status of a logical file. The owner tag of a logical file is the name of the user who owns the file. We identify a GFS user by a GFS user name, which consists of the local user account name and the local host name, e.g., user@grid01. In this way, each user has a unique identity in the Grid File System. The modifier tag of a logical file records the name of the last user who has modified this file. The status of a logical file indicates whether this physical file is the latest version of the logical file. Metadata Service. The metadata service creates and manages the metadata of GFS files. For each GFS file, the metadata service records the following information: logical file path name, logical file name, file size, mode, creation time, modified time and status. The logical file path name is the global space file path. The file size is the size of the logical file. The mode follows the traditional Unix access permissions mechanism, which contains information such as the type of this file, e.g., a regular file or a directory, and the file access permission for users. The creation time and the modified time are the times at which the logical file is created and latest modified. The status indicates the availability of the logical file. Replica Service. The replica service determines where to place the replica of a logical file when it is created. The replica service may use the information provided by the host management service to decide the appropriate location. 2.2 GFS Clients and File Servers GFS Clients. The GFS client on a host serves as an interface between user programs and the Grid File System. The GFS client follows the standard Unix

4 122 C.-T. Chen et al. file system interface. With this interface, users are able to transparently access files in the Grid File System as if they are accessing files from a local file system. The GFS client performs host registration and the users manipulate GFS files through the GFS client. As we pointed out in Section 1, in most grid systems, users must specify in advance the precise set of files to be used by the jobs before submitting jobs, which makes it difficult to run data-intensive jobs in grid systems. Therefore, we want to deploy Grid File System with ease in existing Unix-like distributed systems, and ensure that the access to GFS files must be transparent. GFS achieves these two goals by following the standard Unix file system interface. Another important function of GFS client is to notify the directory server when a host joins GFS. When a user mounts GFS on a local host, the GFS client first communicates with the directory server. The GFS client will send the host information, such as the available space of the host and the host location, to the directory server. The GFS client then updates the local host information with the directory server periodically. File Server. The file server is responsible for reading and writing physical files at local hosts, and transferring them to/from remote hosts. Each file server is configured to store physical files in a specified local directory. The file server accesses files based on the requests from the local GFS client. The GFS client will pass the information of physical files of the logical file to the file server. The file server then looks up the specified local directory to see whether the requested physical file is available in the local host. If it is, the file server reads/writes data from/to the physical file and sends the acknowledgment back to the user through the GFS client. On the other hand, if the requested file is at remote hosts, the file server then sends requests to GFS file servers at those remote hosts that own the data, and then receives the data from those remote servers simultaneously. 2.3 A Usage Scenario We use a usage scenario to demonstrate the interaction among GFS client, GFS directory server, and GFS file server. The scenario is illustrated in Fig. 1a and Fig. 1b. We assume that the user mounts a Grid File System at the directory /gfs, and the file server is configured so as to store files in /.gfs. We also assume that a user John at the host grid01 wants to copy a file filea from the local file system to GFS. i.e., John at grid01 issues a command, cp filea /gfs/dira/filea. After receiving the command from John, the GFS client first sends a LOOKUP query to the directory server asking whether the logical file /dira/filea exists or not. Then the metadata service of the directory server processes this query. If the answer is no, the GFS client then sends a CREATE request to the metadata service to create a logical file record with the logical path set to /dira, and the logical name set to filea. Then the GFS client asks the file server to create a physical file in /.gfs/dira/filea, and writes the content of filea into /.gfs/dira/filea, as illustrated in steps 1 12 in Fig. 1a.

5 GFS: A Distributed File System with Multi-source Data Access 123 local host cp filea /gfs/filea cp filea /gfs/dira/filea user 1. open filea 2. open /gfs/dira/filea 8. loop(read source, write target) local host remote host 1. open filea 9. read filea VFS 11. close filea 3. open /gfs/dira/filea local storge GFS client 4. pass replica placement information GFS client GFS client file server 6. create /.gfs/dira/filea 7. open /.gfs/dira/filea File server 5. creat a replica to the remote host File server 4. send a LOOKUP request 5. send a CREAT request 13. send a REGISTER request 10. write /.gfs/dira/filea 12. close /.gfs/dira/filea 14. rename to /.gfs/dira/ ,filea 1. send a cteat replica request 3. reply a host name 6. send a REGISTER request directory server File control service 13.2 create physical information 4.1 lookup /dira/filea 13.3 update host information 5.1 creat /dira/filea metadata 13.1 update /dira/filea matedata File management service Metadata service Host management service directory server File control service 6.1 creat a phyical information for the replica 1.1 pass a cteat replica request File managment service Replica service 6.2 update a host information Host management service (a) Create a file at the local host. 2. get a host information (b) Place a replica to the remote host. Fig. 1. The process of creating a file in GFS After completing the creation, the GFS client sends a REGISTER request to the directory server. The metadata service updates the metadata information of filea and the file management service creates a record for the new file such as the logical file number and the physical file information, as illustrated in steps in Fig. 1a. Finally the GFS client sends a request for creating replicas of this logical file. The purpose of replication is to enhance fault tolerance and improve performance of GFS. The replica placement service decides where to put those replicas based on the information provided by the host management service. After receiving the locations of replicas, the GFS client passes this information to the file server, which is responsible for communicating with the remote file servers and creating replicas at those remote hosts. After successful replication, the GFS client sends a request to register those replicas with the file management service, which then adds the metadata of these replicas into physical file information database as the primary replicas of the physical file, as illustrated in Fig. 1b. There are some notes about GFS naming convention. The first rule of GFS naming convention is that all GFS hosts must have the same mount point for global view /hosta ,fileA ,fileB host A /gfs filea fileb dirc filec /hostb host B ,fileA dirc ,fileC Fig. 2. The global view of GFS namespace

6 124 C.-T. Chen et al. GFS. This restriction is due to compatibility issues with Unix file system interface. All GFS hosts can share the same logical namespace by this convention. The root of physical files, however, can be different in different hosts. The second rule of GFS naming convention is that logical and physical files share the same directory structure as shown in Fig Security We now describe the security mechanism of GFS. A user only needs to have a user account at any single host of a grid system in order to access GFS files. Therefore, a user who already has an account in a host of the grid system does not need to create new accounts on other machines in order to use GFS. This feature helps us deploy GFS to grid systems with ease since every user in a grid system can use GFS without extra effort from the system administrator. 3.1 Access Control For each file, whether a user can access it or not depends on the identity of that user. The identity of a user is the concatenation of his/her user account and the hostname of local machine. For example, john@grid01 is the identity of the user john at the host grid01. In GFS, the owner of the logical file can modify themodeofthelogicalfile. GFS follows traditional UNIX permission mechanism with minor modification. We now discuss execution permission and read/write permission separately as follows. For execution permission we consider two cases. First, if the account that the system uses to run the executable is the user of the owner tag of the executable, i.e. john@grid01 in our scenario, then the GFS client simply checks the GFS access permission of owner to determine whether it can run this executable or not. Second, if the account is not the user of the owner tag of the executable, the GFS client first checks the GFS access permission of others. If the permission is granted for others, the GFS client loads the executable for execution. If the execution permission by others is denied, then the GFS client will not load this executable file. For read/write permission, we classify access to a file into two categories from the point of view of a GFS user: Direct.The permission is determined according to the GFS owner/others permission. Indirect. The permission is determined according to the GFS group permission. This classification is motivated by the following usage scenario. We assume that John at grid01 wants to execute his program proga at two remote sites, host1 and host2, and proga will read his file file1 as the input data. These two files, proga, and file1, are all created in GFS by John.

7 GFS: A Distributed File System with Multi-source Data Access 125 Now John wishes that file1 can only be accessed by proga. However, when a program is executed in a grid environment, it is usually executed by a temporary account in the remote host. That is, it is the administrators of the remote hosts that decide which account to use to execute the program and that decision is usually not predictable. The decision depends on the administration policies of the remote hosts. Thus, it is not possible to have file1 accessible only to proga with the traditional UNIX permission mechanism. Our solution is based on the fact that the program and the input files have the same owner tag, i.e., john@grid01, in GFS. When a user runs an executable file, proga in our scenario, as a process, the local GFS client will record the owner tag of this executable file and the process ID (PID) of this process. Note that here we assume that each process is associated with an executable file, and the process will not invoke other executables. Now, when this process attempts to access a GFS file, the GFS client first gets the owner tag and the GFS access mode of this file from the directory server. If the user identity of this process is the owner of this GFS file, the GFS client simply checks the GFS owner permission to determine whether this process can directly access this file. Otherwise, the GFS client checks the other permission. If the permission is granted for others, this process can also directly access this file. Next, we consider the case in which the permission by others is denied. In this case, the GFS client checks whether the GFS executable of that process has the same owner tag as the GFS file. The GFS client can simply check the PID and owner tag pair recorded when the process is created. If they have the same owner tag, the GFS client checks the GFS group permission to determine whether this process can indirectly access this file. If they do not have the same owner tag, the permission is denied. 4 Multiple Source Data Transfer Mechanism In this section, we introduce the multiple source data transfer mechanism among GFS hosts. This mechanism improves both efficiency and reliability of file transfer by downloading a file from multiple GFS hosts simultaneously. The data transfer mechanism works as follows. When a GFS client receives a user request for a logical GFS file, it sends a LOOKUP request to the file management service to find out where the physical files are. The file management service then returns the replica list to the GFS client, which contains on-line replicas of the requested file. Then the GFS client passes the list to the file server at the local host. The file server first checks the list to find out whether a replica is available at the local host. If the local host is in the list, then GFS simply use the local replica. Otherwise, the file server sends a request to each of the hosts in the list to download the file simultaneously from those data sources. A GFS file is divided into blocks of equal size, and the file server requests/receives only blocks that are within the region requested by the user to/from the hosts in the replica list.

8 Primergy Ethernet Switch Ethernet Switch 126 C.-T. Chen et al. Note that data transfer can also improve the correctness of GFS host metadata. If a replica is not available, the GFS client will report it back to the file management service. The file management service then marks that replica as off-line, to indicate that the replica is not functioning. Note that if the downloaded fragments constitute a complete file, the GFS client will register this physical file with the file management service as a secondary replica so that other file server can download the file from this host. 5 Performance Evaluation We conduct experiments to evaluate the performance of GFS. We implemented a prototype GFS on Taiwan UniGrid system [1], a grid testbed developed among universities and academic institutes in Taiwan. The participating institutes of Taiwan UniGrid are connected by wide area network. The first set of experiments compare the performance of GFS with two file transfer approaches SCP and GridFtp [3, 4], both are widely used for data transfer among grid hosts. The second set of experiments compare the performance of job execution with/without GFS. The third set of experiments test autodock [5], a suite of automated docking tools that predict how small molecules, such as substrates or drug candidates, bind to a receptor of known 3D structure. 5.1 Experiment Settings Fig. 3 illustrates the system configuration in our experiments. Table 1 lists the hardware parameters of the machines. Our prototype GFS implementation uses SQLite version [6] and FUSE version [7] without any grid middleware. SQLite is a database tool that GFS directory server management uses to keep track of metadata. FUSE is a free Unix kernel module that allows users to create their own file systems without changing the UNIX kernel code. The FUSE kernel module was officially merged directory server grid01 grid02 NTU Ethernet Switch uniblade02 uniblade03 Ethernet Switch srbn01 CHU NTHU 100Mbps 1Gbps iisgrid01~iisgrid08 IIS Fig. 3. An Illustration of the environment of our experiments. We use four sites in Taiwan UniGrid system. The directory server resides in host grid01 at National Taiwan University.

9 GFS: A Distributed File System with Multi-source Data Access 127 Table 1. Hardware configurations in GFS experiments Machine(s) grid01 grid02 iisgrid01 08 uniblade02,03 srbn01 CPU Intel Core2 Intel P4 Intel Xeon Intel Xeon Intel P4 1.86GHz 2.00GHz 3.4GHz 3.20GHz 3.00GHz Cache 4M 512K 2M 2M 1M RAM 2G 1G 2G 1G 1G into the mainstream Linux kernel tree in kernel version We used FUSE to implement GFS as a user-level grid file system. The directory server was deployed on grid01 (a machine located at National Taiwan University), which manages all metadata in our prototype GFS. Each of the other GFS hosts runs a GFS client and a GFS file server. File servers are responsible for the underlying file transfers between GFS sites, and GFS clients are interfaces between users (programs) and GFS, i.e., users manipulate and access GFS files via GFS clients. 5.2 Experiment Results We now describe the experimental results from the three sets of experiments. The first set of experiments examine the effects of file length on the performance. The second set of experiments examine the performance of GFS file transfer. The final set of experiments examine the job success rate using GFS. Effects of File Length. In the first set of experiments we perform a number of file copies from a remote site to a local site under different environment settings. Each experiment copies 100 files with size ranging from 5MB to 1GB, on the different Fast Ethernet switches. These file sizes are common in grid computing, e.g., the autodock tasks that we tested. Although it is possible to transfer task to where the data is located, it will be more efficient to transfer data to multiple sites so that a large number of tasks can run in parallel. This is particularly useful in running multiple tasks with different parameter setting. The file transfer commands are different in SCP, GridFTP, and GFS. In GridFTP/SCP environment, one special command is invoked for each file in order to transfer data from a remote GridFTP/SSH FTP server to the local disk. On the other hand, after mounting GFS on the directory /gfs in each machine, we use the Unix copy command cp to transfer files. Each file has a master copy and a primary replica in the system. Each file is downloaded from its master copy and its primary replica simultaneously since GFS uses the multiple source data transfer mechanism to transfer files. Table 2 shows that results from the first set of experiments. For files ranging from 100M to 1G, all three methods have about the same performance since the network overhead is not significant when compared to disk I/O overhead. However, when the size of files ranges from 5M to 50M, our approach has about the same performance with the SCP approach, and is 26% 43% faster than the popular GridFTP.

10 128 C.-T. Chen et al. Table 2. Performance comparisons of SCP, GridFTP, and GFS. The numbers in the table are performance ratios compared to the transferring time of GFS. 5M 10M 50M 100M 500M 1G SCP GridFTP GFS GFS File Transfer. The second set of experiments compare the performance of job execution with and without GFS multiple data file transfer mechanism. We run an MPI program StringCount that counts the number of occurrence of a given string in a file. The size of all input files are 1 GB. StringCount divides the input file into equal size segments and each computing machine is assigned one segment to count the occurrence of agivenstringinthatsegment. In the first setting, we use the GFS file transfer mechanism. We put the executable file and its input files into GFS and execute the string counting MPI program. GFS file servers transfer these GFS files automatically. Note that the computing machines only receive the necessary segments of the input file from multiple file replicas simultaneously. In the second setting, we do not use GFS file transfer mechanism. Instead, we follow the job submission mechanism of the Globus [3] system. Under Globus, the local machine transfers the executable and the entire input file to the computing machines before execution. Users need to specify the location of the input files and the executable file in a script file. GridFTP transfers the files according to the script. The master copies of the executable file and its input files are in the host iisgrid01 and the primary replicas are in the host grid02. For the experiments that do not use GFS file transfer, the executable file and the input files are initially stored at the host iisgrid01. The number of worker machines ranges from 2 to 10. Fig. 4a shows the experimental results. The vertical axis is the execution time and the horizontal axis is the number of worker machines. From Fig. 4a we can see that the execution time of Globus increases as the number of hosts increases. This overhead is due to transferring the entire input file under Globus between worker machines and iisgrid01, which has the input files. On the other hand, the execution time of GFS is much shorter because the worker machines only need to get the necessary segments of the input file rather than the entire file, which greatly reduces the communication cost. Although it is possible for a programmer to use GridFTP API to transfer only the necessary parts of a input file, it takes extraordinary efforts for a programmer to learn the API and to modify the existing MPI programs. Another drawback is that once the program is modified, it cannot run in grid systems that do not have GridFTP, such as a cluster system without Globus. In contrast our GFS approach does not require a user to change his program since GFS is at the file system level. Job Success Rate. The third set of experiments use autodock [5] to illustrate that GFS improves the success rate of submitted job under Taiwan Unigrid.

11 GFS: A Distributed File System with Multi-source Data Access 129 Elapsed time (Sec) GFS Method GridFTP Method Number of completed successfully submitted tasks GFS Method 100 tasks Number of hosts (a) Execution time comparison under Globus and GFS Elapsed time (min) (b) The number of completed jobs with respect to elapsed time. Fig. 4. Fig. 4a shows results of the second set of experiments and Fig. 4b shows results of the third set of experiments When we submit a job into Taiwan UniGrid, the job may not be able to complete because jobs assigned to the same host may request input data or executable simultaneously. As a result the amount of simultaneous traffic may exceed the capacity of GridFTP at that site, and job fails to execute. In our experience, failure rate is about 18.44% when we submit 100 jobs with two GridFTP servers [8]. GFS solve this I/O bottleneck problem by bypassing GridFTP and using a more efficient mechanism to transfer data, so that job will execute successfully. In a previous paper Ho et al. [8] reported that under the current Taiwan Unigrid Globus setting, the failure rate of an autoduck task is about 52.94% to 18.44%, depending on the methods of arranging executables and input files. The main reason of this high failure rate is the I/O bottleneck due to capacity limitation of GridFTP. Consequently, Globus GRAM jobs cannot stage in the executable program properly. This problem also occurred when tasks read the input files. When we use GFS, the GRAM resource file file.rsl only specifies the executable and arguments, since the other information are implicitly implied by GFS file system. For example, the value of the executable is a local file path such as /gfs/autodock since GFS is treated as a local file system. The arguments of the executable file are specified as usual. The input data and the output data are accessible by GFS, so it is not required in file.rsl. Fig. 4b shows the results of virtual screening (a core computation of autoduck) by screening a 100 ligands database to avian influenza virus (H5N1) [9]. The job success rate is 100%, which means every task submitted completes successfully. In other words, GFS overcomes the I/O bottleneck problem while submitting multiple GRAM jobs, which cannot stage in the executable program due to the capacity limit of GridFTP. 6 Conclusion To cope with the difficulties in running data-intensive applications with unknown data requirements and potential I/O bottleneck in Grid environment, we design

12 130 C.-T. Chen et al. Grid File System (GFS) that provides UNIX-like API, and provides the same namespace and semantics as if the files are stored on a local machine. GFS has the following advantages. First, GFS uses standard file I/O libraries that are available in every UNIX system; therefore, applications do not need modification to access remote GFS files. Second, GFS supports partial file access and replication mechanism for fault tolerance. Third, GFS accesses remote files with a multi-source data transfer mechanism, which improves data transfer rate by 26% 43% compared with GridFTP, which in turn enhances the overall system performance. Fourth, GFS is a user space file system that do not require kernel modification; therefore, it can be easily deployed in any Unix-like environments without the help of system administrators. We also plan to integrate the authentication mechanisms such as GSI or PKI into our further release of GFS, and conduct more experiments to compare GFS with other grid-enabled distributed file systems, such as XtreemFS [10]. Acknowledgement The authors would like to acknowledge the anonymous reviewers for their valuable advises. This research is supported in part by the National Science Council, Republic of China, under Grant NSC E , and by Excellent Research Projects of National Taiwan University, 97R References 1. Taiwan unigrid project, 2. Howard,J.,Kazar,M.,Menees,S.,Nichols,D.,Satyanarayanan,M.,Sidebotham, R., West, M.: Scale and performance in a distributed file system. ACM Transactions on Computer Systems (TOCS) 6(1), (1988) 3. Globus toolkit, 4. Allcock, W., Foster, I., Tuecke, S., Chervenak, A., Kesselman, C.: Protocols and services for distributed data-intensive science. Advanced computing and analysis techniques in physics research 583, (2001) 5. Autodock docking tools, 6. Sqlite, 7. Filesystem in userspace fuse, 8. Ho, L.-Y., Liu, P., Wang, C.-M., Wu, J.-J.: The development of a drug discovery virtual screening application on taiwan unigrid. In: The 4th Workshop on Grid Technologies and Application (WoGTA 2007), Providence University, Taichung,Taiwan (2007) 9. Russell, R., Haire, L., Stevens, D., Collins, P., Lin, Y., Blackburn, G., Hay, A., Gamblin, S., Skehel, J.: Structural biology: antiviral drugs fit for a purpose. Nature 443, (2006) 10. Xtreemfs,

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

Performance Analysis of Applying Replica Selection Technology for Data Grid Environments*

Performance Analysis of Applying Replica Selection Technology for Data Grid Environments* Performance Analysis of Applying Replica Selection Technology for Data Grid Environments* Chao-Tung Yang 1,, Chun-Hsiang Chen 1, Kuan-Ching Li 2, and Ching-Hsien Hsu 3 1 High-Performance Computing Laboratory,

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Why the Grid? Science is becoming increasingly digital and needs to deal with increasing amounts of

More information

File Management. Chapter 12

File Management. Chapter 12 File Management Chapter 12 Files Used for: input to a program Program output saved for long-term storage Terms Used with Files Field basic element of data contains a single value characterized by its length

More information

A Distributed Media Service System Based on Globus Data-Management Technologies1

A Distributed Media Service System Based on Globus Data-Management Technologies1 A Distributed Media Service System Based on Globus Data-Management Technologies1 Xiang Yu, Shoubao Yang, and Yu Hong Dept. of Computer Science, University of Science and Technology of China, Hefei 230026,

More information

A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme

A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme Yue Zhang 1 and Yunxia Pei 2 1 Department of Math and Computer Science Center of Network, Henan Police College, Zhengzhou,

More information

An Evaluation of Alternative Designs for a Grid Information Service

An Evaluation of Alternative Designs for a Grid Information Service An Evaluation of Alternative Designs for a Grid Information Service Warren Smith, Abdul Waheed *, David Meyers, Jerry Yan Computer Sciences Corporation * MRJ Technology Solutions Directory Research L.L.C.

More information

Processing Technology of Massive Human Health Data Based on Hadoop

Processing Technology of Massive Human Health Data Based on Hadoop 6th International Conference on Machinery, Materials, Environment, Biotechnology and Computer (MMEBC 2016) Processing Technology of Massive Human Health Data Based on Hadoop Miao Liu1, a, Junsheng Yu1,

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

Distributed System. Gang Wu. Spring,2018

Distributed System. Gang Wu. Spring,2018 Distributed System Gang Wu Spring,2018 Lecture7:DFS What is DFS? A method of storing and accessing files base in a client/server architecture. A distributed file system is a client/server-based application

More information

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles INF3190:Distributed Systems - Examples Thomas Plagemann & Roman Vitenberg Outline Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles Today: Examples Googel File System (Thomas)

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 DOI:10.19026/ajfst.5.3106 ISSN: 2042-4868; e-issn: 2042-4876 2013 Maxwell Scientific Publication Corp. Submitted: May 29, 2013 Accepted:

More information

Google File System. Arun Sundaram Operating Systems

Google File System. Arun Sundaram Operating Systems Arun Sundaram Operating Systems 1 Assumptions GFS built with commodity hardware GFS stores a modest number of large files A few million files, each typically 100MB or larger (Multi-GB files are common)

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

ARC-XWCH bridge: Running ARC jobs on the XtremWeb-CH volunteer

ARC-XWCH bridge: Running ARC jobs on the XtremWeb-CH volunteer ARC-XWCH bridge: Running ARC jobs on the XtremWeb-CH volunteer computing platform Internal report Marko Niinimaki, Mohamed BenBelgacem, Nabil Abdennadher HEPIA, January 2010 1. Background and motivation

More information

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed. CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed. File-System Structure File structure Logical storage unit Collection of related information File

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google 2017 fall DIP Heerak lim, Donghun Koo 1 Agenda Introduction Design overview Systems interactions Master operation Fault tolerance

More information

Design of Distributed Data Mining Applications on the KNOWLEDGE GRID

Design of Distributed Data Mining Applications on the KNOWLEDGE GRID Design of Distributed Data Mining Applications on the KNOWLEDGE GRID Mario Cannataro ICAR-CNR cannataro@acm.org Domenico Talia DEIS University of Calabria talia@deis.unical.it Paolo Trunfio DEIS University

More information

An Architecture For Computational Grids Based On Proxy Servers

An Architecture For Computational Grids Based On Proxy Servers An Architecture For Computational Grids Based On Proxy Servers P. V. C. Costa, S. D. Zorzo, H. C. Guardia {paulocosta,zorzo,helio}@dc.ufscar.br UFSCar Federal University of São Carlos, Brazil Abstract

More information

an Object-Based File System for Large-Scale Federated IT Infrastructures

an Object-Based File System for Large-Scale Federated IT Infrastructures an Object-Based File System for Large-Scale Federated IT Infrastructures Jan Stender, Zuse Institute Berlin HPC File Systems: From Cluster To Grid October 3-4, 2007 In this talk... Introduction: Object-based

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in

More information

Distributed File Systems Part II. Distributed File System Implementation

Distributed File Systems Part II. Distributed File System Implementation s Part II Daniel A. Menascé Implementation File Usage Patterns File System Structure Caching Replication Example: NFS 1 Implementation: File Usage Patterns Static Measurements: - distribution of file size,

More information

Scalable Hybrid Search on Distributed Databases

Scalable Hybrid Search on Distributed Databases Scalable Hybrid Search on Distributed Databases Jungkee Kim 1,2 and Geoffrey Fox 2 1 Department of Computer Science, Florida State University, Tallahassee FL 32306, U.S.A., jungkkim@cs.fsu.edu, 2 Community

More information

THE GLOBUS PROJECT. White Paper. GridFTP. Universal Data Transfer for the Grid

THE GLOBUS PROJECT. White Paper. GridFTP. Universal Data Transfer for the Grid THE GLOBUS PROJECT White Paper GridFTP Universal Data Transfer for the Grid WHITE PAPER GridFTP Universal Data Transfer for the Grid September 5, 2000 Copyright 2000, The University of Chicago and The

More information

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Lecture-16 Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897 Flat Datacenter Storage Edmund B. Nightingale, Jeremy Elson, et al. 6.S897 Motivation Imagine a world with flat data storage Simple, Centralized, and easy to program Unfortunately, datacenter networks

More information

Grid Architectural Models

Grid Architectural Models Grid Architectural Models Computational Grids - A computational Grid aggregates the processing power from a distributed collection of systems - This type of Grid is primarily composed of low powered computers

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

CSE 124: Networked Services Fall 2009 Lecture-19

CSE 124: Networked Services Fall 2009 Lecture-19 CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but

More information

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver Abstract GFS from Scratch Ge Bian, Niket Agarwal, Wenli Looi https://github.com/looi/cs244b Dec 2017 GFS from Scratch is our partial re-implementation of GFS, the Google File System. Like GFS, our system

More information

A Resource Discovery Algorithm in Mobile Grid Computing based on IP-paging Scheme

A Resource Discovery Algorithm in Mobile Grid Computing based on IP-paging Scheme A Resource Discovery Algorithm in Mobile Grid Computing based on IP-paging Scheme Yue Zhang, Yunxia Pei To cite this version: Yue Zhang, Yunxia Pei. A Resource Discovery Algorithm in Mobile Grid Computing

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

Database Assessment for PDMS

Database Assessment for PDMS Database Assessment for PDMS Abhishek Gaurav, Nayden Markatchev, Philip Rizk and Rob Simmonds Grid Research Centre, University of Calgary. http://grid.ucalgary.ca 1 Introduction This document describes

More information

The Google File System (GFS)

The Google File System (GFS) 1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints

More information

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Sivakumar Harinath 1, Robert L. Grossman 1, K. Bernhard Schiefer 2, Xun Xue 2, and Sadique Syed 2 1 Laboratory of

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

Understanding StoRM: from introduction to internals

Understanding StoRM: from introduction to internals Understanding StoRM: from introduction to internals 13 November 2007 Outline Storage Resource Manager The StoRM service StoRM components and internals Deployment configuration Authorization and ACLs Conclusions.

More information

A Federated Grid Environment with Replication Services

A Federated Grid Environment with Replication Services A Federated Grid Environment with Replication Services Vivek Khurana, Max Berger & Michael Sobolewski SORCER Research Group, Texas Tech University Grids can be classified as computational grids, access

More information

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures GFS Overview Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures Interface: non-posix New op: record appends (atomicity matters,

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

A Finite State Mobile Agent Computation Model

A Finite State Mobile Agent Computation Model A Finite State Mobile Agent Computation Model Yong Liu, Congfu Xu, Zhaohui Wu, Weidong Chen, and Yunhe Pan College of Computer Science, Zhejiang University Hangzhou 310027, PR China Abstract In this paper,

More information

XtreemFS a case for object-based storage in Grid data management. Jan Stender, Zuse Institute Berlin

XtreemFS a case for object-based storage in Grid data management. Jan Stender, Zuse Institute Berlin XtreemFS a case for object-based storage in Grid data management Jan Stender, Zuse Institute Berlin In this talk... Traditional Grid Data Management Object-based file systems XtreemFS Grid use cases for

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP 03, October 19 22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae Woo Memory & Storage Architecture Lab. School

More information

A Dynamic Resource Broker and Fuzzy Logic Based Scheduling Algorithm in Grid Environment

A Dynamic Resource Broker and Fuzzy Logic Based Scheduling Algorithm in Grid Environment A Dynamic Resource Broker and Fuzzy Logic Based Scheduling Algorithm in Grid Environment Jiayi Zhou 1, Kun-Ming Yu 2, Chih-Hsun Chou 2, Li-An Yang 2, and Zhi-Jie Luo 2 1 Institute of Engineering Science,

More information

Middleware of Taiwan UniGrid

Middleware of Taiwan UniGrid Middleware of Taiwan UniGrid Po-Chi Shih 1, Hsi-Min Chen 2, Yeh-Ching Chung 1, Chien-Min Wang 3, Ruay-Shiung Chang 4, Ching-Hsien Hsu 5, Kuo-Chan Huang 6, Chao-Tung Yang 7 shedoh@sslab.cs.nthu.edu.tw,

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Distributed File Systems and Cloud Storage Part I Lecture 12, Feb 22, 2012 Majd F. Sakr, Mohammad Hammoud and Suhail Rehman 1 Today Last two sessions Pregel, Dryad and GraphLab

More information

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani The Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani CS5204 Operating Systems 1 Introduction GFS is a scalable distributed file system for large data intensive

More information

Service and Cloud Computing Lecture 10: DFS2 Prof. George Baciu PQ838

Service and Cloud Computing Lecture 10: DFS2   Prof. George Baciu PQ838 COMP4442 Service and Cloud Computing Lecture 10: DFS2 www.comp.polyu.edu.hk/~csgeorge/comp4442 Prof. George Baciu PQ838 csgeorge@comp.polyu.edu.hk 1 Preamble 2 Recall the Cloud Stack Model A B Application

More information

Grid Computing with Voyager

Grid Computing with Voyager Grid Computing with Voyager By Saikumar Dubugunta Recursion Software, Inc. September 28, 2005 TABLE OF CONTENTS Introduction... 1 Using Voyager for Grid Computing... 2 Voyager Core Components... 3 Code

More information

Revisiting Join Site Selection in Distributed Database Systems

Revisiting Join Site Selection in Distributed Database Systems Revisiting Join Site Selection in Distributed Database Systems Haiwei Ye 1, Brigitte Kerhervé 2, and Gregor v. Bochmann 3 1 Département d IRO, Université de Montréal, CP 6128 succ Centre-Ville, Montréal

More information

File System Implementation

File System Implementation File System Implementation Last modified: 16.05.2017 1 File-System Structure Virtual File System and FUSE Directory Implementation Allocation Methods Free-Space Management Efficiency and Performance. Buffering

More information

Ubiquitous and Mobile Computing CS 525M: Virtually Unifying Personal Storage for Fast and Pervasive Data Accesses

Ubiquitous and Mobile Computing CS 525M: Virtually Unifying Personal Storage for Fast and Pervasive Data Accesses Ubiquitous and Mobile Computing CS 525M: Virtually Unifying Personal Storage for Fast and Pervasive Data Accesses Pengfei Tang Computer Science Dept. Worcester Polytechnic Institute (WPI) Introduction:

More information

Google File System. By Dinesh Amatya

Google File System. By Dinesh Amatya Google File System By Dinesh Amatya Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung designed and implemented to meet rapidly growing demand of Google's data processing need a scalable

More information

Metadaten Workshop 26./27. März 2007 Göttingen. Chimera. a new grid enabled name-space service. Martin Radicke. Tigran Mkrtchyan

Metadaten Workshop 26./27. März 2007 Göttingen. Chimera. a new grid enabled name-space service. Martin Radicke. Tigran Mkrtchyan Metadaten Workshop 26./27. März Chimera a new grid enabled name-space service What is Chimera? a new namespace provider provides a simulated filesystem with additional metadata fast, scalable and based

More information

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c 2016 Joint International Conference on Service Science, Management and Engineering (SSME 2016) and International Conference on Information Science and Technology (IST 2016) ISBN: 978-1-60595-379-3 Dynamic

More information

NUSGRID a computational grid at NUS

NUSGRID a computational grid at NUS NUSGRID a computational grid at NUS Grace Foo (SVU/Academic Computing, Computer Centre) SVU is leading an initiative to set up a campus wide computational grid prototype at NUS. The initiative arose out

More information

Hadoop and HDFS Overview. Madhu Ankam

Hadoop and HDFS Overview. Madhu Ankam Hadoop and HDFS Overview Madhu Ankam Why Hadoop We are gathering more data than ever Examples of data : Server logs Web logs Financial transactions Analytics Emails and text messages Social media like

More information

OPERATING SYSTEM. Chapter 12: File System Implementation

OPERATING SYSTEM. Chapter 12: File System Implementation OPERATING SYSTEM Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management

More information

GridNEWS: A distributed Grid platform for efficient storage, annotating, indexing and searching of large audiovisual news content

GridNEWS: A distributed Grid platform for efficient storage, annotating, indexing and searching of large audiovisual news content 1st HellasGrid User Forum 10-11/1/2008 GridNEWS: A distributed Grid platform for efficient storage, annotating, indexing and searching of large audiovisual news content Ioannis Konstantinou School of ECE

More information

An Introduction to GPFS

An Introduction to GPFS IBM High Performance Computing July 2006 An Introduction to GPFS gpfsintro072506.doc Page 2 Contents Overview 2 What is GPFS? 3 The file system 3 Application interfaces 4 Performance and scalability 4

More information

Design and Implementation of a Random Access File System for NVRAM

Design and Implementation of a Random Access File System for NVRAM This article has been accepted and published on J-STAGE in advance of copyediting. Content is final as presented. IEICE Electronics Express, Vol.* No.*,*-* Design and Implementation of a Random Access

More information

MSF: A Workflow Service Infrastructure for Computational Grid Environments

MSF: A Workflow Service Infrastructure for Computational Grid Environments MSF: A Workflow Service Infrastructure for Computational Grid Environments Seogchan Hwang 1 and Jaeyoung Choi 2 1 Supercomputing Center, Korea Institute of Science and Technology Information, 52 Eoeun-dong,

More information

WSRF Services for Composing Distributed Data Mining Applications on Grids: Functionality and Performance

WSRF Services for Composing Distributed Data Mining Applications on Grids: Functionality and Performance WSRF Services for Composing Distributed Data Mining Applications on Grids: Functionality and Performance Domenico Talia, Paolo Trunfio, and Oreste Verta DEIS, University of Calabria Via P. Bucci 41c, 87036

More information

Finding a Needle in a Haystack. Facebook s Photo Storage Jack Hartner

Finding a Needle in a Haystack. Facebook s Photo Storage Jack Hartner Finding a Needle in a Haystack Facebook s Photo Storage Jack Hartner Paper Outline Introduction Background & Previous Design Design & Implementation Evaluation Related Work Conclusion Facebook Photo Storage

More information

Ethane: taking control of the enterprise

Ethane: taking control of the enterprise Ethane: taking control of the enterprise Martin Casado et al Giang Nguyen Motivation Enterprise networks are large, and complex, and management is distributed. Requires substantial manual configuration.

More information

A Replica Location Grid Service Implementation

A Replica Location Grid Service Implementation A Replica Location Grid Service Implementation Mary Manohar, Ann Chervenak, Ben Clifford, Carl Kesselman Information Sciences Institute, University of Southern California Marina Del Rey, CA 90292 {mmanohar,

More information

Outline. ASP 2012 Grid School

Outline. ASP 2012 Grid School Distributed Storage Rob Quick Indiana University Slides courtesy of Derek Weitzel University of Nebraska Lincoln Outline Storage Patterns in Grid Applications Storage

More information

Chapter 11: Implementing File Systems

Chapter 11: Implementing File Systems Chapter 11: Implementing File Systems Operating System Concepts 99h Edition DM510-14 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation

More information

A AAAA Model to Support Science Gateways with Community Accounts

A AAAA Model to Support Science Gateways with Community Accounts A AAAA Model to Support Science Gateways with Community Accounts Von Welch 1, Jim Barlow, James Basney, Doru Marcusiu NCSA 1 Introduction Science gateways have emerged as a concept for allowing large numbers

More information

The Leading Parallel Cluster File System

The Leading Parallel Cluster File System The Leading Parallel Cluster File System www.thinkparq.com www.beegfs.io ABOUT BEEGFS What is BeeGFS BeeGFS (formerly FhGFS) is the leading parallel cluster file system, developed with a strong focus on

More information

Structuring PLFS for Extensibility

Structuring PLFS for Extensibility Structuring PLFS for Extensibility Chuck Cranor, Milo Polte, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University What is PLFS? Parallel Log Structured File System Interposed filesystem b/w

More information

Dispatcher. Phoenix. Dispatcher Phoenix Enterprise White Paper Version 0.2

Dispatcher. Phoenix. Dispatcher Phoenix Enterprise White Paper Version 0.2 Dispatcher Phoenix Dispatcher Phoenix Enterprise CONTENTS Introduction... 3 Terminology... 4 Planning & Considerations... 5 Security Features... 9 Enterprise Features... 10 Cluster Overview... 11 Deployment

More information

A Distributed System for Continuous Integration with JINI 1

A Distributed System for Continuous Integration with JINI 1 A Distributed System for Continuous Integration with JINI 1 Y. C. Cheng, P.-H. Ou, C.-T. Chen and T.-S. Hsu Software Systems Lab Department of Computer Science and Information Engineering National Taipei

More information

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006 Google File System, Replication Amin Vahdat CSE 123b May 23, 2006 Annoucements Third assignment available today Due date June 9, 5 pm Final exam, June 14, 11:30-2:30 Google File System (thanks to Mahesh

More information

The rcuda middleware and applications

The rcuda middleware and applications The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,

More information

CS2506 Quick Revision

CS2506 Quick Revision CS2506 Quick Revision OS Structure / Layer Kernel Structure Enter Kernel / Trap Instruction Classification of OS Process Definition Process Context Operations Process Management Child Process Thread Process

More information

Chapter 11: Implementing File

Chapter 11: Implementing File Chapter 11: Implementing File Systems Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency

More information

A User-level Secure Grid File System

A User-level Secure Grid File System A User-level Secure Grid File System Ming Zhao, Renato J. Figueiredo Advanced Computing and Information Systems (ACIS) Electrical and Computer Engineering University of Florida {ming, renato}@acis.ufl.edu

More information

Towards Access Control for Isolated Applications. SECRYPT 2016, Lisbon, Portugal

Towards Access Control for Isolated Applications. SECRYPT 2016, Lisbon, Portugal Towards Access Control for Isolated Applications SECRYPT 2016, Lisbon, Portugal Kirill Belyaev and Indrakshi Ray Computer Science Department Colorado State University Fort Collins, CO, USA 2 Introduction

More information

OPEN SOURCE GRID MIDDLEWARE PACKAGES

OPEN SOURCE GRID MIDDLEWARE PACKAGES 3.Explain about Virtualization structure and show how virtualization is achieved in CPU,memory and I/O devices. 4.Explain in detail about Virtual clusters. 5.Explain how resource management is done in

More information

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition Chapter 11: Implementing File Systems Operating System Concepts 9 9h Edition Silberschatz, Galvin and Gagne 2013 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory

More information

Google Disk Farm. Early days

Google Disk Farm. Early days Google Disk Farm Early days today CS 5204 Fall, 2007 2 Design Design factors Failures are common (built from inexpensive commodity components) Files large (multi-gb) mutation principally via appending

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017 Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google

More information

How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O?

How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? bs_bs_banner Short Technical Note Transactions in GIS, 2014, 18(6): 950 957 How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? Cheng-Zhi Qin,* Li-Jun

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs 11/7/2018 CS435 Introduction to Big Data - FALL 2018 W12.B.0.0 CS435 Introduction to Big Data 11/7/2018 CS435 Introduction to Big Data - FALL 2018 W12.B.1 FAQs Deadline of the Programming Assignment 3

More information

Discover CephFS TECHNICAL REPORT SPONSORED BY. image vlastas, 123RF.com

Discover CephFS TECHNICAL REPORT SPONSORED BY. image vlastas, 123RF.com Discover CephFS TECHNICAL REPORT SPONSORED BY image vlastas, 123RF.com Discover CephFS TECHNICAL REPORT The CephFS filesystem combines the power of object storage with the simplicity of an ordinary Linux

More information

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 138: Google CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Google Environment Lots (tens of thousands) of computers all more-or-less equal - processor, disk, memory, network interface

More information

CS3600 SYSTEMS AND NETWORKS

CS3600 SYSTEMS AND NETWORKS CS3600 SYSTEMS AND NETWORKS NORTHEASTERN UNIVERSITY Lecture 11: File System Implementation Prof. Alan Mislove (amislove@ccs.neu.edu) File-System Structure File structure Logical storage unit Collection

More information

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton

More information

The Grid Monitor. Usage and installation manual. Oxana Smirnova

The Grid Monitor. Usage and installation manual. Oxana Smirnova NORDUGRID NORDUGRID-MANUAL-5 2/5/2017 The Grid Monitor Usage and installation manual Oxana Smirnova Abstract The LDAP-based ARC Grid Monitor is a Web client tool for the ARC Information System, allowing

More information

A NEW DISTRIBUTED COMPOSITE OBJECT MODEL FOR COLLABORATIVE COMPUTING

A NEW DISTRIBUTED COMPOSITE OBJECT MODEL FOR COLLABORATIVE COMPUTING A NEW DISTRIBUTED COMPOSITE OBJECT MODEL FOR COLLABORATIVE COMPUTING Güray YILMAZ 1 and Nadia ERDOĞAN 2 1 Dept. of Computer Engineering, Air Force Academy, 34807 Yeşilyurt, İstanbul, Turkey 2 Dept. of

More information