High Throughput WAN Data Transfer with Hadoop-based Storage
|
|
- Lorraine Dickerson
- 1 years ago
- Views:
Transcription
1 High Throughput WAN Data Transfer with Hadoop-based Storage A Amin 2, B Bockelman 4, J Letts 1, T Levshina 3, T Martin 1, H Pi 1, I Sfiligoi 1, M Thomas 2, F Wuerthwein 1 1 University of California, San Diego, 9500 Gilman Dr, San Diego, CA 92093, USA 2 California Institute of Technology, East California Blvd, Pasadena, CA 91125, USA 3 Fermi National Accelerator Laboratory, P.O. Box 500, Batavia, IL 60510, USA 4 University of Nebraska-Lincoln, 118 Schorr Center, Lincoln, NE 68588, USA Abstract Hadoop distributed file system (HDFS) is becoming more popular in recent years as a key building block of integrated grid storage solution in the field of scientific computing. Wide Area Network (WAN) data transfer is one of the important data operations for large high energy experiments to manage, share and process PetaByte scale of data in a highly distributed grid computing environment. In this paper, we present the experience of high throughput WAN data transfer with HDFS-based Storage Element. Two protocols, GridFTP and fast data transfer (FDT), are used to characterize the network performance of WAN data transfer. 1. Introduction Hadoop[1] is a newly emerged Java-based framework including a distributed file system and tightly integrated applications that can quickly process a large volume of data stored in the system. This framework is extensively used in search engine and advertising businesses by Yahoo and other companies with computational scale up to thousands of servers and Petabytes of raw data. The Hadoop Distributed File System (HDFS) is a key component of Hadoop that is designed to store data in commodity hardware while still achieving high throughput for accessing data, which mainly benefits from the simple data coherency model used by HDFS, write-once-read-many. The reliable data replication and detection of failure implemented in Hadoop enable fast and automatic system recovery. The emphasis of high throughput instead of low latency makes HDFS appealing for batch processing. The portability of Hadoop makes integration easier with heterogeneous hardware and software platforms. In 2009, a few computing centers from the Compact Muon Solenoid (CMS) at the Large Hadron Collider (LHC) experiment accepted HDFS as the major technique to establish a fully functional Storage Element (SE) [2]. This effort was endorsed by the Open Science Grid (OSG)[3]. Later a number of computing centers ranging from middle scale at the universities to large scale at the national laboratories involved into the test and deployment of HDFS. Wide Area Network (WAN) data transfer is one of important data operations for large high energy experiments to manage, share and process PetaByte scale of data in a highly distributed grid computing environment. At CMS, all the official data are shared among more than 50 sites according to a tiered architecture. How to effectively handle the WAN data transfer from multiple sources to multiple sinks at a massive scale from network point of view is a challenge. CMS developed a dedicated data management layer, called PhEDEx[4] for Physics Experiment Data Export, which implements agent at each site that initiates and controls a series of
2 processes for the WAN data transfer as shown in Fig. 1. Currently more than 2000 WAN links among CMS computing sites are serving data for thousands of users across the collaboration. Efficient WAN data transfer speeds up the data flow and increases the utilization of idle resources (CPU and storage) awaiting for user jobs which in general have very specific demand for certain datasets to process. A number of issues have impact on the availability of databases for a given sites or development a automatic system to dynamically move data based on user demand, for example limited local storage space, limited network bandwidth for transferring data, unreliable data links causing large overhead and significant delay, unpredictable and constantly changing demand from users etc. At the organization level the data bookkeeping, synchronization, reduction and replication are more important issues that affect almost all users and computing sites. All of these data operation-related issues requires a reliable high throughput WAN data transfer to be in place and seamlessly integrated with every major computing site and various grid middleware platforms to meet the needs of the collaboration. In this paper, we focus on some important WAN data transfer issues with the HDFS-based storage, which is not only the continuity of efforts to enhance the HDFS-based SE technology, but also an attempt to study the WAN networking tools to maximize the utilization of network bandwidth. Eventually we believe this work will be integrated into the network-based solution for a more dynamic data distribution model for high energy experiment and other scientific projects. Figure 1. CMS WAN data transfer managed by PhEDEx 2. Architecture of HDFS-based Storage Element and WAN transfer interface Following is a short review of the architecture of HDFS-based Storage Element (SE). HDFS namenode (NN) provides the namespace of the file system of the SE. HDFS datanode (DN) provides the distributed storage space to the SE. In the typical deployment strategy, the compute node can also be configured as DN. FUSE on HDFS and BeStMan servers provides the POSIX-compliant interface of HDFS for local data access of the SE. GridFTP on either the dedicated hosts or shared with data nodes or compute nodes provides grid data transfer from or to the SE. BeStMan[5] server provides the SRM v2.2 interface of the SE and performs the load-balancing of the GridFTP servers. The interface between GridFTP and storage system (e.g. a distributed file system) can be a serious bottleneck for high throughput data transfer. For the architecture using a distributed file system across the cluster, the WN I/O involves internal data replication of the file system, external WAN-read or write of data, and local data access of user jobs. In the integration of HDFS-GridFTP, the GridFTP needs to use native HDFS client to access the file system and buffer the incoming data in memory because the HDFS doesn't support asynchronous writing and for some cases the GridFTP is not able to write data to disk storage fast enough. From current system integration and data access strategy, most of system dependent implementation takes place at GridFTP level. The overview of the typical SE architecture using BeStMan is shown in Fig. 2. A set of services implemented by SE, from data transfer to data access, involve various components of the storage system and grid middleware as shown in Fig. 3. If there is no explicit bottleneck in the application level (file system, software, user jobs), the throughput for WAN transfer and local data access will be mainly limited by the network link bandwidth or the local disk I/O which is part of the nature of the hardware infrastructure. High throughput WAN transfer is one of important data operations that may put some challenge on the HDFS:
3 Figure 2. Architecture of HDFS-based Storage Element Various data access patterns and use cases may cause scalability problem for certain services. For example, a large volume of data access for small files in the SE introduces significant overhead in the BeStMan and file system. Once the throughput of WAN transfer reaching 10 Giga bit (Gb) per second (Gbps), a scalable distributed file system is needed to efficiently read or write data simultaneously across the whole cluster. The local file access shouldn t be affected by high throughput WAN transfer, which requires the HDFS to have enough room to ensure the serving of data won t be blocked by a weakest point, such as the slowest node or hot node that has the data or storage space needed by the applications. Data redundancy and replication play important roles in the stability and scalability of the HDFS to sustain high throughput data transfer. The normal HDFS operation shouldn t be affected by high throughput data transfer which may last hours or days. For a functioning of HDFS to reach its maximal efficiency, some internal routine data operation tasks must be able to run, for example the replication of data blocks, balancing and distributing files across all the data nodes. Figure 3. File transfer and data access between various components of the HDFS based SE 3. BeStMan Scalability and Testing Tool The first version of major BeStMan release was based on Globus container. Currently this flavor of BeStMan is widely deployed for the production at many OSG Grid sites. The BeStMan2, the second version of major release, based on Jetty server in the test release now. It is expected to eventually replace the Globus container because of its lacking of maintenance in recent years and observed limitation in its scalability and reliability. GlideinWMS (Glide-in based Workload Management System) [6] was used to measure the BeStMan scalability. GlideinWMS is a Grid workload management system to automate condor glide-in configuration, submission and management. GlideinWMS is used widely for several large scientific Grid for handling users jobs for physics analysis and Monte Carlo production. In our work, we used two methods based on GlideinWMS to benchmark the BeStMan (The performance of BeStMan2 using Glidein Test is show Fig. 5): Use GlideinWMS to send jobs to some typical Grid sites and use the normal job slots at the site. The test application will be shipped by glide-in and run at the WN. In this way, the randomness of user job access
4 pattern is repeated in the test environment, since the availability of job slots at the remote sites, type of WN and all the local hardware and WAN networking features are the same as normal Grid jobs. The performance of BeStMan using GlideinWMS is shown in Fig. 4. Use Glidein Tester[7], a fully automated framework for scalability and reliability testing of centralized, network-facing services using Condor glide-ins. The main purpose of this testing system is to simplify the testing configuration and operation. The system can actually actively acquire and monitor the number of free CPU slots in the Condor pool and manage to send testing jobs to those slots when some conditions are met, for example we want to run X jobs simultaneously. We measured the scalability of recent OSG released BeStMan and BeStMan2. There is significant difference in the scalability features between two types of BeStMan technology. The BeStMan2 shows better performance and reliability from running the high throughput and high available service point of view. Figure 4. Correlation between Effective Processing Rate and Client Concurrency for small directories Figure 5. Correlation between Effective Processing Rate and Client Concurrency for large directories (a) and Correlation between Effective Processing Time and Client Concurrency for large directories (b) 4. WAN Data Transfer with UCSD HDFS-based SE Fig. 6 shows the typical hours of UCSD SE that move CMS datasets from or to other computing sites with two 10 Gpbs links with ~15 Gpbs of the maximal data transfer rate we are able to achieve. GridFTP is the major WAN transfer tool managed by BeStMan and PhEDEx for this operation. It is important to understand how to efficiently fill the network pipe. Most of the efforts concentrate on optimizing the number of streams for each GridFTP for the data transfer and total number of GridFTP that can simultaneously transfer the data. Due to the nature of HDFS that it can't support asynchronous write, it is widely confirmed that for the case of sustained high throughput WAN transfer, single stream per GridFTP transfer is the best solution. The GridFTP implements HDFS interface which use a local buffer to keep all the incoming data in memory or local file system before data being sequentially written to HDFS storage. If the GridFTP is not fast enough to sink incoming data into HDFS-SE, the buffer will be full quickly, so the existence of buffer won't increase the overall WAN transfer throughput. A second issue is that multiple streams per transfer requires large buffer size to keep all the out-of-order packets, which is limited by the amount of system memory. The third issue is that the local buffer will introduce significant overhead in the system I/O before the data written to HDFS, which has negative impact on the sustainable throughout.
5 Figure 6. WAN transfer throughput using GridFTP with two 10 Gbps links measured at UCSD HDFS-based SE Figure 7. WAN transfer throughput in 10 Gbps link with 10 GridFTP clients (top left), 20 GridFTP clients (top left), 30 GridFTP clients (bottom left) and 60 GridFTP clients (bottom right) Figure 8. Correlation between WAN transfer throughput and number of GridFTP clients: Total throughput vs number of GridFTP clients (left) and average individual throughput vs number of GridFTP clients (right) Figure 9. WAN transfer throughput using FTD with two 10 Gbps links measured at UCSD HDFS-based SE
6 We use various number of GridFTP clients to measure the WAN throughput as shown in Fig. 7. In order to effectively fill up the 10 Gbps link, more than 30 clients are necessary. There is a very strong correlation between the total throughput and the number of GridFTP clients. While as the number of GridFTP clients increases, the data transfer rate per GridFTP actually decreases as shown in Fig. 8. We also used Fast Data Transfer (FDT) for the WAN transfer with 30 clients to fill up two 10 Gbps links as shown in Fig. 9 and observed similar performance as that of GridFTP based transfer. In the following, we list several important factors of the system architecture at UCSD that have non-trivial impact on the performance of the data transfer: The physics limit of each HDFS datanode. The throughput for a single node to write to HDFS is ~60-80 MB/s. The HDFS normally write the file block to local disk, but as a distributed file system and requirement on the redundancy of the data, the automatic replication in HDFS writes a second copy the file block to other nodes. To support up to 15 Gbps of WAN throughput, at least 30 WAN transfer agent on each datanode need to run simultaneously to be able to sink the data fast enough. This is consistent with the observed WAN throughput vs nunber of GridFTP clients. Use of HDFS datanode for WAN transfer server and client. Ideally the WAN transfer server mainly deal with network issue, while the software needs to buffer the data in the local memory first, which introduces a lot of system I/O and CPU utilization, and has negative impact on the throughput and the performance of HDFS. The optimization of transfer software configuration remains a critical task, which has to be tuned based on each site's hardware condition, expected use case and data access pattern etc. For the future work under the envision of 100 Gbps network being available for the scientific community, we believe the study of WAN transfer for HDFS-based SE will remain an important area, because: The continuously growing size of storage system will make HDFS a more attractive solution for SE. Its highly distributed and scalable nature provide plenty of room for its integration with existing and new grid middleware to conduct the high throughput data transfer. The availability of 100 Gbps network requires efficient way to move data between the grid middleware and HDFS. We expect a scale-out strategy by adding more parallel clients or implementing new technology to use smaller number of hosts but more powerful in handling the disk IO. The WAN transfer tool must be able to fill in the network pipe more efficiently and control possible network congestion due to running too many simultaneous and independent WAN transfer streams. Finally the multi-core system and increasing size of hard drive might make the overall throughput of disk IO a potential bottleneck for high throughput WAN transfer because relative less disks available for the data operation. This can be solved by improving the HDFS architecture and incorporating high-end hardware for the HDFS storage and making the data transfer tools seamlessly integrated with system. 5. Summary In this paper, we present the experience of high throughput WAN data transfer with HDFS-based Storage Element. Two protocols, GridFTP and fast data transfer (FDT), are used to characterize the network performance of WAN data transfer. The optimization of WAN transfer performance is conducted to maximize the throughput. We see very high scalability at the middleware level, HDFS and BeStMan, for WAN transfer. Various issues related to understanding results and possible future improvement are discussed. Reference [1] Hadoop, [2] Pi H et al. Hadoop Distributed File System for the Grid, Proceedings of IEEE Nuclear Science Symposium and Medical Imaging Conference 2009 [3] The Open Science Grid Executive Board, A Science Driven Production Cyberinfrastructure the Open Science Grid, OSG Document 976, docid=976 [4] Rehn et al. PhEDEx high-throughput data transfer management system, Proceedings of Computing High Energy Physics 2006 [5] Berkeley Storage Manager (BeStMan), [6] GlideinWMS, [7] Sfiligoi I et al. Using Condor Glideins for Distributed Testing of Network Facing Services, Proceedings of Computational Science and Optimization 2010
Data Transfers Between LHC Grid Sites Dorian Kcira
Data Transfers Between LHC Grid Sites Dorian Kcira dkcira@caltech.edu Caltech High Energy Physics Group hep.caltech.edu/cms CERN Site: LHC and the Experiments Large Hadron Collider 27 km circumference
File Access Optimization with the Lustre Filesystem at Florida CMS T2
Journal of Physics: Conference Series PAPER OPEN ACCESS File Access Optimization with the Lustre Filesystem at Florida CMS T2 To cite this article: P. Avery et al 215 J. Phys.: Conf. Ser. 664 4228 View
Using Hadoop File System and MapReduce in a small/medium Grid site
Journal of Physics: Conference Series Using Hadoop File System and MapReduce in a small/medium Grid site To cite this article: H Riahi et al 2012 J. Phys.: Conf. Ser. 396 042050 View the article online
The Compact Muon Solenoid Experiment. Conference Report. Mailing address: CMS CERN, CH-1211 GENEVA 23, Switzerland
Available on CMS information server CMS CR -2013/366 The Compact Muon Solenoid Experiment Conference Report Mailing address: CMS CERN, CH-1211 GENEVA 23, Switzerland 28 October 2013 XRootd, disk-based,
Enabling Distributed Scientific Computing on the Campus
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Dissertations, and Student Research Computer Science and Engineering, Department
Distributed Filesystem
Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the
The Google File System. Alexandru Costan
1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems
Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF
Conference 2017 The Data Challenges of the LHC Reda Tafirout, TRIUMF Outline LHC Science goals, tools and data Worldwide LHC Computing Grid Collaboration & Scale Key challenges Networking ATLAS experiment
Distributed File Systems II
Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation
Outline. ASP 2012 Grid School
Distributed Storage Rob Quick Indiana University Slides courtesy of Derek Weitzel University of Nebraska Lincoln Outline Storage Patterns in Grid Applications Storage
CA485 Ray Walshe Google File System
Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage
The Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system
Introduction to Distributed HTC and overlay systems
Introduction to Distributed HTC and overlay systems Tuesday morning session Igor Sfiligoi University of California San Diego Logistical reminder It is OK to ask questions - During
Overview of ATLAS PanDA Workload Management
Overview of ATLAS PanDA Workload Management T. Maeno 1, K. De 2, T. Wenaus 1, P. Nilsson 2, G. A. Stewart 3, R. Walker 4, A. Stradling 2, J. Caballero 1, M. Potekhin 1, D. Smith 5, for The ATLAS Collaboration
Geant4 Computing Performance Benchmarking and Monitoring
Journal of Physics: Conference Series PAPER OPEN ACCESS Geant4 Computing Performance Benchmarking and Monitoring To cite this article: Andrea Dotti et al 2015 J. Phys.: Conf. Ser. 664 062021 View the article
HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1
HDFS Federation Sanjay Radia Founder and Architect @ Hortonworks Page 1 About Me Apache Hadoop Committer and Member of Hadoop PMC Architect of core-hadoop @ Yahoo - Focusing on HDFS, MapReduce scheduler,
Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW
Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW UKOUG RAC SIG Meeting London, October 24 th, 2006 Luca Canali, CERN IT CH-1211 LCGenève 23 Outline Oracle at CERN Architecture of CERN
CLOUD-SCALE FILE SYSTEMS
Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients
Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center
Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center, T. Mashimo, N. Matsui, H. Matsunaga, H. Sakamoto, I. Ueda International Center for Elementary Particle Physics, The University
Distributed Monte Carlo Production for
Distributed Monte Carlo Production for Joel Snow Langston University DOE Review March 2011 Outline Introduction FNAL SAM SAMGrid Interoperability with OSG and LCG Production System Production Results LUHEP
Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.
Chapter 4:- Introduction to Grid and its Evolution Prepared By:- Assistant Professor SVBIT. Overview Background: What is the Grid? Related technologies Grid applications Communities Grid Tools Case Studies
The Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in
Designing Parallel Programs. This review was developed from Introduction to Parallel Computing
Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis
MapReduce. U of Toronto, 2014
MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in
The Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW
Corral: A Glide-in Based Service for Resource Provisioning
: A Glide-in Based Service for Resource Provisioning Gideon Juve USC Information Sciences Institute juve@usc.edu Outline Throughput Applications Grid Computing Multi-level scheduling and Glideins Example:
The Google File System
October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single
Best Practices for Setting BIOS Parameters for Performance
White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page
Introduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
Google File System (GFS) and Hadoop Distributed File System (HDFS)
Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear
VOLTDB + HP VERTICA. page
VOLTDB + HP VERTICA ARCHITECTURE FOR FAST AND BIG DATA ARCHITECTURE FOR FAST + BIG DATA FAST DATA Fast Serve Analytics BIG DATA BI Reporting Fast Operational Database Streaming Analytics Columnar Analytics
HDFS: Hadoop Distributed File System. Sector: Distributed Storage System
GFS: Google File System Google C/C++ HDFS: Hadoop Distributed File System Yahoo Java, Open Source Sector: Distributed Storage System University of Illinois at Chicago C++, Open Source 2 System that permanently
Identifying Workloads for the Cloud
Identifying Workloads for the Cloud 1 This brief is based on a webinar in RightScale s I m in the Cloud Now What? series. Browse our entire library for webinars on cloud computing management. Meet our
ARC integration for CMS
ARC integration for CMS ARC integration for CMS Erik Edelmann 2, Laurence Field 3, Jaime Frey 4, Michael Grønager 2, Kalle Happonen 1, Daniel Johansson 2, Josva Kleist 2, Jukka Klem 1, Jesper Koivumäki
Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI
Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI outlook Alice Examples Atlas Examples CMS Examples Alice Examples ALICE Tier-2s at the moment do not support interactive analysis not
Scalability and interoperability within glideinwms
Journal of Physics: Conference Series Scalability and interoperability within glideinwms To cite this article: D Bradley et al 2010 J. Phys.: Conf. Ser. 219 062036 View the article online for updates and
IEPSAS-Kosice: experiences in running LCG site
IEPSAS-Kosice: experiences in running LCG site Marian Babik 1, Dusan Bruncko 2, Tomas Daranyi 1, Ladislav Hluchy 1 and Pavol Strizenec 2 1 Department of Parallel and Distributed Computing, Institute of
A data Grid testbed environment in Gigabit WAN with HPSS
Computing in High Energy and Nuclear Physics, 24-28 March 23, La Jolla, California A data Grid testbed environment in Gigabit with HPSS Atsushi Manabe, Setsuya Kawabata, Youhei Morita, Takashi Sasaki,
Data Management for Distributed Scientific Collaborations Using a Rule Engine
Data Management for Distributed Scientific Collaborations Using a Rule Engine Sara Alspaugh Department of Computer Science University of Virginia alspaugh@virginia.edu Ann Chervenak Information Sciences
Veritas Storage Foundation and. Sun Solaris ZFS. A performance study based on commercial workloads. August 02, 2007
Veritas Storage Foundation and Sun Solaris ZFS A performance study based on commercial workloads August 02, 2007 Introduction...3 Executive Summary...4 About Veritas Storage Foundation...5 Veritas Storage
Flash: an efficient and portable web server
Flash: an efficient and portable web server High Level Ideas Server performance has several dimensions Lots of different choices on how to express and effect concurrency in a program Paper argues that
Decentralized Distributed Storage System for Big Data
Decentralized Distributed Storage System for Big Presenter: Wei Xie -Intensive Scalable Computing Laboratory(DISCL) Computer Science Department Texas Tech University Outline Trends in Big and Cloud Storage
International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework
Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja
Evolution of Database Replication Technologies for WLCG
Evolution of Database Replication Technologies for WLCG Zbigniew Baranowski, Lorena Lobato Pardavila, Marcin Blaszczyk, Gancho Dimitrov, Luca Canali European Organisation for Nuclear Research (CERN), CH-1211
Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R
Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R Table of Contents Introduction... 3 Topology Awareness in Hadoop... 3 Virtual Hadoop... 4 HVE Solution... 5 Architecture...
Performance of popular open source databases for HEP related computing problems
Journal of Physics: Conference Series OPEN ACCESS Performance of popular open source databases for HEP related computing problems To cite this article: D Kovalskyi et al 2014 J. Phys.: Conf. Ser. 513 042027
Introduction: PURPOSE BUILT HARDWARE. ARISTA WHITE PAPER HPC Deployment Scenarios
HPC Deployment Scenarios Introduction: Private and public High Performance Computing systems are continually increasing in size, density, power requirements, storage, and performance. As these systems
ATLAS operations in the GridKa T1/T2 Cloud
Journal of Physics: Conference Series ATLAS operations in the GridKa T1/T2 Cloud To cite this article: G Duckeck et al 2011 J. Phys.: Conf. Ser. 331 072047 View the article online for updates and enhancements.
On-demand provisioning of HEP compute resources on cloud sites and shared HPC centers
On-demand provisioning of HEP compute resources on cloud sites and shared HPC centers CHEP 2016 - San Francisco, United States of America Gunther Erli, Frank Fischer, Georg Fleig, Manuel Giffels, Thomas
The Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1 Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions
VERITAS Storage Foundation 4.0 TM for Databases
VERITAS Storage Foundation 4.0 TM for Databases Powerful Manageability, High Availability and Superior Performance for Oracle, DB2 and Sybase Databases Enterprises today are experiencing tremendous growth
Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.
1 Using patented high-speed inline deduplication technology, Data Domain systems identify redundant data as they are being stored, creating a storage foot print that is 10X 30X smaller on average than
CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University
CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network
Software Defined Storage at the Speed of Flash. PRESENTATION TITLE GOES HERE Carlos Carrero Rajagopal Vaideeswaran Symantec
Software Defined Storage at the Speed of Flash PRESENTATION TITLE GOES HERE Carlos Carrero Rajagopal Vaideeswaran Symantec Agenda Introduction Software Technology Architecture Review Oracle Configuration
Monitoring of large-scale federated data storage: XRootD and beyond.
Monitoring of large-scale federated data storage: XRootD and beyond. J Andreeva 1, A Beche 1, S Belov 2, D Diguez Arias 1, D Giordano 1, D Oleynik 2, A Petrosyan 2, P Saiz 1, M Tadel 3, D Tuckett 1 and
The glite File Transfer Service
The glite File Transfer Service Peter Kunszt Paolo Badino Ricardo Brito da Rocha James Casey Ákos Frohner Gavin McCance CERN, IT Department 1211 Geneva 23, Switzerland Abstract Transferring data reliably
IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning
IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application
Introduction: PURPOSE BUILT HARDWARE. ARISTA WHITE PAPER HPC Deployment Scenarios
ARISTA WHITE PAPER HPC Deployment Scenarios Introduction: Private and public High Performance Computing systems are continually increasing in size, density, power requirements, storage, and performance.
CS4513 Distributed Computer Systems
Outline CS4513 Distributed Computer Systems Overview Goals Software Client Server Introduction (Ch 1: 1.1-1.2, 1.4-1.5) The Rise of Distributed Systems Computer hardware prices falling, power increasing
Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017
Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google
NPTEL Course Jan K. Gopinath Indian Institute of Science
Storage Systems NPTEL Course Jan 2012 (Lecture 39) K. Gopinath Indian Institute of Science Google File System Non-Posix scalable distr file system for large distr dataintensive applications performance,
Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX
Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Inventing Internet TV Available in more than 190 countries 104+ million subscribers Lots of Streaming == Lots of Traffic
GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures
GFS Overview Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures Interface: non-posix New op: record appends (atomicity matters,
Configuring Short RPO with Actifio StreamSnap and Dedup-Async Replication
CDS and Sky Tech Brief Configuring Short RPO with Actifio StreamSnap and Dedup-Async Replication Actifio recommends using Dedup-Async Replication (DAR) for RPO of 4 hours or more and using StreamSnap for
New Approach to Unstructured Data
Innovations in All-Flash Storage Deliver a New Approach to Unstructured Data Table of Contents Developing a new approach to unstructured data...2 Designing a new storage architecture...2 Understanding
Chapter 20: Database System Architectures
Chapter 20: Database System Architectures Chapter 20: Database System Architectures Centralized and Client-Server Systems Server System Architectures Parallel Systems Distributed Systems Network Types
Grid Data Management
Grid Data Management Week #4 Hardi Teder hardi@eenet.ee University of Tartu March 6th 2013 Overview Grid Data Management Where the Data comes from? Grid Data Management tools 2/33 Grid foundations 3/33
DiPerF: automated DIstributed PERformance testing Framework
DiPerF: automated DIstributed PERformance testing Framework Catalin Dumitrescu, Ioan Raicu, Matei Ripeanu, Ian Foster Distributed Systems Laboratory Computer Science Department University of Chicago Introduction
EMC Business Continuity for Microsoft SharePoint Server (MOSS 2007)
EMC Business Continuity for Microsoft SharePoint Server (MOSS 2007) Enabled by EMC Symmetrix DMX-4 4500 and EMC Symmetrix Remote Data Facility (SRDF) Reference Architecture EMC Global Solutions 42 South
Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
Red Hat Ceph Storage and Samsung NVMe SSDs for intensive workloads
Red Hat Ceph Storage and Samsung NVMe SSDs for intensive workloads Power emerging OpenStack use cases with high-performance Samsung/ Red Hat Ceph reference architecture Optimize storage cluster performance
Advanced Database Systems
Lecture II Storage Layer Kyumars Sheykh Esmaili Course s Syllabus Core Topics Storage Layer Query Processing and Optimization Transaction Management and Recovery Advanced Topics Cloud Computing and Web
Copyright 2012 EMC Corporation. All rights reserved.
1 TRANSFORMING MICROSOFT APPLICATIONS TO THE CLOUD Louaye Rachidi Technology Consultant 2 22x Partner Of Year 19+ Gold And Silver Microsoft Competencies 2,700+ Consultants Worldwide Cooperative Support
Microsoft Office SharePoint Server 2007
Microsoft Office SharePoint Server 2007 Enabled by EMC Celerra Unified Storage and Microsoft Hyper-V Reference Architecture Copyright 2010 EMC Corporation. All rights reserved. Published May, 2010 EMC
Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
Workload Management. Stefano Lacaprara. CMS Physics Week, FNAL, 12/16 April Department of Physics INFN and University of Padova
Workload Management Stefano Lacaprara Department of Physics INFN and University of Padova CMS Physics Week, FNAL, 12/16 April 2005 Outline 1 Workload Management: the CMS way General Architecture Present
Supporting Application- Tailored Grid File System Sessions with WSRF-Based Services
Supporting Application- Tailored Grid File System Sessions with WSRF-Based Services Ming Zhao, Vineet Chadha, Renato Figueiredo Advanced Computing and Information Systems Electrical and Computer Engineering
SoftNAS Cloud Performance Evaluation on AWS
SoftNAS Cloud Performance Evaluation on AWS October 25, 2016 Contents SoftNAS Cloud Overview... 3 Introduction... 3 Executive Summary... 4 Key Findings for AWS:... 5 Test Methodology... 6 Performance Summary
EMC Integrated Infrastructure for VMware. Business Continuity
EMC Integrated Infrastructure for VMware Business Continuity Enabled by EMC Celerra and VMware vcenter Site Recovery Manager Reference Architecture Copyright 2009 EMC Corporation. All rights reserved.
An Introduction to GPFS
IBM High Performance Computing July 2006 An Introduction to GPFS gpfsintro072506.doc Page 2 Contents Overview 2 What is GPFS? 3 The file system 3 Application interfaces 4 Performance and scalability 4
The INFN Tier1. 1. INFN-CNAF, Italy
IV WORKSHOP ITALIANO SULLA FISICA DI ATLAS E CMS BOLOGNA, 23-25/11/2006 The INFN Tier1 L. dell Agnello 1), D. Bonacorsi 1), A. Chierici 1), M. Donatelli 1), A. Italiano 1), G. Lo Re 1), B. Martelli 1),
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD
EMC Celerra Replicator V2 with Silver Peak WAN Optimization
EMC Celerra Replicator V2 with Silver Peak WAN Optimization Applied Technology Abstract This white paper discusses the interoperability and performance of EMC Celerra Replicator V2 with Silver Peak s WAN
CouchDB-based system for data management in a Grid environment Implementation and Experience
CouchDB-based system for data management in a Grid environment Implementation and Experience Hassen Riahi IT/SDC, CERN Outline Context Problematic and strategy System architecture Integration and deployment
vsan Remote Office Deployment January 09, 2018
January 09, 2018 1 1. vsan Remote Office Deployment 1.1.Solution Overview Table of Contents 2 1. vsan Remote Office Deployment 3 1.1 Solution Overview Native vsphere Storage for Remote and Branch Offices
Deploying a CMS Tier-3 Computing Cluster with Grid-enabled Computing Infrastructure
Florida International University FIU Digital Commons FIU Electronic Theses and Dissertations University Graduate School 7-8-2016 Deploying a CMS Tier-3 Computing Cluster with Grid-enabled Computing Infrastructure
Lightstreamer. The Streaming-Ajax Revolution. Product Insight
Lightstreamer The Streaming-Ajax Revolution Product Insight 1 Agenda Paradigms for the Real-Time Web (four models explained) Requirements for a Good Comet Solution Introduction to Lightstreamer Lightstreamer
Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.
18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File
Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.
Apache Hadoop 3 Balazs Gaspar Sales Engineer CEE & CIS balazs@cloudera.com 1 We believe data can make what is impossible today, possible tomorrow 2 We empower people to transform complex data into clear
Distributed Systems. Overview. Distributed Systems September A distributed system is a piece of software that ensures that:
Distributed Systems Overview Distributed Systems September 2002 1 Distributed System: Definition A distributed system is a piece of software that ensures that: A collection of independent computers that
HPC learning using Cloud infrastructure
HPC learning using Cloud infrastructure Florin MANAILA IT Architect florin.manaila@ro.ibm.com Cluj-Napoca 16 March, 2010 Agenda 1. Leveraging Cloud model 2. HPC on Cloud 3. Recent projects - FutureGRID
Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team
Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since
Next Gen Storage StoreVirtual Alex Wilson Solutions Architect
Next Gen Storage StoreVirtual 3200 Alex Wilson Solutions Architect NEW HPE StoreVirtual 3200 Storage Low-cost, next-gen storage that scales with you Start at < 5K* and add flash when you are ready Supercharge
ISTITUTO NAZIONALE DI FISICA NUCLEARE
ISTITUTO NAZIONALE DI FISICA NUCLEARE Sezione di Perugia INFN/TC-05/10 July 4, 2005 DESIGN, IMPLEMENTATION AND CONFIGURATION OF A GRID SITE WITH A PRIVATE NETWORK ARCHITECTURE Leonello Servoli 1,2!, Mirko
A Survey on Big Data
A Survey on Big Data D.Prudhvi 1, D.Jaswitha 2, B. Mounika 3, Monika Bagal 4 1 2 3 4 B.Tech Final Year, CSE, Dadi Institute of Engineering & Technology,Andhra Pradesh,INDIA ---------------------------------------------------------------------***---------------------------------------------------------------------
Technology Insight Series
EMC Avamar for NAS - Accelerating NDMP Backup Performance John Webster June, 2011 Technology Insight Series Evaluator Group Copyright 2011 Evaluator Group, Inc. All rights reserved. Page 1 of 7 Introduction/Executive
Introducing SUSE Enterprise Storage 5
Introducing SUSE Enterprise Storage 5 1 SUSE Enterprise Storage 5 SUSE Enterprise Storage 5 is the ideal solution for Compliance, Archive, Backup and Large Data. Customers can simplify and scale the storage
Data Storage Infrastructure at Facebook
Data Storage Infrastructure at Facebook Spring 2018 Cleveland State University CIS 601 Presentation Yi Dong Instructor: Dr. Chung Outline Strategy of data storage, processing, and log collection Data flow
Removing the I/O Bottleneck in Enterprise Storage
Removing the I/O Bottleneck in Enterprise Storage WALTER AMSLER, SENIOR DIRECTOR HITACHI DATA SYSTEMS AUGUST 2013 Enterprise Storage Requirements and Characteristics Reengineering for Flash removing I/O