A high-speed data processing technique for time-sequential satellite observation data Ken T. Murata 1a), Hidenobu Watanabe 1, Kazunori Yamamoto 1, Eizen Kimura 2, Masahiro Tanaka 3,OsamuTatebe 3, Kentaro Ukawa 4, Kazuya Muranaga 4, Yutaka Suzuki 4, and Hirotsugu Kojima 5 1 National Institute of Information and Communications Technology, 4 2 1, Nukui-Kitamachi, Koganei, Tokyo, 184 8795 Japan 2 Ehime University, Situkawa, Toon City, Ehime, 791 0295 Japan 3 University of Tsukuba, 1 1 1, Tennodai, Tsukuba, Ibaraki, 305 8577 Japan 4 Systems Engineering Consultants Co., LTD., Setagaya Business Square, 4 10 1 Yoga, Setagaya-ku, Tokyo, 158 0097 Japan 5 Research Institute for Sustainable Humanosphere, Kyoto University, Gokasho Uji, Kyoto, 158 0097 Japan a) ken.murata@nict.go.jp Abstract: A variety of satellite missions are carried out every year. Most of the satellites yield big data, and high-performance data processing technologies are expected. We have been developing a cloud system (the NICT Science Cloud) for big data analyses of Earth and Space observations via spacecraft. In the present study, we propose a new technique to process big data considering the fact that highspeed I/O (data file read and write) is important compared with data processing speed. We adopt a task scheduler, the Pwrake, for easy development and management of parallel data processing. Using a set of long-time scientific satellite observation data (GEOTAIL satellite), we examine the performance of the system on the NICT Science Cloud. We successfully archived high-speed data processing more than 100 times faster than those on traditional data processing environments. Keywords: earth and space observation data, data processing, science cloud, parallel file system, Pwrake, parallel processing Classification: Sensing References [1] K. T. Murata, S. Watari, T. Nagatsuma, M. Kunitake, H. Watanabe, K. Yamamoto, Y. Kubota, H. Kato, T. Tsugawa, K. Ukawa, K. Muranaga, E. Kimura, O. Tatebe, K. Fukazawa, and Y. Murayama, A Science Cloud for Data Intensive Sciences, Data Science Journal, vol. 12, pp. WDS139 WDS146, April 2013. [2] H. Matsumoto, I. Nagano, R. R. Anderson, H. Kojima, K. Hashimoto, M. 74
Tsutsui, T. Okada, I. Kimura, Y. Omura, and M. Okada, Plasma wave observations with GEOTAIL spacecraft, J. Geomag. Geoelectr., vol. 46, pp. 59 95, 1994. [3] K. T. Murata, A Software System Designed for Solar-Terrestrial Data Analysis and Reference via OMT Methodology, Proceeding of 2nd EU- ECN Joint Seminar 2001, Matsuyama, Japan, pp. 16 22, Nov. 2001. [4] J. Shafer, I/O virtualization bottlenecks in cloud computing today, Proceedings of the 2nd conference on I/O virtualization (WIOV 10), p.5, 2010. [5] M. Tanaka and O. Tatebe, Large-scale data processing with Pwrake, a parallel and distributed workflow system, JAXA Research and development report: Journal of Space Science Informatics Japan, vol. 1, JAXA- RR-11-007, pp. 67 75, May 2012. [6] K. T. Murata, H. Watanabe, K. Yamamoto, Y. Kubota, O. Tatebe, M. Tanaka, K. Fukazawa, E. Kimura, K. Ukawa, K. Muranaga, Y. Suzuki, and F. Isoda, A Parallel Processing Technique on the NICT Science Cloud via Gfarm/Pwrake, Information Processing Society of Japan, Special Interest Group, Technical Report, High Performance Computing, 2013-HPC-139, no. 9, pp. 1 6, 2013. 1 Introduction A variety of satellite missions, Earth observations and environments, Space physics and Astrophysics and Commercial uses, are carried out every year. Since all of the satellite missions are big projects and their costs are tremendous, effective and fruitful mission results are expected to any satellite mission. Because of development of design and implementation technologies of satellite body and instruments are conducted under long-term project, the lifetime of each satellite tends to be longer, and observed data are getting in larger-scale. Some satellites of date yield more than 1 TB a day, which lead to more than 300 TB data archived in data storages every year. However, currently we have no general hardware-based systems and software techniques to process these big data in both scientific and operational satellite missions. Eventually we need to construct a system specialized for processing big data of each satellite mission, and its cost is no longer negligible. Cloud computing provides users with big data processing environments that can be customized for their own purposes. In the present study, we propose a large-scale data processing system and technique designed for general satellite data working on a science cloud. We paid attention to the fact that, in the most cases of archived data processing, data size is large, however data processing is not complicated, especially in the first stage of data survey processing on a satellite mission. It should be also noted that, in case of data file processing on a computer, I/O (file read and write) time accounts for a large portion of overall processing time compared with data processing time. Science cloud is a novel concept of cloud-based technologies to realize datac IEICE 2014 2 High-Speed I/O using Parallel File System 75
intensive scientific researches and operations that are not well served by current supercomputers, grids and HPC clusters. The NICT Science Cloud [1] is one of the cloud systems designed for big data science. In the present study, we discuss data processing system and technique for long-term observation data of GEOTAIL satellite. GEOTAIL is a scientific satellite to observe Earth s magnetosphere. It was launched in 1992 and is still observing. There are 7 mission instruments equipped on the satellite, and PWI (Plasma Wave Instrument) is one of the mission instruments onboard GEOTAIL [2]. The PWI carries on three types of observations; SFA (Sweep Frequency Analyzer), MCA (Multi-Channel Analyzer) and WFC (Wave Form Capture). Level-2 data of SFA is in CDF format. The size of each Level-2 file is 7.3 MB, and 189 GB for 20 years (from September 1992 to December 2011). One SFA file contains six hours observation data. 27,576 SFA files have been saved after more than 20 years continuum observations. To read, make plots, and analyze the SFA data, we have developed STARS (Solar-Terrestrial data Analysis and Reference System) [3]. The STARS offers GUI for operation, and also has some functions working in batch mode. Fig. 1. An example to process GEOTAIL satellite data: read and write time (I/O time) and processing time. Fig. 1 is an example of satellite data processing to create a dynamic spectrum plot of GEOTAIL PWI/SFA on a general data processing server (CPU: Xeon X5550 2.67 GHz, OS: opensuse11.1). It takes about 2.7 sec. to process one file and 74,455 sec. (close to one day) to process 20 years data plots. This suggests that interactive operation is impossible in single machine, and parallel processing is crucial. What we should note in Fig. 1 is the I/O time. File I/O time, reading data file from storage and writing plot file on storage, is not negligible in satellite data processing. This I/O time is more serious in case of parallel data processing [4]. In a general network file system case (Fig. 2 (1)), I/O speed is in inverse proposition to (or worse than) the number of parallelization 76
Fig. 2. (1) General network file system without task scheduler and (2) parallel file system with task scheduler; an example with 4 data processing servers. Data files in time sequential order are saved on the network storage in each panel. in case of saving all of data files on a simple NAS storage. For example, if one processes GEOTAIL PWI/SFA data files with 100 processes, it takes 145 sec. or more to read one file (7.3 MB). To overcome this I/O bottle-neck issue, we prepared a parallel file system (Fig. 2 (2)) on the NICT Science Cloud. The present data processing system: 10 data processing servers connected to a parallel file system via 10GbE network. Each through-put between the switch and disk array (7.4 to 7.6 Gbps) is experimental (not specified) value. The total disk size is 620 TB, which is enough for the present study. All of the 20-years GEOTAIL PWI/SFA data are saved on the parallel storage. Since there are four management servers, total data transfer speed is theoretically 40 Gbps. We examine the basic I/O performance of this system to find the experimental I/O speed is about 30 Gbps. There are 12-core CPUs on each data processing server. Hyper thread (HT) setting that two virtual cores are seen from one physical core is available 77
on each server. Combination of two virtual cores, 12-core CPUs and 10 servers makes high parallelization, as many as 240 cores in total, is available. The storage is mounted with GPFS (General Parallel File System), one of the parallel file systems, on each server with same mount point (/gpfs/nfs/). 3 Task Scheduling for Time Sequential Data Files In the present study, we perform parallel data processing for 27,576 data files using 240 cores using the data processing system discussed in Section 2. Many types of satellite data are commonly saved in a time sequential format, especially in case of Level-2 or Level-3 data. One file usually corresponds to some extent of observation time, e.g. one month, one week, one day, one hour or one minute. GEOTAIL PWI/SFA contains 6 hours data in each file. A data processing program or application usually processes each data file; there is no dependence or interaction between two or more data file processes. The STARS system [3], developed to process GEOTAIL PWI/SFA file, is already available, thus it is neither reasonable to develop another data processing program nor heavy work to re-write a present program (STARS) for parallel processing using an inter-process library like MPI. For easy but effective parallel data processing using the data processing system on the NICT Science Cloud, we adopt a task scheduler. Taking into account of easy setting, high-performance, and future works of other big data processing on the NICT Science Cloud, we make a choice of Pwrake (Parallel Workflow extension for RAKE) [5] as a task scheduler. Fig. 2 (2) indicates a schematic picture of the Pwrake functions. The basic function of the Pwrake is to allocate each core on data processing servers to data files. User first prepares for a Rakefile on Pwrake controller. In the Rakefile, server list with the numbers of cores on each server is described. Rakefile also has a list of data files to be processed. The Pwrake controller dynamically schedules a task; it allocates a data file to a registered core on any servers. Once a core finishes processing a data file, next file is allocated by the Pwrake controller. It suggests that there is almost no time gap between two tasks on a core. The order of data files is given arbitrarily by users. In the sample case in Fig. 2 (2), the file order is from #1 to #20. In the present study, we describe the order of data file in the Rakefile from old to new (from Sep. 1992 to Dec. 2011). Data file size may not be same over longterm data set because of occasional lack of observations. It should be noted that processing tasks are well balanced between cores even in this unbalanced tasks by using the Pwrake. 4 High-speed Data Processing for Long-Term Satellite Observation Data We performed data processing using STARS for long-term data of GEO- TAIL PWI/SFA data with 1 to 240 cores on the NICT Science Cloud. I/O time (read time and write time) and data processing time in each case are measured. 78
Fig. 3. Speed-up of total processing time (left) and parallel efficiency of total processing time (right) via one server and 10 servers. The horizontal axis shows. Fig. 3 shows the result of the present examinations. Measurements are carried out in the case of one server and 10 servers. Note that there are 12 cores on each server, but 24 virtual cores are available due to hyper thread (HT) setting. Upto 12 cores, speed-up increases successfully due to highperformance I/O via parallel file system and effective task scheduling. In case that parallel number is larger than 12, speed-up of data processing still increase along with the number of parallelization. This is because of the HT effect. The highest speed-up value is 107 when the parallelization number is 200 (20 10) as indicated in Fig. 3. It means that total processing speed with 200 cores is 107 times faster than with one core on a data processing server. The right-hand panel in Fig. 3 shows the parallelization efficiency with same horizontal axis in the left-hand panel. The efficiencies decrease when parallelization number is 4 or larger; it is because that the I/O scalability is not 100% and HT effects. 5 Conclusion In the present study, we perform parallel data processing for 27,576 data files using 240 cores. With paying attention to I/O speed and task scheduling, we achieve 107 times faster than a legacy system and technique. The present method is applicable not only for scientific satellite data but many types of Earth observing satellite data. Numerical simulation before launch of satellite is also important for success of satellite missions. Largescale simulator data processing is another type of the targets of the present study [6]. Acknowledgments The present work is done on the NICT Science Cloud. GEOTAIL satellite data are given by RISH, Kyoto University. We are grateful for Mr. Kenji Inoue and Ms. Chie Toda for their helping us to setup Pwrake and STARS environment. 79