Machine learning with big data in the Hadoop Ecosystem for Scientific Computing

Size: px

Start display at page:

Download "Machine learning with big data in the Hadoop Ecosystem for Scientific Computing"

Phebe Stevens
6 years ago
Views:

1 Machine learning with big data in the Hadoop Ecosystem for Scientific Computing Suhua Wei, Yong Yu Abstract With an unprecedented and exponentially growing amount of data available to research communities at various fields of scientific inquiries, it is a great challenge in the data processing architectures and application software with regards to selecting machine learning tools for research purposes. This proposal aims to help scientific researchers without big data background in understanding the big data architecture for machine learning in application of Hadoop open source tools. Through the discussion of advantages and drawbacks for choices of tools, this proposal aims to show the roadmap to scientific researchers for smart decisions making. In addition, this proposal may serve as a starting point in building scientific toolkits tuned for specific purposes with regards to machine learning application of Hadoop ecosystem. I. Introduction 1.1 Hadoop ecosystem. Hadoop is an open-source project for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across cluster of computers [1]. Hadoop was created by Doug Cutting and named after his son s toy elephant. Hadoop is a framework that is the core of a rapidly growing ecosystem in which a number of providers are building Platforms based on Hadoop. Following figure showed the timeline of Hadoop ecosystem evolution.

2 Fig 1. Timeline of major events in Hadoop. Courtesy of Around 2004 two formative papers were published by Google authors[2][3]. The former defined the Google file system and the latter defined of the architecture an innovative approach called Map-Reduce intended for processing data. Hadoop was initially housed in the Nutch Apache project, but split off to become an independent Apache project in In January 2010, Google was granted a patent that covers the MapReduce algorithm. Afterwards, the use of Hadoop has grown rapidly. As time goes on, the initial Hadoop project have matured and spun off to become independents projects and provided continuously improving products: Avro, HBase, Hive, Pig, Flume, Sqoop, Oozie, HCatalog and Zookeeper etc [4]. Of courses, the on-going Hadoop projects and products are not limited to those we mentioned here. In this survey paper, we do not attempt to introduce those products one by one. Instead, we aim to give the beginners a systematic introduction of Hadoop ecosystem and its products, under the framework of data storage, data processing, data access and data management as shown in Fig 2. At each category, we will give a brief introduction to one or two representative products, exploring both their advantages and disadvantages.

Fig 2. Hadoop Ecosystem. Courtesy of http://revieweasyhomemadecookies.com/what-is-hadoopecosystem/ 1.

3 Fig 2. Hadoop Ecosystem. Courtesy of Machine learning Machine learning focuses on prediction-making through the use of the mathematic optimization on large amount of data. There are three most common categories in applications of machine learning algorithms: supervised learning, in which the algorithms will give a desired output, usually is numerical or labeled; unsupervised learning, which is to discover the hidden patterns in the data, and reinforcement learning, which is to provide feedback by implementing rewards and punishments towards achieving certain goals. Other categorizations of machine learning considering the output of a machine learning system include: classification, regression, clustering, density estimation and dimensionality reduction. 1.3 Applications of machine learning and big data in scientific research Computational modeling has been widely used as powerfully tools in various scientific research communities. From unveiling chemical processes, such as a catalyst s purification of exhaust fumes to high-energy particle accelerators, pushing the knowledge limit of human beings on the universe s expansion and dark energy using data generated by Hubble Space Telescope, highthroughput DNA sequencers, or processing images collected by earth observatory satellites. Each such scientific instrument, as well as a host of others, is critically dependent on computing for sensor control, data processing, international collaboration, and access. As with successive

4 generations of other large-scale scientific instruments, each new generation of advanced computing brings new capabilities, along with technical design [5]. 2. Problem description Data-generation capabilities in most science domains are growing more rapidly than computing capabilities, causing these domains to become data-intensive. High-end data analytics (big data) and high-end computing (extra scale) are both essential elements of an integrated computingresearch-development [5]. Research and development of next generation algorithms, software, and applications is as crucial as investment in devices and hardware. Therefore, fostering development of a technology ecosystem of research-oriented computing and analysis tools/products is also important. This proposal attempts to disclose potential solutions in the technique challenges in computing-research areas regarding machine learning applications in the middleware and management level, as well as at application level based on Hadoop ecosystem. 3. Survey on the related work 3.1 Hadoop data analytics ecosystem Data Storage The representative Hadoop data storage projects are Hadoop Distributed File System (HDFS) and HBase. Yahoo! has developed and contributed to 80% of the core of Hadoop (HDFS and MapReduce). HDFS is designed to store very large dataset reliably, and to stream those data sets at high bandwidth to user applications. An important characteristic of Hadoop is the partitioning of data and computation across many (thousands) of hosts, and executing application computations in parallel close to their data. A Hadoop cluster scales computation capacity, storage capacity and IO bandwidth by simply adding commodity servers. HDFS stores file system metadata and application data separately [6]. The major components of HDFS architecture include NameNode, DataNode and Clients. The HDFS namespace is a hierarchy of files and directories. Files and directories are represented on the NameNode by inodes. The files contents is split into large blocks and each block of file is independently replicated at multiple DataNodes. The NameNodes maintans the namespace tree and the mapping of file blocks to DataNodes. Each block replica on DataNode is represented by two files in the local s host native file system. The first files contain the data itself and the second file is block s metadata including checksums for the block data and the block s generation stamp. HDFS Client is a code library that exports the HDFS file system interface. HDFS supports operations to read, write, and delete files and operations to create and delete directories. Unlike conventional file systems, HDFS provides an API that exposes the locations of files blocks. This

5 allows applications like Mapreduce framework to schedule a task to where the data are located [6]. HBase was originally developed at Powerset, now a department at Microsoft[4]. HBase is an open distributed database modeled after Google s Bigtable and is written in Java. It is a column-oriented database management system that s runs on top of HDFS. It is well suited for sparse data sets. It doesn t support a structured query language like SQL, so it is not relational database. A HBase systems contains tables. Each table has rows and columns, with an element defined as primary key, and all access attempts to HBase table must use this primary key. Hadoop is mainly meant for batch processing, in which data will be accessed only in a sequential manner, where HBase is used for quick random access of huge data Data Processing Since 1970s, the study on parallel processing technologies attracted more and more attention of the I/O bottleneck issue caused by the different bandwidth between processors and memory. As a result, the parallel DBMSs are widely researched by processing huge data on a cluster of computers [7]. The novel Mapreduce plateform achieves success as it can provide a flexible infrastructure, storing data distributive on a scalable system with a simple and flexible style. MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster [8]. Users have to implement their special map function and reduce function to process a key/value pair and mergers all intermediate values associated with the paired key. The programmer does not need to care about the details of partitioning the input data, the execution machines, machine failures and inter-machine communication, the run-time system would handle all of the details. The idea of the function model is inspired by the map and reduces primitives present in LISP. The new abstraction of mapreduce is to express the simple computation: hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. The process of Mapreduce action includes:1, The input files are splited into M pieces, usually 16 ~ 64M per piece; 2, Master assigns works to workers; 3, Worker reads the contents of the input split; 4, The buffered pairs are written to local disk; 5, Master read the buffered data, reduce works sort all intermediate data, group key and value; The reduce worker passes the key and values to reduce function; 7, Master wakes up the user program The performance of the Mapreduce is powerful: for example, it just took about 100 seconds to grep a 92,337 records of three-character pattern in byte records on a 1800 machines cluster. Mapreduce is a powerful and simple framework that has been used a cross wide range of domains: Large-scale machine learning problems; Clustering problems for Google news and Froogle products; Extraction of data used to product reports of popular queries; Extraction of properties of web pages for new experiments and products; Large-scale graph computations.

6 3.1.3 Data Access Hive was initially created by Facebook for its own infrastructure processing, later they made it open source and donated it to the Apache Software Foundation. Hive supports custom map-reduce scripts to be plugged into queries. The programmer can writes HQL (Hive Query Language), which is a SQL-like declarative language, to display results on the console. Data in Hive is organized into tables, partitions and buckets. Tables are analogous to tables in relational database. Each table has a corresponding HDFS directory. The data in tables is serialized and stored the files with that directory. Hive provides built-in serialization formats which compress data and support lazy de-serialization. Each table can have more one or more partitions which determine the distribution of data within sub-directories of the table directory. Also Data in each partition may be divided into buckets based on the hash of a column in the table. Each bucket is stored as a file in the partition directory [9]. The Hive architecture include the following components: External Interface-both iser interfaces like command line and web UI, and application programming interface(api) like JDBC and ODBC. The Hive Thrift server eposes a very simple client API to execute HQL statements. The metastore is the system catalog and the Driver manages the life cycle of a HQL statement during compilation, optimization and execution. The compiler is invoked by the driver upon receiving a HQL statement, then it translates the statement into a plan, which consists of DAG of Mapreduce jobs. The driver submits the individual Mapreduce jobs form the DAG to the execution engine in a topological order[9]. We will discuss machine learning toolkits like Mahout, MLlib, H 2 O, and Samoa separately in a single section Data Management According to ZooKeeper project wiki[10]: ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers (or znodes). Zoo keep is analogous to a file system. Unlike normal file systems ZooKeeper provides ordered access to the znodes. The performance aspects of ZooKeeper allow it to be used in large distributed systems. What differentiate ZooKeeper and standard file systems is that every znode can have data associated with it (every file can also be a directory and vice-versa). znodes only can have limited amount of data. Coordination data, status information, configuration, location information, etc. are stored in znodes. The server itself is replicated over a set of machines, which maintain an in-memory image of the data tree along with a transaction logs. One disadvantage is that data size of the database that ZooKeeper can manage is limited by memory. Clients only connect to a single ZooKeeper server[10] Privacy Data Protection Nowadays, some privacy information is necessary to provide in social life, including hospital, government, travel, etc. The information always was analysis to provide further

7 application, which may contain private sensitive information. The information can be retrieved by data mining techniques, consequently privacy preserving data mining is a new investigation in data mining and statistical database[11]. As hadoop leads the proliferation of massive data analysis, numerous research on studying mining techniques to the MapReduce framework in Hadoop. Randomization and distributed privacy preserving techniques are two classes of privacy preserving data mining technique. Randomization is a model that generally use data perturbation and noise addition and distribute privacy preserving technique is a method that process secure multi-party computation and encryption to share mining result without sensitive information [11]. Kangsoo Jung and his team published Hiding a Needle in a Haystack: Privacy Preserving Apriori Algorithm in MapReduce Framework in 2014[12] to introduce a new privacy preserving data mining (PPDM) to overcome some existing drawback. The concept of their privacy-preserving algorithm is based on hiding a needle in a haystack, which is the idea that detecting a rare class of data is hard to find in a large size of data. The proposed method, a dummy item was added as noise to original transaction data; then a special code is used to identify the dummy and the original items. This method does not reduce the performance a lot but protect the privacy requirement satisfies Machine Learning Toolkits A variety of machine learning toolkits have been created to facilitate the learning process but many researchers and practitioners do not use them for various reasons. Most often because they lack needed features or are difficult to integrate into an existing environment. Landset et al 2015 [13] paper summarized four of the most comprehensive machine learning packages that runs on Hadoop. Their direct comparisons are shown in the following Table:

8 Table 1. Overviews of machine learning toolkits run on Hadoop [13]

9 Mahout Mahout is one of the more well-known tools for machine learning, it has a wide selection of robust algorithms, including classification, clustering and collaborative filtering. In May 2017, Mahout was updated to version [14]. It integrated Samsara, which includes linear algebra, statistical operations and data structures. The goal of the Mahout-Samsara project is to help users to build their own distributed algorithms, rather than simply a library. MLib MLib originally convers the same range of learning categories as Mahout, and also adds regression models, which Mahout lacks. They also have algorithms for topic modeling and frequent pattern mining. Additional tools include dimensionality reduction, feature extraction and transformation, optimization, and basic statistics. Mlib is based on Spark s iterative batch and streaming approaches, as well as its use of in-memory computation, thus much faster than those using Mahout. The drawback is that is dependent on Spark, and may present a problem for those who perform machine learning on multiple platform[13][15]. H 2 O H 2 O is a mature product in comparing to other projects. It has an enterprise edition, as well as open source support without the purchase of a license. More importantly, it provides a graphic user interface (GUI) and numerous tools for deep neural networks. Additionally, it provides deep leaning algorithms that are also targeted for the purpose of research [13][16]. SAMOA SAMOA, a platform for machine learning from streaming data, was originally developed at Yahoo! Labs in Barcelona in 2013 and has been part of the Apache incubator since late It provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms that run on top of distributed stream processing engines (DSPEs). It features a pluggable architecture that allows it to run on several DSPEs such as Apache Storm, Apache S4, Apache Samza, and Apache Flink. SAMOA is similar to Mahout in spirit, but specific designed for stream mining [13].

4. Architecture of proposed system The proposed architecture is illustrated as the following: Domain Knowledge At the system software level, Linux OS and its variant is proposed to use.

10 4. Architecture of proposed system The proposed architecture is illustrated as the following: Domain Knowledge At the system software level, Linux OS and its variant is proposed to use. Linux provides system services, augmented with parallel files systems and batch schedulers for parallel job management. At top of this hardware, Hadoop includes a distributed files system (HDFS) for managing large data. On top the Hadoop storage system, tools (such as Pig and Hive) provide a high level programming model for data accessing. Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store [5]. On the top level is machine learning toolkits for

11 analytic purposes. Individuals or organizations can also develop stand-alone applications tuned for specific purposes with domain knowledge. 5. Proposed Approaches This proposal attempts to provide a framework to facilitate the fast-development of researchoriented machine learning toolkits in this big data era, not specifically targeting any specific scientific research domain. The required features of machine learning toolkits vary greatly among topics. A better approach to test the performance of the proposed framework is to do case studies by a given domain. Although the use cases vary, there are a few factors needed to be considered in judging the performance of the framework: Scalability: this is regarded to both the size and complexity of the data. The type of data to be processed should be taken into account. Speed: it is considered as throughput rate. Factors affecting speed include processing platform. It can be a crucial concern if the models are required to be updated often. Coverage: this refers to the range of option contained in the toolkit regarding machine learning algorithm Usability: this refers to user friendly interface, low cost of maintenance and good documentation, plenty of choices of programming languages. Extensibility: this refers how easy the system copes with changes. The development of tools is usually fast in the industry, can the system adapt new changes for better performance? 6. Future work plan High-end data analytics (big data) and high-end computing (extra scale) are both essential elements of an integrated computing-research-development. Research and development of next generation algorithms, software, and applications is as crucial as investment in devices and hardware. Single stand-alone application tuned for specific scientific research is not sufficient. Therefore, fostering developing sub- ecosystem of research-oriented computing and analysis tools/products targeted for each knowledge domain must become a trend.

12 References: [1] [2] Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles. (2003) [3] Dean, J., Ghemawat, S. Mapreduce: Simplified data processing on large clusters. In: Proceedings of OSDI 04: Sixth Symposium on Operating System Design and Implementation. (2004) [4] J. Yates Monteith, John D. McGregor and John E. Ingram: Hadoop and its evolving ecosystem. Proceeding of IWSECO(2013) [5] D. Reed and Dongarra J. Exascale Computing and Big Data. DOI: / [6] Shvachko, K., H. Huang, S. Radia, R. Chansler: The Hadoop Distributed File System. The proceedings of the 2010 IEEE 26 th Symposium on Mass Storage Systems and Technologies(MSST), MSST 10, pages 1-10, Washington, DC, USA, IEEE Computer Society. [7], J. Lu and Jun Feng, International Conference on Cyberspace Technology (CCT 2014), Beijing, 2014, pp doi: /cp [8] Dean, J. and S. Ghemawat: MapReduce: Simplified Data Processing on large clusters. In OSDI 2004 [9] Thusoo A., J. Sen Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, R. Murthy. Hive-A Warehousing Solution Over a Map-reduce Framework. In VLDB (2009). [10] [11] Aris Gkoulalas Divanis, Vassilios S. Verykios: Association Rule Hiding For Data Mining. Springer, DOI / , Springer Science + Business Media, LLC 2010 [12] Kangsoo Jung, Sehwa Park and Seog Park. Hiding a needle in a Haystack: Privacy Preserving Apriori Algorithm in MapReduce Framework. Proceeding PSBD `14 Proceedings of the First International workshop on Privacy and Security of Big Data Pages [13] S. Landset, Khoshgoidraar T., Richter A.N. and Hasanin T. A survey of open source tools for machine learning with big data in the Hadoop ecosystem. Journal of Big Data 2: [14] [15] [16]

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean