Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

Similar documents
Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

/ Cloud Computing. Recitation 13 April 14 th 2015

Processing of big data with Apache Spark

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Beyond MapReduce: Apache Spark Antonino Virgillito

Big data systems 12/8/17

TI2736-B Big Data Processing. Claudia Hauff

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

We consider the general additive objective function that we saw in previous lectures: n F (w; x i, y i ) i=1

Turning Relational Database Tables into Spark Data Sources

Interruptible Tasks: Treating Memory Pressure as Interrupts for Highly Scalable Data-Parallel Programs

Apache Spark Internals

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Databases and Big Data Today. CS634 Class 22

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Hyperlink-Induced Topic Search (HITS) over Wikipedia Articles using Apache Spark

DATA SCIENCE USING SPARK: AN INTRODUCTION

Pig A language for data processing in Hadoop

Cloud, Big Data & Linear Algebra

Big Data Infrastructures & Technologies

An Introduction to Apache Spark

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

MapReduce Design Patterns

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop course content

Big Data Hadoop Course Content

Improving the MapReduce Big Data Processing Framework

Full-Text Indexing For Heritrix

The detailed Spark programming guide is available at:

15.1 Data flow vs. traditional network programming

Technical Sheet NITRODB Time-Series Database

Certified Big Data and Hadoop Course Curriculum

Big Data XML Parsing in Pentaho Data Integration (PDI)

MapReduce, Hadoop and Spark. Bompotas Agorakis

Project Requirements

Data Platforms and Pattern Mining

Document Clustering: Comparison of Similarity Measures

2/4/2019 Week 3- A Sangmi Lee Pallickara

CS 6320 Natural Language Processing

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

STA141C: Big Data & High Performance Statistical Computing

HOMEWORK 8. M. Neumann. Due: THU 1 NOV PM. Getting Started SUBMISSION INSTRUCTIONS

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Batch Processing Basic architecture

Machine Learning With Spark

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Big Data Analytics using Apache Hadoop and Spark with Scala

Collaborative Filtering

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

CSE 444: Database Internals. Lecture 23 Spark

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

CompSci 516: Database Systems

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

In-Memory Processing with Apache Spark. Vincent Leroy

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

Pig on Spark project proposes to add Spark as an execution engine option for Pig, similar to current options of MapReduce and Tez.

Big Data and Scripting map reduce in Hadoop

Lambda Architecture with Apache Spark

CSE 344 Final Examination

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller

An Introduction to Big Data Analysis using Spark

A Fast and High Throughput SQL Query System for Big Data

Certified Big Data Hadoop and Spark Scala Course Curriculum

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Batch Inherence of Map Reduce Framework

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Cloud Computing & Visualization

Spark Overview. Professor Sasu Tarkoma.


CSCI6900 Assignment 3: Clustering on Spark

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop Development Introduction

Dealing with Data Especially Big Data

Spark supports several storage levels

15/04/2018. Spark supports several storage levels. The storage level is used to specify if the content of the RDD is stored

Accelerate Big Data Insights

Data Partitioning and MapReduce

The Pliny Database PDB

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

MapReduce and Friends

Stream Processing on IoT Devices using Calvin Framework

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

Map Reduce Group Meeting

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Distributed Computing with Spark

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Transcription:

Introduction: Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1 Clustering is an important machine learning task that tackles the problem of classifying data into distinct groups based on their features. An ideal clustering algorithm maximizes feature similarities within a cluster while minimizing the feature similarities across clusters. Some of the most common clustering algorithms include spectral clustering and k-means clustering. This project essentially consists of two main parts: extracting features into a bag-of-words representation and then performing clustering using these features. In this assignment, we are given a 37 GB snapshot of Wikipedia stored in XML format. Our goal is to cluster the Wikipedia articles into hierarchical categories based on their features. In this case, the features consist of word occurrences within each article. We first construct a bag-of-features representation of the data. To begin, we must assign each word a unique ID where the IDs are given to words based on the number of occurrences of each word. Once we create our dictionary, we can transform each of our articles to a bag-of-words representation. Next, with our bag-of-words we perform the clustering using the spectral clustering algorithm to seed k-means. Setup Configuration: We performed our clustering using a somewhat different setup from the one specified in the assignment (although ours was also approved by Prof. Canny). Instead of running Hadoop on the icluster, we used Spark on Amazon EC2 to create our dictionary and build the feature matrix. Our reasons for this choice were that in addition to supporting Hadoop s MapReduce paradigm, Spark provides a number of other high-level operations that simplified our implementations considerably. Moreover, Spark s support for distributed in-memory caching of data sets significantly improved performance by allowing us to keep the dictionary and frequency count data cached in memory, which dramatically reduced the I/O overhead that would have been involved in the feature extraction stage, had we chosen to use Hadoop. We uploaded the 37 GB Wikipedia XML dump file to Amazon S3 so that it could be used by EC2 instances. We then started a 10-node EC2 cluster consisting of Large instances with 7.5 GB each, for a total of 75 GB. We used the Mesos EC2 setup scripts to do this, which automatically set up both Spark and HDFS on the cluster nodes. We then transferred the XML data into HDFS, which automatically partitions and distributes the file across the local disks of each node. Once the data was in place on the cluster, we used Spark to process the data into a dictionary and feature matrix. Finally, we began to use BIDMat to cluster the results generated from Spark.

Feature Matrix: On a high-level, creating the feature matrix requires first parsing and tokenizing the XML data, creating a dictionary that maps IDs to words, and then converting each article into its bag-of-words representation using this dictionary. Below, will we describe the major steps in depth. We also provide a walk-through of some of our code, to underscore both the algorithms we used as well as to highlight the powerful flexibility of Spark s high-level operators vis-a-vis Hadoop s more limited map and reduce. We performed most of our Spark jobs in an ad-hoc manner using the spark-shell, so we have provided the scala code here. 1) The first thing we must be able to do is tokenize the input data so that we can extract the relevant words from each article. We used the XMLInputFormat provided by Apache Mahout to enable us to read records from the dataset based on the <page> and </ page> tags. This worked well for us and allowed us to read in a single article at a time. Next, we needed to extract the relevant tokens from each article. We used the Wikipedia tokenizer from Apache Lucene. Once we were able to parse the article data, we needed to create our word dictionary. This involves outputting the key-value pair (word,1) for each article. After the map-step, we use multiple reducers to merge the counts for each word. In spark, this process basically translates to: val rdd = SparkContext.hadoopFile( wikipedia.xml ) val wordcounts = rdd.flatmap(article => extracttokens(article)).reducebykey(_ + _) Spark automatically uses combiners, so the process was quite easy for us. The next step involves computing a total ordering of the words based on their counts in descending order. We map each (word, count) pair to (count, word) and then we want to sort by counts. Spark has a distributed sortbykey function (implemented by Antonio! ) that uses a range-partitioner to partition the data by key so that it can sort the data on multiple reducers simultaneously. In Spark code, these steps translate to: val ascending = false val sortedcounts = wordcounts.map((word, count) => (count, value)).sortbykey(ascending).collect() This collects the words sorted by count in descending order. We can assign IDs to the words by doing:

val dictionary = sortedcounts.map((count, word) => word).toarray Table 1: Top 10 Words Word Count to 7587258 of 6109544 the 5715734 in 5322840 a 4719922 and 4681149 is 4297563 from 4184193 for 4029959 name 3816141 2) Next, we decided to prune the dictionary to make further computation more manageable, leaving in the top 50,000 most frequently-occurring words. Once we had a pruned dictionary of the most frequent words, we also wanted to experiment with stemming to see how that would affect clustering. Stemming is used to mitigate the effects of varying suffixes and inflected forms of common morphological roots in words. We also considered experimenting with other feature representations including bigrams and tf-idf. To do this, we used ScalaNLP s built-in Porter Stemmer implementation to derive the root of each dictionary term. We then built a new dictionary of unique roots, mapping them to the sum of the frequencies associated with the corresponding terms in the unstemmed dictionary. Finally, we sorted this list of counts (as described above) and took the first 50,000 entries. We performed all of this computation on the master node since the dictionary is fairly small. Within the spark-shell, we used the following scala statements: val pruneddictionary = dictionary.take(50000) val porter = new PorterStemmer() val counts = dictionary.map((count, word) => count) val stemmedungrouped = porter(dictionary.toiterable).toarray.zip(counts)

val stemmeddictionary = stemmedungrouped.groupby(_._1).map((word, counts) => counts.reduce(_ + _)).sortby(_._2).map(_._1).toarray 3) To construct the sparse feature matrix, we mapped each document ID to an array of tuples containing the word ID and number of occurrences of each term contained in that article. val encoder = new HashMap[String, Int]() stemmeddictionary.view.zipwithindex foreach { case (term, index) => encoder += term -> index} val featurematrix = rdd.map(article => extracttokens(article)).map(porter _).map(encoder.get _).filter(stemmeddictionary.contains _).toseq.groupby(identity).mapvalues(_.size) Note that we could have cached the RDD containing the extracted tokens for each article in the first step by doing the following: val tokensrdd = rdd.map(article => extracttokens(article)) tokensrdd.cache() This would have loaded the RDD containing the tokens into our cluster s memory, eliminating the overhead of having to re-extract the tokens to create the feature matrix. This way, we would only need to do the following: val featurematrix = tokensrdd.map(porter _) // and so on, as above, avoiding token extraction Clustering: After we have created the feature matrix, the only step left is to actually perform the clustering. The two main clustering algorithms that we considered were k-means and spectral clustering. We planned to use the BIDMat library to perform the matrix operations for clustering, but unfortunately ran out of time and weren t able to finish getting the results. Nevertheless, we will briefly describe the algorithm we experimented

with. We began to implement a spectral clustering algorithm to generate the first set of cluster points, then apply the standard k-means algorithm to these points. The main steps of this algorithm are as follows: For a weighted graph, with the set of vertices, we want to cluster the vertices into clusters. E is the set of edges such that each is associated with a weight. In our case, we construct the graph with edges between words the co-occurred in articles. The weights ( ) would be computed based on the frequency of co-occurrence of these words in the documents. Spectral clustering works by bounding the conductance of the graph. First, we form an affinity matrix : where if and Next, we create the diagonal matrix Then we construct the matrix We find, the largest eigenvectors of and the matrix Then we normalize to create the new matrix Finally, we would treat each row of as a point in and apply the standard k-means algorithm on these seeds to cluster the points into clusters.