Experiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018

Size: px

Start display at page:

Download "Experiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018"

Brooke Booker
5 years ago
Views:

1 Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

2 abstract In 2017 the Machine Learning research group got funding for a new Hadoop cluster. However, the funding only sufficed for buying the hardware. The deployment of the cluster was done by the research group. The members had no previous specific experience with Hadoop clusters, so the learning curve was steep. This talk covers the pros and cons of deploying a new machine in this manner, and also illustrates how we are currently using the machine for research and teaching. The history of Hadoop is briefly covered, putting it in context with other parallel platforms such as Beowulf clusters.

3 The beginnings Jan 2017: proposed a new Hadoop cluster for our research group Several industry projects with big data New master of analytics programme New courses in data analysis and big data No specific infra-structure for teaching nor research The cluster should have 2 master servers 14 slave nodes ~NZ$ 80K for the budget, everything included

4 The machine

5 The machine 2 master nodes 32 cores 24 TB disk 64 GB RAM 2 x network 14 slave nodes 8 cores 8 TB disk 32 GB RAM TOTAL 160 TB disk 576 GB RAM

6 Deployment Ambari the easy choice for installing all the components Ubuntu previous experience with Beowulf clusters ITS wants the nodes isolated from the network No ITS support: the academics become system administrators...

7 Deployment Industry projects require confidentiality Teaching requires sharing Teaching To the Internet Research

8 Flexible Configuration Teaching: busy less than 30 weeks/year Research: may need full resources Teaching To the Internet Research

9 The Software We chose Ambari as the main platform Free, open source No free support (this took its toll later) Tools included with Ambari: HDFS MapReduce, Spark Hive, Pig etc... Two biggest hurdles: Hostname IP Wrong space measurement in HDFS

10 A view of the dashboard

11 A view of the hosts

12 From MPI to MapReduce Early clusters (Beowulf type) with MPI (1984) Broadcast Scatter Gather Reduce

13 MPI Broadcast Data Master Buffer Data Data Node 1 Node 2... Data Node N

14 MPI Scatter Data Master B1 Data B2 B3 Bn Data Node 1 Node 2... Data Node N

15 MPI Gather Data Master B1 Data B2 B3 Bn Data Node 1 Node 2... Data Node N

16 MPI Reduce Data Master F( ) Buf Data Data Node 1 Node 2... Data Node N

17 Problem: Amdahl's Law Source: Wilkinson and Allen, 2005

18 Amdahl's Law Source: Wilkinson and Allen, 2005

19 Beyond Amdahl's Law The serial percentage is not constant with different problem sizes Scatter data before processing it Distributed database Develop an algorithm that is aware about the data distribution Resilient and fault-tolerant Scalable One answer: MapReduce with HDFS

20 MapReduce Resembles Scatter / Reduce from MPI Added benefits Scalability Fault-tolerance Called Infrastructure Framework Technology Criticism: is this really a new technology? Key element: HDFS

21 MapReduce example Counting word occurencies in a book. Main: public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); }

22 MapReduce example Map and reduce public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } } public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

23 Command Line Hadoop: time hadoop jar wc.jar WordCount filename.txt output.txt Stand alone machine: time cat filename.txt tr '[:space:]' '[\n*]' grep -v "^\s*$" sort uniq -c... A pleasant smile broke quietly over his lips. The mockery of it! he said gaily. Your absurd name, an ancient Greek! He pointed his finger in friendly jest and went over to the parapet, laughing to himself. Stephen Dedalus stepped up, followed him wearily halfway and sat down on the edge of the gunrest, watching him still as he propped his mirror on the parapet, dipped the brush in the bowl and lathered cheeks and neck... Pleasant Pleasants Please Please, Pleased Pleasure Pleiades, Plenty Plevna Plevna. Plot, Plough Plovers Pluck Plucking Plump. Plumped,

24 MapReduce example Counting words in a book. Compare: File size (KB) Single machine 1 master/5 slaves 1 master/9 slaves Number of splits s 23s 21s s 28s 27s m 4s 59s 58s m 53s 3m 36s 3m 33s m 47s 34m 4s 27m 11s 118

25 MapReduce example Counting words in a book. Compare: Word Count MapReduce runtime (s) 400 single machine 1 master/5slaves 1 master/9 slaves size (KB)

26 MapReduce example Counting words in a book. Compare: Word Count MapReduce runtime (s) 4000 single machine 1 master/5slaves 1 master/9 slaves size (KB)

27 Spark Spark minimises I/Os Keeps partial results in memory Smart scheduling pyspark example: text_file = sc.textfile("hdfs:///user/albarcza/test/4300.txt") counts = text_file.flatmap(lambda line: line.split(" ")) \.map(lambda word: (word, 1)) \.reducebykey(lambda a, b: a + b) counts.saveastextfile("hdfs:///user/albarcza/test/sparktest")

28 Spark example Counting words in a book. Compare: File size (KB) 1 master/5 slaves 1 master/9 slaves Number of tasks s 1.4s s 2.8s s 13.9s s 31.0s s

29 Spark X MapReduce Word Count MapRed X Spark runtime (s) single machine 1/5 MapRed 1/9 MapRed Spark size (KB)

30 Teaching Data Wrangling and Machine Learning Perform data processing and data preparation tasks using domain-specific programming technologies. Integrate data from different sources and formats using a high-level programming language. Transform data into appropriate structures for analysis. Plot raw data and results of data analysis at an introductory level. Apply introductory machine learning and statistical techniques to generate data-driven solutions.

31 Teaching Applied Machine Learning and Data Visualisation Use a broad variety of sophisticated machine learning and data mining techniques to extract patterns in data. Assess the usefulness of predictive models. Perform advanced data visualisation techniques. Formulate problems for real-world datasets from various contexts. Present data-driven solutions to real world problems. Devise strategies for Big Data problems.

32 Conclusions: negative aspects Too much jargon Too many competing tools Documentation often incomplete (e.g., Ambari). Difficult to configure anything beyond the defaults. Very difficult to fine tune particular jobs for performance (e.g., MapReduce example) The effective size (disk and memory) is much smaller than the nominal one Many of the tools are not mature yet (e.g., Zeppelin for multiple users)

33 Conclusions: positive aspects When it all works, it is wonderful: one can really use big data and get results e.g., we trained RF for a 1000 classes problem, with 200 GB of images Time series project (GDP prediction) Ambari facilitates the installation process Very good performance for multiple jobs Multiple usage of the machines (even when not using as a dedicated Hadoop), flexible arrangement for the nodes Teaching: students benefit from using a true platform rather than just a sandbox.

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838. COMP4442 Service and Cloud Computing Lab 12: MapReduce www.comp.polyu.edu.hk/~csgeorge/comp4442 Prof. George Baciu csgeorge@comp.polyu.edu.hk PQ838 1 Contents Introduction to MapReduce A WordCount example