Spark and Cassandra. Solving classical data analytic task by using modern distributed databases. Artem Aliev DataStax

Size: px

Start display at page:

Download "Spark and Cassandra. Solving classical data analytic task by using modern distributed databases. Artem Aliev DataStax"

Harold Francis
6 years ago
Views:

1 Spark and Cassandra Solving classical data analytic task by using modern distributed databases Artem Aliev DataStax

3 Spark and Cassandra Solving classical data analytic task by using modern distributed databases Artem Aliev Software Developer 2014 DataStax Confidential. Do not distribute without consent.

4 Agenda: Lambda Architecture

Apache Cassandra Apache Cassandra is a massively scalable NoSQL OLTP

Continuous availability with no single point of failure.

Linear scale performance with online capacity expansion.

5 Apache Cassandra Apache Cassandra is a massively scalable NoSQL OLTP database. Masterless architecture with read/write anywhere design. Continuous availability with no single point of failure. Multi-data center and cloud availability zone support. Linear scale performance with online capacity expansion. CQL SQL-like language. Node Node Node Node Node 100,000 txns/sec Node 200,000 txns/sec Node Node 400,000 txns/sec Node Node Node Node Node Node

Configurable replication factor Configurable data consistency per request Active-Active replication architecture

6 Cassandra Architecture Overview Cassandra was designed with the understanding that system/hardware failures can and do occur Peer-to-peer, distributed system All nodes the same Multi data centre support out of the box Configurable replication factor Configurable data consistency per request Active-Active replication architecture Node 1 1 st Node 1 1 st copy Node 5 Node 2 2 nd copy Node 5 Node 2 2 nd copy Node 4 Node 3 3 rd copy Node 4 Node 3 DC: USA DC: EU

7 Cassandra Query Language CREATE TABLE sporty_league ( team_name varchar, player_name varchar, jersey int, PRIMARY KEY (team_name, player_name) ); SELECT * FROM sporty_league WHERE team_name = Mighty Mutts and player_name = Lucky ; INSERT INTO sporty_league (team_name, player_name, jersey) VALUES ('Mighty Mutts', Felix,90);

8 Apache Spark Distributed computing framework Created by UC AMP Lab since 2009 Apache Project since 2010 Solves problems Hadoop is bad at Iterative Algorithms Interactive Machine Learning More general purpose than MapReduce Streaming! In memory! 8

9 Fast * Logistic Regression Performance sec / iteration 3000 Running Time (s) Hadoop Spark Number of Iterations first iteration 80 sec further iterations 1 sec

Simple 1. package org.myorg; 2. 3. import java.io.ioexception; 4. import java.util.*; 5. 6. import org.apache.hadoop.fs.path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9.

10 Simple 1. package org.myorg; import java.io.ioexception; 4. import java.util.*; import org.apache.hadoop.fs.path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9. import org.apache.hadoop.mapred.*; 10. import org.apache.hadoop.util.*; public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 19. String line = value.tostring(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasmoretoken s()) { 22. word.set(tokenizer.nex ttok en()); 23. output.collect(word, one); 24. } 25. } 26. } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 29. public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 30. int sum = 0; 31. while (values.hasnext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. } public static void main(string[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); 43. conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); 46. conf.setcombinerclass(reduce.class); 47. conf.setreducerclass(reduce.class); conf.setinputformat(textinputfo rmat.class); 50. conf.setoutputformat(textoutpu tformat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); 57. } 58. } 1. file = spark.textfile("hdfs://...") 2. counts = file.flatmap(lambda line: line.split(" ")) \.map(lambda word: (word, 1)) \.reducebykey(lambda a, b: a + b) 3. counts.saveastextfile("hdfs://...")

11 A Quick Comparison to Hadoop map() Data Source 1 Data Source 2 HDFS reduce() map() map() join() reduce() cache() transform() transform()

12 API map reduce

13 API map filter groupby sort union join leftouterjoin rightouterjoin reduce count fold reducebykey groupbykey cogroup cross zip sample take first partitionby mapwith pipe save...

14 API *Resilient Distributed Datasets (RDD) * Collections of objects spread across a cluster, stored in RAM or on Disk * Built through parallel transformations * Automatically rebuilt on failure *Operations * Transformations (e.g. map, filter, groupby) * Actions (e.g. count, collect, save)

15 Operator Graph: Optimization and Fault Tolerance A: B: Stage 1 groupby F: C: D: E: join Stage 2 map filter Stage 3 = RDD = Cached partition

16 Why Spark on Cassandra? *Data model independent queries *Cross-table operations (JOIN, UNION, etc.) *Complex analytics (e.g. machine learning) *Data transformation, aggregation, etc. *Stream processing

17 How to Spark on Cassandra? *DataStax Cassandra Spark driver * Open source: *Compatible with * Spark 0.9+ * Cassandra 2.0+ * DataStax Enterprise 4.5+

18 Cassandra Spark Driver *Cassandra tables exposed as Spark RDDs *Read from and write to Cassandra *Mapping of C* tables and rows to Scala objects *All Cassandra types supported and converted to Scala types *Server side data selection *Spark Streaming support *Scala and Java support

19 Connecting to Cassandra // Import Cassandra- specific functions on SparkContext and RDD objects import com.datastax.driver.spark._ // Spark connection options val conf = new SparkConf(true).setMaster("spark:// :7077").setAppName("cassandra- demo").set("cassandra.connection.host", " ") // initial contact.set("cassandra.username", "cassandra").set("cassandra.password", "cassandra") val sc = new SparkContext(conf)

20 Accessing Data CREATE TABLE test.words (word text PRIMARY KEY, count int); INSERT INTO test.words (word, count) VALUES ('bar', 30); INSERT INTO test.words (word, count) VALUES ('foo', 20); *Accessing table above as RDD: // Use table as RDD val rdd = sc.cassandratable("test", "words") // rdd: CassandraRDD[CassandraRow] = CassandraRDD[0] rdd.toarray.foreach(println) // CassandraRow[word: bar, count: 30] // CassandraRow[word: foo, count: 20] rdd.columnnames // Stream(word, count) rdd.size // 2 val firstrow = rdd.first // firstrow: CassandraRow = CassandraRow[word: bar, count: 30] firstrow.getint("count") // Int = 30

21 Saving Data val newrdd = sc.parallelize(seq(("cat", 40), ("fox", 50))) // newrdd: org.apache.spark.rdd.rdd[(string, Int)] = ParallelCollectionRDD[2] newrdd.savetocassandra("test", "words", Seq("word", "count")) *RDD above saved to Cassandra: SELECT * FROM test.words; word count bar 30 foo 20 cat 40 fox 50 (4 rows)

22 Type Mapping CQL Type ascii bigint boolean counter decimal double float inet int list map set text, varchar timestamp timeuuid uuid varint *nullable values Scala Type String Long Boolean Long BigDecimal, java.math.bigdecimal Double Float java.net.inetaddress Int Vector, List, Iterable, Seq, IndexedSeq, java.util.list Map, TreeMap, java.util.hashmap Set, TreeSet, java.util.hashset String Long, java.util.date, java.sql.date, org.joda.time.datetime java.util.uuid java.util.uuid BigInt, java.math.biginteger Option

23 Mapping Rows to Objects *Mapping rows to Scala Case Classes *CQL underscore case column mapped to Scala camel case property *Custom mapping functions (see docs) CREATE TABLE test.cars ( id text PRIMARY KEY, model text, fuel_type text, year int ); à case class Vehicle( id: String, model: String, fueltype: String, year: Int ) sc.cassandratable[vehicle]("test", "cars").toarray //Array(Vehicle(KF334L, Ford Mondeo, Petrol, 2009), // Vehicle(MT8787, Hyundai x35, Diesel, 2011)

24 Server Side Data Selection *Reduce the amount of data transferred *Selecting columns sc.cassandratable("test", "users").select("username").toarray.foreach(println) // CassandraRow{username: john} // CassandraRow{username: tom} *Selecting rows (by clustering columns and/or secondary indexes) sc.cassandratable("test", "cars").select("model").where("color =?", "black").toarray.foreach(println) // CassandraRow{model: Ford Mondeo} // CassandraRow{model: Hyundai x35}

25 Spark SQL Compatible Spark SQL Streaming ML Graph Spark (General execution engine) Cassandra/HDFS

26 Spark SQL *SQL query engine on top of Spark *Hive compatible (JDBC, UDFs, types, metadata, etc.) *Support for in-memory processing *Pushdown of predicates to Cassandra when possible

27 Spark SQL Example import com.datastax.spark.connector._ // Connect to the Spark cluster val conf = new SparkConf(true)... val sc = new SparkContext(conf) // Create Cassandra SQL context val cc = new CassandraSQLContext(sc) // Execute SQL query val df = cc.sql("select * FROM keyspace.table WHERE... )

28 Spark Streaming Spark SQL Streaming ML Graph Spark (General execution engine) Cassandra/HDFS

29 Spark Streaming *Micro batching *Each batch represented as RDD *Fault tolerant *Exactly-once processing *Unified stream and batch processing framework RDD Data Stream DStream

30 Streaming Example import com.datastax.spark.connector.streaming._ // Spark connection options val conf = new SparkConf(true)... // streaming with 1 second batch window val ssc = new StreamingContext(conf, Seconds(1)) // stream input val lines = ssc.sockettextstream(serverip, serverport) // count words val wordcounts = lines.flatmap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) // stream output wordcounts.savetocassandra("test", "words") // start processing ssc.start() ssc.awaittermination()

31 Spark MLlib Spark SQL Streaming ML Graph Spark (General execution engine) Cassandra/HDFS

32 Spark MLlib *Classification(SVM, LogisticRegression, NaiveBayes, RandomForest ) *Clustering (Kmeans, LDA, PIC) *Linear Regression *Collaborative Filtering *Dimensionality reduction (SVD, PCA) *Frequent pattern mining *Word2Vec

33 Spark MLlib Example import org.apache.spark.mllib.regression.labeledpoint import org.apache.spark.mllib.classification.naivebayes // Spark connection options val conf = new SparkConf(true)... // read data val data = sc.cassandratable[iris]("test", "iris") // convert data to LabeledPoint val parseddata = data.map { i => LabeledPoint(class2id(i.species), Array(i.petal_l,i.petal_w,i.sepal_l,i.sepal_w)) } // train the model val model = NaiveBayes.train(parsedData) //predict model.predict(array(5, 1.5, 6.4, 3.2))

34 APIs Scala Native Python Popular Java Ugly R No closures

35 Questions?

Big Data landscape Lecture #2

Big Data landscape Lecture #2 Contents 1 1 CORE Technologies 2 3 MapReduce YARN 4 SparK 5 Cassandra Contents 2 16 HBase 72 83 Accumulo memcached 94 Blur 10 5 Sqoop/Flume Contents 3 111 MongoDB 12 2 13