spark-testing-java Documentation Release latest

Size: px

Start display at page:

Download "spark-testing-java Documentation Release latest"

Imogene Owen
5 years ago
Views:

1 spark-testing-java Documentation Release latest Nov 20, 2018

3 Contents 1 Input data preparation 3 2 Java Context creation Data preparation Comparison Scala Context creation Data preparation Comparison i

4 ii

5 Apache Spark is become widely used, code become more complex, and integration tests are become important for check code quality. Below integration testing approaches with code samples. Two languages are covered - Java and Scala in separate sections. Testing steps 1. Resource allocation: SparkContext/SparkSession creation for test. Them can be created manually, or existing framework can be used; 2. Data preparation: RDD/DataFrame created in code or read from disk; 3. Run functionality: Two functionality types: a) reading data from storage - files have to be prepared; b) transformations - data can be created in code; 4. Expected and actual comparison. Examples Apples weight and color manipulation is used as example. Sections contains links to available for download test projects. Contents 1

6 2 Contents

7 CHAPTER 1 Input data preparation Several approaches can be used. Dynamically from: empty with predefined structure one primitive primitive list tuple list data objects list Pre saved on disk. For check reading and for complex structures data, in formats: parquet csv Pre saved and dynamically modified on fly. For improve visibility on complex structures testing. 3

8 4 Chapter 1. Input data preparation

9 CHAPTER 2 Java Project with code examples on GitLab: spark-testing-java Functionality located in package repository 2.1 Context creation Code examples in package: context Manual Asterix means all available processor cores will be used. For most cases two cores is enough, and local[2] is appropriate. SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("test").set("spark.ui.enabled", "false"); JavaSparkContext jsc = new JavaSparkContext(conf); // close after usage jsc.stop(); Code example: JavaSparkContextCreationTest.java Framework: spark-testing-base Framework spark-testing-base located in github Drawbacks: current version provided SparkContext and JavaSparkContext only, some additional code required for get SparkSession. 5

10 Inclusion in project with Maven <dependency> <groupid>com.holdenkarau</groupid> <artifactid>spark-testing-base_2.11</artifactid> <version>2.3.1_0.10.0</version> <scope>test</scope> </dependency> Example of pom.xml Enabling in testcases Test classes have to extends SharedJavaSparkContext. public class ApplesRepositoryTestingBaseIT extends SharedJavaSparkContext SparkContext and JavaSparkContext Can be received with methods: sc() jsc() Integer[] weights = {120, 150}; long expected = Arrays.stream(weights).reduce(0, (a, v) -> a + v); Dataset<Row> df = new SQLContext(jsc()).createDataset(Arrays.asList(weights), Encoders.INT()).toDF("weight"); long actual = repository.mass(df); assertequals(expected, actual); SQLContext creation new SQLContext(jsc()) Code example: ApplesRepositoryTestingBaseIT.java 2.2 Data preparation Code examples in package: data_preparation JavaRDD Empty jsc().emptyrdd(); 6 Chapter 2. Java

11 From primitive list List<String> data = Lists.newArrayList("green", "red"); jsc().parallelize(data); From entity list List<Apple> rows = Lists.newArrayList(new Apple("green", 70), new Apple("red", 110)); JavaRDD<Apple> result = jsc().parallelize(rows); Example: RDDCreationTest.java DataFrame (Dataset<Row> in Java) Empty with predefined structure final Dataset<Row> actual = spark().emptydataframe().withcolumn("color", lit("green ")); actual.printschema(); Single primitive Long value = 12L; List<Row> rows = Collections.singletonList(RowFactory.create(value)); final Dataset<Row> actual = spark().createdataframe(rows, Encoders.LONG().schema()); Primitive list List<String> data = Lists.newArrayList("green", "red"); Dataset<Row> actual = spark().createdataset(data, Encoders.STRING()).toDF("color"); Row list with assigned schema List<Row> rows = Arrays.asList(RowFactory.create("green"), RowFactory.create("red")); StructType schema = DataTypes.createStructType( new StructField[]{DataTypes.createStructField("color", DataTypes.StringType, false)}); final Dataset<Row> actual = spark().createdataframe(rows, schema); List of entities List<Apple> rows = Lists.newArrayList(new Apple("green", 70), new Apple("red", 110)); final Dataset<Row> actual = spark().createdataframe(rows, Apple.class); Example: DataFrameCreationTest.java 2.2. Data preparation 7

12 2.2.3 Dataset From list of entities List<Apple> rows = Lists.newArrayList(new Apple("green", 70), new Apple("red", 110)); final Dataset<Apple> actual = spark().createdataset(rows, Encoders.bean(Apple.class)); Example: DataSetCreationTest.java Dynamically modified on fly If input data structure is complex, and only several fields take part in transformation, presaved data can be read, and fields modified on public void testmass_whenweightspecified_thenweightssum() { int expected = 85; Dataset<Row> input = getbulkdata().withcolumn("weight", lit(85)); assertequals(expected, repository.mass(input)); = Exception.class) public void testmass_whenweightisnull_thennpe() { Dataset<Row> input = getbulkdata().withcolumn("weight", lit(null).cast(datatypes. IntegerType)); repository.mass(input); } private Dataset<Row> getbulkdata() { return spark().read().option("header", "true").csv(gettestdatafolder()); } Example: DynamicallyModifiedOnFlyTest.java 2.3 Comparison Check DataFrame structure List<Apple> data = Lists.newArrayList(new Apple("Green", 85)); Dataset<Row> df = spark().createdataframe(data, Apple.class); assertequals(encoders.bean(apple.class).schema(), df.schema()); Get one value Apple apple = new Apple("Green", 85); List<Apple> data = Lists.newArrayList(apple); Dataset<Row> df = spark().createdataframe(data, Apple.class); Integer actual = df.first().getas("weight"); assertequals(apple.getweight(), actual); 8 Chapter 2. Java

13 2.3.3 Primitives list - with as List<String> data = Lists.newArrayList("green", "red"); Dataset<Row> df = spark().createdataset(data, Encoders.STRING()).toDF("color"); List<String> actual = df.select("color").as(encoders.string()).collectaslist(); Compare DataFrames List<Apple> data = Lists.newArrayList(new Apple("Green", 85)); Dataset<Row> expected = spark().createdataframe(data, Apple.class); Dataset<Row> actual = spark().createdataframe(data, Apple.class); assertequals(0, expected.except(actual).count()); Example: ComparisonTest.java 2.3. Comparison 9

14 10 Chapter 2. Java

15 CHAPTER 3 Scala ScalaTest FunSuite is used as testing framework. Project with code examples on GitLab: spark-testing-scala Functionality located in package repository 3.1 Context creation Library from Spark distributive is the best choice as base for integration testing, here named as spark-test-jar. Code examples in package: context Manual Two cores (highlighted) used in this example. val spark = SparkSession.builder.appName(getClass().getSimpleName).master("local[2]").getOrCreate() // use it spark.close() Code example: ManualSpec.scala 11

16 3.1.2 Framework: spark-test-jar Inclusion in project with Maven <dependency> <groupid>${spark.groupid}</groupid> <artifactid>spark-sql_${scala.suffix}</artifactid> <version>${spark.version}</version> <scope>provided</scope> <type>test-jar</type> </dependency> <dependency> <groupid>${spark.groupid}</groupid> <artifactid>spark-core_${scala.suffix}</artifactid> <version>${spark.version}</version> <scope>provided</scope> <type>test-jar</type> </dependency> <dependency> <groupid>${spark.groupid}</groupid> <artifactid>spark-catalyst_${scala.suffix}</artifactid> <version>${spark.version}</version> <scope>provided</scope> <type>test-jar</type> </dependency> Example of pom.xml Enabling in testcases Two traits have to be included for enable resources QueryTest, SharedSQLContext. class SparkTestJarSpec extends QueryTest with SharedSQLContext with Matchers SparkSession and SparkContext Can be received with methods: spark sparkcontext. val weights = List(120, 150) val expected = weights.reduce((a, v) => a + v) val df = spark.createdataset(weights).todf("weight") Code example: SparkTestJarSpec.scala 3.2 Data preparation Code examples in package: data_preparation 12 Chapter 3. Scala

17 3.2.1 RDD Empty RDD with predefined schema val rdd = sparkcontext.emptyrdd[apple] List of primitives val data = List(1, 2, 3) val rdd = sparkcontext.parallelize(data) List of entities val expected = Apple("Green", 120) val rdd = sparkcontext.parallelize(list(expected)) Example: RDDCreationSpec.scala DataFrame Empty with predefined structure val df = Seq.empty[(String, Int)].toDF("color", "weight") Empty with case class structure val df = Seq.empty[Apple].toDF() Empty with struct field val df = spark.emptydataframe.withcolumn("apple", struct(lit("green") as "color", lit(110) as "weight")) From primitive list val df = List(1, 2, 3).toDF("values") From tuples (often used) val df = List(("green", 70), ("red", 110)).toDF("color", "weight") 3.2. Data preparation 13

18 Array field val df = List( Array("red", "green", "yellow"), Array("green", "yellow") ).todf() With null values List(null.asInstanceOf[Integer]).toDF("color") Example: DataFrameCreationSpec.scala Dataset Empty val df = Seq.empty[Apple].toDS() From list List(Apple("green", 70), Apple("red", 110)).toDS() From DataFrame List(("green", 70)).toDF("color", "weight").as[apple] Example: DataSetCreationSpec.scala Dynamically modified on fly If input data structure is complex, and only several fields take part in transformation, presaved data can be read, and fields modified on fly. test("mass When weight specified Then weights sum") { val expected = 85 val input = getbulkdata.withcolumn("weight", lit(85)) repository.mass(input) shouldbe expected } test("mass When weight is null Then NPE") { val input = getbulkdata.withcolumn("weight", lit(null).cast(datatypes.integertype)) intercept[runtimeexception] { repository.mass(input) } } (continues on next page) 14 Chapter 3. Scala

19 (continued from previous page) private def getbulkdata = spark.read.option("header", "true").csv(gettestdatafolder) Example: DynamicallyModifiedOnFlySpec.scala 3.3 Comparison One value - with first val apple = Apple("Green", 85) val df = List(apple).toDF() val actual: Int = df.first.getas("weight") actual shouldequal apple.weight Primitives list - with as val df = List("Green", "Red").toDF("color") val actual = df.select("colour").as(encoders.string).collect() DataFrames - with checkanswer val apples = List(Apple("Green", 85)) val expected = apples.todf() val actual = apples.todf() checkanswer(expected, actual) Example: ComparisonSpec.scala 3.3. Comparison 15

WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS

WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS Andrew Ray StampedeCon 2016 Silicon Valley Data Science is a boutique consulting firm focused on transforming your business through data science