Backtesting with Spark

Similar documents
Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

An Introduction to Apache Spark

CSE 444: Database Internals. Lecture 23 Spark

DATA SCIENCE USING SPARK: AN INTRODUCTION

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Processing of big data with Apache Spark

Chapter 4: Apache Spark

Big Data Infrastructures & Technologies

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Apache Spark 2.0. Matei

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

April Copyright 2013 Cloudera Inc. All rights reserved.

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

MapR Enterprise Hadoop

A Tutorial on Apache Spark

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Spark, Shark and Spark Streaming Introduction

Evolution From Shark To Spark SQL:

Spark Overview. Professor Sasu Tarkoma.

Big data systems 12/8/17

New Developments in Spark

Turning Relational Database Tables into Spark Data Sources

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Experiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms

Cloud Computing & Visualization

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Big Data Architect.

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Data processing in Apache Spark

Data processing in Apache Spark

A Parallel R Framework

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Distributed Machine Learning" on Spark

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Big Data Hadoop Stack

Dell In-Memory Appliance for Cloudera Enterprise

Getting Started with Spark

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Hadoop Development Introduction

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Hadoop. Introduction / Overview

Lecture 11 Hadoop & Spark

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Distributed Computing with Spark

Big Data Analytics using Apache Hadoop and Spark with Scala

Fast, Interactive, Language-Integrated Cluster Computing

Best Practices for Deploying Hadoop Workloads on HCI Powered by vsan

15.1 Data flow vs. traditional network programming

MapReduce, Hadoop and Spark. Bompotas Agorakis

Deep Learning Frameworks with Spark and GPUs

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Cloudera Kudu Introduction

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

Spark and distributed data processing

Data processing in Apache Spark

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Resilient Distributed Datasets

Big Data Processing (and Friends)

An Overview of Apache Spark

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Certified Big Data Hadoop and Spark Scala Course Curriculum

An Introduction to Apache Spark

August Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC. Supported by the National Natural Science Fund

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma


Improving the MapReduce Big Data Processing Framework

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Sempala. Interactive SPARQL Query Processing on Hadoop

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Spark: A Brief History.

Seagate Kinetic Open Storage Platform. Mayur Shetty - Senior Solutions Architect

Specialist ICT Learning

Oracle Big Data Fundamentals Ed 1

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Map Reduce Group Meeting

Big Data Hadoop Course Content

Big Data processing: a framework suitable for Economists and Statisticians

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Webinar Series TMIP VISION

Big Data with Hadoop Ecosystem

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

A brief history on Hadoop

The Datacenter Needs an Operating System

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Chapter 5. The MapReduce Programming Model and Implementation

Transcription:

Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel 1

Traditional Grid Shared storage Storage and compute scale independently Bottleneck on I/O fabric Typically exotic hardware Proprietary schedulers Homegrown application frameworks I/O Fabric 2

Hadoop Cluster Shared nothing Storage and compute scale together Hierarchical architecture minimizes data transfer Typically commodity hardware ToR Switch ToR Switch Core Switch ToR Switch ToR Switch Open source scheduler and frameworks 3

Implementation Choices Storage Engine Processing Language Processing Framework HBase SQL Hive / Impala HDFS + CSV Native: C / C++ MR4C HDFS + Parquet JVM: Java, Scala MapReduce HDFS + AvroFile Python Spark 4

Spark in 60 Seconds Started in 2009 by Matei Zaharia at UC Berkeley AMPLab Advancements over MapReduce: DAG engine Takes advantage of system memory Optimistic fault tolerance using lineage 10x 100x faster Supports applications written in Scala, Java and Python Rich ecosystem: SparkStreaming, SparkR, SparkSQL, MLLib, GraphX Strong community: ~40 committers, 100s of contributors, multi-vendor support 5

BLASH: Algorithm Implementation Data layout: one file per symbol for a year. Pipelining to avoid re-reading the data. Process order book for all symbols. Sort and filter results in the end. Unit Testing Separation of concerns parallelization from algorithm. Automated verification for correctness. Optimizations Use trending to reduce expensive method invocation. Keep memory in check process an order at a time. ~2 weeks effort. Includes coding, data generation and running benchmarks. 6

Setup Hardware 1 master, 1 mgmt node 12 workers 2 x E5-2695 v2 @ 2.40GHz 24 physical cores 96GB RAM 8 x 1TB SAS drives 10Gb Ethernet Software RHEL 6.6 CDH 5.4.0 Apache Hadoop 2.6.0 Apache Spark 1.3 Apache Parquet 1.5 7

Setup Spark Settings 4 cores, 4GB per executor Given 288 total physical cores, theoretical max of 72 executors for the entire cluster Or 144 executors taking into account hyperthreading In reality, the effective core count is somewhere in between 8

Data set Avg File Size (Hi / Low) CSV 7.5 GB 550 MB Parquet + Snappy 2.2 GB 164 MB Total Size, 1 year 8.23 TB 2.46 TB Data Specs Full market symbols 251 trading days (1 year) Simulated high and low volume instruments Simulated high volume trading days 9

Vertical Scaling Test 26.00 24.00 22.00 20.00 18.00 16.00 14.00 12.00 10.00 40 60 80 100 120 140 160 180 10

Parquet vs CSV 45.00 40.00 35.00 30.00 25.00 20.00 CSV Parquet 15.00 10.00 5.00 0.00 90 72 54 11

Horizontal Scaling Test 40.00 35.00 30.00 25.00 20.00 15.00 10.00 5.00 0.00 4 5 6 7 8 9 10 11 12 13 12

Observations Lots of room for optimization Code refactor (avoiding expensive operations) Pre-processing of common data like order books, moving averages, bars, etc. No built-in way of dealing with split time-series data Processing is local only for the first split Workaround: use bigger HDFS block sizes Better: API to process file splits sequentially, ability to pass intermediate state to the next task 13

Thank you! @patrickangeles 14