Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Size: px

Start display at page:

Download "Applied Spark. From Concepts to Bitcoin Analytics. Andrew F."

Evangeline Brooks
5 years ago
Views:

1 Applied Spark From Concepts to Bitcoin Analytics Andrew F.

2 My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2

3 Additionally Member, Apache Software Foundation 3/28/16 QCON-SP Andrew Hart 3

4 Additionally Founder, Data Fluency Software consultancy specializing in appropriate data solutions for startups and smb's Help clients make good decisions and leverage power traditionally accessible only to big business 3/28/16 QCON-SP Andrew Hart 4

5 Previously NASA Jet Propulsion Laboratory Building data management pipelines for research missions in many domains (Climate, Cancer, Mars, Radioastronomy, etc.) 3/28/16 QCON-SP Andrew Hart 5

6 Apache Spark 3/28/16 QCON-SP Andrew Hart 6

7 Spark is General purpose cluster computing software that maximizes use of cluster memory to process data Used by hundreds of organizations to realize performance gains over previous-generation cluster compute platforms, particularly for mapreduce style problems 3/28/16 QCON-SP Andrew Hart 7

8 Spark Was developed at the Algorithms, Machines, and People laboratory (AMP Lab) at the University of California, Berkeley in 2009 Is open source software presently under the governance of the Apache Software Foundation 3/28/16 QCON-SP Andrew Hart 8

9 Why Spark Exists Confluence of three trends: Increased volume of digital data Decreasing cost of computer memory (RAM) Data processing technology liberation 3/28/16 QCON-SP Andrew Hart 9

10 Digital data volume Early days low-resolution sensors, comparatively few people on the internet proprietary data, custom solutions, expensive, custom hardware 3/28/16 QCON-SP Andrew Hart 10

11 Digital data volume Modern era ubiquitous, high-resolution cameras mobile devices packed with sensors i.o.t. open source software, cheap commodity hardware 3/28/16 QCON-SP Andrew Hart 11

12 We've gone from this Internet 3/28/16 QCON-SP Andrew Hart 12

13 To this (Humanity's global communication platform) 3/28/16 QCON-SP Andrew Hart 13

14 1.5 Billion connected PCs 3.2 Billion connected people 6 Billion connected mobile devices 3/28/16 QCON-SP Andrew Hart 14

15 What are we doing with all of this? Every minute: 18,000 votes cast on Reddit 51,000 apps downloaded by Apple users 350,000 tweets posted on Twitter 4,100,000 likes recorded on Facebook 3/28/16 QCON-SP Andrew Hart 15

16 We are awash in data Monetizing this data is a core competency for many businesses Need tools to do this effectively at today's scale 3/28/16 QCON-SP Andrew Hart 16

17 2. Tool support and technology liberation 3/28/16 QCON-SP Andrew Hart 17

18 How far do you want to go back? We have always used tools to help us cope with data 3/28/16 QCON-SP Andrew Hart 18

19 VisiCalc Early "big data" tool Allowed business to move from the chalk board to the digital spreadsheet Phenomenal increase in productivity running numbers for business 3/28/16 QCON-SP Andrew Hart 19

20 Modern-era Spreadsheet Tech Microsoft Excel 1,048,576 rows x 16,384 columns 3/28/16 QCON-SP Andrew Hart 20

21 Open Source Alternatives Exist Microsoft Excel 1,048,576 rows x 16,384 columns Apache OpenOffice 1,048,576 rows x 1,024 columns 3/28/16 QCON-SP Andrew Hart 21

22 Relational Database Systems Support thousands of tables, millions of rows 3/28/16 QCON-SP Andrew Hart 22

23 Relational Database Systems Support thousands of tables, millions of rows Viable open source alternatives exist for many use cases 3/28/16 QCON-SP Andrew Hart 23

24 Modern Big Data Era MapReduce Algorithm (2004) parallelize large-scale computation across clusters of servers 3/28/16 QCON-SP Andrew Hart 24

25 Modern Big Data Era Hadoop Open source processing framework for MapReduce applications to run on large clusters of commodity (unreliable) hardware 3/28/16 QCON-SP Andrew Hart 25

26 3. Commoditization of computer memory 3/28/16 QCON-SP Andrew Hart 26

27 Early Days Main memory was hand made. You could see each bit. 3/28/16 QCON-SP Andrew Hart 27

28 Modern Era AWS EC2 r3.8xlarge 244 GiB for US$1.41/hr* 3/28/16 QCON-SP Andrew Hart 28

29 Why talk about memory? Ability to use memory efficiently distinguishes Spark from Hadoop, contributes to its speed advantages in many scenarios 3/28/16 QCON-SP Andrew Hart 29

30 How Spark Works 3/28/16 QCON-SP Andrew Hart 30

31 Primary abstraction in Spark: Resilient Distributed Datasets (RDD) Immutable (read-only), partitioned dataset Processed in parallel on each cluster node Fault-tolerant resilient to node failure 3/28/16 QCON-SP Andrew Hart 31

32 Primary abstraction in Spark: Resilient Distributed Datasets (RDD) Uses the distributed memory of the cluster to store the state of a computation as a sharable object across jobs ( instead of serializing to disk) 3/28/16 QCON-SP Andrew Hart 32

33 Traditional MapReduce: HDFS Read Map-1 HDFS Write HDFS Read Reduce-1 HDFS Write Data on Disk Map-2... Tuples on Disk Reduce-2... Tuples on Disk Map-n Reduce-n 3/28/16 QCON-SP Andrew Hart 33

34 Spark RDD Architecture: HDFS Read Map-1 Reduce-1 HDFS Write Data on Disk Map-2... Cluster Memory Reduce-2... Data on Disk Map-n RDD Reduce-n 3/28/16 QCON-SP Andrew Hart 34

35 Unified computational model: Spark unifies batch & streaming models, which traditionally require different architectures Sort of like the limit theorem in Calculus If you imagine a time-series set of RDDs with ever smaller windows of time, you can approximate streaming workflows 3/28/16 QCON-SP Andrew Hart 35

36 Two ways to create an RDD in Spark Programs: "Parallelize" an existing collection (e.g.: Python array) in the driver program Reference a dataset on external storage text files on disk anything with a supported Hadoop InputFormat 3/28/16 QCON-SP Andrew Hart 36

37 RDDs from RDDs: RDDs are immutable (read-only) The result of applying transformations and actions to an RDD is a new RDD RDD's can be persisted to memory reducing need to re-compute RDD every time 3/28/16 QCON-SP Andrew Hart 37

38 Writing Spark Programs: Think of spark programs as a way to describe a sequence of transformations and actions that should be applied to an RDD 3/28/16 QCON-SP Andrew Hart 38

39 Writing Spark Programs: Transformations create a new dataset from an existing one (e.g.: map) Actions return a value after running a computation (e.g.: reduce) 3/28/16 QCON-SP Andrew Hart 39

40 Writing Spark Programs: Spark provides a rich set of transformations: map flatmap filter sample union intersection distinct groupbykey sortbykey cogroup pipe join cartesian coalesce 3/28/16 QCON-SP Andrew Hart 40

41 Writing Spark Programs: Spark provides a rich set of actions: reduce collect count first take takesample takeordered saveastextfile countbykey foreach 3/28/16 QCON-SP Andrew Hart 41

42 Writing Spark Programs: Transformations are lazily evaluated. They are only computed when a subsequent action (which must return a result) is run. Transformations get recomputed each time an action is run (unless you explicitly persist the resulting RDD to memory) 3/28/16 QCON-SP Andrew Hart 42

43 Structure of Spark Programs: Spark programs have two principal components: Driver Program Worker function 3/28/16 QCON-SP Andrew Hart 43

44 Structure of Spark Programs: Driver program Executes on the master node Establishes context 3/28/16 QCON-SP Andrew Hart 44

45 Structure of Spark Programs: Worker (processing) function Executes on each worker node Computes transformations and actions on RDD partitions 3/28/16 QCON-SP Andrew Hart 45

46 Structure of Spark Programs: SparkContext Holds all of the information about the cluster Manages what gets shipped to nodes Makes life extremely easy for developers 3/28/16 QCON-SP Andrew Hart 46

47 Structure of Spark Programs: Shared Variables Broadcast variables: efficiently share static data with the cluster nodes Accumulators: write-only variables that serve as counters 3/28/16 QCON-SP Andrew Hart 47

48 Structure of Spark Programs: Worker (processing) function Executes on each worker node Computes transformations and actions on RDD partitions 3/28/16 QCON-SP Andrew Hart 48

49 Interacting with Spark 3/28/16 QCON-SP Andrew Hart 49

50 Spark APIs: Scala Java Python R SQL Spark provides APIS for several languages 3/28/16 QCON-SP Andrew Hart 50

51 Using Spark: There are two main ways to leverage Spark Interactively through the included command line interface Programmatically via standalone programs submitted as jobs to the master node 3/28/16 QCON-SP Andrew Hart 51

52 Using Spark: Spark is GREAT for experimenting Run experimental programs on small sample datasets on a single machine To scale up, simply re-target the SparkContext to point to the master node of a Spark cluster 3/28/16 QCON-SP Andrew Hart 52

53 Bitcoin 3/28/16 QCON-SP Andrew Hart 53

54 Bitcoin is A decentralized digital currency value is exchanged directly between individuals with no need for a traditional central authority (e.g.: bank) A global payment network transactions are broadcast and verified by network peers, with each using a complete copy of the network transaction history (the "blockchain") 3/28/16 QCON-SP Andrew Hart 54

55 Bitcoin is A decentralized digital currency value is exchanged directly between individuals with no need for a traditional central authority (e.g.: bank) A global payment network transactions are broadcast and verified by network peers, with each using a complete copy of the network transaction history (the "blockchain") 3/28/16 QCON-SP Andrew Hart 55

56 Bitcoin Protocol Open source software ( for processing peer-to-peer financial transactions with zero trust The backbone of an experimental financial system secured by math (instead of a trusted authority) 3/28/16 QCON-SP Andrew Hart 56

57 Bitcoin: Open Data Unprecedented transparency into the workings of a financial network Peer-to-peer payments made using Bitcoin Trading markets speculating on Bitcoin value Global, 24x7, Free Interesting research datasets abound 3/28/16 QCON-SP Andrew Hart 57

58 Bitcoin Exchange Companies 3/28/16 QCON-SP Andrew Hart 58

59 Bitcoin exchange Logos 3/28/16 QCON-SP Andrew Hart 59

60 Bitcoin exchange Logos 3/28/16 QCON-SP Andrew Hart 60

61 3/28/16 QCON-SP Andrew Hart 61

62 Spark Demos 3/28/16 QCON-SP Andrew Hart 62

63 The Hardware: Master Node 6 CPU Cores 8GB RAM 6 CPU Cores 8GB RAM 6 CPU Cores 8GB RAM 3/28/16 QCON-SP Andrew Hart 63

64 The Goal: Demonstrate experimenting with Spark CLI Demonstrate program execution on a standalone Spark cluster 3/28/16 QCON-SP Andrew Hart 64

65 The Data: 1 month of log files containing transactions from 10 global exchanges (in 3 currencies) 1 trade per line, JSON encoded Organized into files by exchange, currency, and date 3/28/16 QCON-SP Andrew Hart 65

66 The Question: What is the cumulative value of all "buy"-type transactions in the dataset? Bonus: bucketed by fiat currency 3/28/16 QCON-SP Andrew Hart 66

67 3/28/16 QCON-SP Andrew Hart 67

68 Wrap-up 3/28/16 QCON-SP Andrew Hart 68

69 Things to keep in mind using Spark: Speed comes from avoiding writes to disk Allocate enough memory in your cluster to hold your data completely in memory Once data fits in memory, adding more is not going to boost performance 3/28/16 QCON-SP Andrew Hart 69

70 Things to keep in mind using Spark: Once data fits in memory, most apps are CPU or network bound Allocate more cores in your cluster to increase parallelism (and tune Spark to use them!) Have disks around to handle spillover, configure to reduce unnecessary writes 3/28/16 QCON-SP Andrew Hart 70

71 image credit: Spark ecosystem: 3/28/16 QCON-SP Andrew Hart 71

72 image credit: Spark ecosystem (BDAS Stack): 3/28/16 QCON-SP Andrew Hart 72

73 Spark at the ASF: Entered Incubator in 2013 Graduated to "Top Level" project committers, 800+ contributors Outstanding documentation Community mailing lists Updated project status 3/28/16 QCON-SP Andrew Hart 73

74 Thank you! Contact: Web: 3/28/16 QCON-SP Andrew Hart 74

Processing of big data with Apache Spark

Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT