Spark on Ceph at UPSud/LAL

Size: px

Start display at page:

Download "Spark on Ceph at UPSud/LAL"

Elwin Greer
5 years ago
Views:

1 Spark on Ceph at UPSud/LAL. What Spark is about. Why Spark on Ceph?. Implementation ideas Julien Nauroy Spark on Ceph

2 . What Spark is about Spark is a computing framework Siminar to Hadoop MapReduce from afar Many more use cases Machine Learning, Bioinformatics, Key concept : Resilient Distributed Dataset Tries to fit the dataset into RAM Julien Nauroy Spark on Ceph

3 . What Spark is about Spark runs on a cluster Uses YARN, MESOS, or standalone Reads from/writes to distributed filesystems HDFS, S, Not to Ceph (yet) Preferably uses HDFS Data locality but doesn t make sense in VMs Uses rename on writes possible problem Julien Nauroy Spark on Ceph

4 . Experiments at UPSud Life Sciences DNA/RNA Sequence alignment Galaxy on Spark Simulating turtle embryos growth Astrophysics Image coaddition Cross matching catalogs (CDS Strasbourg) Julien Nauroy Spark on Ceph

5 How HDFS works. Split files into blocks Split on data structure boundaries (e.g. line) Indicative size : 8MB } block Julien Nauroy Spark on Ceph

6 How HDFS Works. Copy each block on multiple nodes Node A Node B Node C Node D Node E Julien Nauroy Spark on Ceph 6

7 How HDFS Works. Copy each block on multiple nodes In general, copies Node A Node B Node C Node D Node E Julien Nauroy Spark on Ceph 7

8 How HDFS Works. Copy each block on multiple nodes In general, copies Node A Node B Node C Node D Node E Julien Nauroy Spark on Ceph 8

9 How HDFS Works. Copy each block on multiple nodes In general, copies Node A Node B Node C Node D Node E Julien Nauroy Spark on Ceph 9

10 Fonctionnement de HDFS. Copy each block on multiple nodes In general, copies Node A Node B Node C Node D Node E Julien Nauroy Spark on Ceph 0

11 How HDFS Works. Copy each block on multiple nodes In general, copies Node A Node B Node C Node D Node E Julien Nauroy Spark on Ceph

12 How MapReduce Works. Select nodes on which to run computations Data has to be node-local (if possible) Node A Node B Node C Node D Node E Julien Nauroy Spark on Ceph

13 How MapReduce works. Select nodes on which to run computations Data has to be node-local (if possible) Node A Node B Node C Node D Node E Julien Nauroy Spark on Ceph

14 How MapReduce works. Sélection des nœuds portant les calculs The node must not be busy Node A Node B Node C Node D Node E Julien Nauroy Spark on Ceph

15 How MapReduce works. Sélection des nœuds portant les calculs Node A Node B Node C Node D Node E Julien Nauroy Spark on Ceph

16 How MapReduce works. Sélection des nœuds portant les calculs Node A Node B Node C Node D Node E Julien Nauroy Spark on Ceph 6

17 . Why Spark on Ceph? Spark clusters in VM works great For computations at least Main usage of Spark (public clouds) Spark requires a distributed storage HDFS, S, NFS HDFS in a VM will not solve the problem HDFS over Ceph = double penalty Data locality doesn t make sense in VMs Julien Nauroy Spark on Ceph 7

18 . Why Spark on Ceph? Ceph is coupled with our OpenStack cluster Local expertise HDFS is not an option Problems with data locality Computing and storage not paired in our cloud Julien Nauroy Spark on Ceph 8

19 . Spark on Ceph ideas. Using RGWFS. Using CephFS-Hadoop. Using a gateway with an S endpoint Julien Nauroy Spark on Ceph 9

20 . - RGWFS Julien Nauroy Spark on Ceph 0

21 . - RGWFS Pros Should ntegrate well with Spark through rgw:// Cons Git repo doesn t exist anymore Cannot find more info vaporware? Julien Nauroy Spark on Ceph

22 . CephFS-Hadoop Pros Transparent for Spark through hdfs:// Cons VMs have to be within the OSD network Perfs? Hadoop.X or doc not updated? Julien Nauroy Spark on Ceph

23 . S Gateway Pros Hadoop supports the S protocol VMS outside of the OSD network Cons Another layer of indirection? Perfs depending on the number of gateways? Julien Nauroy Spark on Ceph

24 Which solution is best suited? discussion Julien Nauroy Spark on Ceph

Processing of big data with Apache Spark

Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT