Fast Big Data Analytics with Spark on Tachyon

Size: px

Start display at page:

Download "Fast Big Data Analytics with Spark on Tachyon"

Job Pitts
5 years ago
Views:

1 1 Fast Big Data Analytics with Spark on Tachyon Shaoshan Liu

2 2 Fun Facts Tachyon A tachyon is a particle that always moves faster than light. The word comes from the Greek: ταχύς or tachys, meaning "swift, quick, fast, rapid", and was coined in 1967 by Gerald Feinberg. The complementary particle types are called luxon (always moving at the speed of light) and bradyon (always moving slower than light), which both exist. In the movie, K-PAX, Kevin Spacey's character claims to have traveled to Earth at Tachyon speeds

3 Fun Facts Baidu One of the top tech companies in the World, and we have an office here! 3

4 Serious Fact When Tachyon Meets Baidu 30X Acceleration of our Big Data Analytics Workload ~ 100 nodes in deployment, > 1 PB storage space 4

5 Agenda Motivation: Why Tachyon? Tachyon Production Usage at Baidu Problems Encountered in Practice Advanced Features Performance Deep Dive Future Works 5

6 Motivation: Why Tachyon? 6

7 7 Interactive Query System Example: John is a PM and he needs to keep track of the top queries submitted to Baidu everyday Based on the top queries of the day, he will perform additional analysis But John is very frustrated that each query takes tens of minutes to finish Requirements: Manages PBs of data Able to finish 95% of queries within 30 seconds

Baidu Ad-hoc Query Architecture Sample Query

Group 3 SELECT event_query, COUNT(event_query) as

Engine SELECT event_province, COUNT(event_query)

event_query= baidu stock" GROUP BY event_province

8 Baidu Ad-hoc Query Architecture Sample Query Sequence: Product Group 1 Product Group 2 Product Group 3 SELECT event_query, COUNT(event_query) as cnt FROM data_warehouse WHERE event_day=" AND event_action="query_click" GROUP BY event_query ORDER BY cnt DESC Query UI Query Engine SELECT event_province, COUNT(event_query) as cnt FROM data_warehouse WHERE event_day=" AND event_action= query_click AND event_query= baidu stock" GROUP BY event_province ORDER BY cnt DESC Data Warehouse 8

9 9 Baidu Ad-hoc Query Architecture Hive on MR Hive Map Reduce 4X Improvement but not good enough! Spark SQL Compute Center Data Center Data Warehouse BFS

10 10 A Cache Layer Is Needed!! Three Requirements: High Performance Reliable Provides Enough Capacity

11 Transparent Cache Layer Problem: Data nodes and compute nodes do not reside in the same data center, and thus data access latency may be too high Specifically, this could be a major performance problem for ad-hoc query workloads Solution: Use Tachyon as a transparent cache layer Cold query: read from remote storage node Warm\hot query: read from Tachyon directly Initially at Baidu, 50 machines deployed with Spark and Tachyon 11 Mostly serving Spark SQL ad-hoc queries Tachyon as transparent cache layer

12 12 Architecture Spark Task Spark Task Spark mem Spark mem block 1 block 2 block 3 block 4 Compute Center Tachyon HDFS in- memory disk Read from remote data center: ~ 100 ~ 150 seconds Read from Tachyon remote node: 10 ~ 15 sec Read from Tachyon local node: ~ 5 sec Tachyon Brings 30X Speed-up! Data Center Baidu File System (BFS)

13 Tachyon Production Usage at Baidu 13

14 14 Architecture: Interactive Query Engine Query UI Operation Manager View Manager Cache Meta Spark Tachyon Data Warehouse

15 15 Architecture: Interactive Query Engine Operation Manager: Accepts queries from query UI Query parsing and optimization using Spark SQL Checks whether the requested data is already cache: if so, read from Tachyon Otherwise, initiate a spark job to read from Data warehouse View Manager: Manages view meta data Handles requests from operation manager: if cache miss, then build new views by reading from data warehouse and then writing to Tachyon Tachyon: View cache: instead of caching raw blocks, we cache views View: <table name, partition key, attributes, data> Data Warehouse: HDFS-based data warehouse that stores all raw data

16 16 Query: Check Cache Query UI Operation Manager View Manager Cache Meta Spark Tachyon Data Warehouse

17 17 Hot Query: Cache Hit Query UI Operation Manager View Manager Cache Meta Spark Tachyon Data Warehouse

18 18 Cold Query: Cache Miss Query UI Operation Manager View Manager Cache Meta Spark Tachyon Data Warehouse

19 Examples SELECT a.key * (2 + 3), b.value FROM T a JOIN T b ON a.key=b.key AND a.key>3 == Physical Plan == Project [(CAST(key#27, DoubleType) * 5.0) AS c_0#24,value#30] BroadcastHashJoin [key#27], [key#29], BuildLeft Filter (CAST(key#27, DoubleType) > 3.0) HiveTableScan [key#27], (MetastoreRelation default, T, Some(a)), None HiveTableScan [key#29,value#30], (MetastoreRelation default, T, Some(b)), None 19 Once we have the Spark SQL physical plan, we parse the HiveTableScan part and then determines whether the requested view is in Cache Cache Hit: directly pull data from Tachyon Cache Miss: get data from remote data storage

20 Caching Strategies On-Demand (default): Triggered by cold cache Query parsing and optimization using Spark SQL Checks whether the requested data is already cache: if so, read from Tachyon Otherwise, initiate a spark job to read from Data warehouse Prefetch: (new feature for Tachyon?) Current Strategy: analyze prefetch patterns of the past month, and then use a static strategy Based on user behavior, prefetch data before users actually access the data Finer details: Which storage tier should we put the data into? Do we actively delete obsolete blocks or just let it phase out? 20

21 Problems Encountered in Practice 21

22 22 Problem 1: Failed to Cache Blocks Problem In our experiments, we observe that blocks can not be cached by Tachyon, the same query would keep going to fetch blocks from the storage node instead of from Tachyon

23 23 Problem 1: Failed to Cache Blocks Problem Root Problem: Tachyon would only cache the block if the whole block has been read Solution: read the whole block if you want to cache it

24 24 Problem 2: Locality Problem DAGScheduler: When DAGScheduler schedules tasks, it schedules tasks on the workers that have the data to make sure there is no network traffic, and thus high performance Also, the master thinks that it is local (no remote fetch needed)

25 25 Problem 2: Reality However, we do observe heavy network traffic: Impact: We expect the Tachyon cache hit rate is 100% We end up with 33% cache hit rate Root Problem: we were using a very old InputFormat Solution: update your InputFormat

26 Problem 3: SIGBUS 26

27 27 Problem 3: SIGBUS Root Problem: bug in Java 1.6 CompressedOops feature Solution: disable CompressedOops or update your Java version

28 Problem 4: Connection reset by peer Root Problem: not enough memory in Java heap Solution: tune your GC parameters 28

29 29 None of the Problems is a Tachyon Problem! Problem 1: need to understand the design of Tachyon first Problem 2: HDFS Input Format Problem Problem 3: Java Version Problem Problem 4: Memory Budget \ GC Problem

30 Advanced Features 30

31 31 Not Enough Cache Space? Problem: Not enough cache space if we cache everything in memory E.g. a machine with 60 GB of memory, 30 GB given to Spark, and 20 GB given to Tachyon, 10 such machines would only give us 200 GB of cache space. Solution: What if we extend Tachyon to expand to other storage medium in addition to memory Tiered Storage: Level 1: Memory Level 2: SSD Level 3: HDD

32 32 Tiered Storage Design Write Path

33 33 Tiered Storage Design Read Path

34 Tiered Storage Deployment Currently use two layers: MEM and HDD MEM: 16GB per machine (will expand when we get more memory) HDD: 10 disks with 2TB each (currently use 6 of them, can expand) > 100 machines: over 2 PB storage space 4

35 35 A Cache Layer Is Needed!! Three Requirements: High Performance Reliable Provides Enough Capacity Also, with its tiered storage feature, it could provide almost infinite storage space

36 Performance Deep Dive 36

37 Overall Performance 1200 1000 800 600 Setup: 1.

Use Spark + Tachyon to query 6 TB of data 400 200 0 MR (sec) Spark

37 37 Overall Performance Setup: 1. Use MR to query 6 TB of data 2. Use Spark to query 6 TB of data 3. Use Spark + Tachyon to query 6 TB of data MR (sec) Spark (sec) Spark + Tachyon (sec) Results: 1. Spark + Tachyon achieves 50-fold speedup compared to MR

38 38 Tiered Storage Performance Read Throughput (MB/s) original hierarchy Write Throughput (MB/s) original hierarchy

39 39 Write-Optimized Allocation 2000 Instead of writing to the top layer, write to the first layer that has space available Write through mapped file, so the content should still be in mapped file if read immediately after write If read does not happen immediately after write, then it does not matter anyway Not suitable for all situations, configurable With two layers, we see 42% improvement on write latency on averages Latency (ms) No Change (ms) With Change (ms)

40 Micro-Benchmark 180 160 140 120 100 80 60 40 20 0 tiered storage 1 disk elapsed time (Sec) tiered storage 6 disks tiered

Tiered storage with 6 disks in HDD layer 3. Tiered storage with 6 disks in HDD layer, and with write-optimization 4.

40 40 Micro-Benchmark tiered storage 1 disk elapsed time (Sec) tiered storage 6 disks tiered storage 6 disks write optimization OS paging Setup: 1. Tiered storage with 1 disk in HDD layer 2. Tiered storage with 6 disks in HDD layer 3. Tiered storage with 6 disks in HDD layer, and with write-optimization 4. OS Paging/Swapping On Conclusions: 1. Current tiered storage implementation cant beat OS paging 2. Need better write mechanism, a garbage collection mechanism would be even better

41 41 About Debugging: You are as good as your tools! new feature for Tachyon?

42 42 Debugging: Master Three logs generated on the Master Side Master.log Normal logging info Master.out Mostly GC / JVM info User.log Rarely used

43 43 Debugging: Worker Three logs generated on the Worker Side Worker.log Normal logging info Worker.out Mostly GC / JVM info User.log Rarely used

44 44 Debugging: Client Client is built into Spark Executor Just check Spark App stdout log for more information

45 Future Works 45

46 46 Welcome to Contribute Use of Tachyon as a parameter Server (Machine Learning) Restful API support for Tachyon Garbage Collection Feature Cache Replacement policy Currently on LRU by default Better policies may improve hit rate in different scenarios

47 Make your system fly at tachyon speed

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017 UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017 March 2017 HISTORY Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in