The Evolution of a Data Project

Size: px

Start display at page:

Download "The Evolution of a Data Project"

Margery Flowers
5 years ago
Views:

2 The Evolution of a Data Project

3 The Evolution of a Data Project Python script

4 The Evolution of a Data Project Python script SQL on live DB

5 The Evolution of a Data Project Python script SQL on live DB SQL on reporting DB

6 The Evolution of a Data Project Python SQL on SQL on Terrible script live DB reporting DB confusion

7 The Evolution of a Data Project Python SQL on SQL on Terrible Hadoop / Spark script live DB reporting DB confusion cluster

8 What needs fixing image: Pexels

9 What needs fixing One cluster: data lock-in. image: Pexels

10 What needs fixing One cluster: data lock-in. Want cluster time? You have to wait. image: Pexels

11 What needs fixing One cluster: data lock-in. Want cluster time? You have to wait. Clusters are underutilized and EXPENSIVE image: Pexels

12 Elastic Big Data Datadog Doug Daniels Director, Engineering

13 What s our big data platform do? WHOM Data Engineers Data Scientists

14 What s our big data platform do? WHOM WHAT Data Engineers Data Scientists do App features Statistical Analysis/ML Ad-hoc investigation

15 What s our big data platform do? WHOM WHAT WITH App features Spark Data Engineers do Statistical Analysis/ML with Hadoop (Pig) Data Scientists Ad-hoc investigation Python (Luigi)

16 Exploring the platform COPIOUS TOOLING ELASTIC COMPUTE CLOUD STORAGE

18 CLOUD STORAGE

19 What do we store?

20 150 Integrations and more

21 What s time series data? timestamp metric system.cpu.idle value tags host:i-xyz, role:cassandra,

22 We collect over a trillion of these per day and growing!

23 Where to put the petabytes? Amazon S3. Amazon S3

24 How data gets to S3 AMAZON S3 HIVE METASTORE Internal Format Parquet Metadata GO LUIGI/SPARK/PIG Kafka - Buffer - Sort + Dedupe - Upload Partition + Sort - Write Parquet - Update Metastore

25 Isn t this a job for HDFS?

26 What we don t love about HDFS

27 What we don t love about HDFS Causes the one cluster problem

28 What we don t love about HDFS Causes the one cluster problem Come for the storage, get stuck with the servers

29 What we don t love about HDFS Causes the one cluster problem Come for the storage, get stuck with the servers No Java? No data!

30 S3 is flexible! Read data from as many clusters as you want

31 S3 is flexible! Read data from as many clusters as you want Store unlimited stuff(*) with no management * Accepting laws of physics and your credit card limit

32 S3 is flexible! Read data from as many clusters as you want Store unlimited stuff(*) with no management Rock solid: durability ( ), availability (99.99) * Accepting laws of physics and your credit card limit

33 S3 is flexible! Read data from as many clusters as you want Store unlimited stuff(*) with no management Rock solid: durability ( ), availability (99.99) Access from any programming language * Accepting laws of physics and your credit card limit

34 Decouple data and compute (BREAK THE RULES!)

35 Breaking the rules is fine. In benchmarks: S3 is ~2X slower than HDFS

36 Breaking the rules is fine. In benchmarks: S3 is ~2X slower than HDFS

37 It s not all roses

38 Listing is slooooow (A CAUTIONARY TALE)

39 How to fix slow listing Parallelize it Bigger files

40 No way to quickly move data Intermediate Final HDFS Task write atomic move

41 No way to quickly move data Intermediate Final HDFS Task write atomic move S3 Task write

42 No way to quickly move data Say goodbye to speculative execution

43 No way to quickly move data Say goodbye to speculative execution Say hello to better task timeouts

44 But really: We S3 This is a great system. Data accessible from many clusters Storage is easy to manage It s a multi-language paradise up in here

45 CLOUD STORAGE ELASTIC COMPUTE

46 TRADITIONALLY One cluster to compute it all

47 Instead, we run many, many clusters New cluster for every automated job clusters at a time Median lifetime: 2hrs

48 Why so many clusters?

49 Total isolation We know what s happening and why

50 No more waiting on loaded clusters Tailor each cluster to the work you want to do Scale up when you need results faster Data scientists and data engineers don t have to wait

51 Pick the best hardware for each job == ~30% savings over general purpose hardware c3 for CPU-bound jobs r3 for memory-bound jobs m1.xlarge if you don t care (cheap!)

52 100% spot-instance clusters, all the time.* * (ok, most of the time)

53 Ridiculous savings! 100% spot-instance clusters, all the time.* Disappearing clusters! * (ok, most of the time)

54 How we do spot clusters In the big data platform Bid the on-demand price, pay the spot price

55 How we do spot clusters In the big data platform Bid the on-demand price, pay the spot price Fallback to on-demand instances if you can t get spot

56 How we do spot clusters In the big data platform Bid the on-demand price, pay the spot price Fallback to on-demand instances if you can t get spot Monitor everything: jobs, clusters, spot market

57 How we do spot clusters In the big data platform Bid the on-demand price, pay the spot price Fallback to on-demand instances if you can t get spot Monitor everything: jobs, clusters, spot market Save up to 80% off the on-demand price

58 Monitor the spot price Switch hardware when the market gets volatile

59 We like this strategy a lot! No waiting for the cluster you need No waste from hardware sitting idle Spot clusters are affordable enough to use everywhere Cluster is oversubscribed; everyone waiting in line to do their work Lots of expensive hardware sits idle when everyone s gone

60 What s challenging, though?

61 Many things that disappear.

62 COPIOUS TOOLING ELASTIC COMPUTE CLOUD STORAGE

63 Platform as a service Jobs, Clusters, Schedules, Users, Code, Monitoring, Logs, and more CLI Web and APIs

64 Big Data Platform Architecture DATA Amazon S3

65 Big Data Platform Architecture CLUSTER EMR DATA Amazon S3

66 Big Data Platform Architecture WORKER Pig Workers Spark Workers Luigi Workers CLUSTER EMR DATA Amazon S3

67 Big Data Platform Architecture STORAGE Metadata DB Queueing Logs WORKER Pig Workers Spark Workers Luigi Workers CLUSTER EMR DATA Amazon S3

68 Big Data Platform Architecture WEB Web API STORAGE Metadata DB Queueing Logs WORKER Pig Workers Spark Workers Luigi Workers CLUSTER EMR DATA Amazon S3

69 Big Data Platform Architecture USER CLI API Clients Job Scheduler WEB Web API STORAGE Metadata DB Queueing Logs WORKER Pig Workers Spark Workers Luigi Workers CLUSTER EMR DATA Amazon S3

70 Big Data Platform Architecture USER CLI API Clients Job Scheduler Datadog Monitoring WEB Web API STORAGE Metadata DB Queueing Logs WORKER Pig Workers Spark Workers Luigi Workers CLUSTER EMR DATA Amazon S3

71 How to find the right cluster when they disappear?

72 Cluster tagging for discovery #anomaly -detection #monitor-report

73 How to monitor many disappearing clusters?

74 Dynamic Monitoring on Tags Dashboards Monitors cluster_tags: anomaly-detection anomaly-detection

75 How to debug problems when the cluster s gone?

76 Debugging In a Post-Cluster World

77 Debugging In a Post-Cluster World Send all logs to S3 HDFS YARN Pig Spark

78 Debugging In a Post-Cluster World Send all logs to S3 Visualize the pipeline HDFS YARN Pig Lipstick for Pig Spark History Server Luigi task flow Spark

79 Debugging In a Post-Cluster World Send all logs to S3 Visualize the pipeline Preserve historical monitoring data HDFS YARN Pig Spark Lipstick for Pig Spark History Server Luigi task flow Keep history, by tag, after the cluster disappears

80 How to handle certain cluster failure in your jobs?

81 Luigi: design for failure. Automatic cleanup and restart A B

82 Luigi: design for failure. Automatic cleanup and restart B

83 Luigi: design for failure. Automatic cleanup and restart

84 Luigi: design for failure. Automatic cleanup and restart

85 COPIOUS TOOLING ELASTIC COMPUTE CLOUD STORAGE

86 Recommendations for Cloud Big Data

87 Recommendations for Cloud Big Data Use S3 for permanent data, not HDFS

88 Recommendations for Cloud Big Data Use S3 for permanent data, not HDFS Start from EMR if building yourself

89 Recommendations for Cloud Big Data Use S3 for permanent data, not HDFS Start from EMR if building yourself Look into a PaaS: Netflix Genie, Qubole, Databricks

90 Recommendations for Cloud Big Data Use S3 for permanent data, not HDFS Start from EMR if building yourself Look into a PaaS: Netflix Genie, Qubole, Databricks Tag your clusters for dynamic monitoring

91 Recommendations for Cloud Big Data Use S3 for permanent data, not HDFS Start from EMR if building yourself Look into a PaaS: Netflix Genie, Qubole, Databricks Tag your clusters for dynamic monitoring Design for failure with a workflow tool (Luigi, Airflow)

92 Thanks! Want to work with us on Spark, Hadoop, Kafka, Parquet, and more? jobs.datadoghq.com DM or

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without