CloudExpo November 2017 Tomer Levi

Size: px

Start display at page:

Download "CloudExpo November 2017 Tomer Levi"

Adrian Shaw
5 years ago
Views:

1 CloudExpo November 2017 Tomer Levi

About me Full Stack Engineer @ Intel s Advanced Analytics group. Artificial Intelligence unit at Intel.

2 About me Full Stack Intel s Advanced Analytics group. Artificial Intelligence unit at Intel. Responsible for (1) Radical improvement of critical processes and (2) Help Building AI Competitive

3 Intel pharma analytics platform An edge-to-cloud artificial intelligence (AI) solution that helps improve clinical trials outcomes offering Sensors & Mobile Big data ANALYTICS Value to Pharma Reduce Cost Increase objectivity Improve Quality Improve compliance

4 Data & Technology Hundreds of machines in Technologies: Hadoop, Spark SQL and Streaming (Kafka & Kinesis), HBase, Athena... DATA 100 s PATIENTS 5-30 DAYS 50Hz = 1B records / day DS /Researchers /Trial Admins

5 Agenda Background Storage types Databases Files Query Engines Summary & Questions 5

6 Typical Big Data Platform Architecture Streaming Device Dashboard Web Servers ETL SQL

7 HL Platform Architecture Streaming Device Dashboard Web Servers ETL SQL HBase schema limitations New DS Algorithm Trial management

9 Goal Add interactivity Get results in seconds Minimal production performance impact Flexible queries Data science and algorithms

10 Goal Add interactivity Query Results in seconds Query engine Store data in a queryable fashion DB Files

11 Storage The most important part of an interactive big-data platform is storage Databases Files

12 Storage How to decide which storage type to use? Databases Sub second queries Easier record version tracking Different levels of security Integrated query engine (*some databases) Files No need to manage/monitor a cluster Separating storage from compute - scalability Easier to backup and share Harder to update

13 Storage How to decide which storage type to use? Databases Sub second queries Easier record version tracking Different levels of security Integrated query engine (*some databases) No need to manage/monitor a cluster Separating storage from compute - scalability Easier to backup and share Harder to update Files

14 Files 14

15 Columnar vs. Text files SELECT AVG(num_requests) FROM daily_requests; In CSV, reading the 6 th column requires reading the entire row 15

16 Columnar vs. Text files id first_name last_name ip_address num_requests Columnar file format advantages: Much much faster quires (examples later) Better encoding and compression column contains same data type 16

Format Wars Dataset 1000 columns 4 million rows The queries ran to measure read speed, were in the form of: SELECT COUNT(*) FROM TABLE WHERE Query 1 includes no

17 Format Wars Dataset 1000 columns 4 million rows The queries ran to measure read speed, were in the form of: SELECT COUNT(*) FROM TABLE WHERE Query 1 includes no additional conditions. Query 2 includes 5 conditions. Query 3 includes 10 conditions. Query 4 includes 20 conditions. Source: 17

18 Adding Interactivity - storage Device Dashboard Web Servers

19 Adding Interactivity? Device Dashboard Web Servers

20 Query engines

21 Query Engine Basically, an SQL engine Translates SQL to MapReduce/DAG jobs over data of various size and formats. This is achieved either by using an existing framework (such as hadoop MR or spark) or by using independent implementation. Separating storage from compute with an emphasis on elasticity and scalability.

22 AWS Athena (based on ) v Interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. v Athena is serverless, so there is no infrastructure to manage v Cost - You pay only for the queries that you run 22

23 Notebooks - Jupyter / Zepplin Web-based Data-driven interactive analytics Easy collaboration Support Python, Spark and many languages Source: 23

24 Notebooks - Jupyter / Zepplin 24

25 Jupyter + Spark on AWS EMR Web browser AWS EMR http Notebook Server Kernel Cluster spark-submit Read data and save results Getting started: 25

26 Adding Interactivity Jupyter (on EMR) Athena Device Dashboard Web Servers

27 Alternatives

28 Key Takeaways v If you choose to use files, use partitions v Don t afraid of duplicating the data (storage is cheap, computation isn t) v No solution fits all, do your own benchmarks v Store your files in the cloud - S3 +Athena or Jupyter +EMR 28

29 Summary 29

30 questions 30

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect Igor Roiter Big Data Cloud Solution Architect Working as a Data Specialist for the last 11 years 9 of them as a Consultant specializing