Parallel In-situ Data Processing Techniques

Size: px

Start display at page:

Download "Parallel In-situ Data Processing Techniques"

Lesley Anderson
5 years ago
Views:

1 Parallel In-situ Data Processing Techniques Florin Rusu, Yu Cheng University of California, Merced

2 Outline Background SCANRAW Operator Speculative Loading Evaluation

3 Astronomy: FITS Format Sloan Digital Sky Survey PhotoObjAll Star Galaxy Sky 509 columns Source:

4 Sky Server Query Examples Colors of 1% random sample of objects from all fields: SELECT TOP 100 u, g, r, i, z FROM Galaxy WHERE (htmid*p & 0x FFFF) < (650*x) Galaxies with blue centers: SELECT TOP 1000 objid, modelmag_u, modelmag_g FROM Galaxy WHERE petrorad_r < 18 AND (modelmag_u-modelmag_g)-(psfmag_u-psfmag_g) < -0.4

Genomics: SAM/BAM Format Source: http://samtools.github.

5 Genomics: SAM/BAM Format Source: More than 200 TB of genomic data can be downloaded for research

6 Query Example Variant, e.g., genome mutation, identification Sam/Bam Files Querying Time to query SELECT position, write count(*) as cnt Sam/Bam FROM Tools genome instant program WHERE CIGAR IS NOT 'M' External GROUP Tables BY position SQL instant HAVING cnt > threshold Access path sequential scan+ sequential scan SQL*Loader SQL loading optimal

7 Execution time Loading time Comparison External table vs. Loading Q4 Q3 Q4 Q3 Q2 Q1 Loading Q2 Q1 Loading External Table External Table Loading Query time

8 Research Problem Can we find an optimal solution to execute SQLlike queries in-situ over raw files? Instant access to data Optimal performance across a query workload

9 Execution time Loading time Comparison External table vs. Loading Q4 Q3 Q2 Q1 Q4 Q3 Q2 Q1 Q4 Q3 Q2 Q1 Loading Loading optimal External Table External Table Optimal Loading Query time

10 Related Work Adaptive partial loading [Idreos et al., CIDR 2011] Only load necessary attributes before query starts NoDB [Alagianis et al., SIGMOD 2012] Build index and cache necessary attributes in memory Invisible loading [Abouzied et al., EDBT/ICDT 2013] Portion of necessary data is loaded into database for every query Data vaults [Ivanova et al., SSDBM 2012] Memory cache for complex data in scientific repositories Instant loading [Muhlbauer et al., PVLDB 2013] Speed-up loading using vectorized instructions, e.g., SSE4 and AVX SDS/Q [Blanas et al., SIGMOD 2014] In-situ processing & analysis on HDF5 files RAW [Karpathiotakis et al., PVLDB 2014] Runtime creation of Just-In-Time access paths

11 Raw File Processing EXTRACT Tokenize Execution Engine tuple Map line tuple READ page WRITE page Storage

12 SCANRAW Operator Can we achieve optimal execution time for the first query? How can we design a parallel operator using current multicore processors? Multi-threads What kind of architecture can take full advantage of the available parallelism? Task & Pipeline parallelism How to integrate the operator with a database system? SCANRAW operator

13 Time per chunk (%) Where Does the Time Go? READ TOKENIZE PARSE WRITE 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% # columns

14 SCANRAW Operator What How is the many optimal threads? ratio? Execution Engine Tokenize Raw File Read Tokenize Write Physical operator text chunks buffer Tokenize position buffer binary chunks buffer DB Parallel super-scalar pipeline

15 SCANRAW Operator Thread pool Adaptive Dynamic Execution Engine Tokenize Raw File Read Tokenize Write Physical operator text chunks buffer Tokenize position buffer binary chunks buffer DB Parallel super-scalar pipeline

16 SCANRAW Operator Consumer Task parallelism Stand-alone threads get worker Assign worker Scheduler worker threads done done write Write Read Write Scheduler get worker Assign worker done stop resume Read Tokenize consumer consumer Tokenize Consumer Worker threads

17 Speculative Loading Can we also achieve optimal execution time for a sequence of queries? How does speculative loading not interfere with query execution? Task & Pipeline parallelism How does speculative loading improve performance for a sequence of queries? Gradual loading How do we guarantee that new chunks are loaded for every query? Safeguard mechanism

18 Speculative Loading full Binary chunk buffer Binary Binary Binary consumer Binary done Scheduler worker threads pst pst pst get worker Assign worker full Position chunk buffer pst pst pst Binary WRITE STOP get worker Assign worker done Write Read Tokenize consumer raw raw raw full database Raw file Text chunk buffer raw raw raw

19 Speculative Loading consumer pst pst pst Safeguard Binary chunk buffer Binary Binary Binary Binary done get worker Assign worker Scheduler worker threads Position chunk buffer pst pst pst Binary WRITE get worker Assign worker done Binary Binary Tokenize consumer raw raw raw Last one Write Read database Raw file Text chunk buffer raw raw raw

20 Evaluation Data: CSV files with 2 to 256 attributes; 2 20 to 2 28 lines; 20MB 638GB size. Query: System: 2 AMD 8-core processors (16 cores) 40 GB memory 4 TB 7200 RPM HDD with 450MB/s I/O bandwidth Illustration: 64 attributes, 2 26 lines, 40GB

21 Optimal for the First Query?

22 Performance Improvement

23 Always Optimal?

24 I/O utilization (%) CPU utilization (%) Resource Utilization I/O CPU Processing progress (%)

25 Real-Data Experiments Everything is implemented in GLADE 1000 Genomes NA12878 SAM (145 GB); BAM (26 GB) Method Execution time (seconds) External tables (SAM) 370 External tables (BAM + BAMTools) 2714 Data loading (SAM) 945 Database processing 122 Speculative loading (SAM) 370

26 Conclusions SCANRAW Operator Super-scalar pipeline architecture No interference with query execution Easy to integrate into database system Speculative Loading Make full use of system resources Improve performance for a sequence of queries Always achieve optimal execution time

27 Thank you! Questions?

Parallel In-situ Data Processing with Speculative Loading

Parallel In-situ Data Processing with Speculative Loading Yu Cheng, Florin Rusu University of California, Merced Outline Background Scanraw Operator Speculative Loading Evaluation SAM/BAM Format More than