Data Sharing Made Easier through Programmable Metadata. University of Wisconsin-Madison

Size: px

Start display at page:

Download "Data Sharing Made Easier through Programmable Metadata. University of Wisconsin-Madison"

Lewis Phillips
5 years ago
Views:

1 Data Sharing Made Easier through Programmable Metadata Zhe Zhang IBM Research! Remzi Arpaci-Dusseau University of Wisconsin-Madison

2 How do applications share data today? Syncing data between storage systems: Commonly used big data workflow Slow, stale and strenuous Cloud analytics cluster Primary Data: transactions, s, logs, etc. In-house analytics cluster!2

3 How do applications share data today? Syncing data between storage systems: Commonly used big data workflow Slow, stale and strenuous Mounting and using shared storage systems: Difficult to serve heterogenous workloads Heavy workload on centralized name nodes Cloud analytics cluster Primary Data: transactions, s, logs, etc. In-house analytics cluster!2

4 How do applications share data today? Syncing data between storage systems: Commonly used big data workflow Slow, stale and strenuous Mounting and using shared storage systems: Difficult to serve heterogenous workloads Heavy workload on centralized name nodes Cloud analytics cluster Primary Data: transactions, s, logs, etc. In-house analytics cluster Observations Data always written and read through the same storage system (filesystem, DB, etc.) Metadata updated with writes Metadata used in reads Data produced in form A and consumed in form B? View DB records as a file? Analyze thousands of local log files as a single text file?!2

5 How do applications share data today? Syncing data between storage systems: Commonly used big data workflow Slow, stale and strenuous Mounting and using shared storage systems: Difficult to serve heterogenous workloads Heavy workload on centralized name nodes Cloud analytics cluster Primary Data: transactions, s, logs, etc. In-house analytics cluster Observations Data always written and read through the same storage system (filesystem, DB, etc.) Metadata updated with writes Metadata used in reads Data produced in form A and consumed in form B? View DB records as a file? Analyze thousands of local log files as a single text file?!2

6 Programming the Metadata segment 1 segment 2 segment 3 Logical definition Source DB table Under the hood!3

7 Programming the Metadata segment 1 segment 2 segment 3 Logical definition Source DB table Under the hood!3

8 Programming the Metadata segment 1 segment 2 segment 3 Logical definition Source DB table Under the hood!3

9 Programming the Metadata segment 1 segment 2 segment 3 Logical definition Source DB table Under the hood!3

Arbitrary SELECT * FROM * WHERE * in source DB tables?

10 Challenges API challenge: identification / namespace of source data How to define a file in VM1 to include a source file in VM2? Granularity-based source file selection: 1 out of 10 lines of text? Content-based source file selection: all lines containing certain keyword? Arbitrary SELECT * FROM * WHERE * in source DB tables? Performance challenge: frequent metadata updates Layers Applications Example of Liseners Map to destination file if keyword matches Map every 1 line out of 10 lines of text to destination file VFS Map entire file to destination file Map every 1MB out of 10MB to destination file Block storage All VFS listeners can be implemented on layer with a reverse pointer from to inode!4

Use Case: Distributed Live Analytics hadoop dfs -composefromlocal <configuration file> <path to HDFS file> Configuration file slave1:/opt/ibm/*/*.log slave2:/var/*.

11 Use Case: Distributed Live Analytics hadoop dfs -composefromlocal <configuration file> <path to HDFS file> Configuration file slave1:/opt/ibm/*/*.log slave2:/var/*.log Challenges Informing NameNode of local file size changes Balancing workload Server Server VM San Jose VM Wed Server VM New York Wed Server VM Dallas Raleigh - Server- Logs Eastern- Coast- Logs MapReduce, Stream, etc.!5

Distributed Filesystem

Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the