The Pliny Database PDB

Size: px

Start display at page:

Download "The Pliny Database PDB"

Calvin Ryan
5 years ago
Views:

1 The Pliny Database PDB Chris Jermaine Carlos Monroy, Kia Teymourian, Sourav Sikdar Rice University 1

2 PDB Overview PDB: Distributed object store + compute platform In Pliny project: Used to store processed source code Used to store large volume of program analysis results Call graphs Parse trees Bags of system calls Text mining results for comments... Used to perform new program analyses at scale Program element at-a-time As well as batch (such as ML) Chris Jermaine, Rice University 2

3 Design Goals Want a system that combines Flexibility of a schema-less system (Hadoop, Spark) High performance of a database system Chris Jermaine, Rice University 3

4 Programming Model PDB class Foo : public PDBStoredDataType{ int a; PDBClient::addData() int b; }; node 2 node 1 cloud-based cluster node 4 class Bar : public PDBStoredDataType{ int c; int d; }; PDBClient::addData() node 3 Data Insertion PDBClient::runQuery() Queries, Updates, Deletes class MyQuery : public PDBQueryExecutor { operator +=; void aggregate (); }; User defines data types in host language (C++, at least initially) System stores data persistently in distributed compute cluster PDB manages object inserts, deletions, updates like a DB Data manipulation via code in the host language Chris Jermaine, Rice University 4

5 Programming Model No data manipulation language (SQL/XQuery/etc.) Easy to write new query types......machine learning computations...top-k queries (fundamental to Pliny!)...Graph analyses All access to stored objects via user-defined methods Chris Jermaine, Rice University 5

6 Why Don t Existing Systems Work for Pliny? Classical RDBs not a good fit Not easy to store program analysis results as flat tables Trees Graphs Lists Difficult to implement program analysis in classical DML Want to allow arbitrary user-defined computations Hard to do in SQL/XQuery/etc. Chris Jermaine, Rice University 6

7 Why Don t Existing Systems Work for Pliny? Existing Big Data systems not a good fit Hadoop/Spark are schema-never: BIG performance hit Excellent flexibility, cost of performance/complexity Expensive! Object allocation Data movement/ copy Class Foo Expensive! Garbage collection Data movement/ copy Class Foo Sequence (disk, network) RAM Chris Jermaine, Rice University 7

8 Scale to 100s of machines Performance/Scalability Goals Chris Jermaine, Rice University 8

9 Performance/Scalability Goals 10X faster than Spark for most batch computations. How? Pliny data DOES have structure Just not structure well-served by traditional data models Like DB: control data movement, layout in RAM and on disk Don t delegate to user, or 3rd party tools Inexpensive! No distinction in RAM/sequence layout Only bring part you need Class Foo Inexpensive! Same layout Only write dirty part Class Foo Sequence (disk, network) RAM Chris Jermaine, Rice University 9

10 Performance/Scalability Goals 100X faster than Spark in special cases How? Leverage classical DB techniques Indexing Fast parallel join algorithms Smart buffer management Chris Jermaine, Rice University 10

11 Preliminary Performance Results Task (1) Load up 20,000 text documents (2) Build a dictionary of most frequent 10,000 words (3) Add a sparse, 10,000 entry TF vector to each document (4) Close down, re-load data (5) Run a top-k query to find 12 closest docs to query doc Run on a single, 4-core Amazon EC2 machine Spark vs. PDB Task Spark Time (sec) PDB Time (sec) (1) (2) (3) (4) (5) Total Chris Jermaine, Rice University 11

Dealing with Data Especially Big Data

Dealing with Data Especially Big Data INFO-GB-2346.01 Fall 2017 Professor Norman White nwhite@stern.nyu.edu normwhite@twitter Teaching Assistant: Frenil Sanghavi fps241@stern.nyu.edu Administrative Assistant: