MDHIM: A Parallel Key/Value Store Framework for HPC

Size: px

Start display at page:

Download "MDHIM: A Parallel Key/Value Store Framework for HPC"

Erick Lawson
5 years ago
Views:

1 MDHIM: A Parallel Key/Value Store Framework for HPC Hugh Greenberg 7/6/2015 LA-UR

2 HPC Clusters Managed by a job scheduler (e.g., Slurm, Moab) Designed for running user jobs Difficult to run system services Parallel file systems High performing with N-N workloads N application processes accessing N files simultaneously Slide 2

3 HPC Clusters High speed interconnects Infiniband Cray Gemini Composed of high-end compute nodes Server class hardware or better Slide 3

4 HPC Clusters High speed interconnects Infiniband Cray Gemini Composed of high-end compute nodes Server class hardware or better Slide 4

5 Motivation Exascale More processes performing file operations simultaneously Less memory per CPU core Existing solutions For cloud storage or web services Do not efficiently utilize HPC environments and programming models Slide 5

6 Motivation Existing solutions Require long running daemons TCP/IP only Cannot be easily added to HPC applications Lack C/C++ APIs Require additional languages E.g., Cassandra, Dynamo, HBase, Riak Slide 6

7 Motivation Parallel Log Structured File System Developed at LANL Turns N-1 workloads into N-N Requires each process to read a potentially large index into memory Needed a scalable index Slide 7

8 Solution MDHIM Multi-Dimensional Hashing Indexing Middleware Distributed key/value store framework designed for HPC Written in HPC friendly programming model MPI Easily added to an MPI application Slide 8

9 MDHIM - Features Doesn t require long running daemons Servers (range servers) are spawned as separate threads Starts with the application and dies with it Pluggable data stores LevelDB as default MySQL support Not difficult to additional data stores Slide 9

10 MDHIM - Features Bulk operations Transfer large packets with many key/value pairs over a high-speed interconnect Multiple dimensions The primary index Key/value pairs with arbitrary values Globally ordered Secondary indexes Keys with values that point to the keys of the primary index Globally ordered or local to the range server Slide 10

11 MDHIM Global Indexes Supports global ordering Keys can be retrieved in order Order depends on key type Each key maps to a single range server Clients use the paritioner algorithm for the key location Avoids querying range servers Requires a single large data transfer of statistics data (mdhimstatflush) Slide 11

12 MDHIM Global Indexes Cursor operations Get next or previous key Traverses range servers Requires a single large data transfer of statistics data (mdhimstatflush) Slide 12

13 MDHIM- Local Indexes Supports local indexes Each rank can store key/value pairs to itself Lookups require querying multiple servers mdhimstatflush can help to reduce the number of queries Slide 13

14 MDHIM - Partitioning Built-in partitioning algorithm with reasonable defaults Pluggable partitioning planned User defined functions for mapping of keys to range servers Slide 14

15 MDHIM - Design MDHIM contains a client and range server Each rank in this image is running a range server. Clients use the partitioner to determine which range server to contact Rank 1 App Client Range Server Ranges: 1,4,7 MDHIM Library Rank 2 App Client Range Server Ranges: 2,5,8 MDHIM Library MDHIM software design Rank 3 App Client Range Server Ranges: 3,6,9 MDHIM Library Slide 15

16 MDHIM - Evaluation Compared against Cassandra Used the Yahoo Cloud Serving Benchmark Created an MDHIM plugin Used built-in Cassandra plugin Random integers as keys Tests performed on LANL Mustang Cluster 2 AMD 12-core MagnyCours 64GB of memory per node 1600 nodes Slide 16

17 MDHIM - Evaluation Cassandra used IP over Infiniband MDHIM used native Infinband Tuned Cassandra and LevelDB to use 50MB of memory Cassandra configured to use batch mode Default periodic Forces Cassandra to wait until data is synced to disk before returning Matches MDHIM/LevelDB Slide 17

18 MDHIM - Evaluation Two types of tests are represented: 1K records per node and 100k records per process. Three runs were performed at each point. Slide 18

19 MDHIM - Evaluation 1 million records inserted/retrieved in total for each run. Three runs were performed for each data point. Slide 19

20 MDHIM - Evaluation MDHIM performs slightly better than Cassandra for the 1K records per nodes weak scaling test MDHIM out performs Cassandra for the 100K records per node test and the strong scaling test times faster with 256 processes Slide 20

21 MDHIM - Evaluation Reasons for MDHIM s performance Native Infiniband support Better key distribution C vs Java Slide 21

22 MDHIM - Evaluation The frequency of database sizes for Cassandra and MDHIM after a run with 128 nodes and 100K records inserted per node. Slide 22

23 Conclusion MDHIM is a parallel key/value store framework for HPC Designed for HPC systems and job schedulers Utilizes high speed interconnects and MPI Easily added to a scientific application Outperformed Cassandra in all tests with the Yahoo Cloud Serving Benchmark (YCSB) Slide 23

24 Thank you Code Contact Hugh Greenberg Slide 24

MDHIM: A Parallel Key/Value Framework for HPC

MDHIM: A Parallel Key/Value Framework for HPC : A Parallel Key/Value Framework for HPC Hugh N. Greenberg 1 Los Alamos National Laboratory John Bent EMC Gary Grider Los Alamos National Laboratory Abstract The long-expected convergence of High Performance