A Distributed Hash Table for Shared Memory

Size: px
Start display at page:

Download "A Distributed Hash Table for Shared Memory"

Transcription

1 A Distributed Hash Table for Shared Memory Wytse Oortwijn Formal Methods and Tools, University of Twente August 31, 2015 Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

2 Table of Contents 1 Introduction 2 Contribution 1: Resolving Hash Collisions 3 Contribution 2: Hiding Latency 4 Experimental Evaluation 5 Conclusion Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

3 Table of Contents 1 Introduction 2 Contribution 1: Resolving Hash Collisions 3 Contribution 2: Hiding Latency 4 Experimental Evaluation 5 Conclusion Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

4 Distributed Hash Table Main challenge Building a fast and CPU-efficient shared hash table: Minimal latency Minimal memory overhead Not relying on CPU-polling ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

5 Distributed Hash Table Main challenge Building a fast and CPU-efficient shared hash table: Minimal latency Minimal memory overhead Not relying on CPU-polling Many use cases in HPC Parallel graph searching Distributed model checking ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

6 Distributed Hash Table Main challenge Building a fast and CPU-efficient shared hash table: Minimal latency Minimal memory overhead Not relying on CPU-polling Many use cases in HPC Parallel graph searching Distributed model checking Distributed (LAN) vs Parallel Cheaper scalability Unlimited scalability, but ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

7 Distributed Hash Table Main challenge Building a fast and CPU-efficient shared hash table: Minimal latency Minimal memory overhead Not relying on CPU-polling Many use cases in HPC Parallel graph searching Distributed model checking Distributed (LAN) vs Parallel Cheaper scalability Unlimited scalability, but Performance overhead! ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

8 Distributed Hash Table Main challenge Building a fast and CPU-efficient shared hash table: Minimal latency Minimal memory overhead Not relying on CPU-polling Many use cases in HPC Parallel graph searching Distributed model checking Distributed (LAN) vs Parallel Cheaper scalability Unlimited scalability, but Performance overhead! Efficient distributed processing Specialized algorithms and data structures needed! ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

9 Distributed Hash Table Main challenge Building a fast and CPU-efficient shared hash table: Minimal latency Minimal memory overhead Not relying on CPU-polling Many use cases in HPC Parallel graph searching Distributed model checking Distributed (LAN) vs Parallel Cheaper scalability Unlimited scalability, but Performance overhead! Efficient distributed processing Specialized algorithms and data structures needed! Contribution: Reducing roundtrips while CPU-efficient ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

10 High-performance Networking Infiniband hardware Specialized hardware used to construct high-performance networks: Comparable in price to Ethernet Supports bandwidths up to 100 Gb/s Direct access to memory via PCI-E bus Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

11 High-performance Networking Infiniband hardware Specialized hardware used to construct high-performance networks: Comparable in price to Ethernet Supports bandwidths up to 100 Gb/s Direct access to memory via PCI-E bus RDMA: Remote Direct Memory Access Directly access to remote memory without invoking remote CPUs Zero-copy networking Kernel bypassing No participation from remote CPUs Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

12 High-performance Networking Infiniband hardware Specialized hardware used to construct high-performance networks: Comparable in price to Ethernet Supports bandwidths up to 100 Gb/s Direct access to memory via PCI-E bus RDMA: Remote Direct Memory Access Directly access to remote memory without invoking remote CPUs Zero-copy networking Kernel bypassing No participation from remote CPUs Performance: one-sided RDMA vs TCP Roundtrips latency: < 3µs (Infiniband) vs 60µs (traditional Ethernet) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

13 Hash Table: Challenges Notation: Hash table T = b 0,..., b n 1 as a sequence of buckets b i, where: n the hash table size and m the number of used entries α = m n the load factor Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

14 Hash Table: Challenges Notation: Hash table T = b 0,..., b n 1 as a sequence of buckets b i, where: n the hash table size and m the number of used entries α = m n the load factor Operation: only find-or-put(d) Takes a data element d as parameter, and: if d T, return found if d T, insert d and return inserted if d T and d cannot be inserted, return full Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

15 Hash Table: Challenges Notation: Hash table T = b 0,..., b n 1 as a sequence of buckets b i, where: n the hash table size and m the number of used entries α = m n the load factor Operation: only find-or-put(d) Takes a data element d as parameter, and: if d T, return found if d T, insert d and return inserted if d T and d cannot be inserted, return full Design: Challenges How to distribute and access T = b 0,..., b n 1 efficiently? How to design find-or-put to perform efficiently? Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

16 PGAS: Partitioned Global Address Space Details Assuming N participating threads: ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

17 PGAS: Partitioned Global Address Space PGAS t 1 t 2... t N... Details Assuming N participating threads: PGAS: shared + distributed memory model ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

18 PGAS: Partitioned Global Address Space PGAS Hybrid PGAS t 1 t 2... t N... t 1 t t N Details Assuming N participating threads: PGAS: shared + distributed memory model Hybrid PGAS: PGAS + message passing (dashed edges) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

19 Table of Contents 1 Introduction 2 Contribution 1: Resolving Hash Collisions 3 Contribution 2: Hiding Latency 4 Experimental Evaluation 5 Conclusion Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

20 Hash Collisions Efficiency: Resolving Hash Collisions Occurs when h(x) = h(y) for data elements x y Efficiency of find-or-put depends on hashing strategy! Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

21 Hash Collisions Efficiency: Resolving Hash Collisions Occurs when h(x) = h(y) for data elements x y Efficiency of find-or-put depends on hashing strategy! Existing Work Pilaf, 2014 (Cuckoo) Nessie, 2014 (Cuckoo) FaRM, 2014 (Hopscotch) HERD, 2014 (CPU-intensive) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

22 Hash Collisions Efficiency: Resolving Hash Collisions Occurs when h(x) = h(y) for data elements x y Efficiency of find-or-put depends on hashing strategy! Existing Work Pilaf, 2014 (Cuckoo) Nessie, 2014 (Cuckoo) FaRM, 2014 (Hopscotch) HERD, 2014 (CPU-intensive) Contributions Existing implementations either: Require more roundtrips Require locking schemes Are CPU-intensive ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

23 Hash Collisions Efficiency: Resolving Hash Collisions Occurs when h(x) = h(y) for data elements x y Efficiency of find-or-put depends on hashing strategy! Existing Work Pilaf, 2014 (Cuckoo) Nessie, 2014 (Cuckoo) FaRM, 2014 (Hopscotch) HERD, 2014 (CPU-intensive) Contributions Existing implementations either: Require more roundtrips Require locking schemes Are CPU-intensive Best strategy for find-or-put Which strategy requires the least number of roundtrips? ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

24 Comparing Hashing Strategies Chained Hashing + Theoretical comp. Θ(1 + α) - Dynamic mem. management - Storing pointers ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

25 Comparing Hashing Strategies Chained Hashing + Theoretical comp. Θ(1 + α) - Dynamic mem. management - Storing pointers Cuckoo Hashing + Uses k hash functions - Lookups require k roundtrips - Relocations require locks ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

26 Comparing Hashing Strategies Chained Hashing + Theoretical comp. Θ(1 + α) - Dynamic mem. management - Storing pointers Cuckoo Hashing + Uses k hash functions - Lookups require k roundtrips - Relocations require locks Hopscotch Hashing + Using neighbourhoods + Lookups require 1 roundtrip - Relocations require locks ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

27 Comparing Hashing Strategies Chained Hashing + Theoretical comp. Θ(1 + α) - Dynamic mem. management - Storing pointers Hopscotch Hashing + Using neighbourhoods + Lookups require 1 roundtrip - Relocations require locks Cuckoo Hashing + Uses k hash functions - Lookups require k roundtrips - Relocations require locks Linear Probing + Buckets are consecutive + No locking or relocations - Roundtrips for lookups? ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

28 Comparing Hashing Strategies Chained Hashing + Theoretical comp. Θ(1 + α) - Dynamic mem. management - Storing pointers Hopscotch Hashing + Using neighbourhoods + Lookups require 1 roundtrip - Relocations require locks Cuckoo Hashing + Uses k hash functions - Lookups require k roundtrips - Relocations require locks Linear Probing + Buckets are consecutive + No locking or relocations - Roundtrips for lookups? Linear Probing versus Hopscotch Due to Hopscotch invariant, lookups may be more expensive, but Inserts are arguably cheaper (amortized complexity) ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

29 Linear Probing: Efficiency Bounds Knuth, 1997 The expected number of buckets to examine until the intended buckets is found is at most: 1 2 ( ) (1 α) 2 ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

30 Linear Probing: Efficiency Bounds Knuth, 1997 The expected number of buckets to examine until the intended buckets is found is at most: 1 2C ( ) (1 α) 2 ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

31 Linear Probing: Efficiency Bounds Knuth, 1997 The expected number of buckets to examine until the intended buckets is found is at most: 1 2C ( ) (1 α) 2 Efficiency bound A chunk is expected to contain the intended bucket if: 1 α 1 2C 1 ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

32 Linear Probing: Efficiency Bounds Knuth, 1997 The expected number of buckets to examine until the intended buckets is found is at most: 1 2C ( ) (1 α) 2 Efficiency bound A chunk is expected to contain the intended bucket if: 1 α 1 2C 1 Expected load-factor at which a chunk is full Chunk size C 3 2 Chunk size C load-factor α load-factor α ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

33 Linear Probing: Efficiency Bounds Knuth, 1997 The expected number of buckets to examine until the intended buckets is found is at most: 1 2C ( ) (1 α) 2 Efficiency bound A chunk is expected to contain the intended bucket if: 1 α 1 2C 1 Expected load-factor at which a chunk is full Chunk size C load-factor α Chunk size C C = 64 C = load-factor α ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

34 Table of Contents 1 Introduction 2 Contribution 1: Resolving Hash Collisions 3 Contribution 2: Hiding Latency 4 Experimental Evaluation 5 Conclusion Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

35 Linear Probing: Hiding Latency Contribution: Asynchronous queries Before chunk iteration, first request the next chunk: Overlapping roundtrips with computational activity Find next chunk with quadratic probing to prevent clustering Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

36 Linear Probing: Hiding Latency Contribution: Asynchronous queries Before chunk iteration, first request the next chunk: Overlapping roundtrips with computational activity Find next chunk with quadratic probing to prevent clustering Defining query-chunk(i, d) Obtains the i-th chunk, starting from bucket b h(d) Returns a handle s Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

37 Linear Probing: Hiding Latency Contribution: Asynchronous queries Before chunk iteration, first request the next chunk: Overlapping roundtrips with computational activity Find next chunk with quadratic probing to prevent clustering Defining query-chunk(i, d) Obtains the i-th chunk, starting from bucket b h(d) Returns a handle s Defining sync-chunk(s) Takes a handle s as parameter, waits until the corresponding query has been completed. ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

38 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

39 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

40 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) s 1 = query-chunk(1, d) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

41 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) s 1 = query-chunk(1, d) sync-chunk(s 0) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

42 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) s 1 = query-chunk(1, d) sync-chunk(s 0) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

43 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) s 1 = query-chunk(1, d) sync-chunk(s 0) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

44 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) s 1 = query-chunk(1, d) sync-chunk(s 0) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

45 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) s 1 = query-chunk(1, d) sync-chunk(s 0) s 2 = query-chunk(2, d) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

46 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) s 1 = query-chunk(1, d) sync-chunk(s 0) s 2 = query-chunk(2, d) sync-chunk(s 1) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

47 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) s 1 = query-chunk(1, d) sync-chunk(s 0) s 2 = query-chunk(2, d) sync-chunk(s 1) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

48 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) s 1 = query-chunk(1, d) sync-chunk(s 0) s 2 = query-chunk(2, d) sync-chunk(s 1) s 3 = query-chunk(3, d) sync-chunk(s 2) s 4 = query-chunk(4, d) sync-chunk(s 3)... Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

49 Table of Contents 1 Introduction 2 Contribution 1: Resolving Hash Collisions 3 Contribution 2: Hiding Latency 4 Experimental Evaluation 5 Conclusion Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

50 Hash Table: Evaluation Experimental Setup All experiments have been perfomed on the DAS-5 cluster: 66 machines 16 cores each (Intel E5-2630v3) 64 GB internal memory each connected via 48Gb/s Infiniband Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

51 Hash Table: Evaluation Experimental Setup All experiments have been perfomed on the DAS-5 cluster: 66 machines 16 cores each (Intel E5-2630v3) 64 GB internal memory each connected via 48Gb/s Infiniband Benchmarks Under different workloads, we measured: Throughput of find-or-put Latency of find-or-put Roundtrips of find-or-put Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

52 Hash Table: Throughput Total Throughput Speedup ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

53 Hash Table: Throughput Total Throughput Speedup Observations Throughputs up to reached (66 machines) Remote speedup up to 110 obtained Local throughput of reached (1 threads) ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

54 Hash Table: Latency Local latency latency (ns) C = 8 C = 16 C = 32 C = 64 C = load-factor (α) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

55 Hash Table: Latency Local latency latency (ns) C = 8 C = 16 C = 32 C = 64 C = load-factor (α) Remote latency latency (ns) 4,400 4,200 4,000 3,800 3, latency (ns) C = 8 C = 16 C = 32 C = 64 C = 128 load-factor (α) load-factor (α) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

56 Table of Contents 1 Introduction 2 Contribution 1: Resolving Hash Collisions 3 Contribution 2: Hiding Latency 4 Experimental Evaluation 5 Conclusion Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

57 Conclusions General Minimizing roundtrips increases performance Overlapping queries reduces waiting-times and decreases latency Linear probing requires less roundtrips than Hopscotch and Cuckoo Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

58 Conclusions General Minimizing roundtrips increases performance Overlapping queries reduces waiting-times and decreases latency Linear probing requires less roundtrips than Hopscotch and Cuckoo Performance find-or-put takes 4.5µs on average with α = 0.9 and C = 64 Peak-throughput of op/s obtained Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

59 Conclusions General Minimizing roundtrips increases performance Overlapping queries reduces waiting-times and decreases latency Linear probing requires less roundtrips than Hopscotch and Cuckoo Performance find-or-put takes 4.5µs on average with α = 0.9 and C = 64 Peak-throughput of op/s obtained Performance Indication FaRM: Inserts take 35µs Pilaf: Operations take 30µs Nessie: Inserts take 25µs Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20

Distributed Binary Decision Diagrams for Symbolic Reachability

Distributed Binary Decision Diagrams for Symbolic Reachability Distributed Binary Decision Diagrams for Symbolic Reachability Wytse Oortwijn Formal Methods and Tools, University of Twente November 1, 2015 Wytse Oortwijn (Formal Methods and Tools, Distributed University

More information

A Distributed Hash Table for Shared Memory

A Distributed Hash Table for Shared Memory A Distributed Hash Table for Shared Memory Wytse Oortwijn (B), Tom van Dijk, and Jaco van de Pol Formal Methods and Tools, Department of EEMCS, University of Twente, P.O.-box 217, 7500 AE Enschede, The

More information

FaRM: Fast Remote Memory

FaRM: Fast Remote Memory FaRM: Fast Remote Memory Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, Miguel Castro Microsoft Research Abstract We describe the design and implementation of FaRM, a new main memory distributed

More information

FaRM: Fast Remote Memory

FaRM: Fast Remote Memory FaRM: Fast Remote Memory Problem Context DRAM prices have decreased significantly Cost effective to build commodity servers w/hundreds of GBs E.g. - cluster with 100 machines can hold tens of TBs of main

More information

High-Performance Key-Value Store on OpenSHMEM

High-Performance Key-Value Store on OpenSHMEM High-Performance Key-Value Store on OpenSHMEM Huansong Fu*, Manjunath Gorentla Venkata, Ahana Roy Choudhury*, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline Background

More information

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing

MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing Bin Fan (CMU), Dave Andersen (CMU), Michael Kaminsky (Intel Labs) NSDI 2013 http://www.pdl.cmu.edu/ 1 Goal: Improve Memcached 1. Reduce space overhead

More information

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda The Ohio State University

More information

2008 International ANSYS Conference

2008 International ANSYS Conference 2008 International ANSYS Conference Maximizing Productivity With InfiniBand-Based Clusters Gilad Shainer Director of Technical Marketing Mellanox Technologies 2008 ANSYS, Inc. All rights reserved. 1 ANSYS,

More information

LITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang

LITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang LITE Kernel RDMA Support for Datacenter Applications Shin-Yeh Tsai, Yiying Zhang Time 2 Berkeley Socket Userspace Kernel Hardware Time 1983 2 Berkeley Socket TCP Offload engine Arrakis & mtcp IX Userspace

More information

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to

More information

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far Chapter 5 Hashing 2 Introduction hashing performs basic operations, such as insertion, deletion, and finds in average time better than other ADTs we ve seen so far 3 Hashing a hash table is merely an hashing

More information

The NE010 iwarp Adapter

The NE010 iwarp Adapter The NE010 iwarp Adapter Gary Montry Senior Scientist +1-512-493-3241 GMontry@NetEffect.com Today s Data Center Users Applications networking adapter LAN Ethernet NAS block storage clustering adapter adapter

More information

Heckaton. SQL Server's Memory Optimized OLTP Engine

Heckaton. SQL Server's Memory Optimized OLTP Engine Heckaton SQL Server's Memory Optimized OLTP Engine Agenda Introduction to Hekaton Design Consideration High Level Architecture Storage and Indexing Query Processing Transaction Management Transaction Durability

More information

Using RDMA Efficiently for Key-Value Services

Using RDMA Efficiently for Key-Value Services Using RDMA Efficiently for Key-Value Services Anuj Kalia Michael Kaminsky David G. Andersen Carnegie Mellon University Intel Labs {akalia,dga}@cs.cmu.edu michael.e.kaminsky@intel.com ABSTRACT This paper

More information

Using RDMA Efficiently for Key-Value Services

Using RDMA Efficiently for Key-Value Services Using RDMA Efficiently for Key-Value Services Anuj Kalia, Michael Kaminsky, David G. Andersen Carnegie Mellon University, Intel Labs CMU-PDL-14-16 June 214 Parallel Data Laboratory Carnegie Mellon University

More information

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion, Introduction Chapter 5 Hashing hashing performs basic operations, such as insertion, deletion, and finds in average time 2 Hashing a hash table is merely an of some fixed size hashing converts into locations

More information

I/O Buffering and Streaming

I/O Buffering and Streaming I/O Buffering and Streaming I/O Buffering and Caching I/O accesses are reads or writes (e.g., to files) Application access is arbitary (offset, len) Convert accesses to read/write of fixed-size blocks

More information

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Introducing the Cray XMT. Petr Konecny May 4 th 2007 Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions

More information

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( ) Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL

More information

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory

More information

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs Anuj Kalia (CMU), Michael Kaminsky (Intel Labs), David Andersen (CMU) RDMA RDMA is a network feature that

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen Outline Architecture of Web-based Data Center Three-Stage framework to benefit

More information

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang Datacenter 3 Monolithic Computer OS / Hypervisor 4 Can monolithic Application Hardware

More information

Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI

Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI Huansong Fu*, Manjunath Gorentla Venkata, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline

More information

Birds of a Feather Presentation

Birds of a Feather Presentation Mellanox InfiniBand QDR 4Gb/s The Fabric of Choice for High Performance Computing Gilad Shainer, shainer@mellanox.com June 28 Birds of a Feather Presentation InfiniBand Technology Leadership Industry Standard

More information

Advanced Computer Networks. End Host Optimization

Advanced Computer Networks. End Host Optimization Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

Single-Points of Performance

Single-Points of Performance Single-Points of Performance Mellanox Technologies Inc. 29 Stender Way, Santa Clara, CA 9554 Tel: 48-97-34 Fax: 48-97-343 http://www.mellanox.com High-performance computations are rapidly becoming a critical

More information

Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes Tailwind: Fast and Atomic RDMA-based Replication Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes In-Memory Key-Value Stores General purpose in-memory key-value stores are widely used nowadays

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits Sayantan Sur Hyun-Wook Jin Lei Chai D. K. Panda Network Based Computing Lab, The Ohio State University Presentation

More information

IsoStack Highly Efficient Network Processing on Dedicated Cores

IsoStack Highly Efficient Network Processing on Dedicated Cores IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single

More information

Chapter 6. Storage and Other I/O Topics

Chapter 6. Storage and Other I/O Topics Chapter 6 Storage and Other I/O Topics Introduction I/O devices can be characterized by Behaviour: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections

More information

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing. The Lecture Contains: Hashing Dynamic hashing Extendible hashing Insertion file:///c /Documents%20and%20Settings/iitkrana1/My%20Documents/Google%20Talk%20Received%20Files/ist_data/lecture9/9_1.htm[6/14/2012

More information

Data Processing on Emerging Hardware

Data Processing on Emerging Hardware Data Processing on Emerging Hardware Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland 3 rd International Summer School on Big Data, Munich, Germany, 2017 www.systems.ethz.ch

More information

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big

More information

Big data, little time. Scale-out data serving. Scale-out data serving. Highly skewed key popularity

Big data, little time. Scale-out data serving. Scale-out data serving. Highly skewed key popularity /7/6 Big data, little time Goal is to keep (hot) data in memory Requires scale-out approach Each server responsible for one chunk Fast access to local data The Case for RackOut Scalable Data Serving Using

More information

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT Krishna Kandalla (1), Hari Subramoni (1), Karen Tomko (2), Dmitry Pekurovsky

More information

Memory Management Strategies for Data Serving with RDMA

Memory Management Strategies for Data Serving with RDMA Memory Management Strategies for Data Serving with RDMA Dennis Dalessandro and Pete Wyckoff (presenting) Ohio Supercomputer Center {dennis,pw}@osc.edu HotI'07 23 August 2007 Motivation Increasing demands

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Memcached Design on High Performance RDMA Capable Interconnects

Memcached Design on High Performance RDMA Capable Interconnects Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan

More information

Hash Joins for Multi-core CPUs. Benjamin Wagner

Hash Joins for Multi-core CPUs. Benjamin Wagner Hash Joins for Multi-core CPUs Benjamin Wagner Joins fundamental operator in query processing variety of different algorithms many papers publishing different results main question: is tuning to modern

More information

Sharing High-Performance Devices Across Multiple Virtual Machines

Sharing High-Performance Devices Across Multiple Virtual Machines Sharing High-Performance Devices Across Multiple Virtual Machines Preamble What does sharing devices across multiple virtual machines in our title mean? How is it different from virtual networking / NSX,

More information

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage Accelerating Real-Time Big Data Breaking the limitations of captive NVMe storage 18M IOPs in 2u Agenda Everything related to storage is changing! The 3rd Platform NVM Express architected for solid state

More information

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage Evaluation of Lustre File System software enhancements for improved Metadata performance Wojciech Turek, Paul Calleja,John

More information

Multifunction Networking Adapters

Multifunction Networking Adapters Ethernet s Extreme Makeover: Multifunction Networking Adapters Chuck Hudson Manager, ProLiant Networking Technology Hewlett-Packard 2004 Hewlett-Packard Development Company, L.P. The information contained

More information

Future Routing Schemes in Petascale clusters

Future Routing Schemes in Petascale clusters Future Routing Schemes in Petascale clusters Gilad Shainer, Mellanox, USA Ola Torudbakken, Sun Microsystems, Norway Richard Graham, Oak Ridge National Laboratory, USA Birds of a Feather Presentation Abstract

More information

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics 1 Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics Mingzhe Li Sreeram Potluri Khaled Hamidouche Jithin Jose Dhabaleswar K. Panda Network-Based Computing Laboratory Department

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth

More information

Indexing in RAMCloud. Ankita Kejriwal, Ashish Gupta, Arjun Gopalan, John Ousterhout. Stanford University

Indexing in RAMCloud. Ankita Kejriwal, Ashish Gupta, Arjun Gopalan, John Ousterhout. Stanford University Indexing in RAMCloud Ankita Kejriwal, Ashish Gupta, Arjun Gopalan, John Ousterhout Stanford University RAMCloud 1.0 Introduction Higher-level data models Without sacrificing latency and scalability Secondary

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton

More information

Deconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better!

Deconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better! Deconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better! Xingda Wei, Zhiyuan Dong, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems (IPADS) Shanghai Jiao Tong

More information

Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems

Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National Center of Atmospheric Research,

More information

Networking at the Speed of Light

Networking at the Speed of Light Networking at the Speed of Light Dror Goldenberg VP Software Architecture MaRS Workshop April 2017 Cloud The Software Defined Data Center Resource virtualization Efficient services VM, Containers uservices

More information

Heterogeneous platforms

Heterogeneous platforms Heterogeneous platforms Systems combining main processors and accelerators e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC Any platform using a GPU is a heterogeneous platform! Further in this talk

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

Designing High Performance DSM Systems using InfiniBand Features

Designing High Performance DSM Systems using InfiniBand Features Designing High Performance DSM Systems using InfiniBand Features Ranjit Noronha and Dhabaleswar K. Panda The Ohio State University NBC Outline Introduction Motivation Design and Implementation Results

More information

Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!

Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better! Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better! Xingda Wei, Zhiyuan Dong, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University Contacts:

More information

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State

More information

MDHIM: A Parallel Key/Value Store Framework for HPC

MDHIM: A Parallel Key/Value Store Framework for HPC MDHIM: A Parallel Key/Value Store Framework for HPC Hugh Greenberg 7/6/2015 LA-UR-15-25039 HPC Clusters Managed by a job scheduler (e.g., Slurm, Moab) Designed for running user jobs Difficult to run system

More information

@ Massachusetts Institute of Technology All rights reserved.

@ Massachusetts Institute of Technology All rights reserved. ..... CPHASH: A Cache-Partitioned Hash Table with LRU Eviction by Zviad Metreveli MASSACHUSETTS INSTITUTE OF TECHAC'LOGY JUN 2 1 2011 LIBRARIES Submitted to the Department of Electrical Engineering and

More information

Adapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES

Adapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES Adapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES Tom Atwood Business Development Manager Sun Microsystems, Inc. Takeaways Understand the technical differences between

More information

Infiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin

Infiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin Infiniswap Efficient Memory Disaggregation Mosharaf Chowdhury with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow

More information

Optimizing non-blocking Collective Operations for InfiniBand

Optimizing non-blocking Collective Operations for InfiniBand Optimizing non-blocking Collective Operations for InfiniBand Open Systems Lab Indiana University Bloomington, USA IPDPS 08 - CAC 08 Workshop Miami, FL, USA April, 14th 2008 Introduction Non-blocking collective

More information

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. W. Jin and D. K. Panda Network Based Computing Laboratory

More information

Latency-Tolerant Software Distributed Shared Memory

Latency-Tolerant Software Distributed Shared Memory Latency-Tolerant Software Distributed Shared Memory Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin University of Washington USENIX ATC 2015 July 9, 2015 25

More information

ANALYZING CHARACTERISTICS OF PC CLUSTER CONSOLIDATED WITH IP-SAN USING DATA-INTENSIVE APPLICATIONS

ANALYZING CHARACTERISTICS OF PC CLUSTER CONSOLIDATED WITH IP-SAN USING DATA-INTENSIVE APPLICATIONS ANALYZING CHARACTERISTICS OF PC CLUSTER CONSOLIDATED WITH IP-SAN USING DATA-INTENSIVE APPLICATIONS Asuka Hara Graduate school of Humanities and Science Ochanomizu University 2-1-1, Otsuka, Bunkyo-ku, Tokyo,

More information

Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload

Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload Summary As today s corporations process more and more data, the business ramifications of faster and more resilient database

More information

Chapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved.

Chapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved. Chapter 27 Hashing 1 Objectives To know what hashing is for ( 27.3). To obtain the hash code for an object and design the hash function to map a key to an index ( 27.4). To handle collisions using open

More information

lecture23: Hash Tables

lecture23: Hash Tables lecture23: Largely based on slides by Cinda Heeren CS 225 UIUC 18th July, 2013 Announcements mp6.1 extra credit due tomorrow tonight (7/19) lab avl due tonight mp6 due Monday (7/22) Hash tables A hash

More information

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Sayantan Sur, Matt Koop, Lei Chai Dhabaleswar K. Panda Network Based Computing Lab, The Ohio State

More information

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Storage and Other I/O Topics I/O Performance Measures Types and Characteristics of I/O Devices Buses Interfacing I/O Devices

More information

Hash Tables. Gunnar Gotshalks. Maps 1

Hash Tables. Gunnar Gotshalks. Maps 1 Hash Tables Maps 1 Definition A hash table has the following components» An array called a table of size N» A mathematical function called a hash function that maps keys to valid array indices hash_function:

More information

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, P. Lai, S. Narravula and D. K. Panda Network Based Computing Laboratory

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations S. Narravula, A. Mamidala, A. Vishnu, K. Vaidyanathan, and D. K. Panda Presented by Lei Chai Network Based

More information

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018 S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific

More information

High Performance Packet Processing with FlexNIC

High Performance Packet Processing with FlexNIC High Performance Packet Processing with FlexNIC Antoine Kaufmann, Naveen Kr. Sharma Thomas Anderson, Arvind Krishnamurthy University of Washington Simon Peter The University of Texas at Austin Ethernet

More information

Accelerating Foreign-Key Joins using Asymmetric Memory Channels

Accelerating Foreign-Key Joins using Asymmetric Memory Channels Accelerating Foreign-Key Joins using Asymmetric Memory Channels Holger Pirk Stefan Manegold Martin Kersten holger@cwi.nl manegold@cwi.nl mk@cwi.nl Why? Trivia: Joins are important But: Many Joins are (Indexed)

More information

The End of a Myth: Distributed Transactions Can Scale

The End of a Myth: Distributed Transactions Can Scale The End of a Myth: Distributed Transactions Can Scale Erfan Zamanian, Carsten Binnig, Tim Harris, and Tim Kraska Rafael Ketsetsides Why distributed transactions? 2 Cheaper for the same processing power

More information

Scalable Concurrent Hash Tables via Relativistic Programming

Scalable Concurrent Hash Tables via Relativistic Programming Scalable Concurrent Hash Tables via Relativistic Programming Josh Triplett September 24, 2009 Speed of data < Speed of light Speed of light: 3e8 meters/second Processor speed: 3 GHz, 3e9 cycles/second

More information

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda Network- Based Compu2ng Laboratory Department of Computer

More information

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of

More information

Input/Output Introduction

Input/Output Introduction Input/Output 1 Introduction Motivation Performance metrics Processor interface issues Buses 2 Page 1 Motivation CPU Performance: 60% per year I/O system performance limited by mechanical delays (e.g.,

More information

Introduction. Motivation Performance metrics Processor interface issues Buses

Introduction. Motivation Performance metrics Processor interface issues Buses Input/Output 1 Introduction Motivation Performance metrics Processor interface issues Buses 2 Motivation CPU Performance: 60% per year I/O system performance limited by mechanical delays (e.g., disk I/O)

More information

Storage. Hwansoo Han

Storage. Hwansoo Han Storage Hwansoo Han I/O Devices I/O devices can be characterized by Behavior: input, out, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections 2 I/O System Characteristics

More information

Optimizing LS-DYNA Productivity in Cluster Environments

Optimizing LS-DYNA Productivity in Cluster Environments 10 th International LS-DYNA Users Conference Computing Technology Optimizing LS-DYNA Productivity in Cluster Environments Gilad Shainer and Swati Kher Mellanox Technologies Abstract Increasing demand for

More information

KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC

KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC Bojie Li * Zhenyuan Ruan * Wencong Xiao Yuanwei Lu Yongqiang Xiong Andrew Putnam Enhong Chen Lintao Zhang Microsoft Research

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work Using Non-blocking Operations in HPC to Reduce Execution Times David Buettner, Julian Kunkel, Thomas Ludwig Euro PVM/MPI September 8th, 2009 Outline 1 Motivation 2 Theory of a non-blocking benchmark 3

More information

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa EPL646: Advanced Topics in Databases Christos Hadjistyllis

More information

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University RAMCloud and the Low- Latency Datacenter John Ousterhout Stanford University Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction

More information

Rethinking Distributed Indexing for RDMA - Based Networks

Rethinking Distributed Indexing for RDMA - Based Networks CSCI 2980 Master s Project Report Rethinking Distributed Indexing for RDMA - Based Networks by Sumukha Tumkur Vani stumkurv@cs.brown.edu Under the guidance of Rodrigo Fonseca Carsten Binnig Submitted in

More information

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)

More information

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture A Cost Effective,, High g Performance,, Highly Scalable, Non-RDMA NVMe Fabric Bob Hansen,, VP System Architecture bob@apeirondata.com Storage Developers Conference, September 2015 Agenda 3 rd Platform

More information

Voltaire Making Applications Run Faster

Voltaire Making Applications Run Faster Voltaire Making Applications Run Faster Asaf Somekh Director, Marketing Voltaire, Inc. Agenda HPC Trends InfiniBand Voltaire Grid Backbone Deployment examples About Voltaire HPC Trends Clusters are the

More information