A Distributed Hash Table for Shared Memory
|
|
- Myles Carson
- 6 years ago
- Views:
Transcription
1 A Distributed Hash Table for Shared Memory Wytse Oortwijn Formal Methods and Tools, University of Twente August 31, 2015 Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
2 Table of Contents 1 Introduction 2 Contribution 1: Resolving Hash Collisions 3 Contribution 2: Hiding Latency 4 Experimental Evaluation 5 Conclusion Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
3 Table of Contents 1 Introduction 2 Contribution 1: Resolving Hash Collisions 3 Contribution 2: Hiding Latency 4 Experimental Evaluation 5 Conclusion Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
4 Distributed Hash Table Main challenge Building a fast and CPU-efficient shared hash table: Minimal latency Minimal memory overhead Not relying on CPU-polling ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
5 Distributed Hash Table Main challenge Building a fast and CPU-efficient shared hash table: Minimal latency Minimal memory overhead Not relying on CPU-polling Many use cases in HPC Parallel graph searching Distributed model checking ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
6 Distributed Hash Table Main challenge Building a fast and CPU-efficient shared hash table: Minimal latency Minimal memory overhead Not relying on CPU-polling Many use cases in HPC Parallel graph searching Distributed model checking Distributed (LAN) vs Parallel Cheaper scalability Unlimited scalability, but ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
7 Distributed Hash Table Main challenge Building a fast and CPU-efficient shared hash table: Minimal latency Minimal memory overhead Not relying on CPU-polling Many use cases in HPC Parallel graph searching Distributed model checking Distributed (LAN) vs Parallel Cheaper scalability Unlimited scalability, but Performance overhead! ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
8 Distributed Hash Table Main challenge Building a fast and CPU-efficient shared hash table: Minimal latency Minimal memory overhead Not relying on CPU-polling Many use cases in HPC Parallel graph searching Distributed model checking Distributed (LAN) vs Parallel Cheaper scalability Unlimited scalability, but Performance overhead! Efficient distributed processing Specialized algorithms and data structures needed! ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
9 Distributed Hash Table Main challenge Building a fast and CPU-efficient shared hash table: Minimal latency Minimal memory overhead Not relying on CPU-polling Many use cases in HPC Parallel graph searching Distributed model checking Distributed (LAN) vs Parallel Cheaper scalability Unlimited scalability, but Performance overhead! Efficient distributed processing Specialized algorithms and data structures needed! Contribution: Reducing roundtrips while CPU-efficient ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
10 High-performance Networking Infiniband hardware Specialized hardware used to construct high-performance networks: Comparable in price to Ethernet Supports bandwidths up to 100 Gb/s Direct access to memory via PCI-E bus Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
11 High-performance Networking Infiniband hardware Specialized hardware used to construct high-performance networks: Comparable in price to Ethernet Supports bandwidths up to 100 Gb/s Direct access to memory via PCI-E bus RDMA: Remote Direct Memory Access Directly access to remote memory without invoking remote CPUs Zero-copy networking Kernel bypassing No participation from remote CPUs Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
12 High-performance Networking Infiniband hardware Specialized hardware used to construct high-performance networks: Comparable in price to Ethernet Supports bandwidths up to 100 Gb/s Direct access to memory via PCI-E bus RDMA: Remote Direct Memory Access Directly access to remote memory without invoking remote CPUs Zero-copy networking Kernel bypassing No participation from remote CPUs Performance: one-sided RDMA vs TCP Roundtrips latency: < 3µs (Infiniband) vs 60µs (traditional Ethernet) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
13 Hash Table: Challenges Notation: Hash table T = b 0,..., b n 1 as a sequence of buckets b i, where: n the hash table size and m the number of used entries α = m n the load factor Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
14 Hash Table: Challenges Notation: Hash table T = b 0,..., b n 1 as a sequence of buckets b i, where: n the hash table size and m the number of used entries α = m n the load factor Operation: only find-or-put(d) Takes a data element d as parameter, and: if d T, return found if d T, insert d and return inserted if d T and d cannot be inserted, return full Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
15 Hash Table: Challenges Notation: Hash table T = b 0,..., b n 1 as a sequence of buckets b i, where: n the hash table size and m the number of used entries α = m n the load factor Operation: only find-or-put(d) Takes a data element d as parameter, and: if d T, return found if d T, insert d and return inserted if d T and d cannot be inserted, return full Design: Challenges How to distribute and access T = b 0,..., b n 1 efficiently? How to design find-or-put to perform efficiently? Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
16 PGAS: Partitioned Global Address Space Details Assuming N participating threads: ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
17 PGAS: Partitioned Global Address Space PGAS t 1 t 2... t N... Details Assuming N participating threads: PGAS: shared + distributed memory model ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
18 PGAS: Partitioned Global Address Space PGAS Hybrid PGAS t 1 t 2... t N... t 1 t t N Details Assuming N participating threads: PGAS: shared + distributed memory model Hybrid PGAS: PGAS + message passing (dashed edges) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
19 Table of Contents 1 Introduction 2 Contribution 1: Resolving Hash Collisions 3 Contribution 2: Hiding Latency 4 Experimental Evaluation 5 Conclusion Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
20 Hash Collisions Efficiency: Resolving Hash Collisions Occurs when h(x) = h(y) for data elements x y Efficiency of find-or-put depends on hashing strategy! Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
21 Hash Collisions Efficiency: Resolving Hash Collisions Occurs when h(x) = h(y) for data elements x y Efficiency of find-or-put depends on hashing strategy! Existing Work Pilaf, 2014 (Cuckoo) Nessie, 2014 (Cuckoo) FaRM, 2014 (Hopscotch) HERD, 2014 (CPU-intensive) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
22 Hash Collisions Efficiency: Resolving Hash Collisions Occurs when h(x) = h(y) for data elements x y Efficiency of find-or-put depends on hashing strategy! Existing Work Pilaf, 2014 (Cuckoo) Nessie, 2014 (Cuckoo) FaRM, 2014 (Hopscotch) HERD, 2014 (CPU-intensive) Contributions Existing implementations either: Require more roundtrips Require locking schemes Are CPU-intensive ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
23 Hash Collisions Efficiency: Resolving Hash Collisions Occurs when h(x) = h(y) for data elements x y Efficiency of find-or-put depends on hashing strategy! Existing Work Pilaf, 2014 (Cuckoo) Nessie, 2014 (Cuckoo) FaRM, 2014 (Hopscotch) HERD, 2014 (CPU-intensive) Contributions Existing implementations either: Require more roundtrips Require locking schemes Are CPU-intensive Best strategy for find-or-put Which strategy requires the least number of roundtrips? ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
24 Comparing Hashing Strategies Chained Hashing + Theoretical comp. Θ(1 + α) - Dynamic mem. management - Storing pointers ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
25 Comparing Hashing Strategies Chained Hashing + Theoretical comp. Θ(1 + α) - Dynamic mem. management - Storing pointers Cuckoo Hashing + Uses k hash functions - Lookups require k roundtrips - Relocations require locks ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
26 Comparing Hashing Strategies Chained Hashing + Theoretical comp. Θ(1 + α) - Dynamic mem. management - Storing pointers Cuckoo Hashing + Uses k hash functions - Lookups require k roundtrips - Relocations require locks Hopscotch Hashing + Using neighbourhoods + Lookups require 1 roundtrip - Relocations require locks ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
27 Comparing Hashing Strategies Chained Hashing + Theoretical comp. Θ(1 + α) - Dynamic mem. management - Storing pointers Hopscotch Hashing + Using neighbourhoods + Lookups require 1 roundtrip - Relocations require locks Cuckoo Hashing + Uses k hash functions - Lookups require k roundtrips - Relocations require locks Linear Probing + Buckets are consecutive + No locking or relocations - Roundtrips for lookups? ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
28 Comparing Hashing Strategies Chained Hashing + Theoretical comp. Θ(1 + α) - Dynamic mem. management - Storing pointers Hopscotch Hashing + Using neighbourhoods + Lookups require 1 roundtrip - Relocations require locks Cuckoo Hashing + Uses k hash functions - Lookups require k roundtrips - Relocations require locks Linear Probing + Buckets are consecutive + No locking or relocations - Roundtrips for lookups? Linear Probing versus Hopscotch Due to Hopscotch invariant, lookups may be more expensive, but Inserts are arguably cheaper (amortized complexity) ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
29 Linear Probing: Efficiency Bounds Knuth, 1997 The expected number of buckets to examine until the intended buckets is found is at most: 1 2 ( ) (1 α) 2 ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
30 Linear Probing: Efficiency Bounds Knuth, 1997 The expected number of buckets to examine until the intended buckets is found is at most: 1 2C ( ) (1 α) 2 ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
31 Linear Probing: Efficiency Bounds Knuth, 1997 The expected number of buckets to examine until the intended buckets is found is at most: 1 2C ( ) (1 α) 2 Efficiency bound A chunk is expected to contain the intended bucket if: 1 α 1 2C 1 ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
32 Linear Probing: Efficiency Bounds Knuth, 1997 The expected number of buckets to examine until the intended buckets is found is at most: 1 2C ( ) (1 α) 2 Efficiency bound A chunk is expected to contain the intended bucket if: 1 α 1 2C 1 Expected load-factor at which a chunk is full Chunk size C 3 2 Chunk size C load-factor α load-factor α ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
33 Linear Probing: Efficiency Bounds Knuth, 1997 The expected number of buckets to examine until the intended buckets is found is at most: 1 2C ( ) (1 α) 2 Efficiency bound A chunk is expected to contain the intended bucket if: 1 α 1 2C 1 Expected load-factor at which a chunk is full Chunk size C load-factor α Chunk size C C = 64 C = load-factor α ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
34 Table of Contents 1 Introduction 2 Contribution 1: Resolving Hash Collisions 3 Contribution 2: Hiding Latency 4 Experimental Evaluation 5 Conclusion Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
35 Linear Probing: Hiding Latency Contribution: Asynchronous queries Before chunk iteration, first request the next chunk: Overlapping roundtrips with computational activity Find next chunk with quadratic probing to prevent clustering Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
36 Linear Probing: Hiding Latency Contribution: Asynchronous queries Before chunk iteration, first request the next chunk: Overlapping roundtrips with computational activity Find next chunk with quadratic probing to prevent clustering Defining query-chunk(i, d) Obtains the i-th chunk, starting from bucket b h(d) Returns a handle s Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
37 Linear Probing: Hiding Latency Contribution: Asynchronous queries Before chunk iteration, first request the next chunk: Overlapping roundtrips with computational activity Find next chunk with quadratic probing to prevent clustering Defining query-chunk(i, d) Obtains the i-th chunk, starting from bucket b h(d) Returns a handle s Defining sync-chunk(s) Takes a handle s as parameter, waits until the corresponding query has been completed. ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
38 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
39 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
40 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) s 1 = query-chunk(1, d) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
41 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) s 1 = query-chunk(1, d) sync-chunk(s 0) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
42 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) s 1 = query-chunk(1, d) sync-chunk(s 0) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
43 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) s 1 = query-chunk(1, d) sync-chunk(s 0) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
44 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) s 1 = query-chunk(1, d) sync-chunk(s 0) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
45 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) s 1 = query-chunk(1, d) sync-chunk(s 0) s 2 = query-chunk(2, d) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
46 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) s 1 = query-chunk(1, d) sync-chunk(s 0) s 2 = query-chunk(2, d) sync-chunk(s 1) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
47 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) s 1 = query-chunk(1, d) sync-chunk(s 0) s 2 = query-chunk(2, d) sync-chunk(s 1) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
48 Linear Probing: Querying Visualized Initiator RDMA device of target Main memory of target s 0 = query-chunk(0, d) s 1 = query-chunk(1, d) sync-chunk(s 0) s 2 = query-chunk(2, d) sync-chunk(s 1) s 3 = query-chunk(3, d) sync-chunk(s 2) s 4 = query-chunk(4, d) sync-chunk(s 3)... Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
49 Table of Contents 1 Introduction 2 Contribution 1: Resolving Hash Collisions 3 Contribution 2: Hiding Latency 4 Experimental Evaluation 5 Conclusion Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
50 Hash Table: Evaluation Experimental Setup All experiments have been perfomed on the DAS-5 cluster: 66 machines 16 cores each (Intel E5-2630v3) 64 GB internal memory each connected via 48Gb/s Infiniband Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
51 Hash Table: Evaluation Experimental Setup All experiments have been perfomed on the DAS-5 cluster: 66 machines 16 cores each (Intel E5-2630v3) 64 GB internal memory each connected via 48Gb/s Infiniband Benchmarks Under different workloads, we measured: Throughput of find-or-put Latency of find-or-put Roundtrips of find-or-put Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
52 Hash Table: Throughput Total Throughput Speedup ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
53 Hash Table: Throughput Total Throughput Speedup Observations Throughputs up to reached (66 machines) Remote speedup up to 110 obtained Local throughput of reached (1 threads) ytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
54 Hash Table: Latency Local latency latency (ns) C = 8 C = 16 C = 32 C = 64 C = load-factor (α) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
55 Hash Table: Latency Local latency latency (ns) C = 8 C = 16 C = 32 C = 64 C = load-factor (α) Remote latency latency (ns) 4,400 4,200 4,000 3,800 3, latency (ns) C = 8 C = 16 C = 32 C = 64 C = 128 load-factor (α) load-factor (α) Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
56 Table of Contents 1 Introduction 2 Contribution 1: Resolving Hash Collisions 3 Contribution 2: Hiding Latency 4 Experimental Evaluation 5 Conclusion Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
57 Conclusions General Minimizing roundtrips increases performance Overlapping queries reduces waiting-times and decreases latency Linear probing requires less roundtrips than Hopscotch and Cuckoo Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
58 Conclusions General Minimizing roundtrips increases performance Overlapping queries reduces waiting-times and decreases latency Linear probing requires less roundtrips than Hopscotch and Cuckoo Performance find-or-put takes 4.5µs on average with α = 0.9 and C = 64 Peak-throughput of op/s obtained Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
59 Conclusions General Minimizing roundtrips increases performance Overlapping queries reduces waiting-times and decreases latency Linear probing requires less roundtrips than Hopscotch and Cuckoo Performance find-or-put takes 4.5µs on average with α = 0.9 and C = 64 Peak-throughput of op/s obtained Performance Indication FaRM: Inserts take 35µs Pilaf: Operations take 30µs Nessie: Inserts take 25µs Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash Table for Shared Memory August 31, / 20
Distributed Binary Decision Diagrams for Symbolic Reachability
Distributed Binary Decision Diagrams for Symbolic Reachability Wytse Oortwijn Formal Methods and Tools, University of Twente November 1, 2015 Wytse Oortwijn (Formal Methods and Tools, Distributed University
More informationA Distributed Hash Table for Shared Memory
A Distributed Hash Table for Shared Memory Wytse Oortwijn (B), Tom van Dijk, and Jaco van de Pol Formal Methods and Tools, Department of EEMCS, University of Twente, P.O.-box 217, 7500 AE Enschede, The
More informationFaRM: Fast Remote Memory
FaRM: Fast Remote Memory Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, Miguel Castro Microsoft Research Abstract We describe the design and implementation of FaRM, a new main memory distributed
More informationFaRM: Fast Remote Memory
FaRM: Fast Remote Memory Problem Context DRAM prices have decreased significantly Cost effective to build commodity servers w/hundreds of GBs E.g. - cluster with 100 machines can hold tens of TBs of main
More informationHigh-Performance Key-Value Store on OpenSHMEM
High-Performance Key-Value Store on OpenSHMEM Huansong Fu*, Manjunath Gorentla Venkata, Ahana Roy Choudhury*, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline Background
More informationNear Memory Key/Value Lookup Acceleration MemSys 2017
Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy
More informationMemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing
MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing Bin Fan (CMU), Dave Andersen (CMU), Michael Kaminsky (Intel Labs) NSDI 2013 http://www.pdl.cmu.edu/ 1 Goal: Improve Memcached 1. Reduce space overhead
More informationS. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University
Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda The Ohio State University
More information2008 International ANSYS Conference
2008 International ANSYS Conference Maximizing Productivity With InfiniBand-Based Clusters Gilad Shainer Director of Technical Marketing Mellanox Technologies 2008 ANSYS, Inc. All rights reserved. 1 ANSYS,
More informationLITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang
LITE Kernel RDMA Support for Datacenter Applications Shin-Yeh Tsai, Yiying Zhang Time 2 Berkeley Socket Userspace Kernel Hardware Time 1983 2 Berkeley Socket TCP Offload engine Arrakis & mtcp IX Userspace
More informationPerformance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA
Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to
More informationIntroduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far
Chapter 5 Hashing 2 Introduction hashing performs basic operations, such as insertion, deletion, and finds in average time better than other ADTs we ve seen so far 3 Hashing a hash table is merely an hashing
More informationThe NE010 iwarp Adapter
The NE010 iwarp Adapter Gary Montry Senior Scientist +1-512-493-3241 GMontry@NetEffect.com Today s Data Center Users Applications networking adapter LAN Ethernet NAS block storage clustering adapter adapter
More informationHeckaton. SQL Server's Memory Optimized OLTP Engine
Heckaton SQL Server's Memory Optimized OLTP Engine Agenda Introduction to Hekaton Design Consideration High Level Architecture Storage and Indexing Query Processing Transaction Management Transaction Durability
More informationUsing RDMA Efficiently for Key-Value Services
Using RDMA Efficiently for Key-Value Services Anuj Kalia Michael Kaminsky David G. Andersen Carnegie Mellon University Intel Labs {akalia,dga}@cs.cmu.edu michael.e.kaminsky@intel.com ABSTRACT This paper
More informationUsing RDMA Efficiently for Key-Value Services
Using RDMA Efficiently for Key-Value Services Anuj Kalia, Michael Kaminsky, David G. Andersen Carnegie Mellon University, Intel Labs CMU-PDL-14-16 June 214 Parallel Data Laboratory Carnegie Mellon University
More informationChapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,
Introduction Chapter 5 Hashing hashing performs basic operations, such as insertion, deletion, and finds in average time 2 Hashing a hash table is merely an of some fixed size hashing converts into locations
More informationI/O Buffering and Streaming
I/O Buffering and Streaming I/O Buffering and Caching I/O accesses are reads or writes (e.g., to files) Application access is arbitary (offset, len) Convert accesses to read/write of fixed-size blocks
More informationIntroducing the Cray XMT. Petr Konecny May 4 th 2007
Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions
More informationCIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )
Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL
More informationSR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience
SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory
More informationFaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs
FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs Anuj Kalia (CMU), Michael Kaminsky (Intel Labs), David Andersen (CMU) RDMA RDMA is a network feature that
More informationFlexible Architecture Research Machine (FARM)
Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense
More informationDesigning Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen
Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen Outline Architecture of Web-based Data Center Three-Stage framework to benefit
More informationFarewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation
Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang Datacenter 3 Monolithic Computer OS / Hypervisor 4 Can monolithic Application Hardware
More informationPortable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI
Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI Huansong Fu*, Manjunath Gorentla Venkata, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline
More informationBirds of a Feather Presentation
Mellanox InfiniBand QDR 4Gb/s The Fabric of Choice for High Performance Computing Gilad Shainer, shainer@mellanox.com June 28 Birds of a Feather Presentation InfiniBand Technology Leadership Industry Standard
More informationAdvanced Computer Networks. End Host Optimization
Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationSingle-Points of Performance
Single-Points of Performance Mellanox Technologies Inc. 29 Stender Way, Santa Clara, CA 9554 Tel: 48-97-34 Fax: 48-97-343 http://www.mellanox.com High-performance computations are rapidly becoming a critical
More informationTailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes
Tailwind: Fast and Atomic RDMA-based Replication Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes In-Memory Key-Value Stores General purpose in-memory key-value stores are widely used nowadays
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationRDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits
RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits Sayantan Sur Hyun-Wook Jin Lei Chai D. K. Panda Network Based Computing Lab, The Ohio State University Presentation
More informationIsoStack Highly Efficient Network Processing on Dedicated Cores
IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single
More informationChapter 6. Storage and Other I/O Topics
Chapter 6 Storage and Other I/O Topics Introduction I/O devices can be characterized by Behaviour: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections
More informationModule 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.
The Lecture Contains: Hashing Dynamic hashing Extendible hashing Insertion file:///c /Documents%20and%20Settings/iitkrana1/My%20Documents/Google%20Talk%20Received%20Files/ist_data/lecture9/9_1.htm[6/14/2012
More informationData Processing on Emerging Hardware
Data Processing on Emerging Hardware Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland 3 rd International Summer School on Big Data, Munich, Germany, 2017 www.systems.ethz.ch
More informationOncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries
Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big
More informationBig data, little time. Scale-out data serving. Scale-out data serving. Highly skewed key popularity
/7/6 Big data, little time Goal is to keep (hot) data in memory Requires scale-out approach Each server responsible for one chunk Fast access to local data The Case for RackOut Scalable Data Serving Using
More informationHigh-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT
High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT Krishna Kandalla (1), Hari Subramoni (1), Karen Tomko (2), Dmitry Pekurovsky
More informationMemory Management Strategies for Data Serving with RDMA
Memory Management Strategies for Data Serving with RDMA Dennis Dalessandro and Pete Wyckoff (presenting) Ohio Supercomputer Center {dennis,pw}@osc.edu HotI'07 23 August 2007 Motivation Increasing demands
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationMemcached Design on High Performance RDMA Capable Interconnects
Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan
More informationHash Joins for Multi-core CPUs. Benjamin Wagner
Hash Joins for Multi-core CPUs Benjamin Wagner Joins fundamental operator in query processing variety of different algorithms many papers publishing different results main question: is tuning to modern
More informationSharing High-Performance Devices Across Multiple Virtual Machines
Sharing High-Performance Devices Across Multiple Virtual Machines Preamble What does sharing devices across multiple virtual machines in our title mean? How is it different from virtual networking / NSX,
More informationAccelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage
Accelerating Real-Time Big Data Breaking the limitations of captive NVMe storage 18M IOPs in 2u Agenda Everything related to storage is changing! The 3rd Platform NVM Express architected for solid state
More informationIntel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage
Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage Evaluation of Lustre File System software enhancements for improved Metadata performance Wojciech Turek, Paul Calleja,John
More informationMultifunction Networking Adapters
Ethernet s Extreme Makeover: Multifunction Networking Adapters Chuck Hudson Manager, ProLiant Networking Technology Hewlett-Packard 2004 Hewlett-Packard Development Company, L.P. The information contained
More informationFuture Routing Schemes in Petascale clusters
Future Routing Schemes in Petascale clusters Gilad Shainer, Mellanox, USA Ola Torudbakken, Sun Microsystems, Norway Richard Graham, Oak Ridge National Laboratory, USA Birds of a Feather Presentation Abstract
More informationEfficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics
1 Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics Mingzhe Li Sreeram Potluri Khaled Hamidouche Jithin Jose Dhabaleswar K. Panda Network-Based Computing Laboratory Department
More informationThe Use of Cloud Computing Resources in an HPC Environment
The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes
More informationExploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters
Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth
More informationIndexing in RAMCloud. Ankita Kejriwal, Ashish Gupta, Arjun Gopalan, John Ousterhout. Stanford University
Indexing in RAMCloud Ankita Kejriwal, Ashish Gupta, Arjun Gopalan, John Ousterhout Stanford University RAMCloud 1.0 Introduction Higher-level data models Without sacrificing latency and scalability Secondary
More informationLS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton
More informationDeconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better!
Deconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better! Xingda Wei, Zhiyuan Dong, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems (IPADS) Shanghai Jiao Tong
More informationLessons from Post-processing Climate Data on Modern Flash-based HPC Systems
Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National Center of Atmospheric Research,
More informationNetworking at the Speed of Light
Networking at the Speed of Light Dror Goldenberg VP Software Architecture MaRS Workshop April 2017 Cloud The Software Defined Data Center Resource virtualization Efficient services VM, Containers uservices
More informationHeterogeneous platforms
Heterogeneous platforms Systems combining main processors and accelerators e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC Any platform using a GPU is a heterogeneous platform! Further in this talk
More informationWrite a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical
Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or
More informationDesigning High Performance DSM Systems using InfiniBand Features
Designing High Performance DSM Systems using InfiniBand Features Ranjit Noronha and Dhabaleswar K. Panda The Ohio State University NBC Outline Introduction Motivation Design and Implementation Results
More informationDeconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!
Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better! Xingda Wei, Zhiyuan Dong, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University Contacts:
More informationMulti-Threaded UPC Runtime for GPU to GPU communication over InfiniBand
Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State
More informationMDHIM: A Parallel Key/Value Store Framework for HPC
MDHIM: A Parallel Key/Value Store Framework for HPC Hugh Greenberg 7/6/2015 LA-UR-15-25039 HPC Clusters Managed by a job scheduler (e.g., Slurm, Moab) Designed for running user jobs Difficult to run system
More information@ Massachusetts Institute of Technology All rights reserved.
..... CPHASH: A Cache-Partitioned Hash Table with LRU Eviction by Zviad Metreveli MASSACHUSETTS INSTITUTE OF TECHAC'LOGY JUN 2 1 2011 LIBRARIES Submitted to the Department of Electrical Engineering and
More informationAdapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES
Adapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES Tom Atwood Business Development Manager Sun Microsystems, Inc. Takeaways Understand the technical differences between
More informationInfiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin
Infiniswap Efficient Memory Disaggregation Mosharaf Chowdhury with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow
More informationOptimizing non-blocking Collective Operations for InfiniBand
Optimizing non-blocking Collective Operations for InfiniBand Open Systems Lab Indiana University Bloomington, USA IPDPS 08 - CAC 08 Workshop Miami, FL, USA April, 14th 2008 Introduction Non-blocking collective
More informationDesigning Next Generation Data-Centers with Advanced Communication Protocols and Systems Services
Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. W. Jin and D. K. Panda Network Based Computing Laboratory
More informationLatency-Tolerant Software Distributed Shared Memory
Latency-Tolerant Software Distributed Shared Memory Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin University of Washington USENIX ATC 2015 July 9, 2015 25
More informationANALYZING CHARACTERISTICS OF PC CLUSTER CONSOLIDATED WITH IP-SAN USING DATA-INTENSIVE APPLICATIONS
ANALYZING CHARACTERISTICS OF PC CLUSTER CONSOLIDATED WITH IP-SAN USING DATA-INTENSIVE APPLICATIONS Asuka Hara Graduate school of Humanities and Science Ochanomizu University 2-1-1, Otsuka, Bunkyo-ku, Tokyo,
More informationImproving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload
Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload Summary As today s corporations process more and more data, the business ramifications of faster and more resilient database
More informationChapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved.
Chapter 27 Hashing 1 Objectives To know what hashing is for ( 27.3). To obtain the hash code for an object and design the hash function to map a key to an index ( 27.4). To handle collisions using open
More informationlecture23: Hash Tables
lecture23: Largely based on slides by Cinda Heeren CS 225 UIUC 18th July, 2013 Announcements mp6.1 extra credit due tomorrow tonight (7/19) lab avl due tonight mp6 due Monday (7/22) Hash tables A hash
More informationPerformance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms
Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Sayantan Sur, Matt Koop, Lei Chai Dhabaleswar K. Panda Network Based Computing Lab, The Ohio State
More informationOptimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink
Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline
More informationComputer Organization and Structure. Bing-Yu Chen National Taiwan University
Computer Organization and Structure Bing-Yu Chen National Taiwan University Storage and Other I/O Topics I/O Performance Measures Types and Characteristics of I/O Devices Buses Interfacing I/O Devices
More informationHash Tables. Gunnar Gotshalks. Maps 1
Hash Tables Maps 1 Definition A hash table has the following components» An array called a table of size N» A mathematical function called a hash function that maps keys to valid array indices hash_function:
More informationOptimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications
Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, P. Lai, S. Narravula and D. K. Panda Network Based Computing Laboratory
More informationCommunication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.
Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance
More informationHigh Performance Distributed Lock Management Services using Network-based Remote Atomic Operations
High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations S. Narravula, A. Mamidala, A. Vishnu, K. Vaidyanathan, and D. K. Panda Presented by Lei Chai Network Based
More informationS WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018
S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific
More informationHigh Performance Packet Processing with FlexNIC
High Performance Packet Processing with FlexNIC Antoine Kaufmann, Naveen Kr. Sharma Thomas Anderson, Arvind Krishnamurthy University of Washington Simon Peter The University of Texas at Austin Ethernet
More informationAccelerating Foreign-Key Joins using Asymmetric Memory Channels
Accelerating Foreign-Key Joins using Asymmetric Memory Channels Holger Pirk Stefan Manegold Martin Kersten holger@cwi.nl manegold@cwi.nl mk@cwi.nl Why? Trivia: Joins are important But: Many Joins are (Indexed)
More informationThe End of a Myth: Distributed Transactions Can Scale
The End of a Myth: Distributed Transactions Can Scale Erfan Zamanian, Carsten Binnig, Tim Harris, and Tim Kraska Rafael Ketsetsides Why distributed transactions? 2 Cheaper for the same processing power
More informationScalable Concurrent Hash Tables via Relativistic Programming
Scalable Concurrent Hash Tables via Relativistic Programming Josh Triplett September 24, 2009 Speed of data < Speed of light Speed of light: 3e8 meters/second Processor speed: 3 GHz, 3e9 cycles/second
More informationCan Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?
Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda Network- Based Compu2ng Laboratory Department of Computer
More informationPerformance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals
Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of
More informationInput/Output Introduction
Input/Output 1 Introduction Motivation Performance metrics Processor interface issues Buses 2 Page 1 Motivation CPU Performance: 60% per year I/O system performance limited by mechanical delays (e.g.,
More informationIntroduction. Motivation Performance metrics Processor interface issues Buses
Input/Output 1 Introduction Motivation Performance metrics Processor interface issues Buses 2 Motivation CPU Performance: 60% per year I/O system performance limited by mechanical delays (e.g., disk I/O)
More informationStorage. Hwansoo Han
Storage Hwansoo Han I/O Devices I/O devices can be characterized by Behavior: input, out, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections 2 I/O System Characteristics
More informationOptimizing LS-DYNA Productivity in Cluster Environments
10 th International LS-DYNA Users Conference Computing Technology Optimizing LS-DYNA Productivity in Cluster Environments Gilad Shainer and Swati Kher Mellanox Technologies Abstract Increasing demand for
More informationKV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC
KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC Bojie Li * Zhenyuan Ruan * Wencong Xiao Yuanwei Lu Yongqiang Xiong Andrew Putnam Enhong Chen Lintao Zhang Microsoft Research
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationOutline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work
Using Non-blocking Operations in HPC to Reduce Execution Times David Buettner, Julian Kunkel, Thomas Ludwig Euro PVM/MPI September 8th, 2009 Outline 1 Motivation 2 Theory of a non-blocking benchmark 3
More informationTrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa
TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa EPL646: Advanced Topics in Databases Christos Hadjistyllis
More informationRAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University
RAMCloud and the Low- Latency Datacenter John Ousterhout Stanford University Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction
More informationRethinking Distributed Indexing for RDMA - Based Networks
CSCI 2980 Master s Project Report Rethinking Distributed Indexing for RDMA - Based Networks by Sumukha Tumkur Vani stumkurv@cs.brown.edu Under the guidance of Rodrigo Fonseca Carsten Binnig Submitted in
More informationGPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh
GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)
More informationHighly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture
A Cost Effective,, High g Performance,, Highly Scalable, Non-RDMA NVMe Fabric Bob Hansen,, VP System Architecture bob@apeirondata.com Storage Developers Conference, September 2015 Agenda 3 rd Platform
More informationVoltaire Making Applications Run Faster
Voltaire Making Applications Run Faster Asaf Somekh Director, Marketing Voltaire, Inc. Agenda HPC Trends InfiniBand Voltaire Grid Backbone Deployment examples About Voltaire HPC Trends Clusters are the
More information