Hash Table Design and Optimization for Software Virtual Switches

Size: px

Start display at page:

Download "Hash Table Design and Optimization for Software Virtual Switches"

Lucas Griffin
5 years ago
Views:

1 Hash Table Design and Optimization for Software Virtual Switches P R E S E N T E R : R E N WA N G Y I P E N G WA N G, S A M E H G O B R I E L, R E N WA N G, C H A R L I E TA I, C R I S T I A N D U M I T R E S C U I N T E L L A B S

2 OUTLINE Background and motivation Survey and understanding Analysis

3 Background We found the most common data structure used in virtual switch is hash table. wildcarding match (tuple space search): routing table, ACL exact match: con-track table, flow cache, etc. Comparing to tree based data structure, hash table based data structure has certain advantages: More parallelism: no pointer chasing Faster rule updates 3

4 Background Hash table lookup is also one of the most time consuming stage during packet processing: E.g. Open vswitch (100k rules, 20 subtables) Execution percentage IO Preprocessing Rule lookup others ~8% ~5% ~78% ~9% A major source of hash table lookup overhead is memory access latency. 4

5 Motivation Hash table is a simple data structure, but there are many different design and implementations. Understanding of hash table performance and how to design an efficient hash table structure is the key to a good software switch. A general guideline to hash table designs will benefit future vswitch development. 5

6 Basic hash table structure The evolution of hash table algorithms: single array -> bucket-based -> n-hash Key A Key A Key A 6

7 Cuckoo hashing Cuckoo hash algorithm: existing keys can be displaced to alternative bucket Key B Key A Key A Hash func Key B 7

8 Survey We also studied into various open source virtual switch applications to learn their implementations. Three major purposes these applications use hash table for: Routing table/acl tuple space search Connect tracking table exact match Flow cache exact/signature match with replacement policy 8

9 Observations Set-associative table and cuckoo hash are widely used. Bucket size is usually 4-8 entries Cache alignment Vectorization Capacity guarantee is needed in telcom use cases Linked list based hash table as extended table Software techniques to improve performance: Software pipelining Batching Read write concurrency Optimistic locking Intel TSX 9

10 Analysis - Table organization and data structure Number of keys per bucket More entries in a bucket can directly improve the table utilizations. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% load factors vs. assoc hash 2hash cuckoo Cuckoo hash insertion cycles vs. load factor Conclusion: when table utilization is important, cuckoo hash should be used. Multiple hash function and multiple ways per bucket also help a lot. 10

11 Analysis - Separate key storage and cache alignment Hash tables could store key-data pair in a separate memory location, and only keep signatures and index in the table. Pros: signature and index are easily to be cache aligned, benefit cache miss case. Cons: requires another memory jump when hit. SIG Key+data Out tests show that with optimized DPDK hash tables, storing keys in or outside the table does not show major difference with 16 or 32-byte key size. However, cache alignment will improve hash table lookup speed by % in our DPDK based performance test. 11

12 Analysis - Hash table based cache When use hash table for flow cache we need to consider cache miss ratio. 4-8 ways per bucket can already keep the miss ratio to be reasonable low. lookup We propose a new AVX-based LRU implementation. Use Intel AVX instruction to permute the bucket % 25.00% miss ratio vs. ways/bucket miss ratio 20.00% 15.00% 10.00% 5.00% 0.00% speed comparison (lower the better) avx age ll bplru tplru insert lookup void adjust_location(int location, bucket* bucket){ m256i array = avx_load(bucket) m256i permute_pattern = avx_load(permute_index[location]) m256i permuted_array = avx_permute(array, permute_pattern) avx_store (bucket, permuted_array) } 12

13 Analysis - Software pipelining and batching Batching can enable us to prefetch hash table bucket for different lookup keys. Together with batching, software pipelining can further improve performance. Software pipelining + batching easily improve performance by 2X in our test case cycle/pkt with prefetch or pipelining no opt. prefetch pipelined 13

14 Analysis- Vectorization Besides using Intel AVX instruction for LRU operation, we can also use AVX instruction to perform signature comparison. We compare three mechanisms: No vectorization. Horizontal vectorization: compare one key s signature to all signatures in a bucket. Vertical vecotrization: compare all key s signatures in a batch to different entries across different buckets. Observation: Vertical or scalar better for low table utilization. Horizontal better for high table utilization. An adaptive method could benefit lookup cycle vs. avg entry location scalar vert hori 14

15 Future directions of Hash Table Design Cuckoo hash + extended linked list design Linked list based hash table provides capacity guarantee. Cuckoo hash table provides high table utilization and constant table lookup time. The combination of both to achieve both capacity guarantee and better utilization. Adaptive vrouter From the study, we found no single data structure could fit all use cases. Runtime decision based on traffic patterns could benefit. During runtime, a learning (e.g., trial and rank) phase to try various hash table data structures. 15

16 Conclusion We investigated multiple hash table algorithms and implementations in popular virtual switches. We analyzed various hash table designs and provide guide lines for different use cases. We proposed Intel AVX based LRU cache implementation and adaptive signature comparison. We proposed future directions on hash table design in virtual switches. 16

Tungsten Fabric Optimization by DPDK ZHAOYAN CHEN YIPENG WANG

x Tungsten Fabric Optimization by DPDK ZHAOYAN CHEN YIPENG WANG Agenda Introduce Tungsten Fabric Support More CPU cores MPLS over GRE Optimization Hash Table Optimization Batch RX for VM and Fabric What