Counter Braids: A novel counter architecture

Similar documents
Counter Braids: A novel counter architecture

Counter Braids: A novel counter architecture

Design and Performance Analysis of a DRAM-based Statistics Counter Array Architecture

Designing Statistics Counters from Slower Memories

QCN: Quantized Congestion Notification. Rong Pan, Balaji Prabhakar, Ashvin Laxmikantha

Data Center Quantized Congestion Notification (QCN): Implementation and Evaluation on NetFPGA

Switch and Router Design. Packet Processing Examples. Packet Processing Examples. Packet Processing Rate 12/14/2011

Design and Performance Analysis of a DRAM-based Statistics Counter Array Architecture

Enhanced Forward Explicit Congestion Notification (E-FECN) Scheme for Datacenter Ethernet Networks

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

Router s Queue Management

New Directions in Traffic Measurement and Accounting. Need for traffic measurement. Relation to stream databases. Internet backbone monitoring

KNOM Tutorial Internet Traffic Matrix Measurement and Analysis. Sue Bok Moon Dept. of Computer Science

Highly Compact Virtual Counters for Per-Flow Traffic Measurement through Register Sharing

DRAM is Plenty Fast for Wirespeed Statistics Counting

A Survey on Congestion Notification Algorithm in Data Centers

Sizing Router Buffers

Appendix B. Standards-Track TCP Evaluation

DRAM is Plenty Fast for Wirespeed Statistics Counting

Congestion Control Without a Startup Phase

CS 421: COMPUTER NETWORKS SPRING FINAL May 24, minutes. Name: Student No: TOT

Improving TCP Performance over Wireless Networks using Loss Predictors

Handles all kinds of traffic on a single network with one class

The Network Layer and Routers

RD-TCP: Reorder Detecting TCP

Memory Management Prof. James L. Frankel Harvard University

Fair Queueing. Presented by Brighten Godfrey. Slides thanks to Ion Stoica (UC Berkeley) with slight adaptation by Brighten Godfrey

Internetwork. recursive definition point-to-point and multi-access: internetwork. composition of one or more internetworks

Data Communication Networks Final

From Routing to Traffic Engineering

Congestion Control in TCP

CS244a: An Introduction to Computer Networks

Page 1. Multilevel Memories (Improving performance using a little cash )

Congestion. Can t sustain input rate > output rate Issues: - Avoid congestion - Control congestion - Prioritize who gets limited resources

Random Early Detection (RED) gateways. Sally Floyd CS 268: Computer Networks

2.993: Principles of Internet Computing Quiz 1. Network

Resource allocation in networks. Resource Allocation in Networks. Resource allocation

TCP so far Computer Networking Outline. How Was TCP Able to Evolve

Origin-Destination Flow Measurement in High-Speed Networks

Network Processors and their memory

CSE 123A Computer Networks

CS307 Operating Systems Main Memory

CS519: Computer Networks. Lecture 5, Part 4: Mar 29, 2004 Transport: TCP congestion control

Data Center TCP (DCTCP)

Simply Top Talkers Jeroen Massar, Andreas Kind and Marc Ph. Stoecklin

CMSC 417. Computer Networks Prof. Ashok K Agrawala Ashok Agrawala. October 30, 2018

CS419: Computer Networks. Lecture 6: March 7, 2005 Fast Address Lookup:

CS 552 Computer Networks

Overview Content Delivery Computer Networking Lecture 15: The Web Peter Steenkiste. Fall 2016

Chapter 1. Introduction

New Directions in Traffic Measurement and Accounting

Routing, Routers, Switching Fabrics

The War Between Mice and Elephants

Adaptive Non-Linear Sampling Method for. Accurate Flow Size Measurement

Congestion Control in TCP

Message Switch. Processor(s) 0* 1 100* 6 1* 2 Forwarding Table

Parallelism in Network Systems

Lecture 22: Buffering & Scheduling. CSE 123: Computer Networks Alex C. Snoeren

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668

LS Example 5 3 C 5 A 1 D

Optimizing Memory Bandwidth of a Multi-Channel Packet Buffer

COMPRESSIVE VIDEO SAMPLING

EECS 428 Final Project Report Distributed Real-Time Process Control Over TCP and the Internet Brian Robinson

Data Center Transport Mechanisms. EE384M Spring 2011

CS4700/CS5700 Fundamentals of Computer Networks

CS 344/444 Computer Network Fundamentals Final Exam Solutions Spring 2007

Internet Protocols Fall Lecture 16 TCP Flavors, RED, ECN Andreas Terzis

CHAPTER 3 EFFECTIVE ADMISSION CONTROL MECHANISM IN WIRELESS MESH NETWORKS

Data Link Layer. Our goals: understand principles behind data link layer services: instantiation and implementation of various link layer technologies

Homework 3 50 points. 1. Computing TCP's RTT and timeout values (10 points)

Medium Access Protocols

Advanced Computer Networks. Flow Control

Tracking Frequent Items Dynamically: What s Hot and What s Not To appear in PODS 2003

Links Reading: Chapter 2. Goals of Todayʼs Lecture. Message, Segment, Packet, and Frame

Outline. The demand The San Jose NAP. What s the Problem? Most things. Time. Part I AN OVERVIEW OF HARDWARE ISSUES FOR IP AND ATM.

Recap. TCP connection setup/teardown Sliding window, flow control Retransmission timeouts Fairness, max-min fairness AIMD achieves max-min fairness

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Baidu s Best Practice with Low Latency Networks

Scaling routers: Where do we go from here?

CS 349/449 Internet Protocols Final Exam Winter /15/2003. Name: Course:

Lecture 9: Bridging. CSE 123: Computer Networks Alex C. Snoeren

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Homework 1 Solutions:

ADVANCED COMPUTER NETWORKS

Network Superhighway CSCD 330. Network Programming Winter Lecture 13 Network Layer. Reading: Chapter 4

Problem Statement. Algorithm MinDPQ (contd.) Algorithm MinDPQ. Summary of Algorithm MinDPQ. Algorithm MinDPQ: Experimental Results.

Codes, Bloom Filters, and Overlay Networks. Michael Mitzenmacher

Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Network Layer (1) Networked Systems 3 Lecture 8

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Overview. TCP & router queuing Computer Networking. TCP details. Workloads. TCP Performance. TCP Performance. Lecture 10 TCP & Routers

Last Lecture: Network Layer

Wireless TCP Performance Issues

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Kommunikationssysteme [KS]

Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks. Congestion Control in Today s Internet

TCP Congestion Control

CS244a: An Introduction to Computer Networks

CSE 573S Protocols for Computer Networks (Spring 2005 Final Project)

Outline. Application examples

Transcription:

Counter Braids: A novel counter architecture Balaji Prabhakar Balaji Prabhakar Stanford University Joint work with: Yi Lu, Andrea Montanari, Sarang Dharmapurikar and Abdul Kabbani

Overview Counter Braids Background: current approaches Exact, per-flow accounting Approximate, large-flow accounting Our approach The Counter Braid architecture A simple, efficient message passing algorithm Performance, comparisons and further work Congestion notification in Ethernet Overview of IEEE standards effort 2

Traffic Statistics: Background Routers collect traffic statistics; useful for Accounting/billing, traffic engineering, security/forensics Several products in this area; notably, Cisco s NetFlow, Juniper s cflowd, Huawei s NetStream Other areas In databases: number and count of distinct items in streams Web server logs Key problem: At high line rates, memory technology is a limiting factor 500,000+ active flows, packets arrive once every 10 ns on 40 Gbps line We need fast and large memories for implementing counters: v.expensive This has spawned two approaches Exact, per-flow accounting: Use hybrid SRAM-DRAM architecture Approximate, large-flow accounting: Use heavy-tailed nature of flow size distribution 3

Per-flow Accounting Naïve approach: one counter per flow F1 F2 43 4 44 4 Fn 15 15 LSB MSB LSB MSB Problem: Need fast and large memories; infeasible 4

An initial approach Shah, Iyer, Prabhakar, McKeown (2001) Hybrid SRAM-DRAM architecture LSBs in SRAM: high-speed updates, on-chip MSBs in DRAM: less frequent updates; can use slower speed, off-chip DRAMs F1 Fl2 Fn 35 4 15 SRAM Interconnect -- Speed: L/S Counter Mgmt Algorithm DRAM Result: Under adversarial inputs, the minimum number of bits for each SRAM counter: 5

Related work Ramabhadran and Varghese (2003) obtained a simpler version of the LCF algorithm Zhao et al (2006) randomized the initial values in the SRAM counters to prevent the adversary from causing several counters to overflow closely F1 Fl2 CMA SRAM FIFO Interconnect -- Speed: L/S Fn SRAM DRAM Main problem of exact methods Can t fit counters into single SRAM Need to know the flow-counter association Need perfect hash function; or, fully associative memory (e.g. CAM) 6

Approximate counting Statistical in nature Use heavy-tailed (Pareto) distribution of network flow sizes 80% of data brought by the biggest 20% of the flows So, quickly identify these big flows and count their packets Sample and hold: Estan et al (2004) Packets off of the wire Large flow? Yes No Counter Array Given the cost of memory, it strikes an good trade-off Moreover, the flow-to-counter association problem is manageable But, the counts are very approximate 7

Summary Exact counting methods Space intensive, complex Approximate methods Focus on large flows, inaccurate Problems to address Save space Get rid of flow-to-counter association problem 8

Compress Space via Braiding Save counter space by braiding counters Give nearly exclusive LSBs, share MSBs LSBs Shared MSBs 9

Counter Braids for Measurement (in anticipation) Status bit Indicates overflow Mouse Traps Many, shallow counters Elephant Traps Few, deep counters 10

Counting with CBs 11

Multiple hashes to get rid of flow-to-counter association problem Multiple hash functions Single hash function leads to collisions However, one can use two or more hash functions and use the redundancy to recover the flow size 1 0 1 2 2 3 2 6 3 35 3 36 35 3 35 3 5 5 5 45 Need efficient decoding algorithm for solving C = MF Invert C --> F 12

Decoder 1: The MLE Consider a single stage of counters and multiple (random) hash functions Let F be the vector of flow sizes, and C = MF be the vector of counter values; where M is the (random) adjacency matrix of dimensions m x n; m < n Let {f i } be IID, and let H(F) be the entropy of the flow-size vector The MLE decoder For an instance of the problem, let F 1,, F k be the list of all solutions F MLE is that solution which is most likely; i.e. if P flow is the flow size distribution, then F MLE = argmin i { D(F i P FLOW ) } Theorem (Lu, Montanari, P): The MLE decoder is optimal; that is, the space needed asymptotically equals H(F) This is interesting because C is a linear, incremental function of the data, F 13

Related Work Compressed sensing Storing sparse vectors using random linear transformations Candes and Tao, Donoho, Indyk, Muthukrishnan, Wainwright, et al Problem statement minimize F 1 subject to C = MF Main result of CS: reconstruction is exact if F is sparse But, for us Linear transformations not necessarily sparse: lot of updating LP decoding: worst-case cubic complexity Noiseless data compression with LDPC codes Use regular graphs (i.e. not hash-based) Caire, Shamai, Verdu, and Aji, Jin, Khandekar, MacKay, McEliece 14

Practical algorithms: The Count-Min Algorithm This algorithm is due to Cormode and Muthukrishnan Algorithm: Estimate flow j s size as the minimum counter it hits The flow sizes for the example below would be estimated as: 34, 34, 32 1 1 32 34 34 32 Major drawbacks Need lots of counters for accurate estimation Don t know how much the error is; in fact, don t know if there is an error We shall see that applying the Turbo-principle to this algorithm gives terrific results 15

The Turbo-principle 16

Example 1 1 32 34 34 32 0 34 1 1 1 0 34 1 1 1 0 32 1 32 32 Iter 0 Iter 1 Iter 2 Iter 3 Iter 4 Count-Min 17

Properties of the MP Algorithm Anti-monotonicity: With initial estimates of 1 for the flow sizes, Flow size Flow index Note: Because of this property, estimation errors are both detectable and have a bound! 18

When does the sandwich close? Answer 1: No assumption on flow size distribution. Suppose we use k hash functions. Then, if m > k(k-1)n, the counters--flows graph becomes a tree and decoding is exact. Answer 2: Given the flow size distrubution. Using the density evolution technique of Coding Theory, one can show that it suffices for m > c*n, where c* = This means for heavy-tailed flow sizes, where there are approximately 35% 1-packet flows, c* is roughly 0.8 19

Threshold, c*= 0.72 Fraction of flows incorrectly decoded Count-Min s error Reduced due to the Turboprinciple Iteration number 20

The 2-stage Architecture: Counter Braids Elephant Traps Few, deep counters -- First stage: Lots of shallow counters -- Second stage: V.few deep counters -- First stage counters hash into the second stage; an overflow status bit on first stage counters indicates if the counter has overflowed to the second stage Mouse Traps Many, shallow counters -- If a first stage counter overflows, it resets and counts again; second stage counters track most significant bits -- Apply MP algorithm recursively 21

Counter Braids vs. the One-layer Architecture 10 0 Fraction of flows incorrectly decoded 10-1 10-2 10-3 10-4 one layer two layers Two layers One layer 0 1 2 3 4 5 6 7 8 9 Number of counter bits per flow 22

Internet Trace Simulations Used two OC-48 (2.5 Gbps) one-hour contiguous traces collected by CAIDA at a San Jose router. Divided traces into 12 5-minute segments. Trace 1: 0.9 million flows and 20 million packets per segment Trace 2: 0.7 million flows and 9 million packets per segment We used total counter space of 1.28 MB. We ran 50 experiments, each with different hash functions. There were a total of 1200 runs. No error was observed. 23

Comparison Hybrid Sample-and-Hold Count-Min Counter Braids Purpose All flow sizes. Exact. Elephant flows. Approximate. All flow sizes. Approximate. All flow sizes. Exact. Number of flows 900,000 98,000 900,000 900,000 Memory Size (SRAM) counters 4.5 Mbit (31.5 Mbit in DRAM + countermanagement ) 1 Mbit 10 Mbit 10 Mbit Memory Size (SRAM) flow-to-counter association > 25 Mbit 1.6 Mbit Not needed Not needed Error Exact Fractional Large: 0.03745% Medium: 1.090% Small: 43.87% Pe ~ 1 avg abs error = 24.7 Lossless recovery. 24

Conclusions for Counter Braids Cheap and accurate solution to the network traffic measurement problem Good initial results Further work Lossy compression Multi-router solution: same flow passes through many routers 25

Congestion Notification in Ethernet: Part of the IEEE 802.1 Data Center Bridging standardization effort Balaji Prabhakar Berk Atikoglu, Abdul Kabbani, Balaji Prabhakar Stanford University Rong Pan Cisco Systems Mick Seaman

Background Switches and routers send congestion signals to end-systems to regulate the amount of network traffic. Two types of congestion. Transient: Caused by random fluctuations in the arrival rate of packets, and effectively dealt with using buffers and link-level pausing (or dropping packets in the case of the Internet). Oversubscription: Caused by an increase in the applied load either because existing flows send more traffic, or (more likely) because new flows have arrived. We ve been developing QCN (for Quantized Congestion Notification), an algorithm which is being studied as a part of the IEEE 802.1 Data Center Bridging group for deployment in Ethernet 27

Switched Ethernet vs Internet Some significant differences 1. There is no end-to-end signaling in the Ethernet a la per-packet acks in the Internet So congestion must be signaled to the source by switches Not possible to know round trip time! Algorithm not automatically self-clocked (like TCP) 2. Links can be paused; i.e. packets may not be dropped 3. No sequence numbering of L2 packets 4. Sources do not start transmission gently (like TCP slow-start); they can potentially come on at the full line rate of 10Gbps 5. Ethernet switch buffers are much smaller than router buffers (100s of KBs vs 100s of MBs) 6. Most importantly, algorithm should be simple enough to be implemented completely in hardware An interesting environment to develop a congestion control algorithm QCN derived from the earlier BCN algorithm Closest Internet relatives: BIC TCP at source, REM/PI controller at switch 28