초고속네트워크보안시스템설계및구현. KyoungSoo Park School of Electrical Engineering, KAIST. (Collaboration with many students & faculty members at KAIST)

Size: px

Start display at page:

Download "초고속네트워크보안시스템설계및구현. KyoungSoo Park School of Electrical Engineering, KAIST. (Collaboration with many students & faculty members at KAIST)"

Jordan Richard
5 years ago
Views:

1 초고속네트워크보안시스템설계및구현 KyoungSoo Park School of Electrical Engineering, KAIST (Collaboration with many students & faculty members at KAIST)

2 Agenda High-performance packet processing on x86 systems Kargus High-performance network IDS [CCS 12] FloSIS Network flow capture/retrieval system [USENIX 15] mos Reusable network stack for middleboxes [NSDI 17] 2

3 Network Traffic Prediction for 2021 Global IP traffic will be 127x than that of 2005 Smartphone traffic will exceed PC traffic by 2021 Number of connected devices = 3 x populations Broadband speeds will double Business Internet traffic grow at a faster pace than WAN Need faster security monitoring at edge (10/40/100G) Data from Cisco Visual Networking Index: 3

4 Real Traffic Monitoring Is Complex Stateful Firewalls L7 protocol analyzers Network Intrusion Detection System Internet Complexity due to TCP and Application layer Can t be easily done on programmable switches or routers Commodity x86 machines are attractive but slow => However, solid design allows 10~40Gbps monitoring 4

5 Packet Capture Technology in 2005 Packet CAPture (PCAP) library Captures packets at network card Still widely used (tcpdump, Ethereal, Wireshark, Ominpeek, etc.) Performance range: 0.4 to 6.7 Gbps (on 12 core machine) Why is PCAP library SLOW? Utilizes a single CPU core per NIC even on a multi-core system Poorer performance small packets (per-packet overhead) Detailed information at PacketShader [SIGCOMM 10] Heavyweight kernel data structures Frequent memory allocation and deallocation Interrupt overhead 5

6 Packet I/O Technology in 2017 Kernel bypass packet I/O Inspired by PacketShader [SIGCOMM 10] DPDK, PSIO, Netmap, PF_Ring, etc. Common techniques Static buffers for packet content/metadata Batch processing 32 to 4096 packets Full utilization of multiple CPU cores (e.g., receive-side scaling) Performance: ~20 Gbps per CPU core (64 bytes) [Core 1] [Core 2] [Core 3] [Core 4] [Core 5] RxQ A1 RxQ B1 RxQ A2 RxQ B2 RxQ A3 RxQ B3 RxQ A4 RxQ B4 RxQ A5 RxQ B5 10 Gbps NIC A 10 Gbps NIC B 6

7 My Research for the Past 7 Years PacketShader[SIGCOMM 10]: 40G S/W router SSLShader[NSDI 11]: fastest SSL reverse proxy Kargus[CCS 12]: 33 Gbps NIDS with Snort-like ruleset Monbot[Mobisys 13]: 10G flow-level content analyzer mtcp[nsdi 14]: fastest user-level TCP stack FloSIS [USENIX 15]: 30Gbps flow capture systems APUNet[NSDI 17]: fast packet processing on AMD APU mos [NSDI 17]: a reusable network stack for middleboxes 7

Kargus Goal A highly-scalable software-based NIDS for

packet acquisition Expensive string & PCRE pattern matching

Parallel processing & GPU offloading Outcome Fastest S/W

8 Kargus Goal A highly-scalable software-based NIDS for high-speed network Slow software NIDS Bottlenecks Inefficient packet acquisition Expensive string & PCRE pattern matching Fast software NIDS Solutions Multi-core packet acquisition Parallel processing & GPU offloading Outcome Fastest S/W signature-based IDS: 33Gbps 100% malicious traffic: 10 Gbps Real network traffic: ~25 Gbps 8

9 Typical Signature-based NIDS alert tcp $EXTERNAL_NET any -> $HTTP_SERVERS 80 (msg: possible attack attempt BACKDOOR optix runtime detection"; content:"/whitepages/page_me/100.html"; pcre:"/body=\x2521\x2521\x2521optix\s+pro\s+v\d+\x252e\d+\s+server\s+online\x2521\x2521\x2521/") Packet Acquisition Preprocessing Multi-string Pattern Matching Match Success Rule Options Evaluation Evaluation Success Output Decode Flow management Reassembly Match Failure (Innocent Flow) Evaluation Failure (Innocent Flow) Malicious Flow Bottlenecks * PCRE: Perl Compatible Regular Expression 9

10 Pattern Matching CPU intensive tasks for serial packet scanning Major bottlenecks Multi-string matching (Aho-Corasick phase) PCRE evaluation (if pcre rule option exists in rule) On an Intel Xeon X5680, 3.33 GHz, 12 MB L3 Cache Aho-Corasick analyzing bandwidth per core: 2.15 Gbps PCRE analyzing bandwidth per core: 0.52 Gbps 10

GPU-based Pattern Matching GPUs Containing 100-1000s of SIMD processors Ideal for parallel data processing without branches DFA-based pattern matching on GPUs Multi-string matching using Aho-Corasick

11 GPU-based Pattern Matching GPUs Containing s of SIMD processors Ideal for parallel data processing without branches DFA-based pattern matching on GPUs Multi-string matching using Aho-Corasick algorithm PCRE matching Pipelined execution in CPU/GPU Concurrent copy and execution Aho-Corasick bandwidth 2.15 Gbps 39 Gbps PCRE bandwidth 0.52 Gbps 8.9 Gbps Packet Acquisition GPU Dispatcher Thread Preprocess Engine Thread Multi-string Matching Offloading Multi-string Matching Queue Rule Option Evaluation Offloading PCRE Matching Queue GPU Multi-string Matching PCRE Matching 11

12 Alternative: Better Pattern matching algorithm: DFC [NSDI 16] 12

13 Overall Architecture of Kargus Non Uniform Memory Access (NUMA)-aware Core framework as deployed in dual hexa-core system Can be configured to various NUMA set-ups accordingly Kargus configuration on a dual NUMA hexanode machine having 4 NICs, and 2 GPUs 13

Innocent Traffic Performance 40 Gbps input traffic 2.7-4.5x faster than multi-core Snort w/ PF_Ring 1.9-4.

14 Innocent Traffic Performance 40 Gbps input traffic x faster than multi-core Snort w/ PF_Ring x faster than MIDeA (state-of-the-art before Kargus) Throughput (Gbps) MIDeA Snort w/ PF_Ring Kargus CPU-only Kargus CPU/GPU Packet size (Bytes) Actual payload analyzing bandwidth 14

15 Malicious Traffic Performance 5x faster than Snort Throughput (Gbps) Kargus, 25% Kargus, 50% Kargus, 100% Snort+PF_Ring, 25% Snort+PF_Ring, 50% Snort+PF_Ring, 100% Packet size (Bytes) 15

16 FloSIS Goal A high-speed flow capture & retrieval system I II Design A pipelined parallel processing architecture A two-stage flow-level indexing & retrieval III IV Optimization Latency limitation for high zero-drop performance Deduplication for storage efficiency 36 Gbps of peak performance 5 times faster retrieval 34.5% of storage saving 16

17 Packet-based Capture Systems tcpdump, wireshark: packet capture & analysis n2disk* : high-speed packet recorder with indexing Arriving packets Packets in disk Retrieval without index Retrieval with index tt * DERI, L., CARDIGLIANO, A., AND FUSCO, F, 10gbit line rate packet-to-disk using n2disk, In proceedings of IEEE INFOCOM Workshop on Traffic Monitoring and Analysis,

18 Flow-based Capture Systems Write/read a flow at one location Arriving packets Flows in disk tt Flow retrieval Real-time flow content reassembly Payload content before writing See the content right after searching Improve storage efficiency by deduplication 18

19 Overall Architecture of FloSIS Overall architecture and thread arrangement of FloSIS 16 CPU cores, 4 10G NIC ports, 24 HDDs, and 2 SSDs 19

20 Key Ideas in FloSIS Pipelined parallel processing architecture Two-stage flow-level indexing & retrieval Low packet processing latency Direct disk I/O, sequential disk I/O User-level memory management Deduplication for storage efficiency 20

21 Packet Capture & Dumping Performance Stress test with 40 Gbps of synthetic packet workload H/W limit Sequential disk writing throughput Tcpdump not optimized for multicore environments n2disk10g and FloSIS show comparable performance 21

22 Real Traffic Processing Performance Peak & zero-drop performance with real traffic trace replay (LTE traffic) 10Gbps Peak performance is restricted by the sequential disk writing performance FloSIS shows higher zero-drop performance than n2disk10g 22

23 Flow-level Query Response Time Elapsed time for processing retrieval Search the flows by querying 10 randomly selected source IP addresses Retrieval: source IP address ,768 flows 1,668 flows 1,693 flows 185 flows 318 flows 276 flows 119 flows 137 flows 21,300 flows 4,255 flows 2~5 times faster n2disk10g FloSIS Elapsed time (sec) # of flows # of disk access Retrieval processing time 23

24 Most Middleboxes Deal with TCP Traffic TCP dominates the Internet 95+% of traffic is TCP [1] Flow-processing middleboxes Stateful firewalls Protocol analyzers Cellular data accounting Intrusion detection/prevention systems Network address translation And many others! TCP UDP etc [1] Comparison of Caching Strategies in Modern Cellular Backhaul Networks, ACM MobiSys TCP state management is complex and error-prone! 24

25 Example: Cellular Data Accounting System Custom middlebox application No open-source projects In South Korea, Celullar ISPs must not account for TCP retransmission packets Internet Gateway Data Accounting System Cellular Core Network 25

26 Programming TCP End-Host Application Typical TCP end-host applications Typical TCP middleboxes? TCP application Berkeley Socket API TCP/IP stack User level Kernel level Middlebox logic Packet processing Flow state tracking Flow reassembly Spaghetti code? No clear separation! Berkeley socket API Nice abstraction that separates flow management from application Write better code if you know TCP internals Never requires you to write TCP stack itself 26

27 mos Networking Stack Reusable networking stack for middleboxes Programming abstraction and APIs to developers Key concepts Separation of flow management from custom logic Event-based middlebox development (event/action) Per-flow flexible resource consumption Benefits Clean, modular development of stateful middleboxes Developers focus on core logic rather than flow management High performance flow management on mtcp stack 27

28 Key Abstraction: mos Monitoring Socket Represents the middlebox viewpoint on network traffic Monitors both TCP connections and IP packets Provides similar API to the Berkeley socket API User context Monitoring socket Flow context Packets Custom middlebox logic mos socket API mos stack Custom event handler Event generation Separation of flow management from custom middlebox logic! 28

29 Key Abstraction: mos Events Notable condition that merits middlebox processing Built-in event (BE) Events that happen naturally in TCP processing e.g., packet arrival, TCP connection start/teardown, retransmission User-defined event (UDE) User can define their own event (= base event + filter function) Built-in event User-defined event User-defined event New data arrival Filter (HTTP request) HTTP request arrival Packet arrival Filter (ACK packet) ACK packet arrival Filter (counter) 3 duplicate ACK arrival 29

mos Flow Management Dual TCP stack management

TCP server SYN SYN/ACK DATA/ACK Server side TCP

30 mos Flow Management Dual TCP stack management Infer the states of both client and server TCP stacks mos stack emulation ESTABLISHED SYN_SENT CLOSED TCP client Client side TCP stack TCP state TCP server SYN SYN/ACK DATA/ACK Server side TCP stack TCP state Receive buffer ESTABLISHED SYN_RCVD LISTEN 30

31 Current mos stack API Socket creation and traffic filter int int int mtcp_socket(mctx_t mctx, int domain, int type, int protocol); mtcp_close(mctx_t mctx, int sock); mtcp_bind_monitor_filter(mctx_t mctx, int sock, monitor_filter_t ft); User-defined event management event_t mtcp_define_event(event_t ev, FILTER filt); int mtcp_register_callback(mctx_t mctx, int sock, event_t ev, int hook, CALLBACK cb); Per-flow user-level context management void * mtcp_get_uctx(mctx_t mctx, int sock); void mtcp_set_uctx(mctx_t mctx, int sock, void *uctx); Flow data reading ssize_t mtcp_peek(mctx_t mctx, int sock, int side, char *buf, size_t len); ssize_t mtcp_ppeek(mctx_t mctx, int sock, int side, char *buf, size_t count, off_t seq_off); 31

32 Current mos stack API Packet information retrieval and modification int mtcp_getlastpkt(mctx_t mctx, int sock, int side, struct pkt_info *pinfo); int mtcp_setlastpkt(mctx_t mctx, int sock, int side, off_t offset, byte *data, uint16_t datalen, int option); Flow information retrieval and flow attribute modification int int mtcp_getsockopt(mctx_t mctx, int sock, int l, int name, void *val, socklen_t *len); mtcp_setsockopt(mctx_t mctx, int sock, int l, int name, void *val, socklen_t len); Retrieve end-node IP addresses int mtcp_getpeername(mctx_t mctx, int sock, struct sockaddr *addr, socklen_t *addrlen); Per-thread context management mctx_t int mtcp_create_context(int cpu); mtcp_destroy_context(mctx_t mctx); Initialization int mtcp_init(const char *mos_conf_fname); 32

33 Sample Code: Initialization static void thread_init(mctx_t mctx) { monitor_filter ft ={0}; int msock; event_t http_event; msock = mtcp_socket(mctx, AF_INET, MOS_SOCK_MONITOR_STREAM, 0); ft.stream_syn_filter = "dst net and dst port 80"; mtcp_bind_monitor_filter(mctx, msock, &ft); mtcp_register_callback(mctx, msock, MOS_ON_CONN_START, MOS_HK_SND,on_flow_start); } http_event = mtcp_define_event(mos_on_conn_new_data, chk_http_request); mtcp_register_callback(mctx, msock, http_event, MOS_HK_RCV, on_http_request); Sets up a traffic filter in Berkeley packet filter (BPF) syntax Defines a user-defined event that detects an HTTP request Uses a built-in event that monitors each TCP connection start event 33

34 UDE Filter Function static bool chk_http_request(mctx_t m, int sock, int side, event_t event) { struct httpbuf *p; u_char* temp; int r; if (side!= MOS_SIDE_SVR) // monitor only server-side buffer return false; if ((p = mtcp_get_uctx(m, sock)) == NULL) { p = calloc(1, sizeof(struct httpbuf)); // user-level structure mtcp_set_uctx(m, sock, p); } r = mtcp_peek(m, sock, side, p->buf + p->len, REQMAX - p->len - 1); p->len += r; p->buf[p->len] = 0; if ((temp = strstr(p->buf, "\n\n")) (temp = strstr(p->buf, "\r\n\r\n"))) { p->reqlen = temp - p->buf; return true; } return false; } Called whenever the base event is triggered If it returns TURE, UDE callback function is called 34

35 mos Stack Implementation Per-thread library TCP stack ~26K lines of C code (mtcp [1] : ~11K lines) Shared nothing parallel architecture CPU core 1 CPU core N Sender TCP stack Rx Application mos core Packet I/O Receiver TCP stack Tx.... Application mos core Symmetric RSS NIC [1] mtcp: a highly scalable user-level TCP stack for multicore systems, NSDI'14 35

36 Real-world mos Applications msnort (2,104 lines modified from 80K lines) Snort with mos flow management mabacus (561 lines vs lines) Secure cellular data accounting system mhalfback proxy (128 lines) Low-latency proxy with proactive TCP retransmission mprads (615 lines modified from 11K) Pattern matching on flow-reassembled content mndpi (765 lines modified from 25K) Pattern matching on flow-reassembled content midstat Netstat for middlebox mnat High-performance NAT 36

37 Performance Scalability on Multicores File download traffic with 192K concurrent flows Each flow downloads an X-byte content in one TCP connection A new flow is spawned when a flow terminates Two simple applications Counting packets per flow (packet arrival event) Searching for a string in flow reassembled data (full flow reassembly & DPI) Throughput (Gbps) Performance linearly scales as 53.0 # of cores are increased Counting packets (# of CPU cores) 64B file 42.5 Searching for a string 8KB file 37

38 Dynamic Event Registration Evaluation Monitor 192K concurrent flows Flow size: 4KB Searching for a string in flow reassembled data Dynamically register a new event when target string found 50% client flows have target strings 40 Naïve mos Throughput (Gbps) # of event nodes in the tree 38

mos Stack Is Open-Sourced Public release of mos stack/api at github https://github.

39 mos Stack Is Open-Sourced Public release of mos stack/api at github mos online manual 39

40 Conclusion Fast packet processing on commodity machine Scalable system architecture is important Kargus & FloSIS Simple programming model is needed for middleboxes mos stack and API I can t build your own system but can offer consulting 40

Developing Stateful Middleboxes with the mos API KYOUNGSOO PARK & YOUNGGYOUN MOON

Developing Stateful Middleboxes with the mos API KYOUNGSOO PARK & YOUNGGYOUN MOON ASIM JAMSHED, DONGHWI KIM, & DONGSU HAN SCHOOL OF ELECTRICAL ENGINEERING, KAIST Network Middlebox Networking devices that