소프트웨어기반고성능침입탐지시스템설계및구현

Similar documents
초고속네트워크보안시스템설계및구현. KyoungSoo Park School of Electrical Engineering, KAIST. (Collaboration with many students & faculty members at KAIST)

PacketShader: A GPU-Accelerated Software Router

PacketShader as a Future Internet Platform

The Power of Batching in the Click Modular Router

G-NET: Effective GPU Sharing In NFV Systems

GASPP: A GPU- Accelerated Stateful Packet Processing Framework

Recent Advances in Software Router Technologies

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Ruler: High-Speed Packet Matching and Rewriting on Network Processors

GPGPU introduction and network applications. PacketShaders, SSLShader

Accelerating String Matching Using Multi-threaded Algorithm

A Capability-Based Hybrid CPU/GPU Pattern Matching Algorithm for Deep Packet Inspection

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Using (Suricata over) PF_RING for NIC-Independent Acceleration

Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances

Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing

NetSlices: Scalable Mul/- Core Packet Processing in User- Space

Supra-linear Packet Processing Performance with Intel Multi-core Processors

LEoNIDS: a Low-latency and Energyefficient Intrusion Detection System

Learning with Purpose

Deep Packet Inspection of Next Generation Network Devices

RouteBricks: Exploiting Parallelism To Scale Software Routers

GASPP: A GPU-Accelerated Stateful Packet Processing Framework

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

DPDK Summit China 2017

Switch and Router Design. Packet Processing Examples. Packet Processing Examples. Packet Processing Rate 12/14/2011

Developing Stateful Middleboxes with the mos API KYOUNGSOO PARK & YOUNGGYOUN MOON

Rhythm: Harnessing Data Parallel Hardware for Server Workloads

Advanced Computer Networks. End Host Optimization

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

High Performance Packet Processing with FlexNIC

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Rule Hashing for Efficient Packet Classification in Network Intrusion Detection

NBA (Network Balancing Act): A High-performance Packet Processing Framework for Heterogeneous Processors

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

Research on DPDK Based High-Speed Network Traffic Analysis. Zihao Wang Network & Information Center Shanghai Jiao Tong University

BUILDING A NEXT-GENERATION FIREWALL

Experiences in Building a 100 Gbps (D)DoS Traffic Generator

Goverlan Reach Server Hardware & Operating System Guidelines

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

LegUp: Accelerating Memcached on Cloud FPGAs

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

Networking at the Speed of Light

Network Processors. Nevin Heintze Agere Systems

ntop Users Group Meeting

Optimizing Performance: Intel Network Adapters User Guide

Suricata IDPS and Linux kernel

PARA-SNORT : A MULTI-THREAD SNORT ON MULTI-CORE IA PLATFORM

Today s Paper. Routers forward packets. Networks and routers. EECS 262a Advanced Topics in Computer Systems Lecture 18

Chapter 7. Network Intrusion Detection and Analysis. SeoulTech UCS Lab (Daming Wu)

Improving the Database Logging Performance of the Snort Network Intrusion Detection Sensor

A Next Generation Home Access Point and Router

Feature Rich Flow Monitoring with P4

High-Speed Network Processors. EZchip Presentation - 1

SWM: Simplified Wu-Manber for GPU-based Deep Packet Inspection

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Cisco Wide Area Application Services (WAAS) Mobile

Evaluation of the Chelsio T580-CR iscsi Offload adapter

WHITE PAPER: BEST PRACTICES. Sizing and Scalability Recommendations for Symantec Endpoint Protection. Symantec Enterprise Security Solutions Group

CEC 450 Real-Time Systems

Accelerate Applications Using EqualLogic Arrays with directcache

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

RDMA and Hardware Support

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

TriBiCa: Trie Bitmap Content Analyzer for High-Speed Network Intrusion Detection

PERG-Rx: An FPGA-based Pattern-Matching Engine with Limited Regular Expression Support for Large Pattern Database. Johnny Ho

Evaluating the Suitability of Server Network Cards for Software Routers

EECS 122: Introduction to Computer Networks Switch and Router Architectures. Today s Lecture

BlackBerry AtHoc Networked Crisis Communication Capacity Planning Guidelines. AtHoc SMS Codes

Design and Implementation of Virtual TAP for Software-Defined Networks

TRex Realistic Traffic Generator

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

It's kind of fun to do the impossible with DPDK Yoshihiro Nakajima, Hirokazu Takahashi, Kunihiro Ishiguro, Koji Yamazaki NTT Labs

Performance Tuning Snort. Steve Sturges SANS Incident Detection Summit 9 December

Accelerating 4G Network Performance

Network Design Considerations for Grid Computing

IsoStack Highly Efficient Network Processing on Dedicated Cores

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

Software and Tools for HPE s The Machine Project

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Chapter 1. Introduction

Today s Paper. Routers forward packets. Networks and routers. EECS 262a Advanced Topics in Computer Systems Lecture 18

Suricata Performance with a S like Security

Evolution of the netmap architecture

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

ntop Users Group Meeting

Gen 6 Fibre Channel Evaluation of Products from Emulex and Brocade

100% PACKET CAPTURE. Intelligent FPGA-based Host CPU Offload NIC s & Scalable Platforms. Up to 200Gbps

Hash-Based String Matching Algorithm For Network Intrusion Prevention systems (NIPS)

Configurable String Matching Hardware for Speeding up Intrusion Detection

Parallel Computing. Hwansoo Han (SKKU)

New Approach to OVS Datapath Performance. Founder of CloudNetEngine Jun Xiao

Improving Packet Processing Performance of a Memory- Bounded Application

Scalable Distributed Training with Parameter Hub: a whirlwind tour

Authenticated Storage Using Small Trusted Hardware Hsin-Jung Yang, Victor Costan, Nickolai Zeldovich, and Srini Devadas

Reliably Scalable Name Prefix Lookup! Haowei Yuan and Patrick Crowley! Washington University in St. Louis!! ANCS 2015! 5/8/2015!

Snort Virtual Network Function with DPI Service

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

Transcription:

소프트웨어기반고성능침입탐지시스템설계및구현 KyoungSoo Park Department of Electrical Engineering, KAIST M. Asim Jamshed *, Jihyung Lee*, Sangwoo Moon*, Insu Yun *, Deokjin Kim, Sungryoul Lee, Yung Yi* Department of Electrical Engineering, KAIST Cyber R&D Division, NSRI

Network Intrusion Detection Systems (NIDS) Detect known malicious activities Port scans, SQL injections, buffer overflows, etc. Deep packet inspection Detect malicious signatures (rules) in each packet Desirable features High performance (> 10Gbps) with precision Easy maintenance Frequent ruleset updates NIDS Attack Internet 2

Hardware vs. Software H/W-based NIDS Specialized hardware ASIC, TCAM, etc. High performance Expensive Annual servicing costs Low flexibility S/W-based NIDS Commodity machines High flexibility Low performance DDoS/packet drops IDS/IPS Sensors (10s of Gbps) ~ US$ 20,000-60,000 IDS/IPS M8000 (10s of Gbps) ~ US$ 10,000-24,000 Open-source S/W ~2 Gbps 3

Goals High performance S/W-based NIDS Commodity machines High flexibility 4

Typical Signature-based NIDS Architecture alert tcp $EXTERNAL_NET any -> $HTTP_SERVERS 80 (msg: possible attack attempt BACKDOOR optix runtime detection"; content:"/whitepages/page_me/100.html"; pcre:"/body=\x2521\x2521\x2521optix\s+pro\s+v\d+\x252e\d+\s+server\s+online\x2521\x2521\x2521/") Packet Acquisition Preprocessing Multi-string Pattern Matching Match Success Rule Options Evaluation Evaluation Success Output Decode Flow management Reassembly Match Failure (Innocent Flow) Evaluation Failure (Innocent Flow) Malicious Flow Bottlenecks * PCRE: Perl Compatible Regular Expression 5

Contributions Goal A highly-scalable software-based NIDS for high-speed network Slow software NIDS Bottlenecks Inefficient packet acquisition Expensive string & PCRE pattern matching Fast software NIDS Solutions Multi-core packet acquisition Parallel processing & GPU offloading Outcome Fastest S/W signature-based IDS: 33Gbps 100% malicious traffic: 10 Gbps Real network traffic: ~25 Gbps 6

Challenge 1: Packet Acquisition Default packet module: Packet CAPture (PCAP) library Unsuitable for multi-core environment Poor performance More power consumption Multi-core packet capture library is required Packet RX bandwidth * 0.4-6.7 Gbps CPU utilization 100 % [Core 1] [Core 2] [Core 3] [Core 4] [Core 5] [Core 7] [Core 8] [Core 9] [Core 10] [Core 11] 10 Gbps NIC A 10 Gbps NIC B 10 Gbps NIC C 10 Gbps NIC D * Intel Xeon X5680, 3.33 GHz, 12 MB L3 Cache 7

Solution: PacketShader I/O PacketShader I/O Uniformly distributes packets based on flow info by RSS hashing Source/destination IP addresses, port numbers, protocol-id 1 core can read packets from RSS queues of multiple NICs Reads packets in batches (32 ~ 4096) Symmetric Receive-Side Scaling (S-RSS) Passes packets of 1 connection to the same queue [Core 1] [Core 2] [Core 3] [Core 4] [Core 5] Packet RX bandwidth 0.4-6.7 Gbps 40 Gbps Rx Q A1 Rx Q B1 Rx Q A2 Rx Q B2 10 Gbps NIC A Rx Q A3 Rx Q B3 Rx Q A4 Rx Q B4 10 Gbps NIC B Rx Q A5 Rx Q B5 CPU utilization 100 % 16-29% * S. Han et al., PacketShader: a GPU-accelerated software router, ACM SIGCOMM 2010 8

Challenge 2: Pattern Matching CPU intensive tasks for serial packet scanning Major bottlenecks Multi-string matching (Aho-Corasick phase) PCRE evaluation (if pcre rule option exists in rule) On an Intel Xeon X5680, 3.33 GHz, 12 MB L3 Cache Aho-Corasick analyzing bandwidth per core: 2.15 Gbps PCRE analyzing bandwidth per core: 0.52 Gbps 9

Solution: GPU for Pattern Matching GPUs Containing 100s of SIMD processors 512 cores for NVIDIA GTX 580 Ideal for parallel data processing without branches DFA-based pattern matching on GPUs Multi-string matching using Aho-Corasick algorithm PCRE matching Pipelined execution in CPU/GPU Concurrent copy and execution Engine Thread Aho-Corasick bandwidth 2.15 Gbps 39 Gbps PCRE bandwidth 0.52 Gbps 8.9 Gbps GPU Packet Acquisition Preprocess Multi-string Matching Rule Option Evaluation Multi-string Matching GPU Dispatcher Thread Offloading Offloading PCRE Matching Multi-string Matching Queue PCRE Matching Queue 10

Optimization 1: IDS Architecture How to best utilize the multi-core architecture? Pattern matching is the eventual bottleneck Function Time % Module acsmsearchsparsedfa_full 51.56 multi-string matching List_GetNextState 13.91 multi-string matching msearch 9.18 multi-string matching in_chksum_tcp 2.63 preprocessing * GNU gprof profiling results Run entire engine on each core 11

Solution: Single-process Multi-thread Runs multiple IDS engine threads & GPU dispatcher threads concurrently Shared address space Less GPU memory consumption Higher GPU utilization & shorter service latency GPU memory usage 1/6 Single thread pinned at core 1 Core 6 GPU Dispatcher Thread Rule Option Evaluation Rule Option Evaluation Rule Option Evaluation Rule Option Evaluation Rule Option Evaluation Multi-string Matching Multi-string Matching Multi-string Matching Multi-string Matching Multi-string Matching Preprocess Preprocess Preprocess Preprocess Preprocess Packet Acquisition Packet Acquisition Packet Acquisition Packet Acquisition Packet Acquisition Core 1 Core 2 Core 3 Core 4 Core 5 12

Architecture Non Uniform Memory Access (NUMA)-aware Core framework as deployed in dual hexa-core system Can be configured to various NUMA set-ups accordingly Kargus configuration on a dual NUMA hexanode machine having 4 NICs, and 2 GPUs 13

Optimization 2: GPU Usage Caveats Use: Long per-packet processing latency: Buffering in GPU dispatcher More power consumption NVIDIA GTX 580: 512 cores CPU when ingress rate is low (idle GPU) GPU when ingress rate is high 14

Solution: Dynamic Load Balancing Load balancing between CPU & GPU Reads packets from NIC queues per cycle Analyzes smaller # of packets at each cycle (a < b < c) Increases analyzing rate if queue length increases Activates GPU if queue length increases Packet latency with GPU : 640 μsecs CPU: 13 μsecs a b c Internal packet queue (per engine) CPU GPU CPU GPU α β γ a b c Queue Length 15

Optimization 3: Batched Processing Huge per-packet processing overhead > 10 million packets per second for small-sized packets at 10 Gbps reduces overall processing throughput Function call batching Reads group of packets from RX queues at once Pass the batch of packets to each function Decode(p) Preprocess(p) Multistring_match(p) 2x faster processing rate Decode(list-p) Preprocess(list-p) Multistring_match(list-p) 16

Kargus Specifications 12 GB DRAM (3GB x 4) $100 NUMA node 1 Intel X5680 3.33 GHz (hexacore) 12 MB L3 NUMA-Shared Cache $1,210 NUMA node 2 NVIDIA GTX 580 GPU $370 $512 Intel 82599 Gigabit Ethernet Adapter (dual port) Total Cost (incl. serverboard) = ~$7,000 17

IDS Benchmarking Tool Generates packets at line rate (40 Gbps) Random TCP packets (innocent) Attack packets are generated by attack rule-set Support packet replay using PCAP files Useful for performance evaluation 18

Kargus Performance Evaluation Micro-benchmarks Input traffic rate: 40 Gbps Evaluate Kargus (~3,000 HTTP rules) against: Kargus-CPU-only (12 engines) Snort with PF_RING MIDeA * Refer to the paper for more results * G. Vasiliadis et al., MIDeA: a multi-parallel intrusion detection architecture, ACM CCS 11 19

Throughput (Gbps) Innocent Traffic Performance 2.7-4.5x faster than Snort 1.9-4.3x faster than MIDeA 35 30 25 20 MIDeA Snort w/ PF_Ring Kargus CPU-only Kargus CPU/GPU 15 10 5 0 64 218 256 818 1024 1518 Packet size (Bytes) Actual payload analyzing bandwidth 20

Throughput (Gbps) Malicious Traffic Performance 5x faster than Snort 35 30 25 20 Kargus, 25% Kargus, 50% Kargus, 100% Snort+PF_Ring, 25% Snort+PF_Ring, 50% Snort+PF_Ring, 100% 15 10 5 0 64 256 1024 1518 Packet size (Bytes) 21

Real Network Traffic Three 10Gbps LTE backbone traces of a major ISP in Korea: Time duration of each trace: 30 mins ~ 1 hour TCP/IPv4 traffic: 84 GB of PCAP traces 109.3 million packets 845K TCP sessions Total analyzing rate: 25.2 Gbps Bottleneck: Flow Management (preprocessing) 22

Power Consumption (Watts) Effects of Dynamic GPU Load Balancing Varying incoming traffic rates Packet size = 1518 B 900 850 800 750 700 650 600 550 500 450 400 Kargus w/o LB (polling) Kargus w/o LB Kargus w/ LB 20% 8.7% 0 5 10 20 33 Offered Incoming Traffic (Gbps) [Packet Size: 1518 B] 23

Conclusion Software-based NIDS: Based on commodity hardware Competes with hardware-based counterparts 5x faster than previous S/W-based NIDS Power efficient Cost effective > 25 Gbps (real traffic) > 33 Gbps (synthetic traffic) US $~7,000/- 24

Thank You fast-ids@list.ndsl.kaist.edu https://shader.kaist.edu/kargus/ 25