Bringing&the&Performance&to&the& Cloud &

Size: px
Start display at page:

Download "Bringing&the&Performance&to&the& Cloud &"

Transcription

1 Bringing&the&Performance&to&the& Cloud & Dongsu&Han& KAIST & Department&of&Electrical&Engineering& Graduate&School&of&InformaAon&Security& &

2 The&Era&of&Cloud&CompuAng& Datacenters&at&Amazon,&Google,&Facebook& & 2

3 Customers&of&the&Cloud& Customers& rent &either&physical&or&virtual& machines&from&the&cloud.& Public&and&private&cloud:&External&or&internal& customers&enaaes& &&&&&& 3&

4 Scale&of&the& Cloud & Facebook:& hundreds)of)thousands)of)machines. ) MicrosoP:&1&million&servers&& Google&envisions&10&million&servers.&& Google&spends&about&$3&Billion&every&year&on&data& centers.*& * sdata-center-spend-slows &

5 Efficiency&is&important& What&if&we&can&increase&a&single&machine s& performance&by&e.g.,&10,&20,&40%?& Equipment&cost&savings& Energy&savings:&By&2012,&the&cost&of&power&for&the& data&center&is&expected&to&exceed&the&cost&of&the& original&capital&investment.&[u.s&doe]& Reduced&complexity&in&system&design&as&#&of& machine&involved&decreases& 5&

6 How do we improve the Cloud efficiency (and performance)? We start with a single machine. What does a machine look like? What kind of workload does it handle? 6&

7 What&does&a&machine&look&like&today?& General&purpose&hardware&(x86&architecture)& MulAcore:&4&~60&cores&(tens&of&CPU&cores)& MulAple&10hGigabit&Ethernet&(becoming&the&norm)& & 7

8 A&Typical&Cluster&ConfiguraAon& Front-end Web tier Back-end Cache tier Service tier (e.g., Advertising) Storage tier 8&

9 A&Typical&Cluster&ConfiguraAon& Front-end Web tier Back-end Cache tier Front-end interaction happens over HTTP (TCP). Back-end interaction can use any protocol. 9&

10 A&Typical&Cluster&ConfiguraAon& Web tier Up&to&3x&improvement&in& performance&&& Cache tier Up&to&7x&improvement&in& performance&&& 1. Improving the performance of a cache server [NSDI 14] Joint work with H. Lim, M. Kaminsky, and D. Andersen 2. Improving the performance of a Web server [NSDI 14] w/ E. Jeong, S. Woo, M. Jamshed, H. Jeong, S. Ihm, and K. Park 10&

11 InhMemory&KeyhValue&Cache& Request& Keyhvalue&Cache& Database& Keyhvalues&are&stored&in&DRAM&( Memcache )& 11&

12 InhMemory&KeyhValue&Cache& Request& What is the workload of a K-V cache like? Keyhvalue&Cache& [SOSP]&2007,&2009,&2011&[NSDI]&2013a,&2013b&[EuroSys]&2012& [SIGCOMM]&2012&[SOCC]&2010,&2012&[SIGMETRICS]&2012&[ATC]&2013& 12& Database&

13 Workload&of&a&Keyhvalue&cache& CDF 5 different memcached pools Facebook&KhV&size&distribuAon&[SIGMETRICS2012]& 13&

14 Diverse&Workload&[SIGMETRICS2012]& Read intensive Write intensive 14&

15 Fast&InhMemory&KeyhValue&Cache& Must&handle&small,&variablehlength&items&efficiently& Support&both&readh&and&writehintensive&workloads& & 15&

16 Current&KhV&Cache&Performance& 80& 70& 60& 50& 40& 30& 20& 10& 0& 1.4& Workload:&YCSBhB&(95%&GET,&5%&PUT)& Throughput&(M&operaAons/sec)& <&1& 4.4& 8.9& 77.8& Memcached& RAMCloud& MemC3& Masstree& (Ideal)& Endhtohend&performance&using&UDP& 16&

17 ReadhIntensive&Workload& 80& 70& 60& 50& 40& 30& 20& 10& 0& 1.5& Workload:&YCSBhB&(95%&GET,&5%&PUT)& Throughput&(M&operaAons/sec)& 3.1& 6.3& 23.5& 77.8& Memcached& RAMCloud& MemC3& Masstree& (Ideal)& Endhtohend&performance&using&our&opAmized&network&stack& 17&

18 ReadhIntensive&Workload& 80& 70& 60& 50& 40& 30& 20& 10& 0& 1.5& Workload:&YCSBhB&(95%&GET,&5%&PUT)& Throughput&(M&operaAons/sec)& 3.1& Problem(1:( Large(gap( 6.3& 23.5& 77.8& Memcached& RAMCloud& MemC3& Masstree& (Ideal)& Endhtohend&performance&using&our&opAmized&network&stack& 18& Maximum& packets/sec& asainable&using& UDP&protocol&

19 WritehIntensive& Workload:&YCSBhA&(50%&GET,&50%&PUT)& Throughput&(M&operaAons/sec)& 80& 70& 60& 50& 40& 30& 20& 10& 0& Problem(2:( Performance(collapses( under(heavy(write( <&1& <&1& <&1& 19& 11.7& 77.8& Memcached& RAMCloud& MemC3& Masstree& (Ideal)& Endhtohend&performance&using&our&opAmized&network&stack& Maximum& packets/sec& asainable&using& UDP&protocol&

20 MICA&Performance&Preview& Workload:&YCSBhA&(50%&GET,&50%&PUT)& Throughput&(M&operaAons/sec)& 80& 70& 60& 50& 40& 30& 20& 10& 0& MICA:(70.1(Mops( <&1& <&1& <&1& 20& About(14(ns(per(request( Maximum& 77.8& packets/sec& asainable&using& UDP&protocol& 11.7& Memcached& RAMCloud& MemC3& Masstree& (Ideal)& Endhtohend&performance&using&our&opAmized&network&stack&

21 MICA&Approach& MICA&redesigns&a&KhV&cache&in&a&holisAc&way.&& Server& Client& NIC& CPU& CPU& Memory& 2.&Request& direcaon& 1.&Parallel& data&access& 3.&Keyhvalue& data&structures& 21&

22 Parallel&Data&Access& Server&node& Client& NIC& CPU& CPU& Memory& 2.&Request& direcaon& 1.&Parallel& data&access& 3.&Keyhvalue& data&structures& Modern&CPUs&have&many&cores&(8,&12,& )& We&must&exploit&CPU&parallelism&efficiently.& 22&

23 Concurrent&Read/Write&(CRCW)& Any&core&can&read/write&any&part&of&memory& & CPU&core& CPU&core& Memory& +)&Can&distribute&load&to&mulAple&cores& Memcached,&RAMCloud&[SOSP],&MemC3&[NSDI],& Masstree&[EuroSys]& h)&limits&scalability&with&mulaple&cores& Lock&contenAon& Expensive&cacheline&transfer&caused&by& concurrent&writes&on&the&same&memory&locaaon& 23&

24 MICA&Scales&Well&with&Many&Cores& Throughput&(Mops)& 1( YCSBhA& Skewed,&50%&GET& MemC3& Memcached& RAMCloud& 0.1( 2& 4& 8& 16& Number&of&CPU&cores& 24&

25 MICA&Scales&Well&with&Many&Cores& Throughput&(Mops)& 100( YCSBhA& Skewed,&50%&GET& 10( 1( MICA& Masstree& MemC3& Memcached& RAMCloud& 0.1( 2& 4& 8& 16& Number&of&CPU&cores& 25&

26 MICA s&parallel&data&access& ParAAon&data&using&the&hash&of&keys& Exclusive&Read/Write&(EREW)& Only&one&core&accesses&a&parAcular&parAAon& CPU&core& ParAAon& & CPU&core& ParAAon& +)&Avoids&synchronizaAon/interhcore& communicaaon&[hhstore,voltdb]& h)&can&be&slow&under&skewed&key&popularity& A&popular&item&cannot&be&served&by&mulAple&cores& 26&

27 EREW&Outperforms&CRCW& Throughput&(Mops)& 80& 77.0& 76.8& 70& 60& 50& 70.1& Skewed & &Same&as&YCSB s& Zipf&key&popularity&with&skewness=0.99& 70.2& 70.9& 70.3& 40& 30& 20& 10& 0& Uniform,&50%&GET& Uniform,&95%&GET& Skewed,&50%&GET& Skewed,&95%&GET& EREW& 35.4& 19.6& CRCW& 27&

28 Skew&Does&Not&Hurt&(Much)& Throughput&(Mops)& 8& 6& 4& 2& Some&cores&can&process&more& requests&under&skewed&workloads& Uniform& Skewed& & 0& 0& 1& 2& 3& 4& 5& 6& 7& 8& 9& 10& 11& 12& 13& 14& 15& Perhcore&performance&breakdown& Hot&parAAons&contain&a&few&popular&keys,& making&cpu&cache&very&effecave& Core&#& 28&

29 Request&DirecAon& Server&node& Client& NIC& CPU& CPU& Memory& 2.&Request& direcaon& 1.&Parallel& data&access& 3.&Keyhvalue& data&structures& EREW&requires&correct&request&direcAon.& A&request&must&be&sent&to&the&core/parAAon&that& handles&the&requested&key.& 29&

30 Common&Request&DirecAon&Scheme& FlowFbased(affinity( Server&node& Client& Client& NIC& CPU& CPU& Using&5htuple&to&assign&packets&to&cores& Useful&for&flowhbased&protocols&(e.g.,&TCP)& Does&not&work&well&with&MICA s&erew& A&client&can&request&keys&from&different&parAAons& 30&

31 MICA s&request&direcaon& ObjectFbased(affinity( MICA&overcomes&commodity&NICs &limited& programmability&by&using&client&assistance& & Server&node& Client& Client& Key&2& Key&1& Key&1& Key&2& NIC& CPU& CPU& Key s&paraaon&id&is&put&into& packet&header&(udp&dest&port)& Programmed&to&use&the&embedded&parAAon& ID&to&direct&packets&to&cores& Uses&Intel&Data&Plane&Development&Kit&(DPDK)&for& lowhoverhead&burst&packet&i/o&bypassing&os&kernel& 31&

32 NIC&HW&for&Request&DirecAon& Throughput&(Mops)& 80& 70& 60& 50& Uniform& Skewed& 77.0& 70.1& 40& 30& 20& 10& 34.8& 25.6& 0& SoPwarehonly& Clienthassisted&hardwarehbased& Using&EREW&for&parallel&data&access& 32&

33 KeyhValue&Data&Structures& Server&node& Client& NIC& CPU& CPU& Memory& & 2.&Request& direcaon& 1.&Parallel& data&access& 3.&Keyhvalue& data&structures& Significant&impact&on&keyhvalue&processing&speed& New&design&required&to&support&both&read&and& write&operaaons&at&high&speeds& 33&

34 MICA s&keyhvalue&data&structures& Each&parAAon&has&two&data&structures:& Circular&log&store&& Lossy&concurrent&hash&index& Omised&in&this&talk:&numerous&opAmizaAons& Garbage&collecAon& LRU&approximaAon& NUMAhaware&memory&allocaAon,&memory&mapping& memory&prefetching,&cachehfriendly&data&structures& Concurrency&support&(for& CREW )& 34&

35 Circular&Log&Store& Allocates&space&for&keyhvalue&items&of&any&length& Simple&garbage&collecAon&and&free&space& defragmentaaon& New&item&is& appended&at&tail& Head& Insufficient&space& for&new&item?& Tail& (fixed&log&size)& Evict&oldest&item& at&head&(fifo)& Tail& Head& 35&

36 Lossy&Concurrent&Hash&Index& Indexes&items&in&the&circular&log&with&a&seth associaave&hash&index& hash(key)& Circular& log& bucket&0& bucket&1& & bucket&nh1& Hash& index& Key,Val& Full&bucket?&Evict&oldest&entry&from&it&(FIFO)& Allows&fast&indexing&of&new&keyhvalue&items& 36&

37 KeyhValue&Data&Structure&Comparison& Throughput&(Mops)& 80& 70& 60& 50%&GET& 95%&GET& EREW,&Skewed,&Small&Items& 70.1& 70.2& 50& 40& 30& 29.6& 20& 10& 0& 6.9& ParAAoned&Masstree& MICA& 37&

38 Throughput&Comparison& Throughput&(Mops)& 80& 70& 60& 50& 40& Uniform,&50%&GET& Uniform,&95%&GET& Skewed,&50%&GET& Skewed,&95%&GET& 77.0& 76.8& 70.1& 70.2& 77.8& 77.8& 77.8& 77.8& 30& 20& 10& 0& 23.5& 17.4& 11.9& 11.7& 6.1& 6.3& 0.7& 1.5& 2.2& 3.1& 0.7& 1.5& 0.6& 0.5& 0.9& 0.9& Memcached& RAMCloud& MemC3& Masstree& MICA& (Ideal)& Endhtohend&performance&using&our&opAmized&network&stack& 38&

39 ThroughputhLatency&(on&Ethernet)& Average&latency&(μs)& 140& 120& 100& 80& 60& 40& 20& 140& 120& 100& 80& Memcached&UDP& MICA& 60& Two(orders(of(magnitude(( 40& 20& 0& 0& 0.1& 0.2& 0.3& 0& 0& 20& 40& 60& 80& Throughput&(Mops)& Memcached&UDP&using&standard&socket&I/O& 39&

40 Summary& MICA&takes&a&holisAc&approach&to&designing& fast&inhmemory&keyhvalue&caches.& Efficient&parallel&data&access&& Hardwarehbased&request&direcAon& OpAmized&data&structures&for&keyhvalue&caching& MICA&consistently&achieves&high&performance& under&diverse&workloads.& 40&

41 Typical&Server&Cluster&ConfiguraAon& Web tier Up&to&3x&improvement&in& performance&&& Cache tier Up&to&7x&improvement&in& performance&&& 1. Improving the performance of a cache server [NSDI 14] Joint work with H. Lim, M. Kaminsky, and D. Andersen 2. Improving the performance of a Web server [NSDI 14] E. Jeong, S. Woo, M. Jamshed, H. Jeong, S. Ihm, and K. Park 41&

42 Workload&for&Userhfacing&Servers Measurement of TCP flows in commercial cellular backbone [Woo,mobisys 13] 100%& 80%& 90%& CDF 60%& 40%& 20%& 0%& 0& 1K& 4K& 8K& 16K& 32K& 64K&128K&256K&512K&1M&Total& Flow&Size&(Bytes)& Over 90% (50%) of TCP flows are smaller than 64 KB (4 KB). 42

43 Web&Server&Performance Large&transfers:&easy&to&fill&up&10&Gbps&& Small&transacAons:&1.2&Gbps&under&SpecWeb) Kernel&is&not&designed&well&for&mulAcore&systems.& ConnecKons/sec((x(10 5 ) 2.5&& 2.0&& 1.5&& 1.0&& 0.5&& 0.0&& TCP(ConnecKon(Setup(Performance Linux: Intel Xeon E Intel 10Gbps NIC Performance Meltdown 1& 2& 4& 6& 8& Number(of(CPU(Cores 43

44 Performance&Analysis&of&a&Web&Server 83% of CPU usage spent inside kernel! CPU&usage&on&Linuxh3.10& Web&server&(Lighspd)&serving&a&64&byte&file 17%& 45%& 34%& 4%& Kernel& Packet&I/O& TCP/IP& ApplicaAon& 44

45 Inefficiencies&in&Kernel&TCP/IP&Stack& 1.&Lack&of&connecAon&locality& & & & ApplicaAon& thread CPU Packet/& TCP&processing&& (ksopirqd) TCP send/recv buffer Core 0 Core 1 packet 45&

46 Inefficiencies&in&Kernel&TCP/IP&Stack& 1.&Lack&of&connecAon&locality& & while (1) { epoll_wait( ) fd = accept(listen_fd, NULL); & read(fd, buf, 1024); & write(fd, buf, 1024); } while (1) { epoll_wait( ) fd = accept(listen_fd, NULL); read(fd, buf, 1024); write(fd, buf, 1024); } ApplicaKon(thread ApplicaKon(thread Core 0 Core 1 AccepthAffinity[EUROSYS 12]:&connecAon&affinity&(only)& Linux&SO_REUSEPORT&opAon&(v3.9.4):&perhcore&listen&socket& 46&

47 Inefficiencies&in&Kernel&TCP/IP&Stack& 2.&Shared&file&descriptor&space& &&&&&fd&=&accept(listen_fd,&null)& Socket API Kernel Socket layer Virtual file system layer e.g., vfs_open() File System VFS overhead: Creates inode for each socket file descriptor. Finds the lowest available integer [POSIX] 47&

48 Inefficiencies&in&Kernel&TCP/IP&Stack& 3.&System&call&overhead&(frequent&and&expensive)& & & & & & while (1) { epoll_wait( ) fd = accept(listen_fd, NULL); read(fd, buf, 1024); write(fd, buf, 1024); } 4.&Inefficient&perhpacket&processing& Perhpacket&memory&allocaAon/deallocaAon&overhead& MegaPipe[OSDI 12]:&parAally&address&problems&1,2,3&& & &All&prior&work&reuses&kernel s&tcp/ip.& 48&

49 mtcp&approach mtcp:&a&highhperformance&userhlevel&tcp& design&for&mulacore&systems& Cleanhslate&approach&to&divorce&kernel s& complexity& 1. Leverage&userhlevel&packet&I/O& 2. Support&mulAcorehaware&flow&processing& 3. Provide&a&userhlevel&socket&API& 49

50 mtcp&overview& ApplicaAon&thread User process ApplicaAon&thread User-space memory mtcp&socket&interface mtcp&thread mtcp&epollhlike&& event&system mtcp&thread Kernel Userhlevel&Packet&I/O&library Device&Driver Core 0 Core 1 50&

51 mtcp&overview& Core-affinity (1) Per-core file descriptor, listen socket (2) Kernel bypass, No system call (3) Batched packet processing (4) Per-core Event Q Socket buffer ApplicaAon&thread mtcp&socket&interface mtcp&thread Userhlevel&Packet&I/O&library Device&Driver ApplicaAon&thread mtcp_accept() mtcp&epollhlike&& event&system mtcp&thread Event Q Socket buffer Core 0 Core 1 Packet queue Receiver-side scaling (H/W) 51&

52 mtcp&overview& Core-affinity (1) Per-core file descriptor, listen socket (2) Kernel bypass, No system call (3) Batched packet processing (4) ApplicaAon&thread mtcp&socket&interface mtcp&thread ApplicaAon&thread mtcp&epollhlike&& event&system mtcp&thread Userhlevel&Packet&I/O&library Device&Driver Core 0 Core 1 Packet queue 52&

53 mtcp&design& Highly&scalability&on&mulAcore&systems& 25x&faster&than&latest&Linux&version& 3x&faster&than&MegaPipe& Easy&to&use;&lisle&porAng&effort& Modified&29&lines&of&the&Apache&library&(out&of&66,493)&& EvaluaAon& HTTP&server/client:&lighspd,&Apache& Web&Replayer:&replays&cellular&backbone&traffic&(Korea)& SSL&Proxy& 53&

54 MulAcore&Scalability 64B&message&per&each&connecAon& Heavy&connecAon,&small&packet&processing&overhead& 25x&Linux,&5x&REUSEPORT,&3x&MegaPipe&[OSDI&2012]& TransacKons/sec((x(10 5 ) 15& 12& 9& 6& 3& 0& 0& 1 2& 4& 6& 8& Number(of(CPU(Cores Linux& REUSEPORT& MegaPipe& mtcp&

55 Message&Benchmark Scaling&by&message&size&& Persistent&connecAon&with&64&byte&messages Throughput((Gbps) 10&& 8&& 6&& 4&& 2&& 0&& Linux& REUSEPORT& MegaPipe& mtcp& Link&saturaAon 64B& 256B& 1KiB& 4KiB& 8KiB& Message(Size TransacKons/sec((x(10 5 ) 40& 30& 20& 10& 0& 1& 2& 8& 32& 64& 128& #(Messages/connecKon

56 Web&Server&(Lighspd)&Performance SpecWeb2009&staAc&file&workload&(738B&average)& 3.2x&faster&than&Linux,&1.5x&faster&than&MegaPipe& Throughput&(Gbps) 5& 4& 3& 2& 1& 0& Throughput&(Gbps)& TransacAons/sec& Linux& REUSEPORT& MegaPipe& mtcp& 4& 3.5& 3& 2.5& 2& 1.5& 1& 0.5& 0& TransacAons/s&(x&10 5 )

57 Summary mtcp:&a&highhperformance&userhlevel&tcp&stack&for& mulacore&systems& Efficiently&uAlize&mulAcore&resources&by& EliminaAng&system&call&overhead& Reducing&context&switch&cost&by&event&batching& Using&perhcore&resource&management& Using&cachehaware&threading& Achieve&high&performance&scalability& Small&message&transacAons:&3x&to&25x& ExisAng&applicaAons:&&33%&(SSLShader)&to&320%&(lighspd)&

58 Conclusion& Despite&many&efforts&from&academia&and&industry,& there&sall&exists&lots&of&room&for&innovaaons&for& Cloudhbased&systems&and&services.& EssenAal&building&blocks&for&Cloud&services&can& benefit&from&a&holisac,&mulacorehaware&design& that&leverages&the&underlying&h/w&and&that& carefully&considers&the&workload.& More&research&is&ahead&in&bringing&new& applicaaons&to&the&cloud.&& 58&

59 Reference& [DPDK]& hsp:// technology/packethprocessinghishenhancedhwithhsopwarehfromhintelh dpdk.html& [FacebookMeasurement]&Berk&AAkoglu,&Yuehai&Xu,&Eitan& Frachtenberg,&Song&Jiang,&and&Mike&Paleczny.&Workload&analysis&of&a& largehscale&keyhvalue&store.&in&proc.)sigmetrics)2012.) [Masstree]&Yandong&Mao,&Eddie&Kohler,&and&Robert&Tappan&Morris.& Cache&CraPiness&for&Fast&MulAcore&KeyhValue&Storage.&In&Proc.) EuroSys)2012.) [MemC3]&Bin&Fan,&David&G.&Andersen,&and&Michael&Kaminsky.&MemC3:& Compact&and&Concurrent&MemCache&with&Dumber&Caching&and& Smarter&Hashing.&In&Proc.)NSDI)2013.) [Memcached]&hsp://memcached.org/& [RAMCloud]&Diego&Ongaro,&Stephen&M.&Rumble,&Ryan&Stutsman,&John& Ousterhout,&and&Mendel&Rosenblum.&Fast&Crash&Recovery&in& RAMCloud.&In&Proc.)SOSP)2011.) 59&

MegaPipe: A New Programming Interface for Scalable Network I/O

MegaPipe: A New Programming Interface for Scalable Network I/O MegaPipe: A New Programming Interface for Scalable Network I/O Sangjin Han in collabora=on with Sco? Marshall Byung- Gon Chun Sylvia Ratnasamy University of California, Berkeley Yahoo! Research tl;dr?

More information

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Speeding up Linux TCP/IP with a Fast Packet I/O Framework Speeding up Linux TCP/IP with a Fast Packet I/O Framework Michio Honda Advanced Technology Group, NetApp michio@netapp.com With acknowledge to Kenichi Yasukata, Douglas Santry and Lars Eggert 1 Motivation

More information

MICA: A Holistic Approach to Fast In-Memory Key-Value Storage

MICA: A Holistic Approach to Fast In-Memory Key-Value Storage MICA: A Holistic Approach to Fast In-Memory Key-Value Storage Hyeontaek Lim, Carnegie Mellon University; Dongsu Han, Korea Advanced Institute of Science and Technology (KAIST); David G. Andersen, Carnegie

More information

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

IX: A Protected Dataplane Operating System for High Throughput and Low Latency IX: A Protected Dataplane Operating System for High Throughput and Low Latency Adam Belay et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Presented by Han Zhang & Zaina Hamid Challenges

More information

Light: A Scalable, High-performance and Fully-compatible User-level TCP Stack. Dan Li ( 李丹 ) Tsinghua University

Light: A Scalable, High-performance and Fully-compatible User-level TCP Stack. Dan Li ( 李丹 ) Tsinghua University Light: A Scalable, High-performance and Fully-compatible User-level TCP Stack Dan Li ( 李丹 ) Tsinghua University Data Center Network Performance Hardware Capability of Modern Servers Multi-core CPU Kernel

More information

PASTE: A Network Programming Interface for Non-Volatile Main Memory

PASTE: A Network Programming Interface for Non-Volatile Main Memory PASTE: A Network Programming Interface for Non-Volatile Main Memory Michio Honda (NEC Laboratories Europe) Giuseppe Lettieri (Università di Pisa) Lars Eggert and Douglas Santry (NetApp) USENIX NSDI 2018

More information

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports and Steven D. Gribble February 2, 2015 1 Introduction What is Tail Latency? What

More information

FlexNIC: Rethinking Network DMA

FlexNIC: Rethinking Network DMA FlexNIC: Rethinking Network DMA Antoine Kaufmann Simon Peter Tom Anderson Arvind Krishnamurthy University of Washington HotOS 2015 Networks: Fast and Growing Faster 1 T 400 GbE Ethernet Bandwidth [bits/s]

More information

PASTE: Fast End System Networking with netmap

PASTE: Fast End System Networking with netmap PASTE: Fast End System Networking with netmap Michio Honda, Giuseppe Lettieri, Lars Eggert and Douglas Santry BSDCan 2018 Contact: @michioh, micchie@sfc.wide.ad.jp Code: https://github.com/micchie/netmap/tree/stack

More information

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016 Kernel Bypass Sujay Jayakar (dsj36) 11/17/2016 Kernel Bypass Background Why networking? Status quo: Linux Papers Arrakis: The Operating System is the Control Plane. Simon Peter, Jialin Li, Irene Zhang,

More information

The Power of Batching in the Click Modular Router

The Power of Batching in the Click Modular Router The Power of Batching in the Click Modular Router Joongi Kim, Seonggu Huh, Keon Jang, * KyoungSoo Park, Sue Moon Computer Science Dept., KAIST Microsoft Research Cambridge, UK * Electrical Engineering

More information

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

IX: A Protected Dataplane Operating System for High Throughput and Low Latency IX: A Protected Dataplane Operating System for High Throughput and Low Latency Adam Belay 1 George Prekas 2 Ana Klimovic 1 Samuel Grossman 1 Christos Kozyrakis 1 1 Stanford University Edouard Bugnion 2

More information

Arrakis: The Operating System is the Control Plane

Arrakis: The Operating System is the Control Plane Arrakis: The Operating System is the Control Plane Simon Peter, Jialin Li, Irene Zhang, Dan Ports, Doug Woos, Arvind Krishnamurthy, Tom Anderson University of Washington Timothy Roscoe ETH Zurich Building

More information

Software Routers: NetMap

Software Routers: NetMap Software Routers: NetMap Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems and Networking October 8, 2014 Slides from the NetMap: A Novel Framework for

More information

A Proxy-based Query Aggregation Method for Distributed Key-Value Stores

A Proxy-based Query Aggregation Method for Distributed Key-Value Stores A Proxy-based Query Aggregation Method for Distributed Key-Value Stores Daichi Kawanami, Masanari Kamoshita, Ryota Kawashima and Hiroshi Matsuo Nagoya Institute of Technology, in Nagoya, Aichi, 466-8555,

More information

Light & NOS. Dan Li Tsinghua University

Light & NOS. Dan Li Tsinghua University Light & NOS Dan Li Tsinghua University Performance gain The Power of DPDK As claimed: 80 CPU cycles per packet Significant gain compared with Kernel! What we care more How to leverage the performance gain

More information

Demystifying Network Cards

Demystifying Network Cards Demystifying Network Cards Paul Emmerich December 27, 2017 Chair of Network Architectures and Services About me PhD student at Researching performance of software packet processing systems Mostly working

More information

Research on DPDK Based High-Speed Network Traffic Analysis. Zihao Wang Network & Information Center Shanghai Jiao Tong University

Research on DPDK Based High-Speed Network Traffic Analysis. Zihao Wang Network & Information Center Shanghai Jiao Tong University Research on DPDK Based High-Speed Network Traffic Analysis Zihao Wang Network & Information Center Shanghai Jiao Tong University Outline 1 Background 2 Overview 3 DPDK Based Traffic Analysis 4 Experiment

More information

TLDK Overview. Transport Layer Development Kit Ray Kinsella February ray.kinsella [at] intel.com IRC: mortderire

TLDK Overview. Transport Layer Development Kit Ray Kinsella February ray.kinsella [at] intel.com IRC: mortderire TLDK Overview Transport Layer Development Kit Ray Kinsella February 2017 Email : ray.kinsella [at] intel.com IRC: mortderire Contributions from Keith Wiles & Konstantin Ananyev Legal Disclaimer General

More information

Advanced Computer Networks. End Host Optimization

Advanced Computer Networks. End Host Optimization Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct

More information

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES ON BIG AND SMALL SERVER PLATFORMS Shuang Chen*, Shay Galon**, Christina Delimitrou*, Srilatha Manne**, and José Martínez* *Cornell University **Cavium

More information

Master s Thesis (Academic Year 2015) Improving TCP/IP stack performance by fast packet I/O framework

Master s Thesis (Academic Year 2015) Improving TCP/IP stack performance by fast packet I/O framework Master s Thesis (Academic Year 2015) Improving TCP/IP stack performance by fast packet I/O framework Keio University Graduate School of Media and Governance Kenichi Yasukata Master s Thesis Academic Year

More information

소프트웨어기반고성능침입탐지시스템설계및구현

소프트웨어기반고성능침입탐지시스템설계및구현 소프트웨어기반고성능침입탐지시스템설계및구현 KyoungSoo Park Department of Electrical Engineering, KAIST M. Asim Jamshed *, Jihyung Lee*, Sangwoo Moon*, Insu Yun *, Deokjin Kim, Sungryoul Lee, Yung Yi* Department of Electrical

More information

Memory Management Strategies for Data Serving with RDMA

Memory Management Strategies for Data Serving with RDMA Memory Management Strategies for Data Serving with RDMA Dennis Dalessandro and Pete Wyckoff (presenting) Ohio Supercomputer Center {dennis,pw}@osc.edu HotI'07 23 August 2007 Motivation Increasing demands

More information

libvnf: building VNFs made easy

libvnf: building VNFs made easy libvnf: building VNFs made easy Priyanka Naik, Akash Kanase, Trishal Patel, Mythili Vutukuru Dept. of Computer Science and Engineering Indian Institute of Technology, Bombay SoCC 18 11 th October, 2018

More information

Much Faster Networking

Much Faster Networking Much Faster Networking David Riddoch driddoch@solarflare.com Copyright 2016 Solarflare Communications, Inc. All rights reserved. What is kernel bypass? The standard receive path The standard receive path

More information

Exception-Less System Calls for Event-Driven Servers

Exception-Less System Calls for Event-Driven Servers Exception-Less System Calls for Event-Driven Servers Livio Soares and Michael Stumm University of Toronto Talk overview At OSDI'10: exception-less system calls Technique targeted at highly threaded servers

More information

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research Fast packet processing in the cloud Dániel Géhberger Ericsson Research Outline Motivation Service chains Hardware related topics, acceleration Virtualization basics Software performance and acceleration

More information

TLDK Overview. Transport Layer Development Kit Keith Wiles April Contributions from Ray Kinsella & Konstantin Ananyev

TLDK Overview. Transport Layer Development Kit Keith Wiles April Contributions from Ray Kinsella & Konstantin Ananyev TLDK Overview Transport Layer Development Kit Keith Wiles April 2017 Contributions from Ray Kinsella & Konstantin Ananyev Notices and Disclaimers Intel technologies features and benefits depend on system

More information

FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto

FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto FlexSC Flexible System Call Scheduling with Exception-Less System Calls Livio Soares and Michael Stumm University of Toronto Motivation The synchronous system call interface is a legacy from the single

More information

Towards a Software Defined Data Plane for Datacenters

Towards a Software Defined Data Plane for Datacenters Towards a Software Defined Data Plane for Datacenters Arvind Krishnamurthy Joint work with: Antoine Kaufmann, Ming Liu, Naveen Sharma Tom Anderson, Kishore Atreya, Changhoon Kim, Jacob Nelson, Simon Peter

More information

IsoStack Highly Efficient Network Processing on Dedicated Cores

IsoStack Highly Efficient Network Processing on Dedicated Cores IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single

More information

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Raghav Sethi Michael Kaminsky David G. Andersen Michael J. Freedman Goal: fast and cost-effective key-value store Target: cluster-level storage for

More information

StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs

StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs Kenichi Yasukata 1, Michio Honda 2, Douglas Santry 2, and Lars Eggert 2 1 Keio University 2 NetApp Abstract StackMap leverages the

More information

MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing

MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing Bin Fan (CMU), Dave Andersen (CMU), Michael Kaminsky (Intel Labs) NSDI 2013 http://www.pdl.cmu.edu/ 1 Goal: Improve Memcached 1. Reduce space overhead

More information

MegaPipe: A New Programming Interface for Scalable Network I/O

MegaPipe: A New Programming Interface for Scalable Network I/O MegaPipe: A New Programming Interface for Scalable Network I/O Sangjin Han +, Scott Marshall +, Byung-Gon Chun *, and Sylvia Ratnasamy + + University of California, Berkeley *Yahoo! Research Abstract We

More information

Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design

Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design A Thesis Submitted in partial fulfillment of the requirements for the degree of Master of Technology by Pijush Chakraborty (153050015)

More information

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet Mao Miao, Fengyuan Ren, Xiaohui Luo, Jing Xie, Qingkai Meng, Wenxue Cheng Dept. of Computer Science and Technology, Tsinghua

More information

Mega KV: A Case for GPUs to Maximize the Throughput of In Memory Key Value Stores

Mega KV: A Case for GPUs to Maximize the Throughput of In Memory Key Value Stores Mega KV: A Case for GPUs to Maximize the Throughput of In Memory Key Value Stores Kai Zhang Kaibo Wang 2 Yuan Yuan 2 Lei Guo 2 Rubao Lee 2 Xiaodong Zhang 2 University of Science and Technology of China

More information

FaRM: Fast Remote Memory

FaRM: Fast Remote Memory FaRM: Fast Remote Memory Problem Context DRAM prices have decreased significantly Cost effective to build commodity servers w/hundreds of GBs E.g. - cluster with 100 machines can hold tens of TBs of main

More information

Developing Stateful Middleboxes with the mos API KYOUNGSOO PARK & YOUNGGYOUN MOON

Developing Stateful Middleboxes with the mos API KYOUNGSOO PARK & YOUNGGYOUN MOON Developing Stateful Middleboxes with the mos API KYOUNGSOO PARK & YOUNGGYOUN MOON ASIM JAMSHED, DONGHWI KIM, & DONGSU HAN SCHOOL OF ELECTRICAL ENGINEERING, KAIST Network Middlebox Networking devices that

More information

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout An Implementation of the Homa Transport Protocol in RAMCloud Yilong Li, Behnam Montazeri, John Ousterhout Introduction Homa: receiver-driven low-latency transport protocol using network priorities HomaTransport

More information

Assignment 2 Group 5 Simon Gerber Systems Group Dept. Computer Science ETH Zurich - Switzerland

Assignment 2 Group 5 Simon Gerber Systems Group Dept. Computer Science ETH Zurich - Switzerland Assignment 2 Group 5 Simon Gerber Systems Group Dept. Computer Science ETH Zurich - Switzerland t Your task Write a simple file server Client has to be implemented in Java Server has to be implemented

More information

An Analysis of Linux Scalability to Many Cores

An Analysis of Linux Scalability to Many Cores An Analysis of Linux Scalability to Many Cores 1 What are we going to talk about? Scalability analysis of 7 system applications running on Linux on a 48 core computer Exim, memcached, Apache, PostgreSQL,

More information

Introduction to OpenOnload Building Application Transparency and Protocol Conformance into Application Acceleration Middleware

Introduction to OpenOnload Building Application Transparency and Protocol Conformance into Application Acceleration Middleware White Paper Introduction to OpenOnload Building Application Transparency and Protocol Conformance into Application Acceleration Middleware Steve Pope, PhD Chief Technical Officer Solarflare Communications

More information

socketservertcl a Tcl extension for using SCM_RIGHTS By Shannon Noe - FlightAware

socketservertcl a Tcl extension for using SCM_RIGHTS By Shannon Noe - FlightAware socketservertcl a Tcl extension for using SCM_RIGHTS By Shannon Noe - FlightAware Presented at the 24th annual Tcl/Tk conference, Houston Texas, October 2017 Abstract: Horizontal scaling is used to distribute

More information

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT Konstantinos Alexopoulos ECE NTUA CSLab MOTIVATION HPC, Multi-node & Heterogeneous Systems Communication with low latency

More information

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

IX: A Protected Dataplane Operating System for High Throughput and Low Latency IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this

More information

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies Meltdown and Spectre Interconnect Evaluation Jan 2018 1 Meltdown and Spectre - Background Most modern processors perform speculative execution This speculation can be measured, disclosing information about

More information

A Scalable Event Dispatching Library for Linux Network Servers

A Scalable Event Dispatching Library for Linux Network Servers A Scalable Event Dispatching Library for Linux Network Servers Hao-Ran Liu and Tien-Fu Chen Dept. of CSIE National Chung Cheng University Traditional server: Multiple Process (MP) server A dedicated process

More information

Problem Set: Processes

Problem Set: Processes Lecture Notes on Operating Systems Problem Set: Processes 1. Answer yes/no, and provide a brief explanation. (a) Can two processes be concurrently executing the same program executable? (b) Can two running

More information

NetSlices: Scalable Mul/- Core Packet Processing in User- Space

NetSlices: Scalable Mul/- Core Packet Processing in User- Space NetSlices: Scalable Mul/- Core Packet Processing in - Space Tudor Marian, Ki Suh Lee, Hakim Weatherspoon Cornell University Presented by Ki Suh Lee Packet Processors Essen/al for evolving networks Sophis/cated

More information

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD Maximum Performance How to get it and how to avoid pitfalls Christoph Lameter, PhD cl@linux.com Performance Just push a button? Systems are optimized by default for good general performance in all areas.

More information

Lecture 19: File System Implementation. Mythili Vutukuru IIT Bombay

Lecture 19: File System Implementation. Mythili Vutukuru IIT Bombay Lecture 19: File System Implementation Mythili Vutukuru IIT Bombay File System An organization of files and directories on disk OS has one or more file systems Two main aspects of file systems Data structures

More information

Memshare: a Dynamic Multi-tenant Key-value Cache

Memshare: a Dynamic Multi-tenant Key-value Cache Memshare: a Dynamic Multi-tenant Key-value Cache ASAF CIDON*, DANIEL RUSHTON, STEPHEN M. RUMBLE, RYAN STUTSMAN *STANFORD UNIVERSITY, UNIVERSITY OF UTAH, GOOGLE INC. 1 Cache is 100X Faster Than Database

More information

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme NET1343BU NSX Performance Samuel Kommu #VMworld #NET1343BU Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no

More information

Thread. Disclaimer: some slides are adopted from the book authors slides with permission 1

Thread. Disclaimer: some slides are adopted from the book authors slides with permission 1 Thread Disclaimer: some slides are adopted from the book authors slides with permission 1 IPC Shared memory Recap share a memory region between processes read or write to the shared memory region fast

More information

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect OpenOnload Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect Copyright 2012 Solarflare Communications, Inc. All Rights Reserved. OpenOnload Acceleration Software Accelerated

More information

High Performance Packet Processing with FlexNIC

High Performance Packet Processing with FlexNIC High Performance Packet Processing with FlexNIC Antoine Kaufmann, Naveen Kr. Sharma Thomas Anderson, Arvind Krishnamurthy University of Washington Simon Peter The University of Texas at Austin Ethernet

More information

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Pilar González-Férez and Angelos Bilas 31 th International Conference on Massive Storage Systems

More information

Learning with Purpose

Learning with Purpose Network Measurement for 100Gbps Links Using Multicore Processors Xiaoban Wu, Dr. Peilong Li, Dr. Yongyi Ran, Prof. Yan Luo Department of Electrical and Computer Engineering University of Massachusetts

More information

DPDK Summit 2016 OpenContrail vrouter / DPDK Architecture. Raja Sivaramakrishnan, Distinguished Engineer Aniket Daptari, Sr.

DPDK Summit 2016 OpenContrail vrouter / DPDK Architecture. Raja Sivaramakrishnan, Distinguished Engineer Aniket Daptari, Sr. DPDK Summit 2016 OpenContrail vrouter / DPDK Architecture Raja Sivaramakrishnan, Distinguished Engineer Aniket Daptari, Sr. Product Manager CONTRAIL (MULTI-VENDOR) ARCHITECTURE ORCHESTRATOR Interoperates

More information

mos: An open middlebox platform with programmable network stacks

mos: An open middlebox platform with programmable network stacks mos: An open middlebox platform with programmable network stacks Shinae Woo ABSTRACT Though the growing popularity of software-based middleboxes raises new requirements for network stack functionality,

More information

Titan: Fair Packet Scheduling for Commodity Multiqueue NICs. Brent Stephens, Arjun Singhvi, Aditya Akella, and Mike Swift July 13 th, 2017

Titan: Fair Packet Scheduling for Commodity Multiqueue NICs. Brent Stephens, Arjun Singhvi, Aditya Akella, and Mike Swift July 13 th, 2017 Titan: Fair Packet Scheduling for Commodity Multiqueue NICs Brent Stephens, Arjun Singhvi, Aditya Akella, and Mike Swift July 13 th, 2017 Ethernet line-rates are increasing! 2 Servers need: To drive increasing

More information

PacketShader: A GPU-Accelerated Software Router

PacketShader: A GPU-Accelerated Software Router PacketShader: A GPU-Accelerated Software Router Sangjin Han In collaboration with: Keon Jang, KyoungSoo Park, Sue Moon Advanced Networking Lab, CS, KAIST Networked and Distributed Computing Systems Lab,

More information

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University HyperDex A Distributed, Searchable Key-Value Store Robert Escriva Bernard Wong Emin Gün Sirer Department of Computer Science Cornell University School of Computer Science University of Waterloo ACM SIGCOMM

More information

FaRM: Fast Remote Memory

FaRM: Fast Remote Memory FaRM: Fast Remote Memory Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, Miguel Castro Microsoft Research Abstract We describe the design and implementation of FaRM, a new main memory distributed

More information

May 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to

May 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to A Sleep-based Our Akram Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) May 1, 2011 Our 1 2 Our 3 4 5 6 Our Efficiency in Back-end Processing Efficiency in back-end

More information

PASTE: A Networking API for Non-Volatile Main Memory

PASTE: A Networking API for Non-Volatile Main Memory PASTE: A Networking API for Non-Volatile Main Memory Michio Honda (NEC Laboratories Europe) Lars Eggert (NetApp) Douglas Santry (NetApp) TSVAREA@IETF 99, Prague May 22th 2017 More details at our HotNets

More information

Bringing the Power of ebpf to Open vswitch. Linux Plumber 2018 William Tu, Joe Stringer, Yifeng Sun, Yi-Hung Wei VMware Inc. and Cilium.

Bringing the Power of ebpf to Open vswitch. Linux Plumber 2018 William Tu, Joe Stringer, Yifeng Sun, Yi-Hung Wei VMware Inc. and Cilium. Bringing the Power of ebpf to Open vswitch Linux Plumber 2018 William Tu, Joe Stringer, Yifeng Sun, Yi-Hung Wei VMware Inc. and Cilium.io 1 Outline Introduction and Motivation OVS-eBPF Project OVS-AF_XDP

More information

LegUp: Accelerating Memcached on Cloud FPGAs

LegUp: Accelerating Memcached on Cloud FPGAs 0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are

More information

Data Path acceleration techniques in a NFV world

Data Path acceleration techniques in a NFV world Data Path acceleration techniques in a NFV world Mohanraj Venkatachalam, Purnendu Ghosh Abstract NFV is a revolutionary approach offering greater flexibility and scalability in the deployment of virtual

More information

An Intelligent NIC Design Xin Song

An Intelligent NIC Design Xin Song 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) An Intelligent NIC Design Xin Song School of Electronic and Information Engineering Tianjin Vocational

More information

What s Wrong with the Operating System Interface? Collin Lee and John Ousterhout

What s Wrong with the Operating System Interface? Collin Lee and John Ousterhout What s Wrong with the Operating System Interface? Collin Lee and John Ousterhout Goals for the OS Interface More convenient abstractions than hardware interface Manage shared resources Provide near-hardware

More information

Chapter 8: I/O functions & socket options

Chapter 8: I/O functions & socket options Chapter 8: I/O functions & socket options 8.1 Introduction I/O Models In general, there are normally two phases for an input operation: 1) Waiting for the data to arrive on the network. When the packet

More information

Slides on cross- domain call and Remote Procedure Call (RPC)

Slides on cross- domain call and Remote Procedure Call (RPC) Slides on cross- domain call and Remote Procedure Call (RPC) This classic paper is a good example of a microbenchmarking study. It also explains the RPC abstraction and serves as a case study of the nuts-and-bolts

More information

CSE398: Network Systems Design

CSE398: Network Systems Design CSE398: Network Systems Design Instructor: Dr. Liang Cheng Department of Computer Science and Engineering P.C. Rossin College of Engineering & Applied Science Lehigh University February 23, 2005 Outline

More information

Ben Walker Data Center Group Intel Corporation

Ben Walker Data Center Group Intel Corporation Ben Walker Data Center Group Intel Corporation Notices and Disclaimers Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation.

More information

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Goal: fast and cost-efficient key-value store Store, retrieve, manage key-value objects Get(key)/Put(key,value)/Delete(key) Target: cluster-level

More information

Problem Set: Processes

Problem Set: Processes Lecture Notes on Operating Systems Problem Set: Processes 1. Answer yes/no, and provide a brief explanation. (a) Can two processes be concurrently executing the same program executable? (b) Can two running

More information

PROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8

PROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8 PROCESSES AND THREADS THREADING MODELS CS124 Operating Systems Winter 2016-2017, Lecture 8 2 Processes and Threads As previously described, processes have one sequential thread of execution Increasingly,

More information

ntop Users Group Meeting

ntop Users Group Meeting ntop Users Group Meeting PF_RING Tutorial Alfredo Cardigliano Overview Introduction Installation Configuration Tuning Use cases PF_RING Open source packet processing framework for

More information

Achieve Low Latency NFV with Openstack*

Achieve Low Latency NFV with Openstack* Achieve Low Latency NFV with Openstack* Yunhong Jiang Yunhong.Jiang@intel.com *Other names and brands may be claimed as the property of others. Agenda NFV and network latency Why network latency on NFV

More information

Performance Objects and Counters for the System

Performance Objects and Counters for the System APPENDIXA Performance Objects and for the System May 19, 2009 This appendix provides information on system-related objects and counters. Cisco Tomcat Connector, page 2 Cisco Tomcat JVM, page 4 Cisco Tomcat

More information

Custom UDP-Based Transport Protocol Implementation over DPDK

Custom UDP-Based Transport Protocol Implementation over DPDK Custom UDPBased Transport Protocol Implementation over DPDK Dmytro Syzov, Dmitry Kachan, Kirill Karpov, Nikolai Mareev and Eduard Siemens Future Internet Lab Anhalt, Anhalt University of Applied Sciences,

More information

Disclaimer: this is conceptually on target, but detailed implementation is likely different and/or more complex.

Disclaimer: this is conceptually on target, but detailed implementation is likely different and/or more complex. Goal: get a better understanding about how the OS and C libraries interact for file I/O and system calls plus where/when buffering happens, what s a file, etc. Disclaimer: this is conceptually on target,

More information

URDMA: RDMA VERBS OVER DPDK

URDMA: RDMA VERBS OVER DPDK 13 th ANNUAL WORKSHOP 2017 URDMA: RDMA VERBS OVER DPDK Patrick MacArthur, Ph.D. Candidate University of New Hampshire March 28, 2017 ACKNOWLEDGEMENTS urdma was initially developed during an internship

More information

New Approach to OVS Datapath Performance. Founder of CloudNetEngine Jun Xiao

New Approach to OVS Datapath Performance. Founder of CloudNetEngine Jun Xiao New Approach to OVS Datapath Performance Founder of CloudNetEngine Jun Xiao Agenda VM virtual network datapath evolvement Technical deep dive on a new OVS datapath Performance comparisons Q & A 2 VM virtual

More information

IBM B2B INTEGRATOR BENCHMARKING IN THE SOFTLAYER ENVIRONMENT

IBM B2B INTEGRATOR BENCHMARKING IN THE SOFTLAYER ENVIRONMENT IBM B2B INTEGRATOR BENCHMARKING IN THE SOFTLAYER ENVIRONMENT 215-4-14 Authors: Deep Chatterji (dchatter@us.ibm.com) Steve McDuff (mcduffs@ca.ibm.com) CONTENTS Disclaimer...3 Pushing the limits of B2B Integrator...4

More information

CS118 Discussion Week 2. Taqi

CS118 Discussion Week 2. Taqi CS118 Discussion Week 2 Taqi Outline Any Questions for Course Project 1? Socket Programming: Non-blocking mode Lecture Review: Application Layer Much of the other related stuff we will only discuss during

More information

Netchannel 2: Optimizing Network Performance

Netchannel 2: Optimizing Network Performance Netchannel 2: Optimizing Network Performance J. Renato Santos +, G. (John) Janakiraman + Yoshio Turner +, Ian Pratt * + HP Labs - * XenSource/Citrix Xen Summit Nov 14-16, 2007 2003 Hewlett-Packard Development

More information

Solid State Storage Technologies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Solid State Storage Technologies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Solid State Storage Technologies Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu NVMe (1) The industry standard interface for high-performance NVM

More information

<Insert Picture Here> Boost Linux Performance with Enhancements from Oracle

<Insert Picture Here> Boost Linux Performance with Enhancements from Oracle Boost Linux Performance with Enhancements from Oracle Chris Mason Director of Linux Kernel Engineering Linux Performance on Large Systems Exadata Hardware How large systems are different

More information

EbbRT: A Framework for Building Per-Application Library Operating Systems

EbbRT: A Framework for Building Per-Application Library Operating Systems EbbRT: A Framework for Building Per-Application Library Operating Systems Overview Motivation Objectives System design Implementation Evaluation Conclusion Motivation Emphasis on CPU performance and software

More information

Network device drivers in Linux

Network device drivers in Linux Network device drivers in Linux Aapo Kalliola Aalto University School of Science Otakaari 1 Espoo, Finland aapo.kalliola@aalto.fi ABSTRACT In this paper we analyze the interfaces, functionality and implementation

More information

Infiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin

Infiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin Infiniswap Efficient Memory Disaggregation Mosharaf Chowdhury with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow

More information

An Experimental review on Intel DPDK L2 Forwarding

An Experimental review on Intel DPDK L2 Forwarding An Experimental review on Intel DPDK L2 Forwarding Dharmanshu Johar R.V. College of Engineering, Mysore Road,Bengaluru-560059, Karnataka, India. Orcid Id: 0000-0001- 5733-7219 Dr. Minal Moharir R.V. College

More information

A Look at Intel s Dataplane Development Kit

A Look at Intel s Dataplane Development Kit A Look at Intel s Dataplane Development Kit Dominik Scholz Chair for Network Architectures and Services Department for Computer Science Technische Universität München June 13, 2014 Dominik Scholz: A Look

More information

Networking at the Speed of Light

Networking at the Speed of Light Networking at the Speed of Light Dror Goldenberg VP Software Architecture MaRS Workshop April 2017 Cloud The Software Defined Data Center Resource virtualization Efficient services VM, Containers uservices

More information

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou (  Zhejiang University Operating Systems (Fall/Winter 2018) CPU Scheduling Yajin Zhou (http://yajin.org) Zhejiang University Acknowledgement: some pages are based on the slides from Zhi Wang(fsu). Review Motivation to use threads

More information