Bringing&the&Performance&to&the& Cloud &

Size: px

Start display at page:

Download "Bringing&the&Performance&to&the& Cloud &"

Georgia Nelson
5 years ago
Views:

1 Bringing&the&Performance&to&the& Cloud & Dongsu&Han& KAIST & Department&of&Electrical&Engineering& Graduate&School&of&InformaAon&Security& &

2 The&Era&of&Cloud&CompuAng& Datacenters&at&Amazon,&Google,&Facebook& & 2

3 Customers&of&the&Cloud& Customers& rent &either&physical&or&virtual& machines&from&the&cloud.& Public&and&private&cloud:&External&or&internal& customers&enaaes& &&&&&& 3&

4 Scale&of&the& Cloud & Facebook:& hundreds)of)thousands)of)machines. ) MicrosoP:&1&million&servers&& Google&envisions&10&million&servers.&& Google&spends&about&$3&Billion&every&year&on&data& centers.*& * sdata-center-spend-slows &

5 Efficiency&is&important& What&if&we&can&increase&a&single&machine s& performance&by&e.g.,&10,&20,&40%?& Equipment&cost&savings& Energy&savings:&By&2012,&the&cost&of&power&for&the& data&center&is&expected&to&exceed&the&cost&of&the& original&capital&investment.&[u.s&doe]& Reduced&complexity&in&system&design&as&#&of& machine&involved&decreases& 5&

6 How do we improve the Cloud efficiency (and performance)? We start with a single machine. What does a machine look like? What kind of workload does it handle? 6&

7 What&does&a&machine&look&like&today?& General&purpose&hardware&(x86&architecture)& MulAcore:&4&~60&cores&(tens&of&CPU&cores)& MulAple&10hGigabit&Ethernet&(becoming&the&norm)& & 7

8 A&Typical&Cluster&ConfiguraAon& Front-end Web tier Back-end Cache tier Service tier (e.g., Advertising) Storage tier 8&

9 A&Typical&Cluster&ConfiguraAon& Front-end Web tier Back-end Cache tier Front-end interaction happens over HTTP (TCP). Back-end interaction can use any protocol. 9&

10 A&Typical&Cluster&ConfiguraAon& Web tier Up&to&3x&improvement&in& performance&&& Cache tier Up&to&7x&improvement&in& performance&&& 1. Improving the performance of a cache server [NSDI 14] Joint work with H. Lim, M. Kaminsky, and D. Andersen 2. Improving the performance of a Web server [NSDI 14] w/ E. Jeong, S. Woo, M. Jamshed, H. Jeong, S. Ihm, and K. Park 10&

11 InhMemory&KeyhValue&Cache& Request& Keyhvalue&Cache& Database& Keyhvalues&are&stored&in&DRAM&( Memcache )& 11&

12 InhMemory&KeyhValue&Cache& Request& What is the workload of a K-V cache like? Keyhvalue&Cache& [SOSP]&2007,&2009,&2011&[NSDI]&2013a,&2013b&[EuroSys]&2012& [SIGCOMM]&2012&[SOCC]&2010,&2012&[SIGMETRICS]&2012&[ATC]&2013& 12& Database&

13 Workload&of&a&Keyhvalue&cache& CDF 5 different memcached pools Facebook&KhV&size&distribuAon&[SIGMETRICS2012]& 13&

14 Diverse&Workload&[SIGMETRICS2012]& Read intensive Write intensive 14&

15 Fast&InhMemory&KeyhValue&Cache& Must&handle&small,&variablehlength&items&efficiently& Support&both&readh&and&writehintensive&workloads& & 15&

16 Current&KhV&Cache&Performance& 80& 70& 60& 50& 40& 30& 20& 10& 0& 1.4& Workload:&YCSBhB&(95%&GET,&5%&PUT)& Throughput&(M&operaAons/sec)& <&1& 4.4& 8.9& 77.8& Memcached& RAMCloud& MemC3& Masstree& (Ideal)& Endhtohend&performance&using&UDP& 16&

17 ReadhIntensive&Workload& 80& 70& 60& 50& 40& 30& 20& 10& 0& 1.5& Workload:&YCSBhB&(95%&GET,&5%&PUT)& Throughput&(M&operaAons/sec)& 3.1& 6.3& 23.5& 77.8& Memcached& RAMCloud& MemC3& Masstree& (Ideal)& Endhtohend&performance&using&our&opAmized&network&stack& 17&

18 ReadhIntensive&Workload& 80& 70& 60& 50& 40& 30& 20& 10& 0& 1.5& Workload:&YCSBhB&(95%&GET,&5%&PUT)& Throughput&(M&operaAons/sec)& 3.1& Problem(1:( Large(gap( 6.3& 23.5& 77.8& Memcached& RAMCloud& MemC3& Masstree& (Ideal)& Endhtohend&performance&using&our&opAmized&network&stack& 18& Maximum& packets/sec& asainable&using& UDP&protocol&

19 WritehIntensive& Workload:&YCSBhA&(50%&GET,&50%&PUT)& Throughput&(M&operaAons/sec)& 80& 70& 60& 50& 40& 30& 20& 10& 0& Problem(2:( Performance(collapses( under(heavy(write( <&1& <&1& <&1& 19& 11.7& 77.8& Memcached& RAMCloud& MemC3& Masstree& (Ideal)& Endhtohend&performance&using&our&opAmized&network&stack& Maximum& packets/sec& asainable&using& UDP&protocol&

20 MICA&Performance&Preview& Workload:&YCSBhA&(50%&GET,&50%&PUT)& Throughput&(M&operaAons/sec)& 80& 70& 60& 50& 40& 30& 20& 10& 0& MICA:(70.1(Mops( <&1& <&1& <&1& 20& About(14(ns(per(request( Maximum& 77.8& packets/sec& asainable&using& UDP&protocol& 11.7& Memcached& RAMCloud& MemC3& Masstree& (Ideal)& Endhtohend&performance&using&our&opAmized&network&stack&

21 MICA&Approach& MICA&redesigns&a&KhV&cache&in&a&holisAc&way.&& Server& Client& NIC& CPU& CPU& Memory& 2.&Request& direcaon& 1.&Parallel& data&access& 3.&Keyhvalue& data&structures& 21&

22 Parallel&Data&Access& Server&node& Client& NIC& CPU& CPU& Memory& 2.&Request& direcaon& 1.&Parallel& data&access& 3.&Keyhvalue& data&structures& Modern&CPUs&have&many&cores&(8,&12,& )& We&must&exploit&CPU&parallelism&efficiently.& 22&

23 Concurrent&Read/Write&(CRCW)& Any&core&can&read/write&any&part&of&memory& & CPU&core& CPU&core& Memory& +)&Can&distribute&load&to&mulAple&cores& Memcached,&RAMCloud&[SOSP],&MemC3&[NSDI],& Masstree&[EuroSys]& h)&limits&scalability&with&mulaple&cores& Lock&contenAon& Expensive&cacheline&transfer&caused&by& concurrent&writes&on&the&same&memory&locaaon& 23&

24 MICA&Scales&Well&with&Many&Cores& Throughput&(Mops)& 1( YCSBhA& Skewed,&50%&GET& MemC3& Memcached& RAMCloud& 0.1( 2& 4& 8& 16& Number&of&CPU&cores& 24&

25 MICA&Scales&Well&with&Many&Cores& Throughput&(Mops)& 100( YCSBhA& Skewed,&50%&GET& 10( 1( MICA& Masstree& MemC3& Memcached& RAMCloud& 0.1( 2& 4& 8& 16& Number&of&CPU&cores& 25&

26 MICA s&parallel&data&access& ParAAon&data&using&the&hash&of&keys& Exclusive&Read/Write&(EREW)& Only&one&core&accesses&a&parAcular&parAAon& CPU&core& ParAAon& & CPU&core& ParAAon& +)&Avoids&synchronizaAon/interhcore& communicaaon&[hhstore,voltdb]& h)&can&be&slow&under&skewed&key&popularity& A&popular&item&cannot&be&served&by&mulAple&cores& 26&

27 EREW&Outperforms&CRCW& Throughput&(Mops)& 80& 77.0& 76.8& 70& 60& 50& 70.1& Skewed & &Same&as&YCSB s& Zipf&key&popularity&with&skewness=0.99& 70.2& 70.9& 70.3& 40& 30& 20& 10& 0& Uniform,&50%&GET& Uniform,&95%&GET& Skewed,&50%&GET& Skewed,&95%&GET& EREW& 35.4& 19.6& CRCW& 27&

28 Skew&Does&Not&Hurt&(Much)& Throughput&(Mops)& 8& 6& 4& 2& Some&cores&can&process&more& requests&under&skewed&workloads& Uniform& Skewed& & 0& 0& 1& 2& 3& 4& 5& 6& 7& 8& 9& 10& 11& 12& 13& 14& 15& Perhcore&performance&breakdown& Hot&parAAons&contain&a&few&popular&keys,& making&cpu&cache&very&effecave& Core&#& 28&

29 Request&DirecAon& Server&node& Client& NIC& CPU& CPU& Memory& 2.&Request& direcaon& 1.&Parallel& data&access& 3.&Keyhvalue& data&structures& EREW&requires&correct&request&direcAon.& A&request&must&be&sent&to&the&core/parAAon&that& handles&the&requested&key.& 29&

30 Common&Request&DirecAon&Scheme& FlowFbased(affinity( Server&node& Client& Client& NIC& CPU& CPU& Using&5htuple&to&assign&packets&to&cores& Useful&for&flowhbased&protocols&(e.g.,&TCP)& Does&not&work&well&with&MICA s&erew& A&client&can&request&keys&from&different&parAAons& 30&

31 MICA s&request&direcaon& ObjectFbased(affinity( MICA&overcomes&commodity&NICs &limited& programmability&by&using&client&assistance& & Server&node& Client& Client& Key&2& Key&1& Key&1& Key&2& NIC& CPU& CPU& Key s&paraaon&id&is&put&into& packet&header&(udp&dest&port)& Programmed&to&use&the&embedded&parAAon& ID&to&direct&packets&to&cores& Uses&Intel&Data&Plane&Development&Kit&(DPDK)&for& lowhoverhead&burst&packet&i/o&bypassing&os&kernel& 31&

32 NIC&HW&for&Request&DirecAon& Throughput&(Mops)& 80& 70& 60& 50& Uniform& Skewed& 77.0& 70.1& 40& 30& 20& 10& 34.8& 25.6& 0& SoPwarehonly& Clienthassisted&hardwarehbased& Using&EREW&for&parallel&data&access& 32&

33 KeyhValue&Data&Structures& Server&node& Client& NIC& CPU& CPU& Memory& & 2.&Request& direcaon& 1.&Parallel& data&access& 3.&Keyhvalue& data&structures& Significant&impact&on&keyhvalue&processing&speed& New&design&required&to&support&both&read&and& write&operaaons&at&high&speeds& 33&

34 MICA s&keyhvalue&data&structures& Each&parAAon&has&two&data&structures:& Circular&log&store&& Lossy&concurrent&hash&index& Omised&in&this&talk:&numerous&opAmizaAons& Garbage&collecAon& LRU&approximaAon& NUMAhaware&memory&allocaAon,&memory&mapping& memory&prefetching,&cachehfriendly&data&structures& Concurrency&support&(for& CREW )& 34&

35 Circular&Log&Store& Allocates&space&for&keyhvalue&items&of&any&length& Simple&garbage&collecAon&and&free&space& defragmentaaon& New&item&is& appended&at&tail& Head& Insufficient&space& for&new&item?& Tail& (fixed&log&size)& Evict&oldest&item& at&head&(fifo)& Tail& Head& 35&

36 Lossy&Concurrent&Hash&Index& Indexes&items&in&the&circular&log&with&a&seth associaave&hash&index& hash(key)& Circular& log& bucket&0& bucket&1& & bucket&nh1& Hash& index& Key,Val& Full&bucket?&Evict&oldest&entry&from&it&(FIFO)& Allows&fast&indexing&of&new&keyhvalue&items& 36&

37 KeyhValue&Data&Structure&Comparison& Throughput&(Mops)& 80& 70& 60& 50%&GET& 95%&GET& EREW,&Skewed,&Small&Items& 70.1& 70.2& 50& 40& 30& 29.6& 20& 10& 0& 6.9& ParAAoned&Masstree& MICA& 37&

38 Throughput&Comparison& Throughput&(Mops)& 80& 70& 60& 50& 40& Uniform,&50%&GET& Uniform,&95%&GET& Skewed,&50%&GET& Skewed,&95%&GET& 77.0& 76.8& 70.1& 70.2& 77.8& 77.8& 77.8& 77.8& 30& 20& 10& 0& 23.5& 17.4& 11.9& 11.7& 6.1& 6.3& 0.7& 1.5& 2.2& 3.1& 0.7& 1.5& 0.6& 0.5& 0.9& 0.9& Memcached& RAMCloud& MemC3& Masstree& MICA& (Ideal)& Endhtohend&performance&using&our&opAmized&network&stack& 38&

39 ThroughputhLatency&(on&Ethernet)& Average&latency&(μs)& 140& 120& 100& 80& 60& 40& 20& 140& 120& 100& 80& Memcached&UDP& MICA& 60& Two(orders(of(magnitude(( 40& 20& 0& 0& 0.1& 0.2& 0.3& 0& 0& 20& 40& 60& 80& Throughput&(Mops)& Memcached&UDP&using&standard&socket&I/O& 39&

40 Summary& MICA&takes&a&holisAc&approach&to&designing& fast&inhmemory&keyhvalue&caches.& Efficient&parallel&data&access&& Hardwarehbased&request&direcAon& OpAmized&data&structures&for&keyhvalue&caching& MICA&consistently&achieves&high&performance& under&diverse&workloads.& 40&

41 Typical&Server&Cluster&ConfiguraAon& Web tier Up&to&3x&improvement&in& performance&&& Cache tier Up&to&7x&improvement&in& performance&&& 1. Improving the performance of a cache server [NSDI 14] Joint work with H. Lim, M. Kaminsky, and D. Andersen 2. Improving the performance of a Web server [NSDI 14] E. Jeong, S. Woo, M. Jamshed, H. Jeong, S. Ihm, and K. Park 41&

42 Workload&for&Userhfacing&Servers Measurement of TCP flows in commercial cellular backbone [Woo,mobisys 13] 100%& 80%& 90%& CDF 60%& 40%& 20%& 0%& 0& 1K& 4K& 8K& 16K& 32K& 64K&128K&256K&512K&1M&Total& Flow&Size&(Bytes)& Over 90% (50%) of TCP flows are smaller than 64 KB (4 KB). 42

43 Web&Server&Performance Large&transfers:&easy&to&fill&up&10&Gbps&& Small&transacAons:&1.2&Gbps&under&SpecWeb) Kernel&is&not&designed&well&for&mulAcore&systems.& ConnecKons/sec((x(10 5 ) 2.5&& 2.0&& 1.5&& 1.0&& 0.5&& 0.0&& TCP(ConnecKon(Setup(Performance Linux: Intel Xeon E Intel 10Gbps NIC Performance Meltdown 1& 2& 4& 6& 8& Number(of(CPU(Cores 43

44 Performance&Analysis&of&a&Web&Server 83% of CPU usage spent inside kernel! CPU&usage&on&Linuxh3.10& Web&server&(Lighspd)&serving&a&64&byte&file 17%& 45%& 34%& 4%& Kernel& Packet&I/O& TCP/IP& ApplicaAon& 44

45 Inefficiencies&in&Kernel&TCP/IP&Stack& 1.&Lack&of&connecAon&locality& & & & ApplicaAon& thread CPU Packet/& TCP&processing&& (ksopirqd) TCP send/recv buffer Core 0 Core 1 packet 45&

46 Inefficiencies&in&Kernel&TCP/IP&Stack& 1.&Lack&of&connecAon&locality& & while (1) { epoll_wait( ) fd = accept(listen_fd, NULL); & read(fd, buf, 1024); & write(fd, buf, 1024); } while (1) { epoll_wait( ) fd = accept(listen_fd, NULL); read(fd, buf, 1024); write(fd, buf, 1024); } ApplicaKon(thread ApplicaKon(thread Core 0 Core 1 AccepthAffinity[EUROSYS 12]:&connecAon&affinity&(only)& Linux&SO_REUSEPORT&opAon&(v3.9.4):&perhcore&listen&socket& 46&

47 Inefficiencies&in&Kernel&TCP/IP&Stack& 2.&Shared&file&descriptor&space& &&&&&fd&=&accept(listen_fd,&null)& Socket API Kernel Socket layer Virtual file system layer e.g., vfs_open() File System VFS overhead: Creates inode for each socket file descriptor. Finds the lowest available integer [POSIX] 47&

48 Inefficiencies&in&Kernel&TCP/IP&Stack& 3.&System&call&overhead&(frequent&and&expensive)& & & & & & while (1) { epoll_wait( ) fd = accept(listen_fd, NULL); read(fd, buf, 1024); write(fd, buf, 1024); } 4.&Inefficient&perhpacket&processing& Perhpacket&memory&allocaAon/deallocaAon&overhead& MegaPipe[OSDI 12]:&parAally&address&problems&1,2,3&& & &All&prior&work&reuses&kernel s&tcp/ip.& 48&

49 mtcp&approach mtcp:&a&highhperformance&userhlevel&tcp& design&for&mulacore&systems& Cleanhslate&approach&to&divorce&kernel s& complexity& 1. Leverage&userhlevel&packet&I/O& 2. Support&mulAcorehaware&flow&processing& 3. Provide&a&userhlevel&socket&API& 49

50 mtcp&overview& ApplicaAon&thread User process ApplicaAon&thread User-space memory mtcp&socket&interface mtcp&thread mtcp&epollhlike&& event&system mtcp&thread Kernel Userhlevel&Packet&I/O&library Device&Driver Core 0 Core 1 50&

51 mtcp&overview& Core-affinity (1) Per-core file descriptor, listen socket (2) Kernel bypass, No system call (3) Batched packet processing (4) Per-core Event Q Socket buffer ApplicaAon&thread mtcp&socket&interface mtcp&thread Userhlevel&Packet&I/O&library Device&Driver ApplicaAon&thread mtcp_accept() mtcp&epollhlike&& event&system mtcp&thread Event Q Socket buffer Core 0 Core 1 Packet queue Receiver-side scaling (H/W) 51&

52 mtcp&overview& Core-affinity (1) Per-core file descriptor, listen socket (2) Kernel bypass, No system call (3) Batched packet processing (4) ApplicaAon&thread mtcp&socket&interface mtcp&thread ApplicaAon&thread mtcp&epollhlike&& event&system mtcp&thread Userhlevel&Packet&I/O&library Device&Driver Core 0 Core 1 Packet queue 52&

53 mtcp&design& Highly&scalability&on&mulAcore&systems& 25x&faster&than&latest&Linux&version& 3x&faster&than&MegaPipe& Easy&to&use;&lisle&porAng&effort& Modified&29&lines&of&the&Apache&library&(out&of&66,493)&& EvaluaAon& HTTP&server/client:&lighspd,&Apache& Web&Replayer:&replays&cellular&backbone&traffic&(Korea)& SSL&Proxy& 53&

54 MulAcore&Scalability 64B&message&per&each&connecAon& Heavy&connecAon,&small&packet&processing&overhead& 25x&Linux,&5x&REUSEPORT,&3x&MegaPipe&[OSDI&2012]& TransacKons/sec((x(10 5 ) 15& 12& 9& 6& 3& 0& 0& 1 2& 4& 6& 8& Number(of(CPU(Cores Linux& REUSEPORT& MegaPipe& mtcp&

55 Message&Benchmark Scaling&by&message&size&& Persistent&connecAon&with&64&byte&messages Throughput((Gbps) 10&& 8&& 6&& 4&& 2&& 0&& Linux& REUSEPORT& MegaPipe& mtcp& Link&saturaAon 64B& 256B& 1KiB& 4KiB& 8KiB& Message(Size TransacKons/sec((x(10 5 ) 40& 30& 20& 10& 0& 1& 2& 8& 32& 64& 128& #(Messages/connecKon

56 Web&Server&(Lighspd)&Performance SpecWeb2009&staAc&file&workload&(738B&average)& 3.2x&faster&than&Linux,&1.5x&faster&than&MegaPipe& Throughput&(Gbps) 5& 4& 3& 2& 1& 0& Throughput&(Gbps)& TransacAons/sec& Linux& REUSEPORT& MegaPipe& mtcp& 4& 3.5& 3& 2.5& 2& 1.5& 1& 0.5& 0& TransacAons/s&(x&10 5 )

57 Summary mtcp:&a&highhperformance&userhlevel&tcp&stack&for& mulacore&systems& Efficiently&uAlize&mulAcore&resources&by& EliminaAng&system&call&overhead& Reducing&context&switch&cost&by&event&batching& Using&perhcore&resource&management& Using&cachehaware&threading& Achieve&high&performance&scalability& Small&message&transacAons:&3x&to&25x& ExisAng&applicaAons:&&33%&(SSLShader)&to&320%&(lighspd)&

58 Conclusion& Despite&many&efforts&from&academia&and&industry,& there&sall&exists&lots&of&room&for&innovaaons&for& Cloudhbased&systems&and&services.& EssenAal&building&blocks&for&Cloud&services&can& benefit&from&a&holisac,&mulacorehaware&design& that&leverages&the&underlying&h/w&and&that& carefully&considers&the&workload.& More&research&is&ahead&in&bringing&new& applicaaons&to&the&cloud.&& 58&

59 Reference& [DPDK]& hsp:// technology/packethprocessinghishenhancedhwithhsopwarehfromhintelh dpdk.html& [FacebookMeasurement]&Berk&AAkoglu,&Yuehai&Xu,&Eitan& Frachtenberg,&Song&Jiang,&and&Mike&Paleczny.&Workload&analysis&of&a& largehscale&keyhvalue&store.&in&proc.)sigmetrics)2012.) [Masstree]&Yandong&Mao,&Eddie&Kohler,&and&Robert&Tappan&Morris.& Cache&CraPiness&for&Fast&MulAcore&KeyhValue&Storage.&In&Proc.) EuroSys)2012.) [MemC3]&Bin&Fan,&David&G.&Andersen,&and&Michael&Kaminsky.&MemC3:& Compact&and&Concurrent&MemCache&with&Dumber&Caching&and& Smarter&Hashing.&In&Proc.)NSDI)2013.) [Memcached]&hsp://memcached.org/& [RAMCloud]&Diego&Ongaro,&Stephen&M.&Rumble,&Ryan&Stutsman,&John& Ousterhout,&and&Mendel&Rosenblum.&Fast&Crash&Recovery&in& RAMCloud.&In&Proc.)SOSP)2011.) 59&

MegaPipe: A New Programming Interface for Scalable Network I/O

MegaPipe: A New Programming Interface for Scalable Network I/O Sangjin Han in collabora=on with Sco? Marshall Byung- Gon Chun Sylvia Ratnasamy University of California, Berkeley Yahoo! Research tl;dr?