Apple. Massive Scale Deployment / Connectivity. This is not a contribution

Similar documents
Zevenet EE 4.x. Performance Benchmark.

Heap Off Memory WTF?? Reducing Heap memory stress

G1 Garbage Collector Details and Tuning. Simone Bordet

Eduardo

Web. Computer Organization 4/16/2015. CSC252 - Spring Web and HTTP. URLs. Kai Shen

RxNetty vs Tomcat Performance Results

Top Ten Enterprise Java performance problems. Vincent Partington Xebia

September 15th, Finagle + Java. A love story (

HTTP, WebSocket, SPDY, HTTP/2.0

Application Layer Introduction; HTTP; FTP

Kotlin/Native concurrency model. nikolay

The Google File System

Webcenter Application Performance Tuning guide

Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process

Sybase Adaptive Server Enterprise on Linux

Servlet Performance and Apache JServ

The G1 GC in JDK 9. Erik Duveblad Senior Member of Technical Staf Oracle JVM GC Team October, 2017

Inside the Garbage Collector

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Tuesday, June 22, JBoss Users & Developers Conference. Boston:2010

A Library and Proxy for SPDY

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Low Latency Java in the Real World

(if you can t read this, move closer!) The high-performance protocol construction toolkit. Peter Royal

Capriccio : Scalable Threads for Internet Services

Hierarchical PLABs, CLABs, TLABs in Hotspot

Shenandoah: An ultra-low pause time garbage collector for OpenJDK. Christine Flood Roman Kennke Principal Software Engineers Red Hat

SpongeFiles: Mitigating Data Skew in MapReduce Using Distributed Memory. Khaled Elmeleegy Turn Inc.

SE Memory Consumption

SPDY - A Web Protocol. Mike Belshe Velocity, Dec 2009

SSL/TLS and HTTP/2 State of the Art in Our Servers Jean-Frederic Clere

Grand Central Dispatch and NSOperation. CSCI 5828: Foundations of Software Engineering Lecture 28 12/03/2015

Using OpenSSL to boost Tomcat. Jean-Frederic Clere

SE Memory Consumption

Events, Memory Management

SAMPLE CHAPTER. Norman Maurer Marvin Allen Wolfthal. FOREWORD BY Trustin Lee MANNING

High Performance Transactions in Deuteronomy

Java On Steroids: Sun s High-Performance Java Implementation. History

JVM and application bottlenecks troubleshooting

Oracle WebCenter Portal Performance Tuning

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

The Google File System

Performance Best Practices Paper for IBM Tivoli Directory Integrator v6.1 and v6.1.1

Java Performance: The Definitive Guide

JVM Troubleshooting MOOC: Troubleshooting Memory Issues in Java Applications

AppGate 11.0 RELEASE NOTES

SSL/TLS and HTTP/2 State of the Art in Our Servers Jean-Frederic Clere

Jaguar: Enabling Efficient Communication and I/O in Java

SCYLLA: NoSQL at Ludicrous Speed. 主讲人 :ScyllaDB 软件工程师贺俊

The Google File System

Asynchronous Events on Linux

OPERATING SYSTEM. Chapter 9: Virtual Memory

THE TROUBLE WITH MEMORY

Today. Adding Memory Does adding memory always reduce the number of page faults? FIFO: Adding Memory with LRU. Last Class: Demand Paged Virtual Memory

Using TCnative with Comet/Asynch. Jean-Frederic Clere, Red Hat November 9th

Virtualizing JBoss Enterprise Middleware with Azul

Memory Management. q Basic memory management q Swapping q Kernel memory allocation q Next Time: Virtual memory

Concurrent Garbage Collection

ION - Large pages for devices

Installing on WebLogic Server

(if you can t read this, move closer!) The high-performance protocol construction toolkit. Peter Royal

Kodewerk. Java Performance Services. The War on Latency. Reducing Dead Time Kirk Pepperdine Principle Kodewerk Ltd.

NPTEL Course Jan K. Gopinath Indian Institute of Science

Processes, Threads and Processors

Chapter 8: Virtual Memory. Operating System Concepts

FAQ. Release rc2

CLOUD-SCALE FILE SYSTEMS

Certified Apache Cassandra Professional VS-1046

Last Class: Demand Paged Virtual Memory

Performance Profiling. Curtin University of Technology Department of Computing

A Study Paper on Performance Degradation due to Excessive Garbage Collection in Java Based Applications using Profiler

Systems I: Programming Abstractions

VMem. By Stewart Lynch.

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

High performance and scalable architectures

One Server Per City: Using TCP for Very Large SIP Servers. Kumiko Ono Henning Schulzrinne {kumiko,

Shenandoah: An ultra-low pause time garbage collector for OpenJDK. Christine Flood Principal Software Engineer Red Hat

What s in a traditional process? Concurrency/Parallelism. What s needed? CSE 451: Operating Systems Autumn 2012

Flash: an efficient and portable web server

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

GFS: The Google File System

The Z Garbage Collector Scalable Low-Latency GC in JDK 11

Performance Analysis of Java Communications with and without CORBA

A RESTful Java Framework for Asynchronous High-Speed Ingest

Distributed Systems 16. Distributed File Systems II

Web Client And Server

The Z Garbage Collector An Introduction

Tuning Intelligent Data Lake Performance

Grand Central Dispatch. Sri Teja Basava CSCI 5528: Foundations of Software Engineering Spring 10

Processes COMPSCI 386

The TCPProxy. Table of contents

An Introduction to CICS JVMServers

What s Wrong with the Operating System Interface? Collin Lee and John Ousterhout

Java performance - not so scary after all

Comet and WebSocket Web Applications How to Scale Server-Side Event-Driven Scenarios

Compact Off-Heap Structures in the Java Language

CSE 4/521 Introduction to Operating Systems

My malloc: mylloc and mhysa. Johan Montelius HT2016

The C4 Collector. Or: the Application memory wall will remain until compaction is solved. Gil Tene Balaji Iyengar Michael Wolf

Optimizing RDM Server Performance

Transcription:

Netty @ Apple Massive Scale Deployment / Connectivity

Norman Maurer Senior Software Engineer @ Apple Core Developer of Netty Formerly worked @ Red Hat as Netty Project Lead (internal Red Hat) Author of Netty in Action (Published by Manning) Apache Software Foundation Eclipse Foundation

Massive Scale

Massive Scale What does Massive Scale mean Instances of Netty based Services in Production: 400,000+ Data / Day: 10s of PetaBytes Requests / Second: 10s of Millions Versions: 3.x (migrating to 4.x), 4.x

Part of the OSS Community Contributing back to the Community 250+ commits from Apple Engineers in 1 year

Services Using an Apple Service? Chances are good Netty is involved somehow.

Areas of importance Native Transport TCP / UDP / Domain Sockets PooledByteBufAllocator OpenSslEngine ChannelPool Build-in codecs + custom codecs for different protocols

With Scale comes Pain

JDK NIO some pains

Some of the pains Selector.selectedKeys() produces too much garbage NIO implementation uses synchronized everywhere! Not optimized for typical deployment environment (support common denominator of all environments) Internal copying of heap buffers to direct buffers

JNI to the rescue Java J N I C/C++ Optimized transport for Linux only Supports Linux specific features Directly operate on pointers for buffers Synchronization optimized for Netty s Thread-Model

Native Transport epoll based high-performance transport NIO Transport Bootstrap bootstrap = new Bootstrap().group( new NioEventLoopGroup()); bootstrap.channel(niosocketchannel.class); Native Transport Bootstrap bootstrap = new Bootstrap().group( new EpollEventLoopGroup()); bootstrap.channel(epollsocketchannel.class); Less GC pressure due less Objects Advanced features SO_REUSEPORT TCP_CORK, TCP_NOTSENT_LOWAT TCP_FASTOPEN TCP_INFO LT and ET Unix Domain Sockets

Buffers

JDK ByteBuffer Direct buffers are free ed by GC Not run frequently enough May trigger GC Hard to use due not separate indices

Buffers Direct buffers == expensive Heap buffers == cheap (but not for free*) Fragmentation *byte[] needs to be zero-out by the JVM!

Buffers - Memory fragmentation Waste memory May trigger GC due lack of coalesced free memory Can t insert int here as we need 4 continuous slots

Allocation times Unpooled Heap Pooled Heap Unpooled Direct Pooled Direct 6000 4500 NanoSeconds 3000 1500 0 0 256 1024 4096 16384 65536 Bytes

PooledByteBufAllocator Thread 1 ThreadLocal Cache 1 Based on jemalloc paper (3.x) Thread 2 ThreadLocal caches for lock-free allocation in most cases #808 ThreadLocal Cache 2 Synchronize per Arena that holds the different chunks of memory Arena 1 Arena 2 Arena 3 Size-classes Size-classes Size-classes Different size classes Reduce fragmentation

ThreadLocal caches 4000 Cache Title No Cache Able to enable / disable ThreadLocal caches Contention Count 3000 2000 1000 Fine tuning of Caches can make a big difference Best effect if number of allocating Threads are low. Using ThreadLocal + MPSC queue #3833 0

JDK SSL Performance. it s slow!

Why handle SSL directly? Secure communication between services Used for HTTP2 / SPDY negotiation Advanced verification of Certificates Unfortunately JDK's SSLEngine implementation is very slow :(

HTTPS Benchmark JDK SSLEngine implementation HTTP/1.1 200 OK Content-Length: 15 Content-Type: text/plain; charset=utf-8 Server: Netty.io Date: Wed, 17 Apr 2013 12:00:00 GMT Hello, World! Response Result Running 2m test @ https://xxx:8080/plaintext 16 threads and 256 connections Thread Stats Avg Stdev Max +/- Stdev Latency 553.70ms 81.74ms 1.43s 80.22% Req/Sec 7.41k 595.69 8.90k 63.93% 14026376 requests in 2.00m, 1.89GB read Socket errors: connect 0, read 0, write 0, timeout 114 Requests/sec: 116883.21 Transfer/sec: 16.16MB Benchmark./wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/ xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' -d 120 -c 256 -t 16 -s scripts/ pipeline-many.lua https://xxx:8080/plaintext

HTTPS Benchmark JDK SSLEngine implementation Unable to fully utilize all cores SSLEngine API limiting in some cases SSLEngine.unwrap( ) can only take one ByteBuffer as src

JNI based SSLEngine to the rescue Java J N I C/C++

JNI based SSLEngine one to rule them all Supports OpenSSL, LibreSSL and BoringSSL Based on Apache Tomcat Native Was part of Finagle but contributed to Netty in 2014

HTTPS Benchmark OpenSSL SSLEngine implementation HTTP/1.1 200 OK Content-Length: 15 Content-Type: text/plain; charset=utf-8 Server: Netty.io Date: Wed, 17 Apr 2013 12:00:00 GMT Hello, World! Response Result Running 2m test @ https://xxx:8080/plaintext 16 threads and 256 connections Thread Stats Avg Stdev Max +/- Stdev Latency 131.16ms 28.24ms 857.07ms 96.89% Req/Sec 31.74k 3.14k 35.75k 84.41% 60127756 requests in 2.00m, 8.12GB read Socket errors: connect 0, read 0, write 0, timeout 52 Requests/sec: 501120.56 Transfer/sec: 69.30MB Benchmark./wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/ xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' -d 120 -c 256 -t 16 -s scripts/ pipeline-many.lua https://xxx:8080/plaintext

HTTPS Benchmark OpenSSL SSLEngine implementation All cores utilized! Makes use of native code provided by OpenSSL Low object creation Drop in replacement* *supported on Linux, OSX and Windows

Optimizations made Added client support: #7, #11, #3270, #3277, #3279 Added support for Auth: #10, #3276 GC-Pressure caused by heavy object creation: #8, #3280, #3648 Too many JNI calls: #3289 Proper SSLSession implementation: #9, #16, #17, #20, #3283, #3286, #3288 ALPN support #3481 Only do priming read if there is no space in dsts buffers #3958

Thread Model Thread Easier to reason about Event Loop I/O I/O I/O Channel Channel Channel Less worry about concurrency Easier to maintain Clear execution order

Thread Model I/O Channel Thread Event Loop Proxy I/O Channel public class ProxyHandler extends ChannelInboundHandlerAdapter { @Override public void channelactive(channelhandlercontext ctx) { final Channel inboundchannel = ctx.channel(); Bootstrap b = new Bootstrap(); b.group(inboundchannel.eventloop()); ctx.channel().config().setautoread(false); ChannelFuture f = b.connect(remotehost, remoteport); f.addlistener(f -> { if (f.issuccess()) { ctx.channel().config().setautoread(true); } else {...} }); } }

Backpressure Network Peer1 Peer2 TCP SND Slow? Slow? Fast TCP SND RCV RCV Slow? Fast Application Slow? OOME Application Slow peers due slow connection Risk of writing too fast Backoff writing and reading

Memory Usage Handling a lot of concurrent connections Need to safe memory to reduce heap sizes Use Atomic*FieldUpdater Lazy init fields

Connection Pooling Having an extensible connection pool is important #3607 flexible / extensible implementation

Thanks We are hiring! http://www.apple.com/jobs/us/