Software Architecture for Highly Available, Scalable Trading Apps: Meeting Low-Latency Requirements Intentionally Craig Blitz Oracle Coherence Product Management 1 Copyright 2011, Oracle and/or its affiliates. All rights
The following is intended to outline general product use and direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle. 2 Copyright 2011, Oracle and/or its affiliates. All rights
Agenda Level-Setting: Why We Care and What We Mean Legacy Solutions and Architectural Patterns A New Paradigm 3 Copyright 2011, Oracle and/or its affiliates. All rights
Why Care About Scalability? This is a low-latency event, isn t it? Trading Growth Growth driven by multiple factors All in the context of competitive pressures Low latency = more business High latency = Sucker! Product Growth Customer Acquisition 4 Copyright 2011, Oracle and/or its affiliates. All rights
Ok, So We ll Just Scale Up Application deployments deliver low-latency at given loads Scale-up strategies risky Depend on systems growing larger and larger Still need to ensure all components can scale-up 5 Copyright 2011, Oracle and/or its affiliates. All rights
Limits to Scale-Up Size of available systems Programming constraints JVM Garbage Collection Network capacity 6 Copyright 2011, Oracle and/or its affiliates. All rights
What Do We Mean By Scalability? Scale linearly and predictability by adding resources as load increases. 900 800 700 600 Throughput Question: Does system on right scale? 500 400 300 200 100 Throughput 0 2 systems 4 systems 8 systems 16 systems 7 Copyright 2011, Oracle and/or its affiliates. All rights
What Do We Mean By Scalability? Latency must not change 900 800 2.5 Are we there yet? 700 600 500 400 300 2 1.5 1 Throughput Mean Latency 200 0.5 100 0 2 systems 4 systems 8 systems 16 systems 0 8 Copyright 2011, Oracle and/or its affiliates. All rights
What Do We Mean By Scalability? Doh! 900 800 2.5 Increased Std Dev mean increased SLA failures 700 600 500 2 1.5 Throughput Done? Ok. Enough. 400 300 1 Latency Std Dev Mean Latency 200 0.5 100 0 2 systems 4 systems 8 systems 16 systems 0 9 Copyright 2011, Oracle and/or its affiliates. All rights
Why Care About High-Availability? Revelation: Not everyone does We ll just stop trading if we crash But if HA were free, this would be silly How cheap does it have to be? Downtime = Lost opportunity at the very least But, more scalability = more chance of component failure HA needs to be scalable, architectural and strategic 10 Copyright 2011, Oracle and/or its affiliates. All rights
Agenda Level-Setting: Why We Care and What We Mean Legacy Solutions and Architectural Patterns A New Paradigm 11 Copyright 2011, Oracle and/or its affiliates. All rights
Conceptual Trading & Risk Platform 12 Copyright 2011, Oracle and/or its affiliates. All rights Insert Information Protection Policy Classification from Slide 8
Scalable Apps, Stove-piped Systems Simplified Process Pre-Trade Analysis Order Management Trade Execution Post-Trade Analysis Scalable best practices applied per application Low-latency messaging to communicate between applications 13 Copyright 2011, Oracle and/or its affiliates. All rights
Challenges Scaling Stovepipe Architectures Which came first, the organization or the silo? Technical Who is managing state? Database? Distributed Cache? Processing does not scale with data Excessive data movement HA managed at component level Low-level messaging must scale Deploying systems and networks for new scale requirements difficult Organizational Many organizations involved App teams Networking Q&A Systems Database Cross-org communications difficult Vested interests 14 Copyright 2011, Oracle and/or its affiliates. All rights
Scalable Apps, Stove-piped Systems A little better Pre-Trade Analysis Order Management Trade Execution Post-Trade Analysis Distributed Cache Recoverable state managed on data tier Data tier scalable as demand or data grows Still expensive (will revisit this later) 15 Copyright 2011, Oracle and/or its affiliates. All rights
Agenda Level-Setting: Why We Care and What We Mean Legacy Solutions and Architectural Patterns A New Paradigm 16 Copyright 2011, Oracle and/or its affiliates. All rights
Distributed Caching and Data Grids Distributed Caches Scalable object caching across multiple servers Possibly lossy It s a cache! No clustering or backups Read-through/ write-through to data sources Expiration Eviction Data Grids Processing scales with data Event model Cannot be lossy Clustering and Backups Death detection and transparent recovery Queries Map-Reduce Aggregations Write-Behind 17 Copyright 2011, Oracle and/or its affiliates. All rights
From Distributed Cache Fast, scalable, highly available access to application objects App tier and data tier scale separately Too many network roundtrips for low latency Lock held across many network roundtrips App LOCK (1) GET (3) PUT (4) UNLOCK (6) Cache Server Primary Partition LOCK (2) PUT (5) UNLOCK (7) Cache Server Backup Partition 18 Copyright 2011, Oracle and/or its affiliates. All rights
To Data Grid Processing moved to data grid App tier and data tier scale together Lockless processing Transactional processing on co-located related objects (trade and orders) State always recoverable Cache Server Cache Server Client Tier INVOKE Primary Partition (App) BACKUP Backup Partition (App) 19 Copyright 2011, Oracle and/or its affiliates. All rights
To Event Driven Architecture Live Objects listen to state change on itself to schedule next process phase State (and hence processing) always recoverable Eliminates need for messaging between application processors Highly scalable, completely asynchronous Client Tier INVOKE Cache Server Primary Partition Process 1 Process 2 Process 3 BACKUP BACKUP BACKUP Cache Server Backup Partition (App) 20 Copyright 2011, Oracle and/or its affiliates. All rights
Oracle Coherence Data Grid Distributed In Memory Data Management Enterprise Applications Real Time Clients Data Services Web Services Oracle Coherence Data Grid Provides a reliable data tier with a single, consistent view of data Enables dynamic data capacity including fault tolerance and load balancing Ensures that data capacity scales with processing capacity Databases Mainframes Web Services 21 Copyright 2011, Oracle and/or its affiliates. All rights
Coherence: A Unique Approach In Coherence Members share responsibilities (health, services, data ) No Single Points of Bottleneck (SPOBs) No Single Points of Failure (SPOFs) Linearly scalable to hundreds of servers by design Servers form a full mesh No Masters / Slaves etc. Data Grid members work together as a team Communication is almost always point-to-point Scalable throughput up to the limit of the backplane 22 Copyright 2011, Oracle and/or its affiliates. All rights
How Does Oracle Coherence Work? Data load-balanced in-memory across a cluster of servers Data automatically and synchronously replicated to at least one other server for continuous availability Single System Image: Logical view of all data on all servers? Servers monitor the health of each other In the event a server fails or is unhealthy, other servers cooperatively diagnose the state The healthy servers immediately assume the responsibilities of the failed server Continuous Operation: No interruption of service or loss of data due when a server fails X 23 Copyright 2011, Oracle and/or its affiliates. All rights
Trading Platform Example System designed as Finite State Machine Data Affinity co-locates Orders and Market Matching Engines in Cluster Coherence manages recoverable state (always recoverable) Used standard Java Concurrency library for asynchronous tasks Individual components unit testable and provable simplifies development Through-put and performance dependent on cores and network Designed to minimize storage and network tasks 24 Copyright 2011, Oracle and/or its affiliates. All rights Oracle Confidential and Proprietary
Revisiting Silos Co-located processing elements (PE) via Coherence EDA. Scaling and HA architected into system. Messaging component between PE eliminated. Several teams still involved in elastic scaling Need to procure, configure, and deploy new systems Need to configure and test new system on network Latency much better than where we started Removed network hops, data movement Limited by network speed 25 Copyright 2011, Oracle and/or its affiliates. All rights
Exabus: Exalogic I/O and Network Design Eliminates cloud, cluster and network virtualization I/O bottlenecks Exalogic X2-2 Ethernet Gateway Switches Spine Switch IB Data Center Service Network (10GbE) Standard Oracle Database Data Center Mgmt Network (GbE) 10GbE GbE Management Switch Exabus (InfiniBand I/O Backplane) Compute Nodes Storage Exadata Exalogic SPARC SuperCluster Management Network (GbE) ZFS Storage 26 Copyright 2011, Oracle and/or its affiliates. All rights Copyright 2011 Oracle Corporation
Coherence Exabus Optimizations Direct Memory I/O for Java and C++ Leverage new Java APIs and Exalogic Elastic Cloud Software - Low Latency support for Infiniband - Optimized implementation for Exalogic Infiniband Scalable to massively multi-core systems Surfacing low-level advanced networking capabilities 4x Throughput, 6x Better Response Time 27 Copyright 2011, Oracle and/or its affiliates. All rights
Coherence on Exalogic Engineered System Optimized Scalability and Performance in a Box Coherence optimized for Exabus Pre-configured, pre-optimized Elastic Data: Expand Capacity with Flash Easy deployment as demand spikes Scale from ¼ to multi-rack 28 Copyright 2011, Oracle and/or its affiliates. All rights
Risk Systems Built on Oracle Coherence Credit Suisse JP Morgan Chase Challenges Solutions Challenges Solutions Achieve five millisecond or lower response time for pretransaction credit checks against counterparties globally Process intraday credit checks for a large number of transactions daily and scale by up to a factor of ten without risk of increase in latency Built in-memory application grid for its performance, resilience and risk-free scale-out with Oracle Coherence and JRockit to achieve consistent low-latency for credit checks. Preferred Coherence for its simplicity, which enabled a team of four to deliver the system quickly and support it globally. Coherence stores intraday data and processes credit checks. Installed regional system instances to ensure proximity to clients and enable low-latency and instant failover. Provide traders, researchers, and financial controllers with accurate, timely risk exposure and profit and loss (P&L) figures for the rates, exotics, and hybrids business in a volatile trading environment Gain drill down from aggregated book-level to trade- level details and slice data in multiple dimensions while reducing preparation and run times to support real-time decisions. Create a fully backed-up, highly redundant loss-less environment to guarantee data availability in case of IT failure. Built Project Orion, a risk exposure and P&L reporting solution, on Oracle Coherence for maximum resilience and riskfree linear scalability with a distributed, in-memory data cache. Deployed Orion on a large (more than 200 node) cluster in Europe, the Middle East, Asia. and North America. Loaded data into Coherence to provide dynamic aggregations for on-demand slicing and dicing Reduced turnaround time for delivering trade level risk exposure and P&L to users. 29 Copyright 2011, Oracle and/or its affiliates. All rights
For More Information General Information: http://coherence.oracle.com Coherence YouTube Channel: http://www.youtube.com/user/oraclecoherence Coherence Training: http://education.oracle.com Coherence Discussion Forum: http://forums.oracle.com Coherence User Group on Linkedin Oracle Coherence 3.5 by Aleks Seovic My email: craig.blitz@oracle.com 30 Copyright 2011, Oracle and/or its affiliates. All rights
31 Copyright 2011, Oracle and/or its affiliates. All rights Q&A