Is Your Project in Trouble on System Performance? Charles Chow May 2017 Is SATURN Your Project 2017 in Trouble - Is Your on System Project Performance? in Trouble on System Performance? May 2017 1 4, [Copyright 2017 Charles Chow] 2017 [Copyright Charles Chow] 1
Agenda Why do so many projects have system performance issues? How to salvage a troubled project with performance issues? Lessons Learned on architectural impact to the performance Q&A 2017 [Copyright Charles Chow] 2
Why do so many projects have system performance issues? Lack of nonfunctional requirements on 1 performance at 2 in detailed 3 an early stage of the architectural design Lack of usercentric approach design and implementation Overlook the performance testing at the start of the System Integration Test 2017 [Copyright Charles Chow] 3
Most common causes of system performance issues Lack of Network / Server Capacity Unoptimized and/or Diminutive Databases Unmanaged Growth of Data Poor Resource Utilization Poor Code Quality Peak User Load or Traffic Spikes 2017 [Copyright Charles Chow] 4
How to salvage a troubled project with performance issues? Identify the root causes by monitoring Key Metrics at each layer of the application stack Conduct performance testing to analyze on performance on the overall application Collaborate with business stakeholders on performance issue resolution Application / Service Stack Application Layer Metric Monitoring Metrics* Online Transaction Metrics: Response Time, Throughput, Transaction Pass/Fail Rate, Error Rate, Distribution) Batch Transaction Metrics: Records Processed & Execution Time Metric Server Layer Server Metrics: Http Server, App Server, Database Server, Other Servers Metric Platform Layer Platform Metrics: Cloud Services, OS, Storage, etc. * Metrics available may vary by vendor s willingness to allow for monitoring. 2017 [Copyright Charles Chow] 5
Background This is a web portal project for a customized solution on Sales Opportunities The architecture was set on top of existing mobile application The client already invested over $2millon on the solution Quality attributes workshop (QAW) was not conducted. Performance as a critical non-functional requirement was not defined in early phase On-Premise 3 rd Party Providers Hosted Applications Client Applications User Notes SFDC Application Frontend Proxy Server SAML SSO Login provider OAM Portal EDW Nightly reload (archive old tables and create new copies) Postgres DB I/O API web services ios App 2017 [Copyright Charles Chow] 6
Problem Statement Currently the API is querying the PostgreSQL database to return results to the Portal Front-End These calls are especially expensive due to data volume and data propagation Timeouts occur when tested with 50 and more concurrent users during UAT Clients wanted to keep current architecture and tech stack and go-live date is 4 weeks away Options on database tuning, code optimization, and hardware capacity increase were exhausted Proposed Solution Introduce Solr in the architecture by moving data aggregation and sorting to Solr as an indexing service Indexing of data will be targeted to data sets that take the longest to return due to complex calculations The API services will start querying the Solr index for results instead of the PostgreSQL database directly Use pre-generated index to serve results from the Solr instance hosted on AWS 2017 [Copyright Charles Chow] 7
Updated Architecture On-Premise 3 rd Party Providers Hosted Applications Client Applications User Notes SFDC Application Frontend Proxy Server SAML SSO Login provider Portal OAM EDW Nightly reload (archive old tables and create new copies) Postgres DB I/O API web services ios App 2017 [Copyright Charles Chow] 8
Performance Testing - Defined and simulated the user day of life flow through the different functionalities of the Portal based on user role with sustained load of 100 users. Parameters Configuration Details Duration 50 minutes Ramped up from 0 to 100 virtual users (threads) in 10 minutes, sustained load for 30 mins and ramped down from 100 to 0 in 10 minutes User think time 4-5 seconds/ 10-12 seconds Simulated wait time between steps in scenarios as per user behavior Data Volume Open opportunities 17 million to 33 million Simulated increase in opportunities based on estimated growth Scenarios and weightages for each scenario: Major Performance Activities: User Group Number of scenarios Weightage per user group Weightage per scenario in user group Action Accomplishments User Group 1 3 45% Scenario 1: 45% Scenario 2: 20% Scenario 3: 35% Implemented Solr Indexing Resolved the major Performance issue User Group 2 3 35% Scenario 1: 40% Scenario 2: 40% Scenario 3: 20% Query and API Tuning Resolved the specific functional area performance issue User Group 3 2 10% Scenario 1: 75% Scenario 2: 25% Load Testing Captured Redis cache issue and mitigated future production issue User Group 4 2 10% Scenario 1: 60% Scenario 2: 40% Database environment turning Identified database connection pooling issue and resolved scalability issue 2017 [Copyright Charles Chow] 9
Performance result - average response time for user actions in each scenario was recorded to be within the SLA. User action in all page loads except performance dashboard - Average response time was below 3 seconds User action in performance dashboard Average response time was below 3 secs with spikes averaging 4.2 secs Home Page Load Opportunity Search Results My Scorecard Page Load My Account Dashboard Drill into top/bottom value 2017 [Copyright Charles Chow] 10
SATURN 2017 Questions? Thank You Is Title Your of Project the Presentation in Trouble System Goes Performance? Here May 2017 1 4, [Copyright 2017 Owner(s)] 2017 [Copyright Charles Chow] 11