Addressing a Performance issue Laurent Léturgez
Whoami Oracle Consultant since 2001 Former developer (C, Java, perl, PL/SQL) Hadoop aficionado Owner@Premiseo: Data Management on Premises and in the Cloud Blogger since 2004 http://laurent-leturgez.com Twitter : @lleturgez
500+ Technical Experts Helping Peers Globally 3 Membership Tiers Oracle ACE Director Oracle ACE Oracle ACE Associate bit.ly/oracleaceprogram Connect: oracle-ace_ww@oracle.com Facebook.com/oracleaces @oracleace Nominate yourself or someone you know: acenomination.oracle.com
Agenda The Drill down approach It s [always] a question of time Active Average Session and Database load Identify Bottlenecks with various tools Qualify identified bottlenecks to reduce time consumption Various tools for a better analysis Code instrumentation PL/SQL Profiling
The drill down approach
Have you ever met this kind of user reactions? Have you ever answer them this? My Application is slow can you help me? It s slow (or it s hang) help! it must be the database?
Usually, we need more information When? How the problem occured? Is it general or specific use case? Any error message? Are you sure it s really the database? We need to trust our user (or interview more than one user). We have to analyze the issue by ourself.
Time is the key of analysis A session can spend time in different ways It waits for work to do Idle Wait time It waits for a system call or something to complete (Waiting for a lock, an IO etc.) Active Wait Time (or Non Idle Wait time) It executes oracle code on CPU DB CPU Time Active time in a session Active Wait Time + DB CPU Time Active time in Database is DB Time DB Time = σ SID=1 SID=n (Active Wait Time + DB CPU Time)
For a Session 0 DB Time Elapsed time 2 sec. 8 sec. 3 sec. 3,5 sec. User IO DB CPU TX contention DB CPU Time = 3 sec Non Idle wait time = 2 + 8 + 3,5 = 13,5 sec DB Time = 3 + 13,5 = 16,5 sec Elapsed Time = 60 sec At Database level Active Average Session AAS = Elapsed Time = 60 sec σ DBTime Elapsed Time
Database Time for a not overloaded system (2 CPUs) User IO DB CPU TX contention
Database Time for an overloaded system short period (2 CPUs) User IO DB CPU TX contention
Database Time for an overloaded system (2 CPUs) User IO DB CPU TX contention
Active Average Session is a key indicator for database load AAS = or close to 0: Database is idle AAS < # CPU Cores, no system bottleneck AAS ~ # CPU Cores, database uses all system resources (If one DB / system) AAS > # CPU Cores, Database loaded (depends on the part of CPU in AAS) AAS >> 2 x # CPU Cores, Database is overloaded Database load = AAS / # CPU Core DB Load = or close to 0: Database is idle DB Load < 1, no system bottleneck DB Load ~ 1, database uses all system resources (If one DB / system) DB Load > 1, Database loaded (depends on the part of CPU in AAS) DB Load >> 2, Database is overloaded
When AAS/Database load is on top? Identify AAS or Database load peak time Not very easy with AWR or Statspack Reports A little bit easier with EM top Activity page (Performance tab) Trending and Data visualisation is the solution Explore AWR tables and views (and ASH) Explore Statspack tables and views Graph and plot Heatmap Trends
Trending and Data visualisation of AAS Granularity matters From ASH, you can get AAS every seconds V$ACTIVE_SESSION_HISTORY (or less by modifying _ash_sampling_interval ) every 10 secs DBA_HIST_ACTIVE_SESS_HISTORY Or more. by writing the correct SQL statement From STATSPACK, you can get AAS from the time model analysis between two snapshots
Trending and Data visualisation of AAS Granularity matters: example AAS every hour from DBA_HIST_ACTIVE_SESS_HISTORY (Thanks Marcin Przepiórowski) SELECT TO_CHAR(sample_time,'YYYY-MM-DD HH24') mtime, round(decode(session_state,'waiting',count(*),0)/360,2) aas_wait, round(decode(session_state,'on CPU',count(*),0) /360,2) aas_cpu, round(count(*)/360,2) aas FROM dba_hist_active_sess_history GROUP BY to_char(sample_time,'yyyy-mm-dd HH24'), session_state ORDER BY mtime
Trending and Data visualisation of AAS MTIME AAS_WAIT AAS_CPU AAS ------------- ---------- ---------- ---------- 2017-09-20 00 1 4.65 5.65 2017-09-20 01 2.96 3.09 6.04 2017-09-20 02 2.41 11.99 14.41 2017-09-20 03 2.59 7 9.59 2017-09-20 04 4.41 7.39 11.8 2017-09-20 05.38.67 1.04 2017-09-20 06.05.11.16 2017-09-20 07.85 4.22 5.07 2017-09-20 08.12.51.63 2017-09-20 09.84 3.17 4.01 2017-09-20 10 1.4 3.16 4.56 2017-09-20 11 1.08 3.41 4.49 2017-09-20 12.8 1.99 2.79 2017-09-20 13 1.09 3.19 4.27 2017-09-20 14 3.4 9.88 13.27 2017-09-20 15 1.55 4.07 5.62 16 14 12 10 8 6 4 2 0-2 0 2 4 6 8 10 12 14 16 18 AAS_WAIT AAS_CPU AAS
But my database is overloaded or not? Add CPU Core number, directly in your graph. 16 14 12 10 8 6 AAS_CPU AAS CORE 4 2 0 0 2 4 6 8 10 12 14 16 18-2
But my database is overloaded or not? Or modify your query to get the db load and plot it directly. SELECT mtime, ROUND(SUM(load),2) LOAD FROM (SELECT TO_CHAR(sample_time,'YYYY-MM-DD HH24') mtime, DECODE(session_state,'WAITING',COUNT(*),0)/360 c1, DECODE(session_state,'ON CPU',COUNT( *),0) /360 c2, COUNT(*)/360 cnt, COUNT(*)/360/cpu.core_nb load FROM dba_hist_active_sess_history, (SELECT value AS core_nb FROM v$osstat WHERE stat_name='num_cpu_cores ) cpu GROUP BY TO_CHAR(sample_time,'YYYY-MM-DD HH24'), session_state, cpu.core_nb ) GROUP BY mtime ORDER BY mtime; MTIME LOAD ------------- ---------- 2017-09-20 00.4 2017-09-20 01.43 2017-09-20 02 1.03 2017-09-20 03.69 2017-09-20 04.84 2017-09-20 05.07 2017-09-20 06.01 2017-09-20 07.36 2017-09-20 08.05 2017-09-20 09.29 2017-09-20 10.33 2017-09-20 11.32 2017-09-20 12.2 2017-09-20 13.31 2017-09-20 14.95 2017-09-20 15.4 1,2 1 0,8 0,6 0,4 0,2 0-0,2 0 5 10 15 20 LOAD
Heatmap to identify bottleneck Based on previous queries and Oracle PIVOT function See: https://laurent-leturgez.com/2016/12/15/database-load-heatmap-with-awr-and-python/ Dataviz can be done with various tools: Tableau software Microsoft Excel with conditional formatting Python with plotly library
Example (With Microsoft Excel)
Examples (with Python and Plotly)
So what is this drilldown approach? First identify if the database is overloaded (User Interviews, AAS wide analysis) Then, identify when the database is overloaded (Heatmap, AAS trending) Then, identify how the DB time is distributed (AAS trending) More CPU Time than Active Wait time? More Active wait time than CPU Time? Run the AWR Report or Statspack to get more details (AWR and SP reports) Reduce CPU Time or Active Wait Time or Both (with the help of your brain!!) If more CPU time, analyze SQL statements that burns buffer cache for example If more Active wait time, identify which one(s), resolve the issue(s)
So what is this drilldown approach? Identify if the database is overloaded (User Interviews, AAS wide analysis) Identify when the database is overloaded (Heatmap, AAS trending) identify how the DB time is distributed (AAS trending)? More CPU Time than Active Wait time? More Active wait time than CPU Time? Run the AWR Report or Statspack to get more details Reduce CPU Time or Active Wait Time or Both If more CPU time, analyze SQL statements that burns buffer cache for example If more Active wait time, identify which one(s), resolve the issue(s)
Ok, but if I haven t bought Diagnostics Pack or if I run a Standard edition? Heatmap not possible because it s based on ASH You can graph AAS, AAS_WAIT and AAS_CPU on a large period Then reduce time scale, redo the same AAS trending How? See: https://laurent-leturgez.com/2015/11/06/active-average-session-trending-in-statspack/ Time Model Analysis with a specific function Get DB Time and DB CPU, and calculate Active Wait Time for every period between two snapshots Calculate AAS = DB Time / Elapsed for every period between two snapshots And plot it!
Various tools for a better analysis
Code Instrumentation Use of DBMS_APPLICATION_INFO Add information in V$SESSION, V$SESSION_LONGOPS, V$SQL_MONITOR, V$SQL (and some others) MODULE ACTION CLIENT_INFO (Only in V$SESSION and V$SQL_MONITOR) Then dispatched to ASH (V$ACTIVE_SESSION_HISTORY, DBA_HIST_ACTIVE_SESS_HISTORY) AWR (DBA_HIST_SQLSTAT) Statpack (only Module for STATS$V_$SQLXS, STATS$SQL_SUMMARY and STATS$TEMP_SQLSTATS) Note: CLIENT_INFO is not dispatched
Code Instrumentation Without Code instrumentation With Code instrumentation
Code Instrumentation: Java sample code public static void main(string[] args) throws Exception { DriverManager.registerDriver(new oracle.jdbc.oracledriver()); // Warning: this is a simple example program : In a long running application, // error handlers MUST clean up connections statements and result sets. String module, prev_module; String action, prev_action; Connection c = DriverManager.getConnection("jdbc:oracle:thin:@192.168.99.8:1521:orcl", "system", "oracle"); CallableStatement call = c.preparecall("begin dbms_application_info.set_module(module_name =>?, action_name =>?); end;"); module="paycheck"; action="header"; try{ call.setstring(1,module); call.setstring(2,action); call.execute(); } catch (SQLException e) {e.printstacktrace();} // PAYCHECK HEADER EDITION HERE module="paycheck"; action="main"; try{ call.setstring(1,module); call.setstring(2,action); call.execute(); } catch (SQLException e) {e.printstacktrace();} finally {call.close();} Backup previous module and action with dbms_application_info. get_module function // PAYCHECK MAIN PART EDITION HERE } c.close();
Code Instrumentation : the drilldown approach Identify top module activity For top modules, identify the top Action When possible, identify client with CLIENT_INFO Complete performance analysis of the specific Module/action AWR / ASH TOP SQL identification in this module / action Code inspection Code profiling
PLSQL Profiling Profile runtime behaviour of PLSQL code Allow bottleneck identification Introduced in Oracle 8i Oracle 11gR1 introduced hierarchical PLSQL profiler
PLSQL Profiling : How does it work? DBMS_PROFILER. start_profiler DBMS_PROFILER. stop_profiler R U N T I M E
Real Life example #1 Night batch is too long (and ends after 4AM) Oracle 10g classic profiler The culprit is the billing (sub)-batch (PLSQL) Profiling
Real Life example #1 results UNIT_NAME LINE# TOTAL_OCCUR TOTAL_SEC MIN_SEC MAX_SEC --------------- ----- ----------- ---------- ---------- ----------.../... XNPCK_XXTRACE 56 403076907 1001,42 0,02 XNPCK_XXTRACE 57 403076907 292,88 0,02 XNPCK_XXTRACE 58 403076907 280,86 0,01 XNPCK_XXTRACE 61 403076907 278,86 0,04 XNPCK_XXTRACE 67 403076907 0 0 0 XNPCK_XXTRACE 69 403076907 275,28 0,02 XNPCK_XXTRACE 70 0 0 0 0 XNPCK_XXTRACE 71 0 0 0 0 XNPCK_XXTRACE 72 0 0 0 0 XNPCK_XXTRACE 73 0 0 0 0 XNPCK_XXTRACE 75 403076907 997,55 0,02 XNPCK_XXTRACE 76 403076907 0 0 0 XNPCK_XXTRACE 79 403076907 658,23 0,03.../... XNPCK_XXTRACE 88 403076907 0 0 0 XNPCK_XXTRACE 90 403076907 540,78 0,02 XNPCK_XXTRACE 91 0 0 0 0.../... XNPCK_XXTRACE2 38 403076907 398,42 0,01 XNPCK_XXTRACE2 45 403076907 279,6 0,01 XNPCK_XXTRACE2 48 403076907 273,83 0,01 XNPCK_XXTRACE2 49 1 0 0 0 XNPCK_XXTRACE2 50 1 0 0 0 XNPCK_XXTRACE2 51 1 0 0 0 XNPCK_XXTRACE2 53 403076906 270,2 0,02 XNPCK_XXTRACE2 54 403076906 298,52 0,02 XNPCK_XXTRACE2 56 403076907 315,83 0,02 XNPCK_XXTRACE2 59 0 0 0 0 XNPCK_XXTRACE2 60 403076907 319,54 0,03 Line 56 of package body XNPCK_XXTRACE has been executed 403076907 times. Each execution took 0 to 0,02 sec The global time for this step is 1001 seconds What are these packages names??? xxtracexx After having a look in the code near these lines numbers, We found something like that: file1:= utl_file.fopen('utl_dir, debug.txt','w'); utl_file.put_line(file1, Some information' ); utl_file.fclose(file1); In a loop!!
PLSQL Hierarchical Profiler : How does it work? DBMS_HPROF.start_profiling (location=>dir, filename=> profiler.txt ) DBMS_HPROF.stop_profiling R U N T I M E profiler.txt
Real Life example #2 Customer complains about very slow application Culprit is the database (11gR2) DB Load constantly upper to 4 Customer bought new Oracle CPU, application capacity increases After a while, application becomes slow, application capacity cannot grow the same way the application becomes popular After code instrumentation, a problematic code path is identified profiling (hierarchical)
Real Life example #2 hierarchical profiler results SUBTREE FUNCTION ELAPSED TIME ELAPSED TIME LEVEL NAME LINE# TYPE usec usec CALLS ---------- --------------------------------------------------------------------- ----- ----- ------------ ------------ ---------- 1 MGUSER.PR_CALC_BULL_FIN_STC_AED 1 PLSQL 25203972 309 1.../... 2 MGUSER.PR_TRAITEGEN_MULT_DEPART_CDD 1 PLSQL 25198639 1123 1.../... 3 MGUSER.PR_TRAITEGEN_LISTE_DEPART_CDD 1 PLSQL 24703905 168802 1.../... 4 MGUSER.PR_TRAITEGEN_LISTE_DEPART_CDD. static_sql_exec_line509 509 SQL 209948 209948 3146 4 MGUSER.PR_TRAITEGEN_LISTE_DEPART_CDD. static_sql_exec_line578 578 SQL 11620988 11620988 239 4 MGUSER.PR_TRAITEGEN_LISTE_DEPART_CDD. static_sql_exec_line613 613 SQL 1133735 1133735 239 4 MGUSER.PR_TRAITEGEN_LISTE_DEPART_CDD. static_sql_exec_line668 668 SQL 504919 504919 239 4 MGUSER.PR_TRAITEGEN_LISTE_DEPART_CDD. static_sql_exec_line802 802 SQL 1272456 1272456 239 4 MGUSER.PR_TRAITEGEN_LISTE_DEPART_CDD. static_sql_exec_line843 843 SQL 166459 166459 3146.../... but its sub-calls took 24703905 µsec need to analyze next level (4) This call took 168802 µsec This call took 11620988 µsec No more sub-execution because Subtree time = function time (and because it s a SQL op). A single execution took : 11620988 / 239 = 48623 µsec Compared to other SQL in this level it s the main time consumer NAME TEXT LINE ------------------------------ ----------------------------------------------------------------- ---------- PR_TRAITEGEN_LISTE_DEPART_CDD begin 576 PR_TRAITEGEN_LISTE_DEPART_CDD PR_PERF('MILI', '','','AED'); 577 PR_TRAITEGEN_LISTE_DEPART_CDD select /*+ index(pgd PK_PARA_GENE_DNAC) */ 578 PR_TRAITEGEN_LISTE_DEPART_CDD id_para, 579 PR_TRAITEGEN_LISTE_DEPART_CDD 'PRES' as status_aed, 580 PR_TRAITEGEN_LISTE_DEPART_CDD date_gene, 581 Oh wait! a hint we analyzed the sqlplan, tune it by simply remove the hint PROBLEM FIXED!!
PLSQL Hierarchical Profiler : Data Visualisation Tools exist to visualise PL/SQL hierarchical profiles SQL Developer Martin Büchi tools Set of packages (Java & PLSQL) to display PLSQL Profile in a web Browser Google Developer tools: cpuprofile Brendann Greg s FlameGraph
PLSQL Hierarchical Profiler : Data Visualisation with Flame Graph SQL> exec ora_hprof#.flatten('work_dir','profiling_4e0f4c0a96016c63e0537a1ea8c0113f_2202','profile_flat.txt'); $ flamegraph.pl /var/tmp/profile_flat.txt > /var/tmp/profile_flat.svg profiler.txt
Conclusion Addressing a performance issue. Key is time analysis Proceed from the general to the detail After identifying bottlenecks, use the right tool for the right job Code Instrumentation PLSQL Profiling (and hierarchical profiler) Better when using graphical tools
Questions?