Performance and Code Tuning. CSCE 315 Programming Studio, Fall 2017 Tanzir Ahmed

Size: px

Start display at page:

Download "Performance and Code Tuning. CSCE 315 Programming Studio, Fall 2017 Tanzir Ahmed"

Bernard Stevens
5 years ago
Views:

1 Performance and Code Tuning CSCE 315 Programming Studio, Fall 2017 Tanzir Ahmed

2 Is Performance Important? Performance tends to improve with time HW Improvements Other things can be more important Accuracy Robustness Code Readability Worrying about it can cause problems More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason including blind stupidity. William A. Wulf

3 So Why Worry About It? Sometimes code performance is critical Very large-scale problems Inefficiencies that make things seem infeasible The gains in hardware improvement may be coming to an end Or, at least need more tricks to take advantage of It is not straight-forward to take full advantage of a 48-core or 1024 CPUs/GPUs Problems such as inter-core or inter-socket communication overhead, cache efficiency Performance Engineering will likely be important in making long-term improvements

4 So, How Can We Improve Performance? First, there are ways to improve performance that don t involve code tuning Before doing code tuning, you need to know what to tune Again, guess work is not helpful Finally, you can tune your code Carefully, measuring along the way

5 Performance Increases without Code Tuning Lower your Standards/Requirements (!!) Performance tuning is expensive May require h/w s/w optimization Performance is stated as a requirement far more than it is actually a requirement

6 Performance Increases without Code Tuning Lower your Standards/Requirements High Level Design The overall program structure can play a huge role

7 Performance Increases without Code Tuning Lower your Standards/Requirements High Level Design Class/Routine Design Algorithms used have real differences Can have largest effect, especially asymptotically

8 Performance Increases without Code Tuning Lower your Standards/Requirements High Level Design Class/Routine Design Interactions with Operating System Hidden OS calls within libraries their performance affects overall code Just removing repeating printing to standard output may increase performance significantly for some programs

9 Performance Increases without Code Tuning Lower your Standards/Requirements High Level Design Class/Routine Design Interactions with Operating System Compiler Optimizations Automatic Optimization Getting better and better, but not perfect Different compilers work differently

10 Performance Increases without Code Tuning Lower your Standards/Requirements High Level Design Class/Routine Design Interactions with Operating System Compiler Optimizations Upgrade Hardware Straightforward, if possible

11 Code Profiling Pareto Rule More than 80% of the time is spent on less than 20% of code In reality this proportion is even more (e.g., 5% code contributing to 90%, just an example) Determine where code is spending time No sense in optimizing where no time is spent Provide measurement basis Determine whether improvement really improved anything Need to take precise measurements

12 Profiling Techniques Profiler compile with profiling options, and run through profiler Gets list of functions/routines, and amount of time spent in each Use system timer (e.g., $time./a.out) Less ideal Might need test harness for functions Graph results for understanding Multiple profile results: see how profile changes for different input types

13 What Is Tuning? Making small-scale adjustments to correct code in order to improve performance After code is written and working Affects only small-scale: a few lines, or at most one routine Examples: adjusting details of loops, expressions Code tuning can sometimes improve code efficiency tremendously

14 What Tuning is Not Reducing lines of code Not an indicator of efficient code A guess at what might improve things Know what you re trying, and measure results Optimizing as you go Wait until finished, then go back to improve Optimizing while programming is often a waste A first choice for improvement Worry about other details/design first It is not Refactoring Refactoring improves code readability and quality, while Tuning often diminishes both

Common Inefficiencies Unnecessary I/O operations File access especially slow Paging/Memory issues Can vary by system System Calls Requires context switch which involves OS and scheduling

15 Common Inefficiencies Unnecessary I/O operations File access especially slow Paging/Memory issues Can vary by system System Calls Requires context switch which involves OS and scheduling overhead beyond the program s control Interpreted Languages Instead of being compiled as a whole, each line is interpreted and converted to machine language individually Table Source: Code Complete book

16 Operation Costs Different operations take different times Integer division longer than other ops Much slower than bit-shifting to achieve the same Transcendental functions (sin, sqrt, etc.) even longer Knowing this can help when tuning Vary by language In C++, private routine calls take about twice the time of an integer op, and in Java about half the time.

17 An Example Stealing an example from Charles Leiserson s Distinguised Lecture on Performance Engineering February 8, 2017 Joint Work with Bradley Kuszmaul and Tao Schardl. Performance Engineering a 4K x 4K Matrix Multiplication Machine: Dual-Socket Intel Xeon E v3 (Haswell) 18 cores 2.9 GHz 60 GB of DRAM

18 Performance Engineering: 4Kx4K Matrix Multiplication Implementation Time (seconds) Straightforward Python Implementation 25, (> 7 hours)

19 Performance Engineering: 4Kx4K Matrix Multiplication Implementation Time (seconds) Straightforward Python Implementation 25, (> 7 hours) Straightforward Java Implementation (~ 40 mins)

20 Performance Engineering: 4Kx4K Matrix Multiplication Implementation Time (seconds) Straightforward Python Implementation 25, (> 7 hours) Straightforward Java Implementation (~ 40 mins) Straightforward C Implementation (<10 mins)

21 Performance Engineering: 4Kx4K Matrix Multiplication Implementation Time (seconds) Straightforward Python Implementation 25, (> 7 hours) Straightforward Java Implementation (~ 40 mins) Straightforward C Implementation (<10 mins) Parallel Loops (~ 1 min)

22 Performance Engineering: 4Kx4K Matrix Multiplication Implementation Time (seconds) Straightforward Python Implementation 25, (> 7 hours) Straightforward Java Implementation (~ 40 mins) Straightforward C Implementation (<10 mins) Parallel Loops (~ 1 min) Parallel Divide and Conquer 3.80

23 Performance Engineering: 4Kx4K Matrix Multiplication Implementation Time (seconds) Straightforward Python Implementation 25, (> 7 hours) Straightforward Java Implementation (~ 40 mins) Straightforward C Implementation (<10 mins) Parallel Loops (~ 1 min) Parallel Divide and Conquer 3.80 add Vectorization (streaming the parallel operations) add AVX intrinsics (processor commands for vector operations) Strassen s algorithm 0.38

24 The Lesson from this Example Code performance engineering can make incredible improvements in performance. 67,243 times faster in this case! It is very rewarding However, this effort is often meaningless and even hurtful when used in tuning as you go style during development Tends to decrease code readability and reusability Often it is hard to identify the real bottleneck early on

25 Remember Code readability/maintainability/etc. is usually more important than efficiency Always start with well-written code, and only tune at the end Measure!

Principles of Software Construction: Objects, Design, and Concurrency

Principles of Software Construction: Objects, Design, and Concurrency Part 3: Design case studies Performance Charlie Garrod Michael Hilton School of Computer Science 1 Administriva Homework 4b due Thursday,