Porting Financial Market Applications to the Cell Broadband Engine Architecture

Size: px

Start display at page:

Download "Porting Financial Market Applications to the Cell Broadband Engine Architecture"

Eustace King
6 years ago
Views:

1 Porting Financial Market Applications to the Cell Broadband Engine Architecture John Easton, Ingo Meents, Olaf Stephen, Horst Zisgen, Sei Kato Presented By: Kanik Sem Dept of Computer & Information Sciences University of Delaware

2 Outline Why Cell B.E. for financial markets? Porting strategies for the Cell B.E. platform Performance results Mixed-precision workloads Tying it all together Conclusions

3 Why Cell B.E. for financial markets? Potential for dramatic impact on financial applications Application codes ported to the Cell Optimized codes to fully exploit Cell Performance improvements of almost 40x

4 A description of the application Code used to price a European Option. Model based on Monte Carlo simulation technique. Need to generate a large number (200,000,000 in this case) of uniform, pseudo-random numbers. Using the random numbers generated, execute the financial model.

5 Porting strategies for Cell Recompilation of existing code for Cell XLC better than gcc Make some structural changes Framework to start separate threads on each SPU. Splitting RNG across all cores. Make functional changes to the code. Re-engineered functions to exploit vectorization on SPU cores.

6 Analysis of the original code %time Seconds Calls Function name getrandom() simulateeuropeanoptionvalue() hpcmontecarlo::random() hpcblackscholes() SDK for Cell provides optimized RNG. Can generate 64 number generators at once on Cell blade. Use gettimeofday() function.

7 Initial performance results To run the performance tests, the following parameters were used : Compiler used: spuxlc, ppuxlc Compiler optimization setting: -03 qstrict Random-number generation method: sdk Precision: single Number of evaluations: 200,000,000

8 Initial performance results Performance by number of SPUs (single precision) Number of SPUs Elapsed time (seconds)2.4 GHz Cell/B.E. processor (measured) Elapsed time (seconds)3.2 GHz Cell/B.E. processor (estimated) Speedup

9 Initial performance results

10 Double Precision Organizations in financial markets require double-precision calculations. Initial target marketplace for Cell does not need this. Initial implementation of Cell provides limited double-precision support in hardware Single-precision Fully pipelined Double-precision Partially pipelined

11 Performance results Performance by number of SPUs (double -precision) Number of SPUs Elapsed time (seconds)2.4 GHz Cell/B.E. processor (measured) Elapsed time (seconds)3.2 GHz Cell/B.E. processor (estimated) Speedup

12 Mersenne-Twister Run time with Mersenne-Twister (without optimization): 5 sec Run time with the Cell/B.E. SDK: 4.1 sec Mechanisms to improve the performance still further : Optimize Mersenne-Twister code for threading framework. Rewrite the code to utilize the SIMD capabilities of SPUs. Performance comparison between Cell/B.E. SDK and Mersenne -Twister random -number generators Precision Runtime (seconds) SDK RNG (2.4Ghz) Runtime (seconds) Mersenne - Twister RNG (2.4 GHz) Single Double Runtime (seconds) Mersenne - Twister RNG 3.2 GHz (estimated)

13 Mixed-precision workloads Mixed-Precision: Only those parts that actually need double-precision are calculated using double-precision. Disadvantage: Makes for a slight increase in the programming effort needed Identify parts of code which use this sort of precision Make the appropriate changes to the code. Advantage: Performance improvement.

14 Mixed-precision workloads The two methods of applying mixed-precision to our code are: (1) Concatenating two single-precision random variables. (2) Generate one single-precision random variable and then doing a double-precision division.

15 Mixed-precision workloads # SPU CC_DP_MT CC_DP_SDK M_DP_MT SP_MT SP_SDK CC_DP_MT = Concatenation Double-Precision Mersenne-Twister CC_DP_SDK = Concatenation Double-Precision SDK M_DP_MT = Division Double-Precision Mersenne-Twister SP_MT = Single-Precision Mersenne-Twister SP_SDK = Single-Precision SDK

16 Mixed-precision workloads

17 Mixed-precision workloads Additional optimization techniques : Unrolling more parts of Mersenne-Twister RNG. Additional software pipelining by parallelizing computation. Introducing new variables to eliminate dependencies. Pre-calculating some items: a[0]=<something>; for (i=0;i<n;i++) {sinf4(a[0]) ; sinf4(a[i+1));...}

18 Intel optimizations A master thread forks slave threads to perform RNG. master thread part of the Cell/B.E. code that runs on PPU slave threads parts that run on the SPUs. Difference: Work scheduled by the OpenMP runtime shares same cores as the OS threads. The SPUs on the Cell/B.E. version are not running the operating system. This enables them to be used entirely to run the application code.

19 Intel optimizations System/CPU Operating System Compiler No. of Threads (Cores) Speed (GHz) x3550/3.0 Red Hat Linux Intel ICPC x336 / 2.8 Red Hat Linux Intel ICPC HS21 / 2.33 Fedora Core 6 gcc

20 Tying it all together

21 Future Work Results achieved so far are on a system that many view as being unsuitable for Financial Markets users. Enhanced Double-Precision version of the Cell Broadband Engine technology. Systems based on Cell/B.E. technology are an excellent platform for Financial Markets applications.

22 Getting the most performance out of Cell/B.E. technology Offload as much of the computation onto the SPUs as possible. Write the SIMD code yourself rather than relying on the compiler to do it. XLC provides auto-simdize This may not be a good approximation. In certain situations, you might find that starting from scratch is a much quicker way to implement application code.

23 Conclusions Reasons for general-purpose processors make up the majority of the computational infrastructures : (1) Huge numbers of systems based on these processors. (2) Large supply of professionals skilled, this leads to lower skills costs. (3) A lot of application development tooling. (4) The relatively easy code porting to these platforms.

24 Conclusions ESOTERIC technologies: Offer high performance for their chip area. Consume much less power per computation. Disadvantages: (1) Skills to program them are rare and, hence, expensive. (2) Lack of application development tooling. (3) The porting process is generally both slow and costly.

25 Conclusions Advantages of Cell/B.E. technology: (1) Consumes less power, space and cooling (2) High computational power. (3) Better data movement and manipulation abilities. (4) A number of strong customer proof points. (5) Support from key Independent Software Vendors (6) Results of experiments such as this one.

26 Questions. Comments. Caveats.

Accelerating Financial Applications on the GPU

Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General