Calc: The challenges of scalable arithmetic How threading can be challenging

Size: px

Start display at page:

Download "Calc: The challenges of scalable arithmetic How threading can be challenging"

Edwina Hubbard
5 years ago
Views:

com Stand at the crossroads and look; ask for the ancient paths,

1 Calc: The challenges of scalable arithmetic How threading can be challenging Michael Meeks General Manager at Collabora Productivity Skype - mmeeks, G+ - mejmeeks@gmail.com Stand at the crossroads and look; ask for the ancient paths, ask where the good way is, and walk in it, and you will find rest for your souls... - Jeremiah 6: / 25

2 Calc threading - Overview LibreOffice 6.0 Calc Existing structure & parallelism Why thread? The initial solution & problems mis-factored code dependency issues The group calculation piece Profiling & optimizing Future work & expansion Disclaimer Disclaimer & Thanks: Thanks: Almost Almost all all of of this this work work was was done done by by Tor Tor Lillqvist Lillqvist & Dennis Dennis Francis Francis who who can t can t be be here here today. today. Some Some great great code code reading reading & improvement. improvement. 2 2 / 25

3 LibreOffice 6.0 Calc... A 30+ year old code-base Primary Data structures hugely improved recently Still some scope for improvement: FormulaGroup vs. FormulaCell, per-cell dependency records etc. Calculation Engine in need of love Some insights into how it works Some problems wrt. threading. 3 3 / 25

4 Core structures since 4.3 (mdds::multi_type_vector) ScTable ScColumn svl::sharedstring block ScDocument double block EditTextObject block Broadcasters Cell notes This This bit: bit: ScFormulaCell block Text widths Script types Cell values 4 / 25

5 FormulaCellGroups ScFormulaCell ScFormulaCell ScFormulaCell ScFormulaCell ScFormulaCell ScFormulaCell ScFormulaCell ScFormulaCellGroup ScTokenArray Sample Sample Token Token types types (StackVar) (StackVar) svsingleref svsingleref A1 A1 svdoubleref svdoubleref A1:C3 A1:C3 svexternalsingleref svexternalsingleref etc. etc. svdouble svdouble svstring svstring hello hello world world svbyte svbyte ocdiv, ocdiv, ocmacro ocmacro Tokens RPN 5 / 25

6 Normal Formula interpreting double ScFormulaCell::GetValue() { MaybeInterpret(); return GetRawValue(); } void ScFormulaCell::Interpret() { amazing recursion flattening InterpretTail() // ie.... { new ScInterpreter( this, pdocument, rcontext, apos, *pcode /* those tokens */); ->Interpret() StackVar ScInterpreter::Interpret() { execute reverse-polish stack execute functions get cell values from references Recursion++ 6 / 25

7 InterpretFormulaGroup ScFormulaCellGroup ScTokenArray getvalues Collected to Matrix Examine for for safe cases Tokens RPN Interpret: OpenCL Software Even non-threaded software case: faster Shares function input collection work. Aggregated / linearized doubles / strings in the matrix 7 / 25

8 Why Thread?

9 CPUs get wider not faster Sometimes CPUs get slower Process clocks stymied at 3-4 GHz IPC improvements ~stalled Real IPC wins: Laptops minimum 4 threads Mid-range 8 threads. PC / Workstation 8 16 threads: the new normal. Affordable too... Many thanks to AMD for sponsoring this work. 9 / 25

10 2017 Crash reporting stats Frustratingly cores not threads. Reports from large core count machines. Crash report % by CPU core count over time % % % % 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% / 25

11 Initial Solution...

12 Thread InterpretFormulaGroup Attempt re-use of existing formula core Try to avoid special / sub-setting code-paths for existing formula-group conversion: a more generic solution. Concept: Pre-calculate dependent cells to control recursion outside of threads. Protect invariants with assertions Black-list problematic functions... Parallelise using existing interpreter. 12 / 25

13 Parallelize existing interpreter double ScFormulaCell::GetValue() { MaybeInterpret(); return GetRawValue(); } Pre-fetch all dependent values and lock-that down: void ScFormulaCell::MaybeInterpret()... assert(!pdocument->mbthreadedgroupcalcinprogress); void ScFormulaCell::Interpret() { amazing recursion flattening InterpretTail() // ie.... { new ScInterpreter( this, pdocument, rcontext, apos, *pcode /* those tokens */); ->Interpret() StackVar ScInterpreter::Interpret() { execute reverse-polish stack execute functions get cell values from references Pre-calculated No recursion 13 / 25

14 ScInterpreter: calcs formulae ScDocument Number format, Link mgmt etc. ScTable ScColumn ScFormulaCell block ScFormulaCellGroup Broadcasters ScBroadcastAreaSlotMachine Dependencies Dependencies Mutates: INDEX, OFFSET etc. Vlookup Cache Macros Ext ns ScTokenArray ScInterpreter Tokens RPN Mutates! Web fn s Cloud 14 / 25

15 ScInterpreter: some fixes Basic iteration - broken: class FormulaTokenArray sal_uint16 nindex; // Current step index FormulaToken* FirstRPN() { nindex = 0; Now has an external iterator return NextRPN(); } a man-week+ to un-wind this, and debug the last pieces that relied on this. Added mutation guards: ScMutationGuard aguard(this, ScMutationGuardFlags::CORE); In all likely-looking places: where core state is changed. 15 / 25

16 Disabling nasties: Dependency graph manipulation During calculation: Indirect, Offset, Match, Cell, octableop Other stuff Macros disabled for now. Could detect pure ie. non-mutating functions Also parallelize the basic/ interpreter (?) Info grab-bag of bits. ocexternal UNO extensions: currently in: but can do ~un-controlled mutation (?) 16 / 25

17 More nasties... Several global variables No-where obvious to hang them Now some thread_local variables Calculation stack Current-document being calculated Matrix positions nc,nr Somewhat horrific: fix obsolete Mac toolchain. ScInterpreterContext Added passed through all functions. Impacts eg. GetValue though / 25

18 How did that look: initially... re-calculating 100k formulae on 1m doubles Faster Seconds to calculate single Thread count Meeks/Linux Ryzen/Win10 Getting some nice speedups ignoring the hyper-threadedness: 8.5s 2.5 with 4 threads 3.4x ~5.5x with 8 threads 18 / 25

19 Up to this point: Plain Old calculation single threaded (POC) Group calculation A) Single Threaded Software Group calc (STSG) B) OpenCL: GPU parallelism after conversion C) New threaded calculation (NTC) Then: C) slower than A) in some cases Collecting data from sheets, branching, type handling, etc. again and again for each formulacell Expensive threading doesn t help. A) collects once and has some SSE2 goodness So add a threaded A) - simple & better Weighting decision: POC vs.... based on complexity. 19 / 25

20 Improving performance... Why don t we get a 8x for 8 threads? Terrible profiling tools on Windows. Linux used perf looking for threading issues: sudo perf record --call-graph dwarf \ --switch-events -c 1 # etc. Looking for false-sharing And other horrors. 20 / 25

21 Horror: rampant heap thrash RPN calculation stack based: Tons of stack operations: pushing values etc. Do memory allocation & frees. Using the ancient / internal allocator never intended for heavy parallel use. drop the custom allocator Re-use tokens where possible too. std::stack deque lists hugely faster Horrible: std::vector instead far better. Re-using ScInterpreterContext / 25

22 Other issues... Where GetDouble meets SfxItemSet... fixed SvNumberFormatter thread safety. 22 / 25

23 Threading & optimizing story: Note this perf. Regression 8 from threading for some workloads 6 came from avoiding the SoftwareGrou 4 pinterpreter Benchmarking some of our sample sheets... Column 1 Column 2 Column Row 1 Row 2 Row 3 Row 4 0 Baseline from recent master Thread Software Interpreter Group Interpreter work by Tor Avoid TokenArray thrash disable custom allocator halve the number of threads if HT is active make token cache thread local use a cache for FormulaDoubleToken allocation up the token cache size to 16 ve threadcount if HT active for group interpreter too Use C++ threads 23 / 25

24 Future work Stop the Crash-testing from asserting... Implicit intersection: killing us (again) Move RPN to have precise ranges Extend threaded unit tests further Move more global variables to ScInterpreterContext Make FormulaCell a 1x item group Make POC calcalation a forced-single-threaded calc Always thread SoftwareGroup Intepreter De-bong the format-typeuse =J20 should not change format type if J20 changes format. A sheet-creation-time optimization Intersects with units work too. 24 / 25

Conclusions Calculation can be threaded Significant speedups are possible Profiling & optimizing works it is slow == not enough invested yet All problems are just economics Many thanks to AMD for

25 Conclusions Calculation can be threaded Significant speedups are possible Profiling & optimizing works it is slow == not enough invested yet All problems are just economics Many thanks to AMD for their support. Oh, that my words were recorded, that they were written on a scroll, that they were inscribed with an iron tool on lead, or engraved in rock for ever! I know that my Redeemer lives, and that in the end he will stand upon the earth. And though this body has been destroyed yet in my flesh I will see God, I myself will see him, with my own eyes - I and not another. How my heart yearns within me. - Job 19: / 25

Exploiting Concurrency

Exploiting Concurrency How I stopped worrying and started threading Michael Meeks michael.meeks@collabora.com mmeeks / irc.freenode.net Collabora Productivity Stand at the crossroads and look; ask for