Rectangles All The Way Down. Martin Thompson

Size: px

Start display at page:

Download "Rectangles All The Way Down. Martin Thompson"

May Glenn
5 years ago
Views:

1 Rectangles All The Way Down Martin Thompson

2 The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry. - Henry Peteroski

3 Fundamental Laws

4 CPU Performance Memory Lane Transistor density doubles every year - Gordon Moore

5 CPU Performance Memory Lane Transistor density doubles every 2 years - Gordon Moore Transistor density doubles every year - Gordon Moore

6 CPU Performance Memory Lane CPUs double in speed every 18 months - David House Transistor density doubles every 2 years - Gordon Moore Transistor density doubles every year - Gordon Moore

7 CPU Performance Memory Lane The free lunch is over: - Herb Sutter CPUs double in speed every 18 months - David House Transistor density doubles every 2 years - Gordon Moore Transistor density doubles every year - Gordon Moore

8 CPU Performance Memory Lane Retirement of Tick Tock - Intel The free lunch is over: - Herb Sutter CPUs double in speed every 18 months - David House Transistor density doubles every 2 years - Gordon Moore Transistor density doubles every year - Gordon Moore

9 CPU Performance Memory Lane Spectre & Meltdown - Google Retirement of Tick Tock - Intel The free lunch is over: - Herb Sutter CPUs double in speed every 18 months - David House Transistor density doubles every 2 years - Gordon Moore Transistor density doubles every year - Gordon Moore

10 Concurrency & Parallelism

12 Universal Scalability Law (USL) C(N) = N / (1 + α(n 1) + ((β* N) * (N 1))) C = capacity or throughput N = number of processors α = contention penalty β = coherence penalty

13 Speedup Universal Scalability Law (USL) Processors Amdahl USL

14 If concurrency is so difficult then what else can we do?

16 Response Time Queueing Theory Utilisation

17 Queueing Theory r = s(2 ρ) / 2(1 ρ) r = mean response time s = service time ρ = utilisation Note: ρ = λ * s

19 Little s Law L = λw WIP = Throughput * Cycle Time

20 Little s Law L = λw WIP = Throughput * Cycle Time Bandwidth Delay Product: Bytes in flight = Bandwidth * Latency

21 Little s Law L = λw WIP = Throughput * Cycle Time Bandwidth Delay Product: Bytes in flight = Bandwidth * Latency 80 bytes / 100ns = 800 MB/s :10 LFBs

22 Memory

23 Are all memory operations equal?

24 Sequential Access - Average time in ns/op to sum all longs in a 1GB array?

25 Access Pattern Benchmark Benchmark Score Error Units ============================================ sequential ± ns/op ~1 ns/op

26 Really??? Less than 1ns per operation?

27 Instruction Level Parallelism

29 Access Pattern Benchmark Benchmark Score Error Units ============================================ sequential ± ns/op randompage ± ns/op

30 Access Pattern Benchmark Benchmark Score Error Units ============================================ sequential ± ns/op randompage ± ns/op dependentrandompage ± ns/op

31 Access Pattern Benchmark Benchmark Score Error Units ============================================ sequential ± ns/op randompage ± ns/op dependentrandompage ± ns/op randomheap ± ns/op

32 Access Pattern Benchmark Benchmark Score Error Units ============================================ sequential ± ns/op randompage ± ns/op dependentrandompage ± ns/op randomheap ± ns/op dependentrandomheap ± ns/op

33 Access Pattern Benchmark Benchmark Score Error Units ============================================ sequential ± ns/op randompage ± ns/op dependentrandompage ± ns/op randomheap ± ns/op dependentrandomheap ± ns/op ~90 ns/op

35 A 100ns cache-miss is a lost opportunity to execute ~1000 instructions on CPU

36 Algorithms & Data Structures

37 Little s Law L = λw Bandwidth Delay Product: Bytes in flight = Bandwidth * Latency 80 bytes / 100ns = 800 MB/s :10 LFBs

38 Little s Law L = λw Bandwidth Delay Product: Bytes in flight = Bandwidth * Latency 80 bytes / 100ns = 800 MB/s :10 LFBs 80 bytes / 15ns = 5.3 GB/s :prefectch

39 Little s Law L = λw Bandwidth Delay Product: Bytes in flight = Bandwidth * Latency 80 bytes / 100ns = 800 MB/s :10 LFBs 80 bytes / 15ns = 5.3 GB/s :prefectch 640 bytes / 15ns = 42.6 GB/s :cachelines

40 Arrays are the most efficient data structure to traverse

42 Functional data structures are like sausages, the more you see them being made, the less well you will sleep

43 Branches

45 Branch Benchmark Benchmark Score Error Units ============================================= baseline ± us/op

46 Branch Benchmark Benchmark Score Error Units ============================================= baseline ± us/op predictable ± us/op

47 Branch Benchmark Benchmark Score Error Units ============================================= baseline ± us/op predictable ± us/op unpredictable ± us/op

48 What can we do?

49 Count bits as Booleans

50 Wide Registers

51 Math, Data Dependencies, and Instruction Level Parallelism

53 Consider Sorting Arrays

55 It s a neat hack, and it s more useful now than it was then for two reasons. - Leslie Lamport (2011)

56 The obvious reason is that word size is larger now, with many computers having 64-bit words. - Leslie Lamport (2011)

57 The less obvious reason is that conditional operations are implemented with masking rather than branching. - Leslie Lamport (2011)

58 Branching is more costly on modern multi-issue computers than it was on the computers of the 70s. - Leslie Lamport (2011)

60 Work with your CPU caches

61 Memory Access Considerations 1. Temporal: group accesses in time

62 Memory Access Considerations 1. Temporal: group accesses in time 2. Spatial: group access in space

63 Memory Access Considerations 1. Temporal: group accesses in time 2. Spatial: group access in space 3. Pattern: create predictable patterns

64 Batching

65 Batching Amortising Costs 100% 90% 80% 70% 60% Average overhead per item, or operation, in a batch 50% 40% 30% 20% 10% 0%

66 Batching Amortising Costs 100% 90% 80% 70% 60% 50% 40% Words, Cachelines, Pages, Blocks, Frames, etc. 30% 20% 10% 0%

67 In closing

68 Profile, profile, profile...

69 Eliminate Waste Batch to Amortise Access Memory in Patterns Favour Math over Branches Favour Predictable Branches

70 Consider Parallelism - ILP & Task

71 Is it really Turtles all the way down?

DRAM memory: Banks and Row Buffers CPU cache subsystems: Cache Lines

72 Rectangles all the way down Is it really Turtles all the way down? Networks: Frames Operating Systems: Pages File systems and storage: Blocks DRAM memory: Banks and Row Buffers CPU cache subsystems: Cache Lines Applications use Arrays plus and interesting data structures are made up of small Arrays

73 I don t care what data structure you use, nothing beats an array - a HFT Programmer

74 Questions? Travel is fatal to prejudice, bigotry, and narrow-mindedness, and many of our people need it sorely on these accounts. Broad, wholesome, charitable views of men and things cannot be acquired by vegetating in one little corner of the earth all one's lifetime. - Mark Twain

Designing for Performance. Martin Thompson

Designing for Performance. Martin Thompson Designing for Performance Martin Thompson - @mjpt777 Feynman is becoming a real pain. He has the greatest scientific honesty of anyone I ve ever meet - William P Rogers The impact of QED cannot be overestimated.