Graphite IntroducDon and Overview. Goals, Architecture, and Performance

Size: px

Start display at page:

Download "Graphite IntroducDon and Overview. Goals, Architecture, and Performance"

Alexina Andrews
5 years ago
Views:

1 Graphite IntroducDon and Overview Goals, Architecture, and Performance 4

2 The Future of MulDcore #Cores cores? CompuDng has moved aggressively to muldcore MIT Raw Intel SSC Up to 72 cores available now Sun UltraSparc T2 IBM PowerXCell 8i core chips by 2018, if trends condnue Time 5

3 SimulaDon in MulDcore Research SimulaDon is vital for exploring future architectures Experiment with new designs/technologies Abstract away details and focus on key elements Rapid exploradon of design space Early sohware development for upcoming architectures The future of muldcore simuladon: Need to simulate 100 s to 1000 s of cores Massive quanddes of computadon High- level architecture is becoming more important than microarchitecture On- chip networks, Memory hierarchies, DRAM access, Cache coherence 6

4 Graphite At- a- Glance ApplicaDon A fast, high- level simulator for large- scale muldcores ApplicaDon- level simuladon where threads are mapped to target cores core core core core core core core core core core core core MulDcore Host Machines MulD- machine distribudon Leverage add l compute and memory Invisible to applicadon Runs off- the- shelf pthread apps Relaxed synchronizadon scheme Trades some Dming accuracy for performance Guarantees funcdonal correctness Integrated power models 7

5 Graphite Performance Graphite performance on 8 host machines (64 cores total) Min Max Median 3 MIPS 81 MIPS 14 MIPS Typical slowdown for exisdng sequendal simulators: 10,000x 100,000x Results from SPLASH2 benchmarks on a 32- core target processor 8

6 Graphite Trades Accuracy for Performance Simulator performance is a major limidng factor Limits depth and breath of studies, size of benchmarks Too much detail slows simuladon Cannot simulate 1000 s of cores Most simulators are sequendal, Graphite is parallel Typical performance: 10,000x 100,000x slowdown per core Our target performance: 20 MIPS (100x 1000x) Performance vs. accuracy Cycle- accurate: very accurate but slow High- level: trade some accuracy for performance For next year s chips, you need cycle- accuracy For chips 5-10 years out, you need performance 9

7 Outline IntroducDon Graphite Architecture Overview MulD- machine distribudon Clock SynchronizaDon Results Conclusions 10

8 Graphite Overview ApplicaDon- level simulator based on dynamic binary transladon Uses Intel s Pin App runs nadvely except for new features and modeled events On trap, model funcdonality, Dming and energy SimulaDon consists of running an applicadon on a target architecture specified by swappable models and rundme parameters Different architectures Accuracy vs. Performance Result: ApplicaDon output Simulated Dme to compledon StaDsDcs about processor events Energy and power for various components 11

9 Graphite Architecture Application Application Threads Host Threads Host Machines Host Core Host Process Host OS Host Core Host Core Architecture Graphite Host Process Host Core Host Process Host OS Host Core Host Core ApplicaDon threads mapped to target Dles On trap, use correct target Dle s models Dles are distributed among host processes Processes can be distributed to muldple host machines 12

network, and memory hierarchy components Explore different architectures

10 Simulated Architecture Processor Core DRAM Controller DRAM Network Switch Cache Hierarchy Interconnection Network Swappable models for processor, network, and memory hierarchy components Explore different architectures Trade accuracy for performance Cores may be homogeneous or heterogeneous 13

11 Key Simulator Components Host Machine Host Machine Host Machine ApplicaDon Thread MCP LCP LCP LCP Messaging API Memory System Network Model Transport Layer Physical Transport 14

CommunicaDon Stack ApplicaDon Thread Messaging API Memory System Network Model Transport Layer Graphite implements a layered communicadon stack.

12 CommunicaDon Stack ApplicaDon Thread Messaging API Memory System Network Model Transport Layer Graphite implements a layered communicadon stack. The applicadon thread communicates with other threads via messages. Graphite messaging API Simulated shared memory Messages are routed and Dmed by target architecture network model. Transport layer delivers messages to desdnadon target core. Host shared memory (same host process) TCP/IP (different host processes) 15

13 Power Modeling Core, caches, and network stadc and dynamic power/ energy modeling SimulaDon- driven energy modeling (not trace- based) On- line availability of energy/power esdmates Enables dynamic power management (hardware and sohware) DVFS Affects power and performance Uses 3 rd party tools: McPAT and DSENT 16

14 Outline IntroducDon Graphite Architecture Overview MulD- machine distribudon Clock SynchronizaDon Results Conclusions 17

15 Parallel DistribuDon Challenges Wanted support for standard pthreads model Allows use of off- the- shelf apps Simulate coherent- shared- memory architectures Must provide the illusion that all threads are running in a single process on a single machine Single shared address space Thread spawning System calls 18

of the target memory models Eliminate redundant work Test correctness of memory models Host

16 Single Shared Address Space All applicadon threads run in a single simulated address space Memory subsystem provides modeling as well as funcdonality FuncDonality implemented as part of the target memory models Eliminate redundant work Test correctness of memory models Host Address Space Application Simulated Address Space Host Address Space Host Address Space 19

17 Thread DistribuDon ApplicaDon Graphite runs applicadon threads across several host machines core core core core core core core core core core core core MulDcore Must inidalize each host process correctly Threads are automadcally distributed by trapping threading calls Host Machines 20

18 System Calls Many system calls need to be handled specially Pass memory operands to the kernel SynchronizaDon/communicaDon between threads AllocaDng and deallocadng dynamic memory File I/O operadons Reflect target architectural state (e.g., Dme) Other system calls can simply be allowed to fall through 21

19 Lite Mode Runs without complexity needed for muld- machine simuladons Relies on host system for correctness No special handling of system calls Advantages Be?er compadbility with off- the- shelf applicadons Faster simuladons in some situadons Disadvantages Can only run on a single machine Does not help debug target memory system models May not work well with very large numbers of target cores Power/performance models of target architecture remain the same Includes core, memory subsystem and network models 22

20 Outline IntroducDon Graphite Architecture Overview MulD- machine distribudon Clock SynchronizaDon Results Conclusions 23

21 Clock SynchronizaDon Cores only interact through messages Clocks are updated with message Dmestamps Core 1 Core 2 Message Message 24

22 Clock SynchronizaDon Threads may run at different speeds, causing clocks to deviate Clocks are only used for Dming, funcdonal correctness is always preserved Must be synchronized on explicit interacdon Clocks may differ on implicit interacdon à Dming inaccuracy Define synchronizadon as managing the skew of different target core clocks. This is not applicadon synchronizadon! Graphite supports three synchronizadon schemes with different accuracy and performance tradeoffs 25

23 SynchronizaDon Schemes Lax Relies exclusively on applicadon synchronizadon events to synchronize Dles local clocks FuncDonally, events may occur out- of- order w.r.t. simulated Dme Best performance; worst accuracy LaxP2P ObservaDon: Timing inaccuracy is due to a few outliers Every N cycles, each target core randomly pairs with another If cycles differ by too much, future core goes to sleep Good performance; good accuracy LaxBar Every N cycles, all target cores wait on a barrier Keeps cores Dghtly synchronized, imitates cycle- accuracy Worst performance; best accuracy 26

24 Example SimulaDon (Lax) App Exit Lax Simulated Dme ApplicaDon SynchronizaDon Point Core 1 Core 2 Core 3 Real Dme 27

25 Example SimulaDon (LaxP2P) App Exit Lax LaxP2P Zzzz Zzzz Zzzz P2P Check P2P Check Simulated Dme ApplicaDon SynchronizaDon Point P2P Check Real Dme 28

26 Example SimulaDon (LaxBar) App Exit Lax LaxP2P LaxBar Barrier Barrier Simulated Dme ApplicaDon SynchronizaDon Point Barrier Real Dme 29

27 Clock Skew Measurements Local clock value Lax LaxP2P LaxBar Graphs show approximate clock skew for each scheme (fmm benchmark) Clock skew is the spread between minimum and maximum clocks at any given point Note: Spikes on graphs due to errors in measurement method Lax has largest skew (~2,000,000 cycles) ApplicaDon synchronizadon events are clearly visible Fine- grain thread interacdons can be missed or misrepresented Lax P2P has much lower skew (~30,000 cycles) [Interval = 10,000 cycles] ApplicaDon synchronizadon events slightly visible LaxBar has low, constant skew (~4000 cycles) 30

28 Outline IntroducDon Graphite Architecture Results Experimental methodology Simulator performance and scaling ValidaDon against cycle- level simuladon Conclusions 31

29 Experimental Methodology Architecture: Feature Value Number of cores 64 / 1024 L1- I/L1- D caches L2 caches Cache coherence scheme InterconnecDon Network Private, 32 kb per Dle Private, 512 kb per Dle Full- map directory based 2- D mesh SPLASH- 2 & PARSEC benchmark suite All experimental results collected on 8- core Xeon host machines running Linux 32

30 Performance Scaling 64 Cores Simulator Speed- Up (Normalized) x Graphite scales if the applicadon scales Even non- ideal speedup sdll reduces latency and design iteradon Dme 33

31 Performance Summary Performance of Graphite Simulator (in MIPS) SequenOal (1 core) 1 host* (8 cores) 8 hosts* (64 cores) Min Max Mean Median * Host machines are 8- core servers SequenDal simulator performance is unacceptable Parallel simulator performance as high as 81 MIPS Would condnue to increase with larger targets and more hosts Simulator overhead depends heavily on applicadon characterisdcs SDll more room for opdmizadon 34

Performance Scaling 1024 Cores Simulated MIPS (Normalized) 8 6 4 2 0 1 2 4 8 radix u lu_condguous ocean_condguous

32 Performance Scaling 1024 Cores Simulated MIPS (Normalized) radix u lu_condguous ocean_condguous Performance increased by 4x (on average) going from 1 to 8 host machines 8- core host: 2 sockets (Intel 4- core Xeon CPU X5460) 35

33 Cycle- Level SimulaDon Synchronize at cycle boundaries Default Graphite operadng modes synchronize at instrucdon boundaries Simulates all architectural models on a cycle- by- cycle basis Models exact contendon and synchronizadon delays Events globally ordered by cycle Dme Processed in order of occurrence Events stored in priority queues Dequeued in the order of their Dmestamps Verifying the accuracy of Graphite 36

34 Cycle- Level SimulaDon DeviaOon (%) Graphite ValidaDon Graphite (LaxP2P) is only 6.4% off from cycle- level on average

35 Summary Graphite accelerates muldcore simuladon using muld- machine parallel distribudon Enables simuladon of 1000 s of cores Invisible to applicadon, runs off- the- shelf pthread apps Simultaneous performance and energy esdmadon Graphite provides fast, scalable performance As high as 81 MIPS simulator performance Up to 34x speedup on 64 host cores (across 8 machines) 38

Graphite: A Distributed Parallel Simulator for Multicores

Graphite: A Distributed Parallel Simulator for Multicores Jason E. Miller, Harshad Kasture, George Kurian, Charles Gruenwald III, Nathan Beckmann, Christopher Celio, Jonathan Eastep and Anant Agarwal Massachusetts