Incorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems. Scott Marshall and Stephen Twigg

Size: px

Start display at page:

Download "Incorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems. Scott Marshall and Stephen Twigg"

Russell Whitehead
5 years ago
Views:

1 Incorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems Scott Marshall and Stephen Twigg

2 2 Problems with Shared Memory I/O Fairness Memory bandwidth worthless without memory access Network congestion relieves slowly Hard to define network guarantees Fairness vs. utilization tradeoff Fairness keeps nodes coarsely synchronized Limited buffer, transaction space risks priority inversion if attempt to slip in premature request

3 3 Our Goals Evaluate using full system models to avoid higher order priority inversion Previous research message driven Analogous to deadlock avoidance req. Incorporate DMA requests into system DMA requests are high bandwidth Analyze interaction with normal requests

4 4 Existing QoS Strategies Globally Stacked Frames All packets assigned to incrementing frames based on QOS credit allowance Early frame retirement allows BW recovery Localized network congestion slows entire system Preemptive Virtual Clock Increasing deadline counter based on packet issue Completely localized BW recovery Requires NACKing of packets during congestion

5 5 Evaluation of Existing Solutions Unclear how to handle response messages Worse: response messages likely to be large Who is penalized? Need localized recovery without NACKing NACKs are wasteful Desire mechanism to reel in demanding nodes

6 6 Our Approach Age-based routing Provides reasonable prioritization Responses inherit request age Admission controlled network access QoS gate between node->network tracks requests Assess costs according to total network impact

7 7 Challenges What is Network Utilization/Congestion? Buffer space vs. available BW Appropriate reduction methods Local versus global congestion DMA controller Does it require a QoS gate? Assessing costs to original requestor

8 8 Memory Simulator Realistic cache coherence protocol Mem. Controller has directory, transaction buffers MSI cache states, mem. transaction buffers DMA controller does unchached reads/writes Non-deadlocking wormhole routing Mesh network with input buffers Bidirectional links with split BW 4 VN: DMAReq, MemReq, Invl., Resp. Randomized memory, DMA reads/writes Multiple concurrent requests provides desired saturation

9 9 QoS Gates How do they work? QoS permits/denies requests from endpoints into the network using credits Each cycle, credits are replenished according to the policy Desired traits: QoS credit replenishment policy Incorporate notion of local and network-wide congestion à blend two metrics Fairness rather than max utilization à non-linear relationship

10 10 QoS Credit Policy At-a-Glance credits = local ( local 2 + net 2 )

11 11 Measurement Strategy Two simulator platforms gem5: open-source full-system simulator with complex ruby memory subsystem Our memory simulator: fast; only for synthetic workloads Two benchmark classes Realistic applications hdparm: cached and non-cached disk reads (DMA) PARSEC: multithreaded application benchmark Synthetic memory & DMA access

12 12 Measurement Configuration gem5 8 CPU x86_64 running Linux row mesh network interconnect, packet-based routing Record disk throughput & application runtime Our Simulator 8x8 mesh, 1 DMA controller Packets are 2-10 flits, bisection BW: 80 flits/cycle Record statistics (completed requests, latency, network utilization)

13 13 Preliminary Results How Many Requests Completed in 10 K Cycles? Baseline Age-Based Age-Based + QoS No DMA total normal 0 DMA total normal 0 DMA total normal 0 DMA DMA Only 739 total 0 normal 739 DMA 759 total 0 normal 759 DMA 743 total 0 normal 743 DMA Mixed total normal 445 DMA total normal 524 DMA total normal 659 DMA

14 14 Preliminary Results Distribution of Requests in the Network? Baseline Age-Based + QoS

15 15 Unexpected Problems gem5 Simulator Instability: Arbitrary deadlock crashes Slow Performance: Linux boot takes several hours (~24 hrs with our changes) Consequence: became the project s bottleneck Benchmark results will be in final paper

16 16 Conclusions The combination of age-based routing and QoS effectively distributes fairness throughout the mesh The complexity of gem5 is both its strength and its weakness

17 17 Lessons Learned Limit the scope of responsibilities for a fullsystem simulator Future Work Complete gem5 benchmarks Incorporate message distance into QoS cost Knowledge of processor groups in QoS policy

18 18 THANK YOU QUESTIONS?

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching