LCA14-104: GTS- A solution to support ARM s big.little technology. Mon-3-Mar, 11:15am, Mathieu Poirier

Size: px

Start display at page:

Download "LCA14-104: GTS- A solution to support ARM s big.little technology. Mon-3-Mar, 11:15am, Mathieu Poirier"

Rolf Sparks
6 years ago
Views:

1 LCA14-104: GTS- A solution to support ARM s big.little technology Mon-3-Mar, 11:15am, Mathieu Poirier

2 Today s Presentation: Things to know about Global Task Scheduling (GTS). MP patchset description and how the solution works. Configuration parameters at various levels. Continuous integration at Linaro.

3 Other Presentations on GTS: This presentation is the lighter version of two presentation Linaro has on GTS. The other runs for about 75 minutes and goes much deeper in the solution. If you are interested in the in-depth version please contact Joe Bates:

4 What is the MP Patchset? A set of patches enacting Global Task Scheduling(GTS). Developed by ARM Ltd. GTS modifies the Linux scheduler in order to place tasks on the best possible CPU. Advantages: Take full advantage of the asynchronous nature of b.l architecture. Maximum performance Minimum power consumption Better benchmark scores for thread-intensive benchmarks. Increased responsiveness by spinning off new tasks on big CPUs. Decreases power consumption, specifically with small-task packing.

5 Where to get it In a tarball from the release page: Always look for the latest vexpress-lsk release on release.linaro.org - ex. for January: February should look like: In the Linaro Stable Kernel:

6 Where to get it (continued) In the ARM big LITTLE MP tree: ** Linaro doesn t rebase the MP patchset on other kernels than the Linaro Stable Kernel.

7 MP Patchset Description General Overview: The Linux kernel builds a hierarchy of scheduling domains at boot time. The order is (Linux convention): Sibling (for Hyperthreading) MC - multi-core CPU - between clusters NUMA To understand how the kernel does this: Enable CONFIG_SCHED_DEBUG and set sched_debug=1 on the kernel cmd line In a pure SMP context load balancing is done by spreading tasks evenly among all processors. Maximisation of CPU resources Run-to-completion model

8 Domain Load Balancing - no GTS CFS (CPU level) CPU0 CPU1 CPU4 CPU3 CFS (MC level) CFS (MC level) CPU2 CFS (CPU level) Vexpress (A7x3 + A15x2)

9 How MP Works Classic load balancing between CPU domains (i.e big and LITTLE) is disabled. A derivative of Paul Turner s load_avg_contrib metric is used to decide if a task should be moved to another HMP domain. Paul s work: Migration of tasks among the CPU domains is done by comparing their loads with migration thresholds. By default, all new user tasks are placed on the big cluster.

10 Domain Load Balancing - with GTS CFS (CPU level) CPU0 CPU1 GTS CPU4 CPU3 CFS (MC level) CFS (MC level) CPU2 CFS (CPU level) Vexpress (A7x3 + A15x2)

11 Load Average Contribution and Decay Plotting of the runnable_avg_sum metric introduced by Paul Turner

12 Per Entity Load Tracking Paul Turner introduced the load average contribution metric in his work on per-entity load tracking: load_avg_contrib = task->weight * runnable_average where runnable_average is: runnable_average = runnable_avg_sum / runnable_avg_period runnable_avg_sum and runnable_avg_period are geometric series. load_avg_contrib is good for scheduling decisions but bad for task migration i.e, weight scaling doesn t reflect the true time spent by a task in the runnable state.

13 Load Average Ratio The MP patchset introduces the load average ratio: load_avg_ratio = NICE_0_LOAD * runnable_average The load average ratio allows for the comparison of tasks without their weight factor, giving the same perspective for all of them. At migration time the load average ratio is compared against two thresholds: hmp_up_threashold hmp_down_threashold

14 UP and Down Migration thresholds * Source: ARM Ltd. A task s load is compared to the up and down migration threshold during the MP domain balancing process.

15 What We ve Learned So Far The Linux scheduler will separate CPUs into domains. Tasks are spread out among the domains as equally as possible. For GTS load balancing at the CPU domain level is disabled. GTS will move tasks between CPU domains using a derivative of the load average contribution and a couple of thresholds. But when is GTS moving tasks between the CPU domains?

16 Task Migration Points 4 task migration points: When tasks are created (fork migration). At wakeup time (wakeup migration). With every scheduler tick (forced migration). When a CPU is about to become idle (idle pull).

17 Fork Migration When tasks are created (fork migration): Done by setting the task s load statistics to their maximum value. Tasks are placed on big CPUs unless they are: Kernel Threads Forked from init i.e, Android services. Android apps are forked from Zygote, hence go on big CPUs. Tasks are eventually migrated down if they aren t heavy enough.

18 Wakeup Migration At wakeup time (wakeup migration): When a task is to be placed on a CPU, the scheduler will normally prefer: The previous CPU the task ran on Or one in the same package. For GTS, the decision is based on the load a task had before it was suspended: if load(task) > hmp_up_threshold, select more potent HMP domain if load(task) < hmp_down_threshold, select less powerful HMP domain What happened in the past is likely to happen again.

19 Forced Migration With every scheduler tick (forced migration): Every CPU in the system has a scheduler tick. With each tick (minimum interval of 1 jiffies) a CPU s runqueue is rebalanced if event due. Each time the load balancer runs, the MP code will inspect the runqueue of all CPUs in the system: If LITTLE CPU can a task be moved to big cluster? if ((big CPU ) && (CPU overloaded)) offload lightest task. When offloading, always select an idle CPU to ensure CPU availability for the task. So that tasks can be migrated as quickly as possible as domains can stay balanced for a long time.

20 Idle Pull When a CPU is about to become idle(idle pull): When a CPU is about to go idle the scheduler will attempt to pull tasks away from other CPUs in the same domain. Happens only if the CPU average idle time is more than the estimated migration cost. Balancing within a domain is left to normal scheduler operation. If the scheduler didn t find any task to pull and CPU is in big cluster: Go through the runqueues of all online CPUs in the LITTLE cluster. If a task s load is above threshold, move it to a CPU in the big cluster. When moving a task, always look for the least loaded CPU.

21 MP Migration Types * Source: ARM Ltd.

22 Small Task Packing Scheduler will try to fit as many small task on a single CPU as possible. A small task is =< 90% of NICE_0_LOAD, i,e 921 Done on the LITTLE cluster only to make sure tasks on the big cluster have all the CPU time they need. Takes place when a task is waking up: Using the tracked load of CPU runqueues and tasks. Saturation threshold to make sure tasks offloaded from the big domain can keep being serviced. Effects of enabling small task packing: CPU operating point may increase CPUfreq governor will kick in. Wakeup latency of task may increase more tasks to run.

23 Key Things to Remember Load balancing at the CPU domain level is disabled to favour the GTS scheme. GTS works by comparing a task s runnable load ratio and migrating it to a different HMP domain if need be. There are 4 migration points: At creation time. At wakeup time. Every rebalance. When a CPU is about to become idle. Small task packing when CPU gating is possible.

24 One Last Remark GTS doesn t hotplug CPUs and is not concerned at all with hotplugging When hotplugging: It takes too long to bring a CPU in and out of service All smpboot threads need to be stopped. stop_machine threads suspend interrupts on all online CPUs. IRQs on the swapped CPU are diverted to another CPU. All processes in swapped CPU s runqueue are migrated. CPU is taken out of coherency. More CPUs means longer hotplug time per CPU. Very expensive to make a CPU coherent with the domain hierarchy again. The system needs intelligence to determine when CPUs will be swapped in and out.

25 GTS Tuning The GTS solution itself has a number of parameters that can be tuned. Examples: From /sys/kernel/hmp: up_threshold, down_threshold for task migration limits load_avg_period_ms and frequency_invariant_load_scale From the code: runqueue saturation when doing small task packing Amount of task on a runqueue to search when force migrating between domains

26 CPUFreq Governor Tuning Linaro and ARM have been using the interactive governor in their testing of the solution. Any governor can be used. b.l CPUfreq driver makes the architecture seamless to the governor. Example of interactive governor tuneables: hispeed_freq and go_hispeed_load target_loads timer_rate and min_sample_time above_hispeed_delay Governors will have tuneable parameters. Regardless of the governor used, there are parameters to adjust in order to yield the right behavior Default values are usually not what you want

27 MP Testing at Linaro As Linaro assimilate MP patches in the LSK, continuous integration testing is done daily to catch possible regressions. We run bbench with an audio track in the background good average test case. exercises both big and LITTLE clusters All automated in our LAVA environment and results verified each day. Full WA regression tests with each monthly release. TC2 is the only b.l platform being tested at Linaro - we d welcome other platforms.

28 Question and Acknowledgements Special thanks to: Chris Redpath (ARM) Robin Randhawa (ARM) Vincent Guittot (Linaro)

29 More about Linaro Connect: More about Linaro: More about Linaro engineering: Linaro members:

Power Management for Embedded Systems

Power Management for Embedded Systems Minsoo Ryu Hanyang University Why Power Management? Battery-operated devices Smartphones, digital cameras, and laptops use batteries Power savings and battery run