Thread Affinity Experiments

Size: px

Start display at page:

Download "Thread Affinity Experiments"

Marshall Mills
5 years ago
Views:

Thread Affinity Experiments Power implications on Exynos Introduction The LPGPU2 Profiling Tool and API provide support for CPU thread affinity locking and logging, and

Giving the developer the ability to decide which threads run where is more important than ever because scheduling systems, such as the Linux Completely Fair Scheduler (CFS),

1 Thread Affinity Experiments Power implications on Exynos Introduction The LPGPU2 Profiling Tool and API provide support for CPU thread affinity locking and logging, and although this functionality is not available for every device, it is a very powerful addition when it is. Giving the developer the ability to decide which threads run where is more important than ever because scheduling systems, such as the Linux Completely Fair Scheduler (CFS), that are designed for symmetric multi-processing (SMP) systems are no longer entirely fit for purpose in heterogeneous multi processing (HMP) environments. These are by definition, unfair. The HMP approach has been adopted by all major System-on-Chip (SoC) vendors, whose systems generally comprise two clusters (depicted in Figure 1) of CPU cores: a high performance (and high power) big-cluster, and a lower performance (but lower power) little-cluster. The cores within each cluster are the same. Figure 1 Common octa-core arrangement comprising a big- and little- CPU-cluster of four identical cores each

2 This situation presents a number of pertinent questions about what is best to do: Is it better to let an app run unrestricted giving the scheduler free rein? Will better performance be achieved if the app is constrained to run on the big cluster? Will worse performance be seen if the app is constrained to run on the little cluster? If a single-threaded app is constrained to run on one core, perhaps saving on migration, would this be better, worse or no different than confining it to the cluster of that core? Is there a general rule, or is best-practice strongly dependent on the characteristics of a given app or device? etc This series of experiments explores some of these questions using the LPGPU2 Profiling Tool on the LPGPU2 Hypercube test app and a Samsung Galaxy S7 G930F. The purpose of these experiments is to investigate just how varied behaviour and performance can be in different CPU affinity locking scenarios, with the hope that it will shed some light on the questions posed. Experimental Device The device chosen for these experiments is a Samsung Galaxy S7 G930F. It is based on the Exynos 8890 processor and its basic spec is shown in Table 1 Device SM-G930F Resolution 1440 x 2560 RAM 4Gb Android 6.0 (Marshmallow) Chipset Exynos 8890 Octa GPU Mali-T880 CPU count 8 CPU s 4 x 2.3 GHz Mongoose 4 x 1.6 GHz Cortex-A53 Table 1 Experimental device, Samsung Galaxy S7 G930F basic spec

Experimental App The Hypercube app was chosen because it is very lightweight, offering a high frame rate which may, ironically, cause more work to be done on the CPU.

Figure 2 Hypercube tumbling in 4D Analysis The Hypercube app was extended to make setting the CPU-affinity mask as simple as changing the value of an enum.

The updated Hypercube app was installed on the device, and in this initial configuration the thread affinity was unrestricted.

of almost all previous experiments - one core drops from full load to near zero just as another core ramps up while performance remains unaffected.

3 Experimental App The Hypercube app was chosen because it is very lightweight, offering a high frame rate which may, ironically, cause more work to be done on the CPU. Figure 2 shows some typical frames from the Hypercube app. Figure 2 Hypercube tumbling in 4D Analysis The Hypercube app was extended to make setting the CPU-affinity mask as simple as changing the value of an enum. The app was updated to report the CPU affinity of the main thread exactly once per frame. It was also updated to report frames per second (FPS) to User Counter 0. The updated Hypercube app was installed on the device, and in this initial configuration the thread affinity was unrestricted. This would be the first time we had directly observed the built in thread migration behaviour of a device in LPGPU2, although we have seen hints of it many times in the live CPU Load counter profiles of almost all previous experiments - one core drops from full load to near zero just as another core ramps up while performance remains unaffected. This common pattern could be explained by thread migration. In these experiments we would expect to see the behaviour explicitly. Also in previous experiments we have noted that it can take some minutes for a device to settle down after collection has begun. This can be especially problematic when trying to diagnose the asymptotic power usage characteristics of a particular app / device pairing. Because of this, Timer Mode was used for collection. In this mode, the user still starts a collection explicitly, but the collection will then run for a pre-set period. Termination occurs automatically at the end of this period. In the example shown in Figure 4, collection is set for five minutes the period used for collecting in these experiments.

4 Figure 4 Collection Mode selection panel showing Timer mode selected for 5 minutes For extra help in mitigating the unpredictable transient effects observed across all counter profiles, each experiment was conducted four times. This was to help expose how repeatable any particular result was. The CPU affinity results of the single-threaded Hypercube app for an unrestricted CPU affinity run are shown in Figure 5. They are from the four independent experiments. Battery power the most pertinent measure in the present experiments is shown alongside the CPU affinity, and although each run is different, a number of features are common to the four profiles a-d.

5 (a) (b) (c) (d) Figure 5 Four experiments showing CPU affinity and power consumption when CPU affinity is not restricted

6 First, it is clear from the profiles in Figure 5 that battery power reduces over time. There is a wide variation in the value of initial and final power consumption, but it is clear that power reduces to a fraction of its initial value. This is not due to anything within LPGPU2, it is simply the underlying black-box system responding to the shock of a collection being started. It does this by migrating processes and threads, adjusting process-priorities in no doubt many other proprietary tricks in order to reduce power while maintaining performance. Secondly, it is clear that the CPU affinity of the process threads migrate very often, and do so across all eight cores of the device. Upon closer inspection, however, it becomes clear that the app spends more time running on the lower cores (0,1,2 ) than the higher cores ( 5,6,7). The LPGPU2 Profiling Tool displays the instantaneous values of the CPU frequencies which immediately reveals that cores 0 3 represent the big cluster and cores 4 7 represent the little cluster. It is interesting to note that, by default, the system prefers to run the app on the big cluster, but also that it does not do so exclusively. The next sequence of experiments investigates power consumption when the system is tied to one of the clusters, first to the big cluster (cores 0,1,2 and 3) and then to the little cluster (cores 4,5,6 and 7). Figure 6 shows the result of four identical experiments profiling the Hypercube when tied to the big cluster for exactly 5 minutes.

7 a) b) c) d) Figure 6 Four experiments with CPU mask tied to the big cluster (cores 0,1,2 and 3) Power consumption and affinity shown

8 Firstly, it is clear that the behaviour of the app when constrained to the big cluster is very similar in form to the unrestricted affinity tests in that power is initially high, and then reduces over the period of the experiment. However, it should be noted that the results present in Figure 6d are very odd and do not fit the pattern. No explanation can be given for this except to say that with a system as complex as a modern Android device, it is simply not possible to know everything that is running, or why certain processes are spawned or woken at any given time. Such odd results and artefacts appear in profiles from time to time regardless of device or app. The only common factor is the Operating System. Secondly, it is clear from all four profiles that the system has honoured the request to lock the CPU affinity to the cpuset prescribed by the LPGPU2 API call. This is noteworthy as the cpuset bitmask is interpreted as a request; the system is not obliged to honour it. Thirdly, it is clear from all profiles that the thread is migrated very often. It is clear because of the almost solid blue bar that is the CPU Affinity counter profile that covers the values 0,1,2 and 3 the indices of the cores exclusively requested. Finally it is most interesting to note that (with the exception of the strange profile 6d) the asymptotic power consumption is approximately 50% of that when the app is allowed to run unrestricted. This is an exciting result. An enormous power reduction has been achieved with trivial modifications to the code, but the reason is not immediately obvious. If the big cluster is more expensive (in power) than the little cluster, why does limiting execution to the higher-power cluster result in a power reduction? An analysis of the Exynos architecture reveals that the cores of each cluster share a L2 cache: 2Mb for the big cluster and 256Kb for the little cluster. It could be that allowing the system to migrate the app between clusters is invalidating these caches, incurring a cost on other microsystems such as memory and busses. It is easy to imagine how constraining an app to run on one cluster could reduce this. If this phenomenon really is responsible for the power savings observed, then constraining the app to run on the little cluster may result in similar power reductions perhaps even greater. The next experiment was designed to explore this, and Figure 7 shows the results of constraining the app to run on the little cluster exclusively.

9 (a) (b) (c) (d) Figure 7 Four experiments with CPU mask tied to the little cluster (cores 0,1,2 and 3) Power consumption and affinity shown

10 As for previous experiments, four collections were run and a similar pattern emerges. Power is high initially and reduces over the duration of the experiment. Coincidentally, the fourth experiment in the series, shown in Figure 7d is unusual, though it still represents an overall reduction in power with time. It is interesting to note that thread migrations are much more sparse on the little cluster and furthermore they appear to contain a bias for cores 4 and 5 a feature not visible (at least by eye) in the profile for constraining to the big cluster (cores 0 3) The overall power reduction is still considerable compared with running unrestricted, however, but it is not noticeably (if at all) greater than the power reduction seen in constraining the app to run on the big cluster. This supports the hypothesis that cache invalidations are responsible for the increased power consumption due to the extra work required in populating a new cache. If hardware counters reporting cache invalidation were made available to the LPGPU2 Profiling Tool, the hypothesis could be tested more rigorously. Further experiments There is no end to the number of experiments that can be devised in an attempt to tease out the nature of the black-box algorithms responsible for thread migration on a given device. However, with the encouraging discovery that constraining an app to run on a single cluster seems to yield enormous power savings, another experiment presents itself. In the next experiment, the app is constrained once again to run on only four of the cores as before, but this time those cores will straddle the clusters. If cache invalidation is indeed responsible, then power consumption in this regime should be similar to the unconstrained experiment, or at least should be worse than running constrained to either cluster exclusively. Four collections from the same Hypercube app constrained to cores 2,3,4 and 5 were run. Cores 2 and 3 reside in the big cluster, and cores 4 and 5 are in the little cluster, so this run straddles the clusters. Figure 8 shows the results of the experiment and a familiar pattern is seen: Power starts high and reduces over time, although the final power value in each of the experiments is significantly greater than in the experiments with affinity tied exclusively to either cluster; the lowest current in this series is greater than 150mA. Contrast that with less than 100mA for the previous two clusterconstrained experiments. The results of the present experiment are comparable with the unconstrained case. Looking at the accompanying CPU affinity profiles of figure 8, it is confirmed that the app is indeed constrained to cores 2,3,4 and 5 and that it is being migrated between the clusters.

11 (a) (b) (c) (d) Figure 8 Four experiments showing power consumption and CPU affinity when CPU affinity is constrained to four cores spanning the clusters (cores 2,3,4 and 5)

12 CPU Affinity Patterns The experiments for the Exynos device show a consistent preference for scheduling threads to lower numbers of cores. In particular, cores 0,1,2 and 3 are the most preferred, cores 4 and 5 are the next most popular and cores 6 and 7 are the least popular. This means the scheduler prefers the big cluster over the little cluster for running the LPGPU2 test apps, and that when the little cluster is chosen, cores 4 and 5 are preferred over cores 6 and 7. Figure 9 shows a temporal zoom of some selected experiments to reveal the finer scale detail of the scheduling behaviour. Figure 9a manifests a pulse, regularly scheduling the thread to the little cluster. Although the time axis is not shown in these examples, the pulse frequency is approximately 1Hz. Figure 9b suggests that there is no scheduling preference for cores within the big cluster as no clear pattern can be seen. Figure 9c shows a similar experiment but constrained to the little cluster (cores 4,5,6 and 7) and there is a clear preference for cores 4 and 5 over cores 6 and 7. Figure 9d shows a zoomed section of a straddling experiment. It is microcosm of the unconstrained experiment in that it reveals favoured scheduling of the big cluster. Not only is more time spent in the big cluster, but scheduling on big cluster processes happens on a much smaller timescale than little cluster processes the time slicing of big cluster processes appears to be much shorter than for little cluster processes.

13 a) Unrestricted core affinity b) Core affinity restricted to the big cluster (cores 0,1,2 and 3) c) Core affinity restricted to the little cluster (cores 4,5,6 and 7) d) Core affinity restricted to 4 cores straddling the clusters Figure 9 Temporal zoom of Exynos core affinity profiles revealing different scheduling patterns for different affinity masks Conclusion These experiments with an octa-core dual-cluster device show that the ability to specify which threads are permitted to migrate between which CPU cores can be very powerful indeed. Significant power savings O(50%) are available for little, even trivial, development overhead. This exciting result was achieved by constraining an important task thread to run within a single cluster. The scheduler was free to migrate the thread between the cores of the cluster, but not to migrate the thread to the other cluster. The choice of which cluster big or little the thread was constrained to, was much less important than preventing thread migration between the clusters. Further work is required to ascertain the generality of these results. Will other (potentially very different) apps benefit from the same innovation, and will different devices respond in a similarly positive way?

LPGPU2 Font Renderer App

LPGPU2 Font Renderer App Performance Analysis Introduction As part of LPGPU2 Work Package 3, a font rendering app was developed to research the profiling characteristics of different font rendering algorithms.