Message ID | 20231010083143.19593-7-mgorman@techsingularity.net (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | sched/numa: Complete scanning of partial and inactive VMAs | expand |
* Mel Gorman <mgorman@techsingularity.net> wrote: > On a 2-socket Cascade Lake test machine, the time to complete the > workload is as follows; > > 6.6.0-rc2 6.6.0-rc2 > sched-numabtrace-v1 sched-numabselective-v1 > Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%) > Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%* > Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%) > CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%) > Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%) > > The time to complete the workload is reduced by almost 30% > > 6.6.0-rc2 6.6.0-rc2 > sched-numabtrace-v1 sched-numabselective-v1 / > Duration User 91201.80 63506.64 > Duration System 2015.53 1819.78 > Duration Elapsed 1234.77 868.37 > > In this specific case, system CPU time was not increased but it's not > universally true. > > From vmstat, the NUMA scanning and fault activity is as follows; > > 6.6.0-rc2 6.6.0-rc2 > sched-numabtrace-v1 sched-numabselective-v1 > Ops NUMA base-page range updates 64272.00 26374386.00 > Ops NUMA PTE updates 36624.00 55538.00 > Ops NUMA PMD updates 54.00 51404.00 > Ops NUMA hint faults 15504.00 75786.00 > Ops NUMA hint local faults % 14860.00 56763.00 > Ops NUMA hint local percent 95.85 74.90 > Ops NUMA pages migrated 1629.00 6469222.00 > > Both the number of PTE updates and hint faults is dramatically > increased. While this is superficially unfortunate, it represents > ranges that were simply skipped without the patch. As a result > of the scanning and hinting faults, many more pages were also > migrated but as the time to completion is reduced, the overhead > is offset by the gain. Nice! I've applied your series to tip:sched/core with a few non-functional edits to comment/changelog formatting/clarity. Btw., was any previous analysis done on the size of the pids_active[] hash and the hash collision rate? 64 (BITS_PER_LONG) feels a bit small, especially on larger machines running threaded workloads, and the kmalloc of numab_state likely allocates a full cacheline anyway, so we could double the hash size from 8 bytes (2x1 longs) to 32 bytes (2x2 longs) with very little real cost, and still have a long field left to spare? Thanks, Ingo
On Tue, Oct 10, 2023 at 11:23:00AM +0200, Ingo Molnar wrote: > > * Mel Gorman <mgorman@techsingularity.net> wrote: > > > On a 2-socket Cascade Lake test machine, the time to complete the > > workload is as follows; > > > > 6.6.0-rc2 6.6.0-rc2 > > sched-numabtrace-v1 sched-numabselective-v1 > > Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%) > > Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%* > > Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%) > > CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%) > > Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%) > > > > The time to complete the workload is reduced by almost 30% > > > > 6.6.0-rc2 6.6.0-rc2 > > sched-numabtrace-v1 sched-numabselective-v1 / > > Duration User 91201.80 63506.64 > > Duration System 2015.53 1819.78 > > Duration Elapsed 1234.77 868.37 > > > > In this specific case, system CPU time was not increased but it's not > > universally true. > > > > From vmstat, the NUMA scanning and fault activity is as follows; > > > > 6.6.0-rc2 6.6.0-rc2 > > sched-numabtrace-v1 sched-numabselective-v1 > > Ops NUMA base-page range updates 64272.00 26374386.00 > > Ops NUMA PTE updates 36624.00 55538.00 > > Ops NUMA PMD updates 54.00 51404.00 > > Ops NUMA hint faults 15504.00 75786.00 > > Ops NUMA hint local faults % 14860.00 56763.00 > > Ops NUMA hint local percent 95.85 74.90 > > Ops NUMA pages migrated 1629.00 6469222.00 > > > > Both the number of PTE updates and hint faults is dramatically > > increased. While this is superficially unfortunate, it represents > > ranges that were simply skipped without the patch. As a result > > of the scanning and hinting faults, many more pages were also > > migrated but as the time to completion is reduced, the overhead > > is offset by the gain. > > Nice! I've applied your series to tip:sched/core with a few non-functional > edits to comment/changelog formatting/clarity. > Thanks. > Btw., was any previous analysis done on the size of the pids_active[] hash > and the hash collision rate? > Not that I'm aware of but I also think it would be difficult to design something representative in terms of a benchmark. New pids are typically sequential so most benchmarks are not going to show many collisions unless the hash algorithm ignores lower bits. Maybe it does, I didn't actually check the hash algorithm and if it does, that is likely the patch justification right there -- threads created at similar times are almost certain to collide). As it was Peter that suggested the hash, I assumed he considered collisions due to lower bits but that is also lazy on my part. If lower bits are used then it would pose the question -- does it matter? The intent of the bitmap is for threads to prefer updating PTEs within task-active VMAs but ultimately all VMAs should be scanned anyway so some overhead will be usless. While collisions may occur, it's still better than scanning within VMAs that are definitely *not* of interest. It would suggest that a sensible direction would be to scan in passes like load balancing uses fbq_type in find_busiest_queue() to filter what types of tasks should be considered for moving. So, maybe the passes would look like 1. Task-active 2. Multiple tasks active 3. Any task active 4. Inactive The objective would be that PTE updates are as relevant as possible and hopefully by the time only inactive VMAs are considered, there is a relatively small amount of wasted work. > 64 (BITS_PER_LONG) feels a bit small, especially on larger machines running > threaded workloads, and the kmalloc of numab_state likely allocates a full > cacheline anyway, so we could double the hash size from 8 bytes (2x1 longs) > to 32 bytes (2x2 longs) with very little real cost, and still have a long > field left to spare? > You're right, we could and it's relatively cheap. I would worry that as the storage overhead is per-VMA then workloads for large machines may also have lots of VMAs that are not necessarily using threads. As I would struggle to provide supporting data justifying the change, I would also be hesitant to try merging it because if I was reviewing the patch for someone else, the first question I would ask is "is there any performance benefit that you can show?". I would expect the first patch would provide some telemetry and the patch some justification.
On 10/10/2023 2:53 PM, Ingo Molnar wrote: > > * Mel Gorman <mgorman@techsingularity.net> wrote: > [...] >> Both the number of PTE updates and hint faults is dramatically >> increased. While this is superficially unfortunate, it represents >> ranges that were simply skipped without the patch. As a result >> of the scanning and hinting faults, many more pages were also >> migrated but as the time to completion is reduced, the overhead >> is offset by the gain. > > Nice! I've applied your series to tip:sched/core with a few non-functional > edits to comment/changelog formatting/clarity. > > Btw., was any previous analysis done on the size of the pids_active[] hash > and the hash collision rate? > Hello Ingo, I did test to understand the behaviour threaded workloads relation to pids_active[], and that there is more spread of bits set etc, but not actually from hash_collision point of view since it would work better for workloads that does not create multiple threads quickly (thus we have sparse PIDs) as well as normal case where we have almost continuous PIDS. But I also did not try to increase size of individual pids_active more than 64 for the same reasoning of Mel that at the most we endup doing cross PTE updates. Perhaps I can experiment when I come across workloads that have 512/1000s of threads to refine these cross PTE updates further. However, I have tried increasing history (PeterZ's patch) (i.e., increasing array size of pids_active[]) to get a better VMA candidate for PTE update in https://lore.kernel.org/all/cover.1693287931.git.raghavendra.kt@amd.com/T/ to handle what Mel suggested in other email. viz., 1. Task-active 2. Multiple tasks active 3. Any task active 4. Inactive Thanks and Regards - Raghu
On 10/10/2023 2:01 PM, Mel Gorman wrote: > VMAs are skipped if there is no recent fault activity but this represents > a chicken-and-egg problem as there may be no fault activity if the PTEs > are never updated to trap NUMA hints. There is an indirect reliance on > scanning to be forced early in the lifetime of a task but this may fail > to detect changes in phase behaviour. Force inactive VMAs to be scanned > when all other eligible VMAs have been updated within the same scan > sequence. > > Test results in general look good with some changes in performance, both > negative and positive, depending on whether the additional scanning and > faulting was beneficial or not to the workload. The autonuma benchmark > workload NUMA01_THREADLOCAL was picked for closer examination. The workload > creates two processes with numerous threads and thread-local storage that > is zero-filled in a loop. It exercises the corner case where unrelated > threads may skip VMAs that are thread-local to another thread and still > has some VMAs that inactive while the workload executes. > > The VMA skipping activity frequency with and without the patch is as > follows; > > 6.6.0-rc2-sched-numabtrace-v1 > 649 reason=scan_delay > 9094 reason=unsuitable > 48915 reason=shared_ro > 143919 reason=inaccessible > 193050 reason=pid_inactive > > 6.6.0-rc2-sched-numabselective-v1 > 146 reason=seq_completed > 622 reason=ignore_pid_inactive > 624 reason=scan_delay > 6570 reason=unsuitable > 16101 reason=shared_ro > 27608 reason=inaccessible > 41939 reason=pid_inactive > > Note that with the patch applied, the PID activity is ignored > (ignore_pid_inactive) to ensure a VMA with some activity is completely > scanned. In addition, a small number of VMAs are scanned when no other > eligible VMA is available during a single scan window (seq_completed). > The number of times a VMA is skipped due to no PID activity from the > scanning task (pid_inactive) drops dramatically. It is expected that > this will increase the number of PTEs updated for NUMA hinting faults > as well as hinting faults but these represent PTEs that would otherwise > have been missed. The tradeoff is scan+fault overhead versus improving > locality due to migration. > > On a 2-socket Cascade Lake test machine, the time to complete the > workload is as follows; > > 6.6.0-rc2 6.6.0-rc2 > sched-numabtrace-v1 sched-numabselective-v1 > Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%) > Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%* > Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%) > CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%) > Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%) > > The time to complete the workload is reduced by almost 30% > > 6.6.0-rc2 6.6.0-rc2 > sched-numabtrace-v1 sched-numabselective-v1 / > Duration User 91201.80 63506.64 > Duration System 2015.53 1819.78 > Duration Elapsed 1234.77 868.37 > > In this specific case, system CPU time was not increased but it's not > universally true. > > From vmstat, the NUMA scanning and fault activity is as follows; > > 6.6.0-rc2 6.6.0-rc2 > sched-numabtrace-v1 sched-numabselective-v1 > Ops NUMA base-page range updates 64272.00 26374386.00 > Ops NUMA PTE updates 36624.00 55538.00 > Ops NUMA PMD updates 54.00 51404.00 > Ops NUMA hint faults 15504.00 75786.00 > Ops NUMA hint local faults % 14860.00 56763.00 > Ops NUMA hint local percent 95.85 74.90 > Ops NUMA pages migrated 1629.00 6469222.00 > > Both the number of PTE updates and hint faults is dramatically > increased. While this is superficially unfortunate, it represents > ranges that were simply skipped without the patch. As a result > of the scanning and hinting faults, many more pages were also > migrated but as the time to completion is reduced, the overhead > is offset by the gain. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > include/linux/mm_types.h | 6 +++ > include/linux/sched/numa_balancing.h | 1 + > include/trace/events/sched.h | 3 +- > kernel/sched/fair.c | 55 ++++++++++++++++++++++++++-- > 4 files changed, 61 insertions(+), 4 deletions(-) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 8cb1dec3e358..a123c1a58617 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -578,6 +578,12 @@ struct vma_numab_state { > * VMA_PID_RESET_PERIOD > * jiffies. > */ > + int prev_scan_seq; /* MM scan sequence ID when > + * the VMA was last completely > + * scanned. A VMA is not > + * eligible for scanning if > + * prev_scan_seq == numa_scan_seq > + */ > }; > > /* > diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h > index 7dcc0bdfddbb..b69afb8630db 100644 > --- a/include/linux/sched/numa_balancing.h > +++ b/include/linux/sched/numa_balancing.h > @@ -22,6 +22,7 @@ enum numa_vmaskip_reason { > NUMAB_SKIP_SCAN_DELAY, > NUMAB_SKIP_PID_INACTIVE, > NUMAB_SKIP_IGNORE_PID, > + NUMAB_SKIP_SEQ_COMPLETED, > }; > > #ifdef CONFIG_NUMA_BALANCING > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h > index 27b51c81b106..010ba1b7cb0e 100644 > --- a/include/trace/events/sched.h > +++ b/include/trace/events/sched.h > @@ -671,7 +671,8 @@ DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa, > EM( NUMAB_SKIP_INACCESSIBLE, "inaccessible" ) \ > EM( NUMAB_SKIP_SCAN_DELAY, "scan_delay" ) \ > EM( NUMAB_SKIP_PID_INACTIVE, "pid_inactive" ) \ > - EMe(NUMAB_SKIP_IGNORE_PID, "ignore_pid_inactive" ) > + EM( NUMAB_SKIP_IGNORE_PID, "ignore_pid_inactive" ) \ > + EMe(NUMAB_SKIP_SEQ_COMPLETED, "seq_completed" ) > > /* Redefine for export. */ > #undef EM > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 150f01948ec6..72ef60f394ba 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3175,6 +3175,8 @@ static void task_numa_work(struct callback_head *work) > unsigned long nr_pte_updates = 0; > long pages, virtpages; > struct vma_iterator vmi; > + bool vma_pids_skipped; > + bool vma_pids_forced = false; > > SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work)); > > @@ -3217,7 +3219,6 @@ static void task_numa_work(struct callback_head *work) > */ > p->node_stamp += 2 * TICK_NSEC; > > - start = mm->numa_scan_offset; > pages = sysctl_numa_balancing_scan_size; > pages <<= 20 - PAGE_SHIFT; /* MB in pages */ > virtpages = pages * 8; /* Scan up to this much virtual space */ > @@ -3227,6 +3228,16 @@ static void task_numa_work(struct callback_head *work) > > if (!mmap_read_trylock(mm)) > return; > + > + /* > + * VMAs are skipped if the current PID has not trapped a fault within > + * the VMA recently. Allow scanning to be forced if there is no > + * suitable VMA remaining. > + */ > + vma_pids_skipped = false; > + > +retry_pids: > + start = mm->numa_scan_offset; > vma_iter_init(&vmi, mm, start); > vma = vma_next(&vmi); > if (!vma) { > @@ -3277,6 +3288,13 @@ static void task_numa_work(struct callback_head *work) > /* Reset happens after 4 times scan delay of scan start */ > vma->numab_state->pids_active_reset = vma->numab_state->next_scan + > msecs_to_jiffies(VMA_PID_RESET_PERIOD); > + > + /* > + * Ensure prev_scan_seq does not match numa_scan_seq > + * to prevent VMAs being skipped prematurely on the > + * first scan. > + */ > + vma->numab_state->prev_scan_seq = mm->numa_scan_seq - 1; nit: Perhaps even vma->numab_state->prev_scan_seq = -1 would have worked, but does not matter. > }
* Mel Gorman <mgorman@techsingularity.net> wrote: > > 64 (BITS_PER_LONG) feels a bit small, especially on larger machines running > > threaded workloads, and the kmalloc of numab_state likely allocates a full > > cacheline anyway, so we could double the hash size from 8 bytes (2x1 longs) ^--- 16 bytes > > to 32 bytes (2x2 longs) with very little real cost, and still have a long > > field left to spare? > > > > You're right, we could and it's relatively cheap. I would worry that as > the storage overhead is per-VMA then workloads for large machines may > also have lots of VMAs that are not necessarily using threads. So I think there would be *zero* extra per-vma storage overhead, because vma->numab_state is a pointer, with the structure kmalloc() allocated, which should round the allocation to cacheline granularity anyway: 64 bytes on NUMA systems that matter. So with the current size of 'struct vma_numab_state' of 36 bytes, we can extend it by 16 bytes with zero additional storage cost. And since there's no cost, and less hash collisions are always good, the change wouldn't need any other justification. :-) [ Plus the resulting abstraction for the definition of a larger bitmask would probably make future extensions easier. ] But ... it was just a suggestion. Thanks, Ingo
On 10/10/2023 5:10 PM, Raghavendra K T wrote: > On 10/10/2023 2:53 PM, Ingo Molnar wrote: >> >> * Mel Gorman <mgorman@techsingularity.net> wrote: >> > [...] >>> Both the number of PTE updates and hint faults is dramatically >>> increased. While this is superficially unfortunate, it represents >>> ranges that were simply skipped without the patch. As a result >>> of the scanning and hinting faults, many more pages were also >>> migrated but as the time to completion is reduced, the overhead >>> is offset by the gain. >> >> Nice! I've applied your series to tip:sched/core with a few >> non-functional >> edits to comment/changelog formatting/clarity. >> >> Btw., was any previous analysis done on the size of the pids_active[] >> hash >> and the hash collision rate? >> Hello Ingo, I did complete the hash collision experiment you asked long back, But did not report soon (Year end time) . Sorry for that. Here is the summary: Context: ======== Currently in VMA scanning we hash the PID value into 6bit value so that we can use 8 Byte long variable to record which PID had accessed VMA to optimize scanning (HASH32 method) suggested by PeterZ. functions used: hash32(PID, ilog2(BITS_PER_LONG)) Alternate was to directly use last 6 bits of PID (PID_6BIT method). Current experiment evaluates how the distribution or collision looks like. Experiment: Main thread - Allocates large memory. - Creates n thread that that sweeps allocated memory for fixed iterations (n = 8,16,32,64) (All these threads are created without delay) Note down hash32 value for the threads generated from trace. Note down the PIDs of the threads and extract 6 bits. Generate a hashvalue-frequency list. Notes: 1) 8 Thread experiment will have 8 thread + 1 main thread hashvalue so on 2) When we have large number of threads some time all the thread may not get the chance to scan VMA and hence total count may be less. Observations: =============== 1) PID_6BIT generates hashvalues which are crowded. 2) HASH32 generates a very well distributed hash values (as expected). 3) There are no collisions when total threads created is less than 64 in both the cases. 4) When number of Threads created = 64 we see more collision in HASH32 method. But false positives did not seem to be an issue from the experiments so far. Also number of collisions are not that high too. I think we got lucky in PID_6BIT case. Overall hash32 service the intended purpose. (Ingo, I have experimented with extending total PID info stored from 64 - 128 on larger systems, will post it separately with the patch) ================== iter0/8-thread ================== PID_6BIT HASH32 (value-freq) (value-freq) 0 - 1 5 - 1 1 - 1 14 - 1 2 - 1 20 - 1 3 - 1 29 - 1 4 - 1 35 - 1 5 - 1 44 - 1 56 - 1 52 - 1 62 - 1 54 - 1 63 - 1 59 - 1 ================== iter0/16-thread ================== 0 - 1 3 - 1 1 - 1 9 - 1 2 - 1 12 - 1 3 - 1 14 - 1 4 - 1 18 - 1 5 - 1 24 - 1 6 - 1 27 - 1 7 - 1 33 - 1 8 - 1 37 - 1 9 - 1 39 - 1 10 - 1 42 - 1 11 - 1 48 - 1 59 - 1 52 - 1 60 - 1 54 - 1 61 - 1 58 - 1 62 - 1 61 - 1 63 - 1 63 - 1 ================== iter0/32-thread ================== 0 - 1 0 - 1 1 - 1 2 - 1 2 - 1 4 - 1 3 - 1 5 - 1 4 - 1 8 - 1 5 - 1 11 - 1 6 - 1 13 - 1 7 - 1 15 - 1 8 - 1 17 - 1 9 - 1 19 - 1 10 - 1 20 - 1 11 - 1 23 - 1 12 - 1 24 - 1 13 - 1 26 - 1 14 - 1 28 - 1 15 - 1 30 - 1 16 - 1 32 - 1 48 - 1 34 - 1 49 - 1 36 - 1 50 - 1 38 - 1 51 - 1 39 - 1 52 - 1 41 - 1 53 - 1 44 - 1 54 - 1 45 - 1 55 - 1 47 - 1 56 - 1 48 - 1 57 - 1 51 - 1 58 - 1 53 - 1 59 - 1 54 - 1 60 - 1 56 - 1 61 - 1 59 - 1 62 - 1 60 - 1 63 - 1 62 - 1 ================== iter0/64-thread ================== 0 - 1 0 - 1 1 - 1 1 - 1 2 - 1 2 - 1 3 - 1 4 - 1 4 - 1 6 - 1 5 - 1 7 - 1 6 - 1 8 - 1 7 - 1 9 - 1 8 - 1 10 - 1 9 - 1 12 - 1 10 - 1 13 - 1 11 - 1 15 - 1 12 - 1 16 - 1 15 - 1 17 - 2 16 - 1 19 - 1 18 - 1 20 - 1 20 - 1 21 - 1 22 - 1 22 - 1 23 - 1 23 - 1 24 - 1 25 - 2 25 - 2 27 - 1 26 - 1 29 - 1 27 - 1 30 - 1 28 - 1 31 - 1 29 - 1 32 - 1 30 - 1 33 - 1 31 - 1 34 - 1 32 - 1 35 - 1 33 - 1 36 - 1 34 - 1 37 - 1 35 - 1 40 - 2 36 - 1 41 - 1 37 - 1 42 - 1 38 - 1 43 - 1 39 - 1 44 - 1 40 - 1 45 - 1 41 - 1 46 - 1 42 - 1 47 - 1 43 - 1 48 - 1 44 - 1 49 - 1 45 - 1 50 - 1 48 - 1 53 - 2 49 - 1 55 - 1 50 - 1 56 - 2 51 - 1 57 - 1 52 - 1 58 - 1 53 - 1 59 - 1 54 - 1 61 - 2 55 - 1 62 - 1 56 - 1 63 - 1 57 - 1 60 - 1 61 - 1 62 - 1 63 - 1 ================== iter1/8-thread ================== 53 - 1 8 - 1 55 - 1 17 - 1 56 - 1 23 - 1 57 - 1 33 - 1 58 - 1 38 - 1 59 - 1 48 - 1 60 - 1 53 - 1 61 - 1 57 - 1 62 - 1 63 - 1 ================== iter1/16-thread ================== 4 - 1 0 - 1 5 - 1 6 - 1 7 - 1 9 - 1 8 - 1 15 - 1 9 - 1 25 - 1 10 - 1 30 - 1 11 - 1 36 - 1 12 - 1 40 - 1 13 - 1 43 - 1 14 - 1 45 - 1 15 - 1 49 - 1 16 - 1 55 - 1 18 - 1 58 - 1 20 - 1 61 - 1 ================== iter1/32-thread ================== 27 - 1 1 - 1 28 - 1 3 - 1 29 - 1 5 - 1 30 - 1 7 - 1 31 - 1 9 - 1 32 - 1 11 - 1 33 - 1 12 - 1 34 - 1 14 - 1 35 - 1 17 - 1 36 - 1 18 - 1 37 - 1 20 - 1 38 - 1 22 - 1 39 - 1 24 - 1 40 - 1 26 - 1 41 - 1 27 - 1 42 - 1 29 - 1 43 - 1 32 - 1 44 - 1 33 - 1 45 - 1 35 - 1 46 - 1 37 - 1 47 - 1 39 - 1 48 - 1 41 - 1 49 - 1 42 - 1 50 - 1 45 - 1 51 - 1 47 - 1 52 - 1 48 - 1 53 - 1 50 - 1 54 - 1 52 - 1 55 - 1 54 - 1 56 - 1 56 - 1 57 - 1 58 - 1 58 - 1 60 - 1 59 - 1 63 - 1 ================== iter1/64-thread ================== 0 - 1 0 - 1 1 - 1 1 - 1 2 - 1 2 - 1 3 - 1 3 - 1 4 - 1 4 - 2 5 - 1 6 - 1 6 - 1 7 - 2 7 - 1 8 - 1 10 - 1 9 - 1 12 - 1 10 - 1 14 - 1 12 - 1 15 - 1 13 - 1 16 - 2 14 - 1 18 - 1 15 - 1 19 - 1 16 - 1 20 - 1 17 - 1 21 - 1 19 - 1 22 - 1 20 - 1 23 - 1 22 - 1 24 - 1 23 - 1 25 - 1 24 - 1 26 - 1 25 - 1 27 - 1 26 - 1 28 - 1 27 - 1 31 - 1 30 - 1 33 - 1 31 - 1 34 - 1 32 - 2 35 - 1 34 - 1 36 - 1 35 - 1 37 - 1 36 - 1 38 - 1 37 - 1 39 - 1 40 - 1 40 - 1 41 - 1 41 - 1 42 - 1 42 - 1 44 - 1 43 - 1 45 - 1 44 - 1 46 - 1 45 - 1 47 - 1 46 - 1 48 - 1 48 - 1 49 - 1 49 - 1 50 - 1 50 - 1 51 - 1 51 - 1 52 - 1 52 - 1 55 - 1 54 - 1 57 - 1 55 - 1 58 - 1 56 - 1 59 - 1 57 - 1 60 - 1 58 - 1 62 - 1 59 - 1 63 - 1 60 - 1 62 - 1 Thanks and Regards - Raghu
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 8cb1dec3e358..a123c1a58617 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -578,6 +578,12 @@ struct vma_numab_state { * VMA_PID_RESET_PERIOD * jiffies. */ + int prev_scan_seq; /* MM scan sequence ID when + * the VMA was last completely + * scanned. A VMA is not + * eligible for scanning if + * prev_scan_seq == numa_scan_seq + */ }; /* diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h index 7dcc0bdfddbb..b69afb8630db 100644 --- a/include/linux/sched/numa_balancing.h +++ b/include/linux/sched/numa_balancing.h @@ -22,6 +22,7 @@ enum numa_vmaskip_reason { NUMAB_SKIP_SCAN_DELAY, NUMAB_SKIP_PID_INACTIVE, NUMAB_SKIP_IGNORE_PID, + NUMAB_SKIP_SEQ_COMPLETED, }; #ifdef CONFIG_NUMA_BALANCING diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 27b51c81b106..010ba1b7cb0e 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -671,7 +671,8 @@ DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa, EM( NUMAB_SKIP_INACCESSIBLE, "inaccessible" ) \ EM( NUMAB_SKIP_SCAN_DELAY, "scan_delay" ) \ EM( NUMAB_SKIP_PID_INACTIVE, "pid_inactive" ) \ - EMe(NUMAB_SKIP_IGNORE_PID, "ignore_pid_inactive" ) + EM( NUMAB_SKIP_IGNORE_PID, "ignore_pid_inactive" ) \ + EMe(NUMAB_SKIP_SEQ_COMPLETED, "seq_completed" ) /* Redefine for export. */ #undef EM diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 150f01948ec6..72ef60f394ba 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3175,6 +3175,8 @@ static void task_numa_work(struct callback_head *work) unsigned long nr_pte_updates = 0; long pages, virtpages; struct vma_iterator vmi; + bool vma_pids_skipped; + bool vma_pids_forced = false; SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work)); @@ -3217,7 +3219,6 @@ static void task_numa_work(struct callback_head *work) */ p->node_stamp += 2 * TICK_NSEC; - start = mm->numa_scan_offset; pages = sysctl_numa_balancing_scan_size; pages <<= 20 - PAGE_SHIFT; /* MB in pages */ virtpages = pages * 8; /* Scan up to this much virtual space */ @@ -3227,6 +3228,16 @@ static void task_numa_work(struct callback_head *work) if (!mmap_read_trylock(mm)) return; + + /* + * VMAs are skipped if the current PID has not trapped a fault within + * the VMA recently. Allow scanning to be forced if there is no + * suitable VMA remaining. + */ + vma_pids_skipped = false; + +retry_pids: + start = mm->numa_scan_offset; vma_iter_init(&vmi, mm, start); vma = vma_next(&vmi); if (!vma) { @@ -3277,6 +3288,13 @@ static void task_numa_work(struct callback_head *work) /* Reset happens after 4 times scan delay of scan start */ vma->numab_state->pids_active_reset = vma->numab_state->next_scan + msecs_to_jiffies(VMA_PID_RESET_PERIOD); + + /* + * Ensure prev_scan_seq does not match numa_scan_seq + * to prevent VMAs being skipped prematurely on the + * first scan. + */ + vma->numab_state->prev_scan_seq = mm->numa_scan_seq - 1; } /* @@ -3298,8 +3316,19 @@ static void task_numa_work(struct callback_head *work) vma->numab_state->pids_active[1] = 0; } - /* Do not scan the VMA if task has not accessed */ - if (!vma_is_accessed(mm, vma)) { + /* Do not rescan VMAs twice within the same sequence. */ + if (vma->numab_state->prev_scan_seq == mm->numa_scan_seq) { + mm->numa_scan_offset = vma->vm_end; + trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_SEQ_COMPLETED); + continue; + } + + /* + * Do not scan the VMA if task has not accessed unless no other + * VMA candidate exists. + */ + if (!vma_pids_forced && !vma_is_accessed(mm, vma)) { + vma_pids_skipped = true; trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_PID_INACTIVE); continue; } @@ -3328,8 +3357,28 @@ static void task_numa_work(struct callback_head *work) cond_resched(); } while (end != vma->vm_end); + + /* VMA scan is complete, do not scan until next sequence. */ + vma->numab_state->prev_scan_seq = mm->numa_scan_seq; + + /* + * Only force scan within one VMA at a time to limit the + * cost of scanning a potentially uninteresting VMA. + */ + if (vma_pids_forced) + break; } for_each_vma(vmi, vma); + /* + * If no VMAs are remaining and VMAs were skipped due to the PID + * not accessing the VMA previously then force a scan to ensure + * forward progress. + */ + if (!vma && !vma_pids_forced && vma_pids_skipped) { + vma_pids_forced = true; + goto retry_pids; + } + out: /* * It is possible to reach the end of the VMA list but the last few
VMAs are skipped if there is no recent fault activity but this represents a chicken-and-egg problem as there may be no fault activity if the PTEs are never updated to trap NUMA hints. There is an indirect reliance on scanning to be forced early in the lifetime of a task but this may fail to detect changes in phase behaviour. Force inactive VMAs to be scanned when all other eligible VMAs have been updated within the same scan sequence. Test results in general look good with some changes in performance, both negative and positive, depending on whether the additional scanning and faulting was beneficial or not to the workload. The autonuma benchmark workload NUMA01_THREADLOCAL was picked for closer examination. The workload creates two processes with numerous threads and thread-local storage that is zero-filled in a loop. It exercises the corner case where unrelated threads may skip VMAs that are thread-local to another thread and still has some VMAs that inactive while the workload executes. The VMA skipping activity frequency with and without the patch is as follows; 6.6.0-rc2-sched-numabtrace-v1 649 reason=scan_delay 9094 reason=unsuitable 48915 reason=shared_ro 143919 reason=inaccessible 193050 reason=pid_inactive 6.6.0-rc2-sched-numabselective-v1 146 reason=seq_completed 622 reason=ignore_pid_inactive 624 reason=scan_delay 6570 reason=unsuitable 16101 reason=shared_ro 27608 reason=inaccessible 41939 reason=pid_inactive Note that with the patch applied, the PID activity is ignored (ignore_pid_inactive) to ensure a VMA with some activity is completely scanned. In addition, a small number of VMAs are scanned when no other eligible VMA is available during a single scan window (seq_completed). The number of times a VMA is skipped due to no PID activity from the scanning task (pid_inactive) drops dramatically. It is expected that this will increase the number of PTEs updated for NUMA hinting faults as well as hinting faults but these represent PTEs that would otherwise have been missed. The tradeoff is scan+fault overhead versus improving locality due to migration. On a 2-socket Cascade Lake test machine, the time to complete the workload is as follows; 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%) Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%* Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%) CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%) Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%) The time to complete the workload is reduced by almost 30% 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 / Duration User 91201.80 63506.64 Duration System 2015.53 1819.78 Duration Elapsed 1234.77 868.37 In this specific case, system CPU time was not increased but it's not universally true. From vmstat, the NUMA scanning and fault activity is as follows; 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 Ops NUMA base-page range updates 64272.00 26374386.00 Ops NUMA PTE updates 36624.00 55538.00 Ops NUMA PMD updates 54.00 51404.00 Ops NUMA hint faults 15504.00 75786.00 Ops NUMA hint local faults % 14860.00 56763.00 Ops NUMA hint local percent 95.85 74.90 Ops NUMA pages migrated 1629.00 6469222.00 Both the number of PTE updates and hint faults is dramatically increased. While this is superficially unfortunate, it represents ranges that were simply skipped without the patch. As a result of the scanning and hinting faults, many more pages were also migrated but as the time to completion is reduced, the overhead is offset by the gain. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- include/linux/mm_types.h | 6 +++ include/linux/sched/numa_balancing.h | 1 + include/trace/events/sched.h | 3 +- kernel/sched/fair.c | 55 ++++++++++++++++++++++++++-- 4 files changed, 61 insertions(+), 4 deletions(-)