diff mbox series

[RFC] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault

Message ID 20210329062651.2487905-1-ying.huang@intel.com (mailing list archive)
State New, archived
Headers show
Series [RFC] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault | expand

Commit Message

Huang, Ying March 29, 2021, 6:26 a.m. UTC
For NUMA balancing, in hint page fault handler, the faulting page will
be migrated to the accessing node if necessary.  During the migration,
TLB will be shot down on all CPUs that the process has run on
recently.  Because in the hint page fault handler, the PTE will be
made accessible before the migration is tried.  The overhead of TLB
shooting down is high, so it's better to be avoided if possible.  In
fact, if we delay mapping the page in PTE until migration, that can be
avoided.  This is what this patch doing.

We have tested the patch with the pmbench memory accessing benchmark
on a 2-socket Intel server, and found that the number of the TLB
shooting down IPI reduces up to 99% (from ~6.0e6 to ~2.3e4) if NUMA
balancing is triggered (~8.8e6 pages migrated).  The benchmark score
has no visible changes.

Known issues:

For the multiple threads applications, it's possible that the page is
accessed by 2 threads almost at the same time.  In the original
implementation, the second thread may go accessing the page directly
because the first thread has installed the accessible PTE.  While with
this patch, there will be a window that the second thread will find
the PTE is still inaccessible.  But the difference between the
accessible window is small.  Because the page will be made
inaccessible soon for migrating.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Matthew Wilcox" <willy@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
---
 mm/memory.c | 54 +++++++++++++++++++++++++++++++----------------------
 1 file changed, 32 insertions(+), 22 deletions(-)

Comments

Mel Gorman March 30, 2021, 1:33 p.m. UTC | #1
On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote:
> For NUMA balancing, in hint page fault handler, the faulting page will
> be migrated to the accessing node if necessary.  During the migration,
> TLB will be shot down on all CPUs that the process has run on
> recently.  Because in the hint page fault handler, the PTE will be
> made accessible before the migration is tried.  The overhead of TLB
> shooting down is high, so it's better to be avoided if possible.  In
> fact, if we delay mapping the page in PTE until migration, that can be
> avoided.  This is what this patch doing.
> 

Why would the overhead be high? It was previously inaccessibly so it's
only parallel accesses making forward progress that trigger the need
for a flush. As your change notes -- "The benchmark score has no visible
changes". The patch was neither a win nor a loss for your target workload
but there are more fundamental issues to consider.

> We have tested the patch with the pmbench memory accessing benchmark
> on a 2-socket Intel server, and found that the number of the TLB
> shooting down IPI reduces up to 99% (from ~6.0e6 to ~2.3e4) if NUMA
> balancing is triggered (~8.8e6 pages migrated).  The benchmark score
> has no visible changes.
> 
> Known issues:
> 
> For the multiple threads applications, it's possible that the page is
> accessed by 2 threads almost at the same time.  In the original
> implementation, the second thread may go accessing the page directly
> because the first thread has installed the accessible PTE.  While with
> this patch, there will be a window that the second thread will find
> the PTE is still inaccessible.  But the difference between the
> accessible window is small.  Because the page will be made
> inaccessible soon for migrating.
> 

If multiple threads trap the hinting fault, only one potentially attempts
a migration as the others observe the PTE has changed when the PTL is
acquired and return to userspace. Such threads then have a short window to
make progress before the PTE *potentially* becomes a migration PTE and
during that window, the parallel access may not need the page any more
and never stall on the migration.

That migration PTE may never be created if migrate_misplaced_page
chooses to ignore the PTE in which case there is minimal disruption.

If migration is attempted, then the time until the migration PTE is
created is variable. The page has to be isolated from the LRU so there
could be contention on the LRU lock, a new page has to be allocated and
that allocation potentially has to enter the page allocator slow path
etc. During that time, parallel threads make forward progress but with
the patch, multiple threads potentially attempt the allocation and fail
instead of doing real work.

You should consider the following question -- is the potential saving
of an IPI transmission enough to offset the cost of parallel accesses
not making forward progress while one migration is setup and having
different migration attempts collide?

I have tests running just in case but I think the answer may be "no".
So far only one useful test as completed (specjbb2005 with one VM per NUMA
node) and it showed a mix of small gains and losses but with *higher*
interrupts contrary to what was expected from the changelog. For some
thread counts, the results showed large differences in variability,
sometimes lower and sometimes much higher.

It makes me think that a workload should be identified that really
benefits from the IPI savings are enough to justify stalling parallel
accesses that could be making forward progress.

One nit below

> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: "Matthew Wilcox" <willy@infradead.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Arjun Roy <arjunroy@google.com>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> ---
>  mm/memory.c | 54 +++++++++++++++++++++++++++++++----------------------
>  1 file changed, 32 insertions(+), 22 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index d3273bd69dbb..a9a8ed1ac06c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4148,29 +4148,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  		goto out;
>  	}
>  
> -	/*
> -	 * Make it present again, Depending on how arch implementes non
> -	 * accessible ptes, some can allow access by kernel mode.
> -	 */
> -	old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
> +	/* Get the normal PTE  */
> +	old_pte = ptep_get(vmf->pte);
>  	pte = pte_modify(old_pte, vma->vm_page_prot);
> -	pte = pte_mkyoung(pte);
> -	if (was_writable)
> -		pte = pte_mkwrite(pte);
> -	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
> -	update_mmu_cache(vma, vmf->address, vmf->pte);
>  
>  	page = vm_normal_page(vma, vmf->address, pte);
> -	if (!page) {
> -		pte_unmap_unlock(vmf->pte, vmf->ptl);
> -		return 0;
> -	}
> +	if (!page)
> +		goto out_map;
>  
>  	/* TODO: handle PTE-mapped THP */
> -	if (PageCompound(page)) {
> -		pte_unmap_unlock(vmf->pte, vmf->ptl);
> -		return 0;
> -	}
> +	if (PageCompound(page))
> +		goto out_map;
>  
>  	/*
>  	 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
> @@ -4180,7 +4168,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  	 * pte_dirty has unpredictable behaviour between PTE scan updates,
>  	 * background writeback, dirty balancing and application behaviour.
>  	 */
> -	if (!pte_write(pte))
> +	if (was_writable)
>  		flags |= TNF_NO_GROUP;
>  
>  	/*

Not clear why this change was made.

> @@ -4194,23 +4182,45 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  	page_nid = page_to_nid(page);
>  	target_nid = numa_migrate_prep(page, vma, vmf->address, page_nid,
>  			&flags);
> -	pte_unmap_unlock(vmf->pte, vmf->ptl);
>  	if (target_nid == NUMA_NO_NODE) {
>  		put_page(page);
> -		goto out;
> +		goto out_map;
>  	}
> +	pte_unmap_unlock(vmf->pte, vmf->ptl);
>  
>  	/* Migrate to the requested node */
>  	if (migrate_misplaced_page(page, vma, target_nid)) {
>  		page_nid = target_nid;
>  		flags |= TNF_MIGRATED;
> -	} else
> +	} else {
>  		flags |= TNF_MIGRATE_FAIL;
> +		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> +		spin_lock(vmf->ptl);
> +		if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
> +			pte_unmap_unlock(vmf->pte, vmf->ptl);
> +			goto out;
> +		}
> +		goto out_map;
> +	}
>  
>  out:
>  	if (page_nid != NUMA_NO_NODE)
>  		task_numa_fault(last_cpupid, page_nid, 1, flags);
>  	return 0;
> +out_map:
> +	/*
> +	 * Make it present again, Depending on how arch implementes non
> +	 * accessible ptes, some can allow access by kernel mode.
> +	 */
> +	old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
> +	pte = pte_modify(old_pte, vma->vm_page_prot);
> +	pte = pte_mkyoung(pte);
> +	if (was_writable)
> +		pte = pte_mkwrite(pte);
> +	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
> +	update_mmu_cache(vma, vmf->address, vmf->pte);
> +	pte_unmap_unlock(vmf->pte, vmf->ptl);
> +	goto out;
>  }
>  
>  static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
> -- 
> 2.30.2
>
Huang, Ying March 31, 2021, 11:20 a.m. UTC | #2
Mel Gorman <mgorman@suse.de> writes:

> On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote:
>> For NUMA balancing, in hint page fault handler, the faulting page will
>> be migrated to the accessing node if necessary.  During the migration,
>> TLB will be shot down on all CPUs that the process has run on
>> recently.  Because in the hint page fault handler, the PTE will be
>> made accessible before the migration is tried.  The overhead of TLB
>> shooting down is high, so it's better to be avoided if possible.  In
>> fact, if we delay mapping the page in PTE until migration, that can be
>> avoided.  This is what this patch doing.
>> 
>
> Why would the overhead be high? It was previously inaccessibly so it's
> only parallel accesses making forward progress that trigger the need
> for a flush.

Sorry, I don't understand this.  Although the page is inaccessible, the
threads may access other pages, so TLB flushing is still necessary.

> As your change notes -- "The benchmark score has no visible
> changes". The patch was neither a win nor a loss for your target workload
> but there are more fundamental issues to consider.
>
>> We have tested the patch with the pmbench memory accessing benchmark
>> on a 2-socket Intel server, and found that the number of the TLB
>> shooting down IPI reduces up to 99% (from ~6.0e6 to ~2.3e4) if NUMA
>> balancing is triggered (~8.8e6 pages migrated).  The benchmark score
>> has no visible changes.
>> 
>> Known issues:
>> 
>> For the multiple threads applications, it's possible that the page is
>> accessed by 2 threads almost at the same time.  In the original
>> implementation, the second thread may go accessing the page directly
>> because the first thread has installed the accessible PTE.  While with
>> this patch, there will be a window that the second thread will find
>> the PTE is still inaccessible.  But the difference between the
>> accessible window is small.  Because the page will be made
>> inaccessible soon for migrating.
>> 
>
> If multiple threads trap the hinting fault, only one potentially attempts
> a migration as the others observe the PTE has changed when the PTL is
> acquired and return to userspace. Such threads then have a short window to
> make progress before the PTE *potentially* becomes a migration PTE and
> during that window, the parallel access may not need the page any more
> and never stall on the migration.

Yes.

> That migration PTE may never be created if migrate_misplaced_page
> chooses to ignore the PTE in which case there is minimal disruption.

Yes.  And in the patched kernel, if numa_migrate_prep() returns
NUMA_NO_NODE or migrate_misplaced_page() returns 0, the PTE will be made
accessible too.

> If migration is attempted, then the time until the migration PTE is
> created is variable. The page has to be isolated from the LRU so there
> could be contention on the LRU lock, a new page has to be allocated and
> that allocation potentially has to enter the page allocator slow path
> etc. During that time, parallel threads make forward progress but with
> the patch, multiple threads potentially attempt the allocation and fail
> instead of doing real work.

If my understanding of the code were correct, only the first thread will
attempt the isolation and allocation.  Because TestClearPageLRU() is
called in

  migrate_misplaced_page()
    numamigrate_isolate_page()
      isolate_lru_page()

And migrate_misplaced_page() will return 0 immediately if
TestClearPageLRU() returns false.  Then the second thread will make the
page accessible and make forward progress.

But there's still some timing difference between the original and
patched kernel.  We have several choices to reduce the difference.

1. Check PageLRU() with PTL held in do_numa_page()

If PageLRU() return false, do_numa_page() can make the page accessible
firstly.  So the second thread will make the page accessible earlier.

2. Try to lock the page with PTL held in do_numa_page()

If the try-locking succeeds, it's the first thread, so it can delay
mapping.  If try-locking fails, it may be the second thread, so it will
make the page accessible firstly.  We need to teach
migrate_misplaced_page() to work with the page locked.  This will
enlarge the duration that the page is locked.  Is it a problem?

3. Check page_count() with PTL held in do_numa_page()

The first thread will call get_page() in numa_migrate_prep().  So if the
second thread can detect that, it can make the page accessible firstly.
The difficulty is that it appears hard to identify the expected
page_count() for the file pages.  For anonymous pages, that is much
easier, so at least if a page passes the following test, we can delay
mapping,

    PageAnon(page) && page_count(page) == page_mapcount(page) + !!PageSwapCache(page)

This will disable the optimization for the file pages.  But it may be
good enough?

Which one do you think is better?  Maybe the first one is good enough?

> You should consider the following question -- is the potential saving
> of an IPI transmission enough to offset the cost of parallel accesses
> not making forward progress while one migration is setup and having
> different migration attempts collide?
>
> I have tests running just in case but I think the answer may be "no".
> So far only one useful test as completed (specjbb2005 with one VM per NUMA
> node) and it showed a mix of small gains and losses but with *higher*
> interrupts contrary to what was expected from the changelog.

That is hard to be understood.  May be caused by the bug you pointed out
(about was_writable)?

> For some
> thread counts, the results showed large differences in variability,
> sometimes lower and sometimes much higher.
>
> It makes me think that a workload should be identified that really
> benefits from the IPI savings are enough to justify stalling parallel
> accesses that could be making forward progress.
>
> One nit below
>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: "Matthew Wilcox" <willy@infradead.org>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Michel Lespinasse <walken@google.com>
>> Cc: Arjun Roy <arjunroy@google.com>
>> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> ---
>>  mm/memory.c | 54 +++++++++++++++++++++++++++++++----------------------
>>  1 file changed, 32 insertions(+), 22 deletions(-)
>> 
>> diff --git a/mm/memory.c b/mm/memory.c
>> index d3273bd69dbb..a9a8ed1ac06c 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4148,29 +4148,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>  		goto out;
>>  	}
>>  
>> -	/*
>> -	 * Make it present again, Depending on how arch implementes non
>> -	 * accessible ptes, some can allow access by kernel mode.
>> -	 */
>> -	old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
>> +	/* Get the normal PTE  */
>> +	old_pte = ptep_get(vmf->pte);
>>  	pte = pte_modify(old_pte, vma->vm_page_prot);
>> -	pte = pte_mkyoung(pte);
>> -	if (was_writable)
>> -		pte = pte_mkwrite(pte);
>> -	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
>> -	update_mmu_cache(vma, vmf->address, vmf->pte);
>>  
>>  	page = vm_normal_page(vma, vmf->address, pte);
>> -	if (!page) {
>> -		pte_unmap_unlock(vmf->pte, vmf->ptl);
>> -		return 0;
>> -	}
>> +	if (!page)
>> +		goto out_map;
>>  
>>  	/* TODO: handle PTE-mapped THP */
>> -	if (PageCompound(page)) {
>> -		pte_unmap_unlock(vmf->pte, vmf->ptl);
>> -		return 0;
>> -	}
>> +	if (PageCompound(page))
>> +		goto out_map;
>>  
>>  	/*
>>  	 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
>> @@ -4180,7 +4168,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>  	 * pte_dirty has unpredictable behaviour between PTE scan updates,
>>  	 * background writeback, dirty balancing and application behaviour.
>>  	 */
>> -	if (!pte_write(pte))
>> +	if (was_writable)
>>  		flags |= TNF_NO_GROUP;
>>  
>>  	/*
>
> Not clear why this change was made.

Sorry, this is a mistake.  It should be

        if (!was_writable)
  		flags |= TNF_NO_GROUP;

Thanks for pointing this out!  Will change this in the future version.

Best Regards,
Huang, Ying
Mel Gorman March 31, 2021, 1:16 p.m. UTC | #3
On Wed, Mar 31, 2021 at 07:20:09PM +0800, Huang, Ying wrote:
> Mel Gorman <mgorman@suse.de> writes:
> 
> > On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote:
> >> For NUMA balancing, in hint page fault handler, the faulting page will
> >> be migrated to the accessing node if necessary.  During the migration,
> >> TLB will be shot down on all CPUs that the process has run on
> >> recently.  Because in the hint page fault handler, the PTE will be
> >> made accessible before the migration is tried.  The overhead of TLB
> >> shooting down is high, so it's better to be avoided if possible.  In
> >> fact, if we delay mapping the page in PTE until migration, that can be
> >> avoided.  This is what this patch doing.
> >> 
> >
> > Why would the overhead be high? It was previously inaccessibly so it's
> > only parallel accesses making forward progress that trigger the need
> > for a flush.
> 
> Sorry, I don't understand this.  Although the page is inaccessible, the
> threads may access other pages, so TLB flushing is still necessary.
> 

You assert the overhead of TLB shootdown is high and yes, it can be
very high but you also said "the benchmark score has no visible changes"
indicating the TLB shootdown cost is not a major problem for the workload.
It does not mean we should ignore it though.

> > <SNIP the parts that are not problematic>
> >
> > If migration is attempted, then the time until the migration PTE is
> > created is variable. The page has to be isolated from the LRU so there
> > could be contention on the LRU lock, a new page has to be allocated and
> > that allocation potentially has to enter the page allocator slow path
> > etc. During that time, parallel threads make forward progress but with
> > the patch, multiple threads potentially attempt the allocation and fail
> > instead of doing real work.
> 
> If my understanding of the code were correct, only the first thread will
> attempt the isolation and allocation.  Because TestClearPageLRU() is
> called in
> 
>   migrate_misplaced_page()
>     numamigrate_isolate_page()
>       isolate_lru_page()
> 
> And migrate_misplaced_page() will return 0 immediately if
> TestClearPageLRU() returns false.  Then the second thread will make the
> page accessible and make forward progress.
> 

Ok, that's true. While additional work is done, the cost is reasonably
low -- lower than I initially imagined and with fewer side-effects.

> But there's still some timing difference between the original and
> patched kernel.  We have several choices to reduce the difference.
> 
> 1. Check PageLRU() with PTL held in do_numa_page()
> 
> If PageLRU() return false, do_numa_page() can make the page accessible
> firstly.  So the second thread will make the page accessible earlier.
> 
> 2. Try to lock the page with PTL held in do_numa_page()
> 
> If the try-locking succeeds, it's the first thread, so it can delay
> mapping.  If try-locking fails, it may be the second thread, so it will
> make the page accessible firstly.  We need to teach
> migrate_misplaced_page() to work with the page locked.  This will
> enlarge the duration that the page is locked.  Is it a problem?
> 
> 3. Check page_count() with PTL held in do_numa_page()
> 
> The first thread will call get_page() in numa_migrate_prep().  So if the
> second thread can detect that, it can make the page accessible firstly.
> The difficulty is that it appears hard to identify the expected
> page_count() for the file pages.  For anonymous pages, that is much
> easier, so at least if a page passes the following test, we can delay
> mapping,
> 
>     PageAnon(page) && page_count(page) == page_mapcount(page) + !!PageSwapCache(page)
> 
> This will disable the optimization for the file pages.  But it may be
> good enough?
> 
> Which one do you think is better?  Maybe the first one is good enough?
> 

The first one is probably the most straight-forward but it's more
important to figure out why interrupts were higher with at least one
workload when the exact opposite is expected. Investigating which of
options 1-3 are best and whether it's worth the duplicated check could
be done as a separate patch.

> > You should consider the following question -- is the potential saving
> > of an IPI transmission enough to offset the cost of parallel accesses
> > not making forward progress while one migration is setup and having
> > different migration attempts collide?
> >
> > I have tests running just in case but I think the answer may be "no".
> > So far only one useful test as completed (specjbb2005 with one VM per NUMA
> > node) and it showed a mix of small gains and losses but with *higher*
> > interrupts contrary to what was expected from the changelog.
> 
> That is hard to be understood.  May be caused by the bug you pointed out
> (about was_writable)?
> 

It's possible and I could not figure out what the rationale behind the
change was :/

Fix it and run it through your tests to make sure it works as you
expect. Assuming it passes your tests and it's posted, I'll read it again
and run it through a battery of tests. If it shows that interrupts are
lower and is either netural or improves performance in enough cases then
I think it'll be ok. Even if it's only neutral in terms of performance
but interrupts are lower, it'll be acceptable.

Thanks!
Nadav Amit March 31, 2021, 4:36 p.m. UTC | #4
> On Mar 31, 2021, at 6:16 AM, Mel Gorman <mgorman@suse.de> wrote:
> 
> On Wed, Mar 31, 2021 at 07:20:09PM +0800, Huang, Ying wrote:
>> Mel Gorman <mgorman@suse.de> writes:
>> 
>>> On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote:
>>>> For NUMA balancing, in hint page fault handler, the faulting page will
>>>> be migrated to the accessing node if necessary.  During the migration,
>>>> TLB will be shot down on all CPUs that the process has run on
>>>> recently.  Because in the hint page fault handler, the PTE will be
>>>> made accessible before the migration is tried.  The overhead of TLB
>>>> shooting down is high, so it's better to be avoided if possible.  In
>>>> fact, if we delay mapping the page in PTE until migration, that can be
>>>> avoided.  This is what this patch doing.
>>>> 
>>> 
>>> Why would the overhead be high? It was previously inaccessibly so it's
>>> only parallel accesses making forward progress that trigger the need
>>> for a flush.
>> 
>> Sorry, I don't understand this.  Although the page is inaccessible, the
>> threads may access other pages, so TLB flushing is still necessary.
>> 
> 
> You assert the overhead of TLB shootdown is high and yes, it can be
> very high but you also said "the benchmark score has no visible changes"
> indicating the TLB shootdown cost is not a major problem for the workload.
> It does not mean we should ignore it though.

If you are looking for a benchmark that is negatively affected by NUMA
balancing, then IIRC Parsec’s dedup is such a workload. [1]

[1] https://parsec.cs.princeton.edu/
Huang, Ying April 1, 2021, 12:52 a.m. UTC | #5
Mel Gorman <mgorman@suse.de> writes:

> On Wed, Mar 31, 2021 at 07:20:09PM +0800, Huang, Ying wrote:
>> Mel Gorman <mgorman@suse.de> writes:
>> 
>> > On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote:
>> >> For NUMA balancing, in hint page fault handler, the faulting page will
>> >> be migrated to the accessing node if necessary.  During the migration,
>> >> TLB will be shot down on all CPUs that the process has run on
>> >> recently.  Because in the hint page fault handler, the PTE will be
>> >> made accessible before the migration is tried.  The overhead of TLB
>> >> shooting down is high, so it's better to be avoided if possible.  In
>> >> fact, if we delay mapping the page in PTE until migration, that can be
>> >> avoided.  This is what this patch doing.
>> >> 
>> >
>> > Why would the overhead be high? It was previously inaccessibly so it's
>> > only parallel accesses making forward progress that trigger the need
>> > for a flush.
>> 
>> Sorry, I don't understand this.  Although the page is inaccessible, the
>> threads may access other pages, so TLB flushing is still necessary.
>> 
>
> You assert the overhead of TLB shootdown is high and yes, it can be
> very high but you also said "the benchmark score has no visible changes"
> indicating the TLB shootdown cost is not a major problem for the workload.
> It does not mean we should ignore it though.
>
>> > <SNIP the parts that are not problematic>
>> >
>> > If migration is attempted, then the time until the migration PTE is
>> > created is variable. The page has to be isolated from the LRU so there
>> > could be contention on the LRU lock, a new page has to be allocated and
>> > that allocation potentially has to enter the page allocator slow path
>> > etc. During that time, parallel threads make forward progress but with
>> > the patch, multiple threads potentially attempt the allocation and fail
>> > instead of doing real work.
>> 
>> If my understanding of the code were correct, only the first thread will
>> attempt the isolation and allocation.  Because TestClearPageLRU() is
>> called in
>> 
>>   migrate_misplaced_page()
>>     numamigrate_isolate_page()
>>       isolate_lru_page()
>> 
>> And migrate_misplaced_page() will return 0 immediately if
>> TestClearPageLRU() returns false.  Then the second thread will make the
>> page accessible and make forward progress.
>> 
>
> Ok, that's true. While additional work is done, the cost is reasonably
> low -- lower than I initially imagined and with fewer side-effects.
>
>> But there's still some timing difference between the original and
>> patched kernel.  We have several choices to reduce the difference.
>> 
>> 1. Check PageLRU() with PTL held in do_numa_page()
>> 
>> If PageLRU() return false, do_numa_page() can make the page accessible
>> firstly.  So the second thread will make the page accessible earlier.
>> 
>> 2. Try to lock the page with PTL held in do_numa_page()
>> 
>> If the try-locking succeeds, it's the first thread, so it can delay
>> mapping.  If try-locking fails, it may be the second thread, so it will
>> make the page accessible firstly.  We need to teach
>> migrate_misplaced_page() to work with the page locked.  This will
>> enlarge the duration that the page is locked.  Is it a problem?
>> 
>> 3. Check page_count() with PTL held in do_numa_page()
>> 
>> The first thread will call get_page() in numa_migrate_prep().  So if the
>> second thread can detect that, it can make the page accessible firstly.
>> The difficulty is that it appears hard to identify the expected
>> page_count() for the file pages.  For anonymous pages, that is much
>> easier, so at least if a page passes the following test, we can delay
>> mapping,
>> 
>>     PageAnon(page) && page_count(page) == page_mapcount(page) + !!PageSwapCache(page)
>> 
>> This will disable the optimization for the file pages.  But it may be
>> good enough?
>> 
>> Which one do you think is better?  Maybe the first one is good enough?
>> 
>
> The first one is probably the most straight-forward but it's more
> important to figure out why interrupts were higher with at least one
> workload when the exact opposite is expected. Investigating which of
> options 1-3 are best and whether it's worth the duplicated check could
> be done as a separate patch.
>
>> > You should consider the following question -- is the potential saving
>> > of an IPI transmission enough to offset the cost of parallel accesses
>> > not making forward progress while one migration is setup and having
>> > different migration attempts collide?
>> >
>> > I have tests running just in case but I think the answer may be "no".
>> > So far only one useful test as completed (specjbb2005 with one VM per NUMA
>> > node) and it showed a mix of small gains and losses but with *higher*
>> > interrupts contrary to what was expected from the changelog.
>> 
>> That is hard to be understood.  May be caused by the bug you pointed out
>> (about was_writable)?
>> 
>
> It's possible and I could not figure out what the rationale behind the
> change was :/
>
> Fix it and run it through your tests to make sure it works as you
> expect. Assuming it passes your tests and it's posted, I'll read it again
> and run it through a battery of tests. If it shows that interrupts are
> lower and is either netural or improves performance in enough cases then
> I think it'll be ok. Even if it's only neutral in terms of performance
> but interrupts are lower, it'll be acceptable.

Will do it.  Thanks a lot for your help!

Best Regards,
Huang, Ying
Mel Gorman April 1, 2021, 8:38 a.m. UTC | #6
On Wed, Mar 31, 2021 at 09:36:04AM -0700, Nadav Amit wrote:
> 
> 
> > On Mar 31, 2021, at 6:16 AM, Mel Gorman <mgorman@suse.de> wrote:
> > 
> > On Wed, Mar 31, 2021 at 07:20:09PM +0800, Huang, Ying wrote:
> >> Mel Gorman <mgorman@suse.de> writes:
> >> 
> >>> On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote:
> >>>> For NUMA balancing, in hint page fault handler, the faulting page will
> >>>> be migrated to the accessing node if necessary.  During the migration,
> >>>> TLB will be shot down on all CPUs that the process has run on
> >>>> recently.  Because in the hint page fault handler, the PTE will be
> >>>> made accessible before the migration is tried.  The overhead of TLB
> >>>> shooting down is high, so it's better to be avoided if possible.  In
> >>>> fact, if we delay mapping the page in PTE until migration, that can be
> >>>> avoided.  This is what this patch doing.
> >>>> 
> >>> 
> >>> Why would the overhead be high? It was previously inaccessibly so it's
> >>> only parallel accesses making forward progress that trigger the need
> >>> for a flush.
> >> 
> >> Sorry, I don't understand this.  Although the page is inaccessible, the
> >> threads may access other pages, so TLB flushing is still necessary.
> >> 
> > 
> > You assert the overhead of TLB shootdown is high and yes, it can be
> > very high but you also said "the benchmark score has no visible changes"
> > indicating the TLB shootdown cost is not a major problem for the workload.
> > It does not mean we should ignore it though.
> 
> If you are looking for a benchmark that is negatively affected by NUMA
> balancing, then IIRC Parsec???s dedup is such a workload. [1]
> 

Few questions;

Is Parsec imparied due to NUMA balancing in general or due to TLB
shootdowns specifically?

Are you using "gcc-pthreads" for parallelisation and the "native" size
for Parsec?

Is there any specific thread count that matters either in
absolute terms or as a precentage of online CPUs?
Nadav Amit April 1, 2021, 7:21 p.m. UTC | #7
> On Apr 1, 2021, at 1:38 AM, Mel Gorman <mgorman@suse.de> wrote:
> 
> On Wed, Mar 31, 2021 at 09:36:04AM -0700, Nadav Amit wrote:
>> 
>> 
>>> On Mar 31, 2021, at 6:16 AM, Mel Gorman <mgorman@suse.de> wrote:
>>> 
>>> On Wed, Mar 31, 2021 at 07:20:09PM +0800, Huang, Ying wrote:
>>>> Mel Gorman <mgorman@suse.de> writes:
>>>> 
>>>>> On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote:
>>>>>> For NUMA balancing, in hint page fault handler, the faulting page will
>>>>>> be migrated to the accessing node if necessary.  During the migration,
>>>>>> TLB will be shot down on all CPUs that the process has run on
>>>>>> recently.  Because in the hint page fault handler, the PTE will be
>>>>>> made accessible before the migration is tried.  The overhead of TLB
>>>>>> shooting down is high, so it's better to be avoided if possible.  In
>>>>>> fact, if we delay mapping the page in PTE until migration, that can be
>>>>>> avoided.  This is what this patch doing.
>>>>>> 
>>>>> 
>>>>> Why would the overhead be high? It was previously inaccessibly so it's
>>>>> only parallel accesses making forward progress that trigger the need
>>>>> for a flush.
>>>> 
>>>> Sorry, I don't understand this.  Although the page is inaccessible, the
>>>> threads may access other pages, so TLB flushing is still necessary.
>>>> 
>>> 
>>> You assert the overhead of TLB shootdown is high and yes, it can be
>>> very high but you also said "the benchmark score has no visible changes"
>>> indicating the TLB shootdown cost is not a major problem for the workload.
>>> It does not mean we should ignore it though.
>> 
>> If you are looking for a benchmark that is negatively affected by NUMA
>> balancing, then IIRC Parsec???s dedup is such a workload. [1]
>> 
> 
> Few questions;
> 
> Is Parsec imparied due to NUMA balancing in general or due to TLB
> shootdowns specifically?

TLB shootdowns specifically.

> 
> Are you using "gcc-pthreads" for parallelisation and the "native" size
> for Parsec?

native as it is the biggest workload, so it is most apparent with
native. I don’t remember that I played with the threading model
parameters.

> 
> Is there any specific thread count that matters either in
> absolute terms or as a precentage of online CPUs?

IIRC, when thread count matches the CPU numbers (or perhaps
slightly lower), the impact is the greatest.
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index d3273bd69dbb..a9a8ed1ac06c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4148,29 +4148,17 @@  static vm_fault_t do_numa_page(struct vm_fault *vmf)
 		goto out;
 	}
 
-	/*
-	 * Make it present again, Depending on how arch implementes non
-	 * accessible ptes, some can allow access by kernel mode.
-	 */
-	old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
+	/* Get the normal PTE  */
+	old_pte = ptep_get(vmf->pte);
 	pte = pte_modify(old_pte, vma->vm_page_prot);
-	pte = pte_mkyoung(pte);
-	if (was_writable)
-		pte = pte_mkwrite(pte);
-	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
-	update_mmu_cache(vma, vmf->address, vmf->pte);
 
 	page = vm_normal_page(vma, vmf->address, pte);
-	if (!page) {
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		return 0;
-	}
+	if (!page)
+		goto out_map;
 
 	/* TODO: handle PTE-mapped THP */
-	if (PageCompound(page)) {
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		return 0;
-	}
+	if (PageCompound(page))
+		goto out_map;
 
 	/*
 	 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
@@ -4180,7 +4168,7 @@  static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	 * pte_dirty has unpredictable behaviour between PTE scan updates,
 	 * background writeback, dirty balancing and application behaviour.
 	 */
-	if (!pte_write(pte))
+	if (was_writable)
 		flags |= TNF_NO_GROUP;
 
 	/*
@@ -4194,23 +4182,45 @@  static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, vmf->address, page_nid,
 			&flags);
-	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	if (target_nid == NUMA_NO_NODE) {
 		put_page(page);
-		goto out;
+		goto out_map;
 	}
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
 
 	/* Migrate to the requested node */
 	if (migrate_misplaced_page(page, vma, target_nid)) {
 		page_nid = target_nid;
 		flags |= TNF_MIGRATED;
-	} else
+	} else {
 		flags |= TNF_MIGRATE_FAIL;
+		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+		spin_lock(vmf->ptl);
+		if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
+			pte_unmap_unlock(vmf->pte, vmf->ptl);
+			goto out;
+		}
+		goto out_map;
+	}
 
 out:
 	if (page_nid != NUMA_NO_NODE)
 		task_numa_fault(last_cpupid, page_nid, 1, flags);
 	return 0;
+out_map:
+	/*
+	 * Make it present again, Depending on how arch implementes non
+	 * accessible ptes, some can allow access by kernel mode.
+	 */
+	old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
+	pte = pte_modify(old_pte, vma->vm_page_prot);
+	pte = pte_mkyoung(pte);
+	if (was_writable)
+		pte = pte_mkwrite(pte);
+	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
+	update_mmu_cache(vma, vmf->address, vmf->pte);
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+	goto out;
 }
 
 static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)