diff mbox series

mm: mmu_gather: remove __tlb_reset_range() for force flush

Message ID 1557264889-109594-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series mm: mmu_gather: remove __tlb_reset_range() for force flush | expand

Commit Message

Yang Shi May 7, 2019, 9:34 p.m. UTC
A few new fields were added to mmu_gather to make TLB flush smarter for
huge page by telling what level of page table is changed.

__tlb_reset_range() is used to reset all these page table state to
unchanged, which is called by TLB flush for parallel mapping changes for
the same range under non-exclusive lock (i.e. read mmap_sem).  Before
commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
munmap"), MADV_DONTNEED is the only one who may do page zapping in
parallel and it doesn't remove page tables.  But, the forementioned commit
may do munmap() under read mmap_sem and free page tables.  This causes a
bug [1] reported by Jan Stancek since __tlb_reset_range() may pass the
wrong page table state to architecture specific TLB flush operations.

So, removing __tlb_reset_range() sounds sane.  This may cause more TLB
flush for MADV_DONTNEED, but it should be not called very often, hence
the impact should be negligible.

The original proposed fix came from Jan Stancek who mainly debugged this
issue, I just wrapped up everything together.

[1] https://lore.kernel.org/linux-mm/342bf1fd-f1bf-ed62-1127-e911b5032274@linux.alibaba.com/T/#m7a2ab6c878d5a256560650e56189cfae4e73217f

Reported-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
---
 mm/mmu_gather.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Comments

Will Deacon May 9, 2019, 8:37 a.m. UTC | #1
Hi all, [+Peter]

Apologies for the delay; I'm attending a conference this week so it's tricky
to keep up with email.

On Wed, May 08, 2019 at 05:34:49AM +0800, Yang Shi wrote:
> A few new fields were added to mmu_gather to make TLB flush smarter for
> huge page by telling what level of page table is changed.
> 
> __tlb_reset_range() is used to reset all these page table state to
> unchanged, which is called by TLB flush for parallel mapping changes for
> the same range under non-exclusive lock (i.e. read mmap_sem).  Before
> commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
> munmap"), MADV_DONTNEED is the only one who may do page zapping in
> parallel and it doesn't remove page tables.  But, the forementioned commit
> may do munmap() under read mmap_sem and free page tables.  This causes a
> bug [1] reported by Jan Stancek since __tlb_reset_range() may pass the
> wrong page table state to architecture specific TLB flush operations.

Yikes. Is it actually safe to run free_pgtables() concurrently for a given
mm?

> So, removing __tlb_reset_range() sounds sane.  This may cause more TLB
> flush for MADV_DONTNEED, but it should be not called very often, hence
> the impact should be negligible.
> 
> The original proposed fix came from Jan Stancek who mainly debugged this
> issue, I just wrapped up everything together.

I'm still paging the nested flush logic back in, but I have some comments on
the patch below.

> [1] https://lore.kernel.org/linux-mm/342bf1fd-f1bf-ed62-1127-e911b5032274@linux.alibaba.com/T/#m7a2ab6c878d5a256560650e56189cfae4e73217f
> 
> Reported-by: Jan Stancek <jstancek@redhat.com>
> Tested-by: Jan Stancek <jstancek@redhat.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> Signed-off-by: Jan Stancek <jstancek@redhat.com>
> ---
>  mm/mmu_gather.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 99740e1..9fd5272 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -249,11 +249,12 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>  	 * flush by batching, a thread has stable TLB entry can fail to flush

Urgh, we should rewrite this comment while we're here so that it makes sense...

>  	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
>  	 * forcefully if we detect parallel PTE batching threads.
> +	 *
> +	 * munmap() may change mapping under non-excluse lock and also free
> +	 * page tables.  Do not call __tlb_reset_range() for it.
>  	 */
> -	if (mm_tlb_flush_nested(tlb->mm)) {
> -		__tlb_reset_range(tlb);
> +	if (mm_tlb_flush_nested(tlb->mm))
>  		__tlb_adjust_range(tlb, start, end - start);
> -	}

I don't think we can elide the call __tlb_reset_range() entirely, since I
think we do want to clear the freed_pXX bits to ensure that we walk the
range with the smallest mapping granule that we have. Otherwise couldn't we
have a problem if we hit a PMD that had been cleared, but the TLB
invalidation for the PTEs that used to be linked below it was still pending?

Perhaps we should just set fullmm if we see that here's a concurrent
unmapper rather than do a worst-case range invalidation. Do you have a feeling
for often the mm_tlb_flush_nested() triggers in practice?

Will
Peter Zijlstra May 9, 2019, 10:38 a.m. UTC | #2
On Thu, May 09, 2019 at 09:37:26AM +0100, Will Deacon wrote:
> Hi all, [+Peter]

Right, mm/mmu_gather.c has a MAINTAINERS entry; use it.

Also added Nadav and Minchan who've poked at this issue before. And Mel,
because he loves these things :-)

> Apologies for the delay; I'm attending a conference this week so it's tricky
> to keep up with email.
> 
> On Wed, May 08, 2019 at 05:34:49AM +0800, Yang Shi wrote:
> > A few new fields were added to mmu_gather to make TLB flush smarter for
> > huge page by telling what level of page table is changed.
> > 
> > __tlb_reset_range() is used to reset all these page table state to
> > unchanged, which is called by TLB flush for parallel mapping changes for
> > the same range under non-exclusive lock (i.e. read mmap_sem).  Before
> > commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
> > munmap"), MADV_DONTNEED is the only one who may do page zapping in
> > parallel and it doesn't remove page tables.  But, the forementioned commit
> > may do munmap() under read mmap_sem and free page tables.  This causes a
> > bug [1] reported by Jan Stancek since __tlb_reset_range() may pass the

Please don't _EVER_ refer to external sources to describe the actual bug
a patch is fixing. That is the primary purpose of the Changelog.

Worse, the email you reference does _NOT_ describe the actual problem.
Nor do you.

> > wrong page table state to architecture specific TLB flush operations.
> 
> Yikes. Is it actually safe to run free_pgtables() concurrently for a given
> mm?

Yeah.. sorta.. it's been a source of 'interesting' things. This really
isn't the first issue here.

Also, change_protection_range() is 'fun' too.

> > So, removing __tlb_reset_range() sounds sane.  This may cause more TLB
> > flush for MADV_DONTNEED, but it should be not called very often, hence
> > the impact should be negligible.
> > 
> > The original proposed fix came from Jan Stancek who mainly debugged this
> > issue, I just wrapped up everything together.
> 
> I'm still paging the nested flush logic back in, but I have some comments on
> the patch below.
> 
> > [1] https://lore.kernel.org/linux-mm/342bf1fd-f1bf-ed62-1127-e911b5032274@linux.alibaba.com/T/#m7a2ab6c878d5a256560650e56189cfae4e73217f
> > 
> > Reported-by: Jan Stancek <jstancek@redhat.com>
> > Tested-by: Jan Stancek <jstancek@redhat.com>
> > Cc: Will Deacon <will.deacon@arm.com>
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> > Signed-off-by: Jan Stancek <jstancek@redhat.com>
> > ---
> >  mm/mmu_gather.c | 7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> > index 99740e1..9fd5272 100644
> > --- a/mm/mmu_gather.c
> > +++ b/mm/mmu_gather.c
> > @@ -249,11 +249,12 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
> >  	 * flush by batching, a thread has stable TLB entry can fail to flush
> 
> Urgh, we should rewrite this comment while we're here so that it makes sense...

Yeah, that's atrocious. We should put the actual race in there.

> >  	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> >  	 * forcefully if we detect parallel PTE batching threads.
> > +	 *
> > +	 * munmap() may change mapping under non-excluse lock and also free
> > +	 * page tables.  Do not call __tlb_reset_range() for it.
> >  	 */
> > -	if (mm_tlb_flush_nested(tlb->mm)) {
> > -		__tlb_reset_range(tlb);
> > +	if (mm_tlb_flush_nested(tlb->mm))
> >  		__tlb_adjust_range(tlb, start, end - start);
> > -	}
> 
> I don't think we can elide the call __tlb_reset_range() entirely, since I
> think we do want to clear the freed_pXX bits to ensure that we walk the
> range with the smallest mapping granule that we have. Otherwise couldn't we
> have a problem if we hit a PMD that had been cleared, but the TLB
> invalidation for the PTEs that used to be linked below it was still pending?

That's tlb->cleared_p*, and yes agreed. That is, right until some
architecture has level dependent TLBI instructions, at which point we'll
need to have them all set instead of cleared.

> Perhaps we should just set fullmm if we see that here's a concurrent
> unmapper rather than do a worst-case range invalidation. Do you have a feeling
> for often the mm_tlb_flush_nested() triggers in practice?

Quite a bit for certain workloads I imagine, that was the whole point of
doing it.

Anyway; am I correct in understanding that the actual problem is that
we've cleared freed_tables and the ARM64 tlb_flush() will then not
invalidate the cache and badness happens?

Because so far nobody has actually provided a coherent description of
the actual problem we're trying to solve. But I'm thinking something
like the below ought to do.


diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 99740e1dd273..fe768f8d612e 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
 		unsigned long start, unsigned long end)
 {
 	/*
-	 * If there are parallel threads are doing PTE changes on same range
-	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
-	 * flush by batching, a thread has stable TLB entry can fail to flush
-	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
-	 * forcefully if we detect parallel PTE batching threads.
+	 * Sensible comment goes here..
 	 */
-	if (mm_tlb_flush_nested(tlb->mm)) {
-		__tlb_reset_range(tlb);
-		__tlb_adjust_range(tlb, start, end - start);
+	if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
+		/*
+		 * Since we're can't tell what we actually should have
+		 * flushed flush everything in the given range.
+		 */
+		tlb->start = start;
+		tlb->end = end;
+		tlb->freed_tables = 1;
+		tlb->cleared_ptes = 1;
+		tlb->cleared_pmds = 1;
+		tlb->cleared_puds = 1;
+		tlb->cleared_p4ds = 1;
 	}
 
 	tlb_flush_mmu(tlb);
Peter Zijlstra May 9, 2019, 10:54 a.m. UTC | #3
On Thu, May 09, 2019 at 12:38:13PM +0200, Peter Zijlstra wrote:

> That's tlb->cleared_p*, and yes agreed. That is, right until some
> architecture has level dependent TLBI instructions, at which point we'll
> need to have them all set instead of cleared.

> Anyway; am I correct in understanding that the actual problem is that
> we've cleared freed_tables and the ARM64 tlb_flush() will then not
> invalidate the cache and badness happens?
> 
> Because so far nobody has actually provided a coherent description of
> the actual problem we're trying to solve. But I'm thinking something
> like the below ought to do.

There's another 'fun' issue I think. For architectures like ARM that
have range invalidation and care about VM_EXEC for I$ invalidation, the
below doesn't quite work right either.

I suspect we also have to force: tlb->vma_exec = 1.

And I don't think there's an architecture that cares, but depending on
details I can construct cases where any setting of tlb->vm_hugetlb is
wrong, so that is _awesome_. But I suspect the sane thing for now is to
force it 0.

> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 99740e1dd273..fe768f8d612e 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>  		unsigned long start, unsigned long end)
>  {
>  	/*
> -	 * If there are parallel threads are doing PTE changes on same range
> -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> -	 * flush by batching, a thread has stable TLB entry can fail to flush
> -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> -	 * forcefully if we detect parallel PTE batching threads.
> +	 * Sensible comment goes here..
>  	 */
> -	if (mm_tlb_flush_nested(tlb->mm)) {
> -		__tlb_reset_range(tlb);
> -		__tlb_adjust_range(tlb, start, end - start);
> +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
> +		/*
> +		 * Since we're can't tell what we actually should have
> +		 * flushed flush everything in the given range.
> +		 */
> +		tlb->start = start;
> +		tlb->end = end;
> +		tlb->freed_tables = 1;
> +		tlb->cleared_ptes = 1;
> +		tlb->cleared_pmds = 1;
> +		tlb->cleared_puds = 1;
> +		tlb->cleared_p4ds = 1;
>  	}
>  
>  	tlb_flush_mmu(tlb);
Jan Stancek May 9, 2019, 12:44 p.m. UTC | #4
> > I don't think we can elide the call __tlb_reset_range() entirely, since I
> > think we do want to clear the freed_pXX bits to ensure that we walk the
> > range with the smallest mapping granule that we have. Otherwise couldn't we
> > have a problem if we hit a PMD that had been cleared, but the TLB
> > invalidation for the PTEs that used to be linked below it was still
> > pending?
> 
> That's tlb->cleared_p*, and yes agreed. That is, right until some
> architecture has level dependent TLBI instructions, at which point we'll
> need to have them all set instead of cleared.
> 
> > Perhaps we should just set fullmm if we see that here's a concurrent
> > unmapper rather than do a worst-case range invalidation. Do you have a
> > feeling
> > for often the mm_tlb_flush_nested() triggers in practice?
> 
> Quite a bit for certain workloads I imagine, that was the whole point of
> doing it.
> 
> Anyway; am I correct in understanding that the actual problem is that
> we've cleared freed_tables and the ARM64 tlb_flush() will then not
> invalidate the cache and badness happens?

That is my understanding, only last level is flushed, which is not enough.

> 
> Because so far nobody has actually provided a coherent description of
> the actual problem we're trying to solve. But I'm thinking something
> like the below ought to do.

I applied it (and fixed small typo: s/tlb->full_mm/tlb->fullmm/).
It fixes the problem for me.

> 
> 
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 99740e1dd273..fe768f8d612e 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>  		unsigned long start, unsigned long end)
>  {
>  	/*
> -	 * If there are parallel threads are doing PTE changes on same range
> -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> -	 * flush by batching, a thread has stable TLB entry can fail to flush
> -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> -	 * forcefully if we detect parallel PTE batching threads.
> +	 * Sensible comment goes here..
>  	 */
> -	if (mm_tlb_flush_nested(tlb->mm)) {
> -		__tlb_reset_range(tlb);
> -		__tlb_adjust_range(tlb, start, end - start);
> +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
> +		/*
> +		 * Since we're can't tell what we actually should have
> +		 * flushed flush everything in the given range.
> +		 */
> +		tlb->start = start;
> +		tlb->end = end;
> +		tlb->freed_tables = 1;
> +		tlb->cleared_ptes = 1;
> +		tlb->cleared_pmds = 1;
> +		tlb->cleared_puds = 1;
> +		tlb->cleared_p4ds = 1;
>  	}
>  
>  	tlb_flush_mmu(tlb);
>
Nadav Amit May 9, 2019, 5:36 p.m. UTC | #5
> On May 9, 2019, at 3:38 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, May 09, 2019 at 09:37:26AM +0100, Will Deacon wrote:
>> Hi all, [+Peter]
> 
> Right, mm/mmu_gather.c has a MAINTAINERS entry; use it.
> 
> Also added Nadav and Minchan who've poked at this issue before. And Mel,
> because he loves these things :-)
> 
>> Apologies for the delay; I'm attending a conference this week so it's tricky
>> to keep up with email.
>> 
>> On Wed, May 08, 2019 at 05:34:49AM +0800, Yang Shi wrote:
>>> A few new fields were added to mmu_gather to make TLB flush smarter for
>>> huge page by telling what level of page table is changed.
>>> 
>>> __tlb_reset_range() is used to reset all these page table state to
>>> unchanged, which is called by TLB flush for parallel mapping changes for
>>> the same range under non-exclusive lock (i.e. read mmap_sem).  Before
>>> commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
>>> munmap"), MADV_DONTNEED is the only one who may do page zapping in
>>> parallel and it doesn't remove page tables.  But, the forementioned commit
>>> may do munmap() under read mmap_sem and free page tables.  This causes a
>>> bug [1] reported by Jan Stancek since __tlb_reset_range() may pass the
> 
> Please don't _EVER_ refer to external sources to describe the actual bug
> a patch is fixing. That is the primary purpose of the Changelog.
> 
> Worse, the email you reference does _NOT_ describe the actual problem.
> Nor do you.
> 
>>> wrong page table state to architecture specific TLB flush operations.
>> 
>> Yikes. Is it actually safe to run free_pgtables() concurrently for a given
>> mm?
> 
> Yeah.. sorta.. it's been a source of 'interesting' things. This really
> isn't the first issue here.
> 
> Also, change_protection_range() is 'fun' too.
> 
>>> So, removing __tlb_reset_range() sounds sane.  This may cause more TLB
>>> flush for MADV_DONTNEED, but it should be not called very often, hence
>>> the impact should be negligible.
>>> 
>>> The original proposed fix came from Jan Stancek who mainly debugged this
>>> issue, I just wrapped up everything together.
>> 
>> I'm still paging the nested flush logic back in, but I have some comments on
>> the patch below.
>> 
>>> [1] https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2F342bf1fd-f1bf-ed62-1127-e911b5032274%40linux.alibaba.com%2FT%2F%23m7a2ab6c878d5a256560650e56189cfae4e73217f&amp;data=02%7C01%7Cnamit%40vmware.com%7C7be2f2b29b654aba7de308d6d46a7b93%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636929951176903247&amp;sdata=gGptCMeb9vW4jXUnG53amgvrv8TB9F52JYBHmPeHFvs%3D&amp;reserved=0
>>> 
>>> Reported-by: Jan Stancek <jstancek@redhat.com>
>>> Tested-by: Jan Stancek <jstancek@redhat.com>
>>> Cc: Will Deacon <will.deacon@arm.com>
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>>> Signed-off-by: Jan Stancek <jstancek@redhat.com>
>>> ---
>>> mm/mmu_gather.c | 7 ++++---
>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>> 
>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>>> index 99740e1..9fd5272 100644
>>> --- a/mm/mmu_gather.c
>>> +++ b/mm/mmu_gather.c
>>> @@ -249,11 +249,12 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>>> 	 * flush by batching, a thread has stable TLB entry can fail to flush
>> 
>> Urgh, we should rewrite this comment while we're here so that it makes sense...
> 
> Yeah, that's atrocious. We should put the actual race in there.
> 
>>>  * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
>>> 	 * forcefully if we detect parallel PTE batching threads.
>>> +	 *
>>> +	 * munmap() may change mapping under non-excluse lock and also free
>>> +	 * page tables.  Do not call __tlb_reset_range() for it.
>>> 	 */
>>> -	if (mm_tlb_flush_nested(tlb->mm)) {
>>> -		__tlb_reset_range(tlb);
>>> +	if (mm_tlb_flush_nested(tlb->mm))
>>> 		__tlb_adjust_range(tlb, start, end - start);
>>> -	}
>> 
>> I don't think we can elide the call __tlb_reset_range() entirely, since I
>> think we do want to clear the freed_pXX bits to ensure that we walk the
>> range with the smallest mapping granule that we have. Otherwise couldn't we
>> have a problem if we hit a PMD that had been cleared, but the TLB
>> invalidation for the PTEs that used to be linked below it was still pending?
> 
> That's tlb->cleared_p*, and yes agreed. That is, right until some
> architecture has level dependent TLBI instructions, at which point we'll
> need to have them all set instead of cleared.
> 
>> Perhaps we should just set fullmm if we see that here's a concurrent
>> unmapper rather than do a worst-case range invalidation. Do you have a feeling
>> for often the mm_tlb_flush_nested() triggers in practice?
> 
> Quite a bit for certain workloads I imagine, that was the whole point of
> doing it.
> 
> Anyway; am I correct in understanding that the actual problem is that
> we've cleared freed_tables and the ARM64 tlb_flush() will then not
> invalidate the cache and badness happens?
> 
> Because so far nobody has actually provided a coherent description of
> the actual problem we're trying to solve. But I'm thinking something
> like the below ought to do.
> 
> 
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 99740e1dd273..fe768f8d612e 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
> 		unsigned long start, unsigned long end)
> {
> 	/*
> -	 * If there are parallel threads are doing PTE changes on same range
> -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> -	 * flush by batching, a thread has stable TLB entry can fail to flush
> -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> -	 * forcefully if we detect parallel PTE batching threads.
> +	 * Sensible comment goes here..
> 	 */
> -	if (mm_tlb_flush_nested(tlb->mm)) {
> -		__tlb_reset_range(tlb);
> -		__tlb_adjust_range(tlb, start, end - start);
> +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
> +		/*
> +		 * Since we're can't tell what we actually should have
> +		 * flushed flush everything in the given range.
> +		 */
> +		tlb->start = start;
> +		tlb->end = end;
> +		tlb->freed_tables = 1;
> +		tlb->cleared_ptes = 1;
> +		tlb->cleared_pmds = 1;
> +		tlb->cleared_puds = 1;
> +		tlb->cleared_p4ds = 1;
> 	}
> 
> 	tlb_flush_mmu(tlb);

As a simple optimization, I think it is possible to hold multiple nesting
counters in the mm, similar to tlb_flush_pending, for freed_tables,
cleared_ptes, etc.

The first time you set tlb->freed_tables, you also atomically increase
mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use
mm->tlb_flush_freed_tables instead of tlb->freed_tables.
Yang Shi May 9, 2019, 6:22 p.m. UTC | #6
On 5/9/19 3:38 AM, Peter Zijlstra wrote:
> On Thu, May 09, 2019 at 09:37:26AM +0100, Will Deacon wrote:
>> Hi all, [+Peter]
> Right, mm/mmu_gather.c has a MAINTAINERS entry; use it.

Sorry for that, I didn't realize we have mmu_gather maintainers. I 
should run maintainer.pl.

>
> Also added Nadav and Minchan who've poked at this issue before. And Mel,
> because he loves these things :-)
>
>> Apologies for the delay; I'm attending a conference this week so it's tricky
>> to keep up with email.
>>
>> On Wed, May 08, 2019 at 05:34:49AM +0800, Yang Shi wrote:
>>> A few new fields were added to mmu_gather to make TLB flush smarter for
>>> huge page by telling what level of page table is changed.
>>>
>>> __tlb_reset_range() is used to reset all these page table state to
>>> unchanged, which is called by TLB flush for parallel mapping changes for
>>> the same range under non-exclusive lock (i.e. read mmap_sem).  Before
>>> commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
>>> munmap"), MADV_DONTNEED is the only one who may do page zapping in
>>> parallel and it doesn't remove page tables.  But, the forementioned commit
>>> may do munmap() under read mmap_sem and free page tables.  This causes a
>>> bug [1] reported by Jan Stancek since __tlb_reset_range() may pass the
> Please don't _EVER_ refer to external sources to describe the actual bug
> a patch is fixing. That is the primary purpose of the Changelog.
>
> Worse, the email you reference does _NOT_ describe the actual problem.
> Nor do you.

Sure, will articulate the real bug in the commit log.

>
>>> wrong page table state to architecture specific TLB flush operations.
>> Yikes. Is it actually safe to run free_pgtables() concurrently for a given
>> mm?
> Yeah.. sorta.. it's been a source of 'interesting' things. This really
> isn't the first issue here.
>
> Also, change_protection_range() is 'fun' too.
>
>>> So, removing __tlb_reset_range() sounds sane.  This may cause more TLB
>>> flush for MADV_DONTNEED, but it should be not called very often, hence
>>> the impact should be negligible.
>>>
>>> The original proposed fix came from Jan Stancek who mainly debugged this
>>> issue, I just wrapped up everything together.
>> I'm still paging the nested flush logic back in, but I have some comments on
>> the patch below.
>>
>>> [1] https://lore.kernel.org/linux-mm/342bf1fd-f1bf-ed62-1127-e911b5032274@linux.alibaba.com/T/#m7a2ab6c878d5a256560650e56189cfae4e73217f
>>>
>>> Reported-by: Jan Stancek <jstancek@redhat.com>
>>> Tested-by: Jan Stancek <jstancek@redhat.com>
>>> Cc: Will Deacon <will.deacon@arm.com>
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>>> Signed-off-by: Jan Stancek <jstancek@redhat.com>
>>> ---
>>>   mm/mmu_gather.c | 7 ++++---
>>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>>> index 99740e1..9fd5272 100644
>>> --- a/mm/mmu_gather.c
>>> +++ b/mm/mmu_gather.c
>>> @@ -249,11 +249,12 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>>>   	 * flush by batching, a thread has stable TLB entry can fail to flush
>> Urgh, we should rewrite this comment while we're here so that it makes sense...
> Yeah, that's atrocious. We should put the actual race in there.
>
>>>   	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
>>>   	 * forcefully if we detect parallel PTE batching threads.
>>> +	 *
>>> +	 * munmap() may change mapping under non-excluse lock and also free
>>> +	 * page tables.  Do not call __tlb_reset_range() for it.
>>>   	 */
>>> -	if (mm_tlb_flush_nested(tlb->mm)) {
>>> -		__tlb_reset_range(tlb);
>>> +	if (mm_tlb_flush_nested(tlb->mm))
>>>   		__tlb_adjust_range(tlb, start, end - start);
>>> -	}
>> I don't think we can elide the call __tlb_reset_range() entirely, since I
>> think we do want to clear the freed_pXX bits to ensure that we walk the
>> range with the smallest mapping granule that we have. Otherwise couldn't we
>> have a problem if we hit a PMD that had been cleared, but the TLB
>> invalidation for the PTEs that used to be linked below it was still pending?
> That's tlb->cleared_p*, and yes agreed. That is, right until some
> architecture has level dependent TLBI instructions, at which point we'll
> need to have them all set instead of cleared.
>
>> Perhaps we should just set fullmm if we see that here's a concurrent
>> unmapper rather than do a worst-case range invalidation. Do you have a feeling
>> for often the mm_tlb_flush_nested() triggers in practice?
> Quite a bit for certain workloads I imagine, that was the whole point of
> doing it.
>
> Anyway; am I correct in understanding that the actual problem is that
> we've cleared freed_tables and the ARM64 tlb_flush() will then not
> invalidate the cache and badness happens?

Yes.

>
> Because so far nobody has actually provided a coherent description of
> the actual problem we're trying to solve. But I'm thinking something
> like the below ought to do.

Thanks for the suggestion, will do in v2.

>
>
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 99740e1dd273..fe768f8d612e 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>   		unsigned long start, unsigned long end)
>   {
>   	/*
> -	 * If there are parallel threads are doing PTE changes on same range
> -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> -	 * flush by batching, a thread has stable TLB entry can fail to flush
> -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> -	 * forcefully if we detect parallel PTE batching threads.
> +	 * Sensible comment goes here..
>   	 */
> -	if (mm_tlb_flush_nested(tlb->mm)) {
> -		__tlb_reset_range(tlb);
> -		__tlb_adjust_range(tlb, start, end - start);
> +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
> +		/*
> +		 * Since we're can't tell what we actually should have
> +		 * flushed flush everything in the given range.
> +		 */
> +		tlb->start = start;
> +		tlb->end = end;
> +		tlb->freed_tables = 1;
> +		tlb->cleared_ptes = 1;
> +		tlb->cleared_pmds = 1;
> +		tlb->cleared_puds = 1;
> +		tlb->cleared_p4ds = 1;
>   	}
>   
>   	tlb_flush_mmu(tlb);
Peter Zijlstra May 9, 2019, 6:24 p.m. UTC | #7
On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote:
> > On May 9, 2019, at 3:38 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> > index 99740e1dd273..fe768f8d612e 100644
> > --- a/mm/mmu_gather.c
> > +++ b/mm/mmu_gather.c
> > @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
> > 		unsigned long start, unsigned long end)
> > {
> > 	/*
> > -	 * If there are parallel threads are doing PTE changes on same range
> > -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> > -	 * flush by batching, a thread has stable TLB entry can fail to flush
> > -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> > -	 * forcefully if we detect parallel PTE batching threads.
> > +	 * Sensible comment goes here..
> > 	 */
> > -	if (mm_tlb_flush_nested(tlb->mm)) {
> > -		__tlb_reset_range(tlb);
> > -		__tlb_adjust_range(tlb, start, end - start);
> > +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
> > +		/*
> > +		 * Since we're can't tell what we actually should have
> > +		 * flushed flush everything in the given range.
> > +		 */
> > +		tlb->start = start;
> > +		tlb->end = end;
> > +		tlb->freed_tables = 1;
> > +		tlb->cleared_ptes = 1;
> > +		tlb->cleared_pmds = 1;
> > +		tlb->cleared_puds = 1;
> > +		tlb->cleared_p4ds = 1;
> > 	}
> > 
> > 	tlb_flush_mmu(tlb);
> 
> As a simple optimization, I think it is possible to hold multiple nesting
> counters in the mm, similar to tlb_flush_pending, for freed_tables,
> cleared_ptes, etc.
> 
> The first time you set tlb->freed_tables, you also atomically increase
> mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use
> mm->tlb_flush_freed_tables instead of tlb->freed_tables.

That sounds fraught with races and expensive; I would much prefer to not
go there for this arguably rare case.

Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1
races and doesn't see that PTE. Therefore CPU-0 sets and counts
cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather,
it will see cleared_ptes count increased and flush that granularity,
OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall
miss an invalidate it should have had.

This whole concurrent mmu_gather stuff is horrible.

  /me ponders more....

So I think the fundamental race here is this:

	CPU-0				CPU-1

	tlb_gather_mmu(.start=1,	tlb_gather_mmu(.start=2,
		       .end=3);			       .end=4);

	ptep_get_and_clear_full(2)
	tlb_remove_tlb_entry(2);
	__tlb_remove_page();
					if (pte_present(2)) // nope

					tlb_finish_mmu();

					// continue without TLBI(2)
					// whoopsie

	tlb_finish_mmu();
	  tlb_flush()		->	TLBI(2)


And we can fix that by having tlb_finish_mmu() sync up. Never let a
concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers
have completed.

This should not be too hard to make happen.
Yang Shi May 9, 2019, 6:35 p.m. UTC | #8
On 5/9/19 3:54 AM, Peter Zijlstra wrote:
> On Thu, May 09, 2019 at 12:38:13PM +0200, Peter Zijlstra wrote:
>
>> That's tlb->cleared_p*, and yes agreed. That is, right until some
>> architecture has level dependent TLBI instructions, at which point we'll
>> need to have them all set instead of cleared.
>> Anyway; am I correct in understanding that the actual problem is that
>> we've cleared freed_tables and the ARM64 tlb_flush() will then not
>> invalidate the cache and badness happens?
>>
>> Because so far nobody has actually provided a coherent description of
>> the actual problem we're trying to solve. But I'm thinking something
>> like the below ought to do.
> There's another 'fun' issue I think. For architectures like ARM that
> have range invalidation and care about VM_EXEC for I$ invalidation, the
> below doesn't quite work right either.
>
> I suspect we also have to force: tlb->vma_exec = 1.

Isn't the below code in tlb_flush enough to guarantee this?

...
} else if (tlb->end) {
                struct vm_area_struct vma = {
                        .vm_mm = tlb->mm,
                        .vm_flags = (tlb->vma_exec ? VM_EXEC    : 0) |
                                    (tlb->vma_huge ? VM_HUGETLB : 0),
                };
...

>
> And I don't think there's an architecture that cares, but depending on
> details I can construct cases where any setting of tlb->vm_hugetlb is
> wrong, so that is _awesome_. But I suspect the sane thing for now is to
> force it 0.
>
>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>> index 99740e1dd273..fe768f8d612e 100644
>> --- a/mm/mmu_gather.c
>> +++ b/mm/mmu_gather.c
>> @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>>   		unsigned long start, unsigned long end)
>>   {
>>   	/*
>> -	 * If there are parallel threads are doing PTE changes on same range
>> -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
>> -	 * flush by batching, a thread has stable TLB entry can fail to flush
>> -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
>> -	 * forcefully if we detect parallel PTE batching threads.
>> +	 * Sensible comment goes here..
>>   	 */
>> -	if (mm_tlb_flush_nested(tlb->mm)) {
>> -		__tlb_reset_range(tlb);
>> -		__tlb_adjust_range(tlb, start, end - start);
>> +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
>> +		/*
>> +		 * Since we're can't tell what we actually should have
>> +		 * flushed flush everything in the given range.
>> +		 */
>> +		tlb->start = start;
>> +		tlb->end = end;
>> +		tlb->freed_tables = 1;
>> +		tlb->cleared_ptes = 1;
>> +		tlb->cleared_pmds = 1;
>> +		tlb->cleared_puds = 1;
>> +		tlb->cleared_p4ds = 1;
>>   	}
>>   
>>   	tlb_flush_mmu(tlb);
Peter Zijlstra May 9, 2019, 6:40 p.m. UTC | #9
On Thu, May 09, 2019 at 11:35:55AM -0700, Yang Shi wrote:
> 
> 
> On 5/9/19 3:54 AM, Peter Zijlstra wrote:
> > On Thu, May 09, 2019 at 12:38:13PM +0200, Peter Zijlstra wrote:
> > 
> > > That's tlb->cleared_p*, and yes agreed. That is, right until some
> > > architecture has level dependent TLBI instructions, at which point we'll
> > > need to have them all set instead of cleared.
> > > Anyway; am I correct in understanding that the actual problem is that
> > > we've cleared freed_tables and the ARM64 tlb_flush() will then not
> > > invalidate the cache and badness happens?
> > > 
> > > Because so far nobody has actually provided a coherent description of
> > > the actual problem we're trying to solve. But I'm thinking something
> > > like the below ought to do.
> > There's another 'fun' issue I think. For architectures like ARM that
> > have range invalidation and care about VM_EXEC for I$ invalidation, the
> > below doesn't quite work right either.
> > 
> > I suspect we also have to force: tlb->vma_exec = 1.
> 
> Isn't the below code in tlb_flush enough to guarantee this?
> 
> ...
> } else if (tlb->end) {
>                struct vm_area_struct vma = {
>                        .vm_mm = tlb->mm,
>                        .vm_flags = (tlb->vma_exec ? VM_EXEC    : 0) |
>                                    (tlb->vma_huge ? VM_HUGETLB : 0),
>                };

Only when vma_exec is actually set... and there is no guarantee of that
in the concurrent path (the last VMA we iterate might not be executable,
but the TLBI we've missed might have been).

More specific, the 'fun' case is if we have no present page in the whole
executable page, in that case tlb->end == 0 and we never call into the
arch code, never giving it chance to flush I$.
Yang Shi May 9, 2019, 7:10 p.m. UTC | #10
On 5/9/19 11:24 AM, Peter Zijlstra wrote:
> On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote:
>>> On May 9, 2019, at 3:38 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>>> index 99740e1dd273..fe768f8d612e 100644
>>> --- a/mm/mmu_gather.c
>>> +++ b/mm/mmu_gather.c
>>> @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>>> 		unsigned long start, unsigned long end)
>>> {
>>> 	/*
>>> -	 * If there are parallel threads are doing PTE changes on same range
>>> -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
>>> -	 * flush by batching, a thread has stable TLB entry can fail to flush
>>> -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
>>> -	 * forcefully if we detect parallel PTE batching threads.
>>> +	 * Sensible comment goes here..
>>> 	 */
>>> -	if (mm_tlb_flush_nested(tlb->mm)) {
>>> -		__tlb_reset_range(tlb);
>>> -		__tlb_adjust_range(tlb, start, end - start);
>>> +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
>>> +		/*
>>> +		 * Since we're can't tell what we actually should have
>>> +		 * flushed flush everything in the given range.
>>> +		 */
>>> +		tlb->start = start;
>>> +		tlb->end = end;
>>> +		tlb->freed_tables = 1;
>>> +		tlb->cleared_ptes = 1;
>>> +		tlb->cleared_pmds = 1;
>>> +		tlb->cleared_puds = 1;
>>> +		tlb->cleared_p4ds = 1;
>>> 	}
>>>
>>> 	tlb_flush_mmu(tlb);
>> As a simple optimization, I think it is possible to hold multiple nesting
>> counters in the mm, similar to tlb_flush_pending, for freed_tables,
>> cleared_ptes, etc.
>>
>> The first time you set tlb->freed_tables, you also atomically increase
>> mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use
>> mm->tlb_flush_freed_tables instead of tlb->freed_tables.
> That sounds fraught with races and expensive; I would much prefer to not
> go there for this arguably rare case.
>
> Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1
> races and doesn't see that PTE. Therefore CPU-0 sets and counts
> cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather,
> it will see cleared_ptes count increased and flush that granularity,
> OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall
> miss an invalidate it should have had.
>
> This whole concurrent mmu_gather stuff is horrible.
>
>    /me ponders more....
>
> So I think the fundamental race here is this:
>
> 	CPU-0				CPU-1
>
> 	tlb_gather_mmu(.start=1,	tlb_gather_mmu(.start=2,
> 		       .end=3);			       .end=4);
>
> 	ptep_get_and_clear_full(2)
> 	tlb_remove_tlb_entry(2);
> 	__tlb_remove_page();
> 					if (pte_present(2)) // nope
>
> 					tlb_finish_mmu();
>
> 					// continue without TLBI(2)
> 					// whoopsie
>
> 	tlb_finish_mmu();
> 	  tlb_flush()		->	TLBI(2)

I'm not quite sure if this is the case Jan really met. But, according to 
his test, once correct tlb->freed_tables and tlb->cleared_* are set, his 
test works well.

>
>
> And we can fix that by having tlb_finish_mmu() sync up. Never let a
> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers
> have completed.

Not sure if this will scale well.

>
> This should not be too hard to make happen.
Peter Zijlstra May 9, 2019, 7:56 p.m. UTC | #11
On Thu, May 09, 2019 at 12:38:13PM +0200, Peter Zijlstra wrote:

> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 99740e1dd273..fe768f8d612e 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>  		unsigned long start, unsigned long end)
>  {
>  	/*
> -	 * If there are parallel threads are doing PTE changes on same range
> -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> -	 * flush by batching, a thread has stable TLB entry can fail to flush
> -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> -	 * forcefully if we detect parallel PTE batching threads.
> +	 * Sensible comment goes here..
>  	 */
> -	if (mm_tlb_flush_nested(tlb->mm)) {
> -		__tlb_reset_range(tlb);
> -		__tlb_adjust_range(tlb, start, end - start);
> +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
> +		/*
> +		 * Since we're can't tell what we actually should have
> +		 * flushed flush everything in the given range.
> +		 */
> +		tlb->start = start;
> +		tlb->end = end;
> +		tlb->freed_tables = 1;
> +		tlb->cleared_ptes = 1;
> +		tlb->cleared_pmds = 1;
> +		tlb->cleared_puds = 1;
> +		tlb->cleared_p4ds = 1;
>  	}
>  
>  	tlb_flush_mmu(tlb);

So PPC-radix has page-size dependent TLBI, but the above doesn't work
for them, because they use the tlb_change_page_size() interface and
don't look at tlb->cleared_p*().

Concequently, they have their own special magic :/

Nick, how about you use the tlb_change_page_size() interface to
find/flush on the page-size boundaries, but otherwise use the
tlb->cleared_p* flags to select which actual sizes to flush?

AFAICT that should work just fine for you guys. Maybe something like so?

(fwiw, there's an aweful lot of almost identical functions there)

---

diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 6a23b9ebd2a1..efc99ef78db6 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -692,7 +692,7 @@ static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = POWER9_
 
 static inline void __radix__flush_tlb_range(struct mm_struct *mm,
 					unsigned long start, unsigned long end,
-					bool flush_all_sizes)
+					bool pflush, bool hflush, bool gflush)
 
 {
 	unsigned long pid;
@@ -734,14 +734,9 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
 				_tlbie_pid(pid, RIC_FLUSH_TLB);
 		}
 	} else {
-		bool hflush = flush_all_sizes;
-		bool gflush = flush_all_sizes;
 		unsigned long hstart, hend;
 		unsigned long gstart, gend;
 
-		if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
-			hflush = true;
-
 		if (hflush) {
 			hstart = (start + PMD_SIZE - 1) & PMD_MASK;
 			hend = end & PMD_MASK;
@@ -758,7 +753,9 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
 
 		asm volatile("ptesync": : :"memory");
 		if (local) {
-			__tlbiel_va_range(start, end, pid, page_size, mmu_virtual_psize);
+			if (pflush)
+				__tlbiel_va_range(start, end, pid,
+						page_size, mmu_virtual_psize);
 			if (hflush)
 				__tlbiel_va_range(hstart, hend, pid,
 						PMD_SIZE, MMU_PAGE_2M);
@@ -767,7 +764,9 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
 						PUD_SIZE, MMU_PAGE_1G);
 			asm volatile("ptesync": : :"memory");
 		} else {
-			__tlbie_va_range(start, end, pid, page_size, mmu_virtual_psize);
+			if (pflush)
+				__tlbie_va_range(start, end, pid,
+						page_size, mmu_virtual_psize);
 			if (hflush)
 				__tlbie_va_range(hstart, hend, pid,
 						PMD_SIZE, MMU_PAGE_2M);
@@ -785,12 +784,17 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
 		     unsigned long end)
 
 {
+	bool hflush = false;
+
 #ifdef CONFIG_HUGETLB_PAGE
 	if (is_vm_hugetlb_page(vma))
 		return radix__flush_hugetlb_tlb_range(vma, start, end);
 #endif
 
-	__radix__flush_tlb_range(vma->vm_mm, start, end, false);
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		hflush = true;
+
+	__radix__flush_tlb_range(vma->vm_mm, start, end, true, hflush, false);
 }
 EXPORT_SYMBOL(radix__flush_tlb_range);
 
@@ -881,49 +885,14 @@ void radix__tlb_flush(struct mmu_gather *tlb)
 	 */
 	if (tlb->fullmm) {
 		__flush_all_mm(mm, true);
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE)
-	} else if (mm_tlb_flush_nested(mm)) {
-		/*
-		 * If there is a concurrent invalidation that is clearing ptes,
-		 * then it's possible this invalidation will miss one of those
-		 * cleared ptes and miss flushing the TLB. If this invalidate
-		 * returns before the other one flushes TLBs, that can result
-		 * in it returning while there are still valid TLBs inside the
-		 * range to be invalidated.
-		 *
-		 * See mm/memory.c:tlb_finish_mmu() for more details.
-		 *
-		 * The solution to this is ensure the entire range is always
-		 * flushed here. The problem for powerpc is that the flushes
-		 * are page size specific, so this "forced flush" would not
-		 * do the right thing if there are a mix of page sizes in
-		 * the range to be invalidated. So use __flush_tlb_range
-		 * which invalidates all possible page sizes in the range.
-		 *
-		 * PWC flush probably is not be required because the core code
-		 * shouldn't free page tables in this path, but accounting
-		 * for the possibility makes us a bit more robust.
-		 *
-		 * need_flush_all is an uncommon case because page table
-		 * teardown should be done with exclusive locks held (but
-		 * after locks are dropped another invalidate could come
-		 * in), it could be optimized further if necessary.
-		 */
-		if (!tlb->need_flush_all)
-			__radix__flush_tlb_range(mm, start, end, true);
-		else
-			radix__flush_all_mm(mm);
-#endif
-	} else if ( (psize = radix_get_mmu_psize(page_size)) == -1) {
-		if (!tlb->need_flush_all)
-			radix__flush_tlb_mm(mm);
-		else
-			radix__flush_all_mm(mm);
 	} else {
 		if (!tlb->need_flush_all)
-			radix__flush_tlb_range_psize(mm, start, end, psize);
+			__radix__flush_tlb_range(mm, start, end,
+					tlb->cleared_pte,
+				        tlb->cleared_pmd,
+					tlb->cleared_pud);
 		else
-			radix__flush_tlb_pwc_range_psize(mm, start, end, psize);
+			radix__flush_all_mm(mm);
 	}
 	tlb->need_flush_all = 0;
 }
Jan Stancek May 9, 2019, 9:06 p.m. UTC | #12
----- Original Message -----
> 
> 
> On 5/9/19 11:24 AM, Peter Zijlstra wrote:
> > On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote:
> >>> On May 9, 2019, at 3:38 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> >>> index 99740e1dd273..fe768f8d612e 100644
> >>> --- a/mm/mmu_gather.c
> >>> +++ b/mm/mmu_gather.c
> >>> @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
> >>> 		unsigned long start, unsigned long end)
> >>> {
> >>> 	/*
> >>> -	 * If there are parallel threads are doing PTE changes on same range
> >>> -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> >>> -	 * flush by batching, a thread has stable TLB entry can fail to flush
> >>> -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> >>> -	 * forcefully if we detect parallel PTE batching threads.
> >>> +	 * Sensible comment goes here..
> >>> 	 */
> >>> -	if (mm_tlb_flush_nested(tlb->mm)) {
> >>> -		__tlb_reset_range(tlb);
> >>> -		__tlb_adjust_range(tlb, start, end - start);
> >>> +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
> >>> +		/*
> >>> +		 * Since we're can't tell what we actually should have
> >>> +		 * flushed flush everything in the given range.
> >>> +		 */
> >>> +		tlb->start = start;
> >>> +		tlb->end = end;
> >>> +		tlb->freed_tables = 1;
> >>> +		tlb->cleared_ptes = 1;
> >>> +		tlb->cleared_pmds = 1;
> >>> +		tlb->cleared_puds = 1;
> >>> +		tlb->cleared_p4ds = 1;
> >>> 	}
> >>>
> >>> 	tlb_flush_mmu(tlb);
> >> As a simple optimization, I think it is possible to hold multiple nesting
> >> counters in the mm, similar to tlb_flush_pending, for freed_tables,
> >> cleared_ptes, etc.
> >>
> >> The first time you set tlb->freed_tables, you also atomically increase
> >> mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use
> >> mm->tlb_flush_freed_tables instead of tlb->freed_tables.
> > That sounds fraught with races and expensive; I would much prefer to not
> > go there for this arguably rare case.
> >
> > Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1
> > races and doesn't see that PTE. Therefore CPU-0 sets and counts
> > cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather,
> > it will see cleared_ptes count increased and flush that granularity,
> > OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall
> > miss an invalidate it should have had.
> >
> > This whole concurrent mmu_gather stuff is horrible.
> >
> >    /me ponders more....
> >
> > So I think the fundamental race here is this:
> >
> > 	CPU-0				CPU-1
> >
> > 	tlb_gather_mmu(.start=1,	tlb_gather_mmu(.start=2,
> > 		       .end=3);			       .end=4);
> >
> > 	ptep_get_and_clear_full(2)
> > 	tlb_remove_tlb_entry(2);
> > 	__tlb_remove_page();
> > 					if (pte_present(2)) // nope
> >
> > 					tlb_finish_mmu();
> >
> > 					// continue without TLBI(2)
> > 					// whoopsie
> >
> > 	tlb_finish_mmu();
> > 	  tlb_flush()		->	TLBI(2)
> 
> I'm not quite sure if this is the case Jan really met. But, according to
> his test, once correct tlb->freed_tables and tlb->cleared_* are set, his
> test works well.

My theory was following sequence:

t1: map_write_unmap()                 t2: dummy()

  map_address = mmap()
  map_address[i] = 'b'
  munmap(map_address)
  downgrade_write(&mm->mmap_sem);
  unmap_region()
  tlb_gather_mmu()
    inc_tlb_flush_pending(tlb->mm);
  free_pgtables()
    tlb->freed_tables = 1
    tlb->cleared_pmds = 1

                                        pthread_exit()
                                        madvise(thread_stack, 8M, MADV_DONTNEED)
                                          zap_page_range()
                                            tlb_gather_mmu()
                                              inc_tlb_flush_pending(tlb->mm);

  tlb_finish_mmu()
    if (mm_tlb_flush_nested(tlb->mm))
      __tlb_reset_range()
        tlb->freed_tables = 0
        tlb->cleared_pmds = 0
    __flush_tlb_range(last_level = 0)
  ...
  map_address = mmap()
    map_address[i] = 'b'
      <page fault loop>
      # PTE appeared valid to me,
      # so I suspected stale TLB entry at higher level as result of "freed_tables = 0"


I'm happy to apply/run any debug patches to get more data that would help.

> 
> >
> >
> > And we can fix that by having tlb_finish_mmu() sync up. Never let a
> > concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers
> > have completed.
> 
> Not sure if this will scale well.
> 
> >
> > This should not be too hard to make happen.
> 
>
Nadav Amit May 9, 2019, 9:21 p.m. UTC | #13
[ Restoring the recipients after mistakenly pressing reply instead of
reply-all ]

> On May 9, 2019, at 12:11 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, May 09, 2019 at 06:50:00PM +0000, Nadav Amit wrote:
>>> On May 9, 2019, at 11:24 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote:
> 
>>>> As a simple optimization, I think it is possible to hold multiple nesting
>>>> counters in the mm, similar to tlb_flush_pending, for freed_tables,
>>>> cleared_ptes, etc.
>>>> 
>>>> The first time you set tlb->freed_tables, you also atomically increase
>>>> mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use
>>>> mm->tlb_flush_freed_tables instead of tlb->freed_tables.
>>> 
>>> That sounds fraught with races and expensive; I would much prefer to not
>>> go there for this arguably rare case.
>>> 
>>> Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1
>>> races and doesn't see that PTE. Therefore CPU-0 sets and counts
>>> cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather,
>>> it will see cleared_ptes count increased and flush that granularity,
>>> OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall
>>> miss an invalidate it should have had.
>> 
>> CPU-0 would send a TLB shootdown request to CPU-1 when it is done, so I
>> don’t see the problem. The TLB shootdown mechanism is independent of the
>> mmu_gather for the matter.
> 
> Duh.. I still don't like those unconditional mm wide atomic counters.
> 
>>> This whole concurrent mmu_gather stuff is horrible.
>>> 
>>> /me ponders more....
>>> 
>>> So I think the fundamental race here is this:
>>> 
>>> 	CPU-0				CPU-1
>>> 
>>> 	tlb_gather_mmu(.start=1,	tlb_gather_mmu(.start=2,
>>> 		       .end=3);			       .end=4);
>>> 
>>> 	ptep_get_and_clear_full(2)
>>> 	tlb_remove_tlb_entry(2);
>>> 	__tlb_remove_page();
>>> 					if (pte_present(2)) // nope
>>> 
>>> 					tlb_finish_mmu();
>>> 
>>> 					// continue without TLBI(2)
>>> 					// whoopsie
>>> 
>>> 	tlb_finish_mmu();
>>> 	  tlb_flush()		->	TLBI(2)
>>> 
>>> 
>>> And we can fix that by having tlb_finish_mmu() sync up. Never let a
>>> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers
>>> have completed.
>>> 
>>> This should not be too hard to make happen.
>> 
>> This synchronization sounds much more expensive than what I proposed. But I
>> agree that cache-lines that move from one CPU to another might become an
>> issue. But I think that the scheme I suggested would minimize this overhead.
> 
> Well, it would have a lot more unconditional atomic ops. My scheme only
> waits when there is actual concurrency.

Well, something has to give. I didn’t think that if the same core does the
atomic op it would be too expensive.

> I _think_ something like the below ought to work, but its not even been
> near a compiler. The only problem is the unconditional wakeup; we can
> play games to avoid that if we want to continue with this.
> 
> Ideally we'd only do this when there's been actual overlap, but I've not
> found a sensible way to detect that.
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4ef4bbe78a1d..b70e35792d29 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm)
> 	 *
> 	 * Therefore we must rely on tlb_flush_*() to guarantee order.
> 	 */
> -	atomic_dec(&mm->tlb_flush_pending);
> +	if (atomic_dec_and_test(&mm->tlb_flush_pending)) {
> +		wake_up_var(&mm->tlb_flush_pending);
> +	} else {
> +		wait_event_var(&mm->tlb_flush_pending,
> +			       !atomic_read_acquire(&mm->tlb_flush_pending));
> +	}
> }

It still seems very expensive to me, at least for certain workloads (e.g.,
Apache with multithreaded MPM).

It may be possible to avoid false-positive nesting indications (when the
flushes do not overlap) by creating a new struct mmu_gather_pending, with
something like:

  struct mmu_gather_pending {
 	u64 start;
	u64 end;
	struct mmu_gather_pending *next;
  }

tlb_finish_mmu() would then iterate over the mm->mmu_gather_pending
(pointing to the linked list) and find whether there is any overlap. This
would still require synchronization (acquiring a lock when allocating and
deallocating or something fancier).
Yang Shi May 9, 2019, 9:48 p.m. UTC | #14
On 5/9/19 2:06 PM, Jan Stancek wrote:
> ----- Original Message -----
>>
>> On 5/9/19 11:24 AM, Peter Zijlstra wrote:
>>> On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote:
>>>>> On May 9, 2019, at 3:38 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>>>>> index 99740e1dd273..fe768f8d612e 100644
>>>>> --- a/mm/mmu_gather.c
>>>>> +++ b/mm/mmu_gather.c
>>>>> @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>>>>> 		unsigned long start, unsigned long end)
>>>>> {
>>>>> 	/*
>>>>> -	 * If there are parallel threads are doing PTE changes on same range
>>>>> -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
>>>>> -	 * flush by batching, a thread has stable TLB entry can fail to flush
>>>>> -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
>>>>> -	 * forcefully if we detect parallel PTE batching threads.
>>>>> +	 * Sensible comment goes here..
>>>>> 	 */
>>>>> -	if (mm_tlb_flush_nested(tlb->mm)) {
>>>>> -		__tlb_reset_range(tlb);
>>>>> -		__tlb_adjust_range(tlb, start, end - start);
>>>>> +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
>>>>> +		/*
>>>>> +		 * Since we're can't tell what we actually should have
>>>>> +		 * flushed flush everything in the given range.
>>>>> +		 */
>>>>> +		tlb->start = start;
>>>>> +		tlb->end = end;
>>>>> +		tlb->freed_tables = 1;
>>>>> +		tlb->cleared_ptes = 1;
>>>>> +		tlb->cleared_pmds = 1;
>>>>> +		tlb->cleared_puds = 1;
>>>>> +		tlb->cleared_p4ds = 1;
>>>>> 	}
>>>>>
>>>>> 	tlb_flush_mmu(tlb);
>>>> As a simple optimization, I think it is possible to hold multiple nesting
>>>> counters in the mm, similar to tlb_flush_pending, for freed_tables,
>>>> cleared_ptes, etc.
>>>>
>>>> The first time you set tlb->freed_tables, you also atomically increase
>>>> mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use
>>>> mm->tlb_flush_freed_tables instead of tlb->freed_tables.
>>> That sounds fraught with races and expensive; I would much prefer to not
>>> go there for this arguably rare case.
>>>
>>> Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1
>>> races and doesn't see that PTE. Therefore CPU-0 sets and counts
>>> cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather,
>>> it will see cleared_ptes count increased and flush that granularity,
>>> OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall
>>> miss an invalidate it should have had.
>>>
>>> This whole concurrent mmu_gather stuff is horrible.
>>>
>>>     /me ponders more....
>>>
>>> So I think the fundamental race here is this:
>>>
>>> 	CPU-0				CPU-1
>>>
>>> 	tlb_gather_mmu(.start=1,	tlb_gather_mmu(.start=2,
>>> 		       .end=3);			       .end=4);
>>>
>>> 	ptep_get_and_clear_full(2)
>>> 	tlb_remove_tlb_entry(2);
>>> 	__tlb_remove_page();
>>> 					if (pte_present(2)) // nope
>>>
>>> 					tlb_finish_mmu();
>>>
>>> 					// continue without TLBI(2)
>>> 					// whoopsie
>>>
>>> 	tlb_finish_mmu();
>>> 	  tlb_flush()		->	TLBI(2)
>> I'm not quite sure if this is the case Jan really met. But, according to
>> his test, once correct tlb->freed_tables and tlb->cleared_* are set, his
>> test works well.
> My theory was following sequence:
>
> t1: map_write_unmap()                 t2: dummy()
>
>    map_address = mmap()
>    map_address[i] = 'b'
>    munmap(map_address)
>    downgrade_write(&mm->mmap_sem);
>    unmap_region()
>    tlb_gather_mmu()
>      inc_tlb_flush_pending(tlb->mm);
>    free_pgtables()
>      tlb->freed_tables = 1
>      tlb->cleared_pmds = 1
>
>                                          pthread_exit()
>                                          madvise(thread_stack, 8M, MADV_DONTNEED)

I'm not quite familiar with the implementation detail of pthread_exit(), 
does pthread_exit() call MADV_DONTNEED all the time? I don't see your 
test call it. If so this pattern is definitely possible.

>                                            zap_page_range()
>                                              tlb_gather_mmu()
>                                                inc_tlb_flush_pending(tlb->mm);
>
>    tlb_finish_mmu()
>      if (mm_tlb_flush_nested(tlb->mm))
>        __tlb_reset_range()
>          tlb->freed_tables = 0
>          tlb->cleared_pmds = 0
>      __flush_tlb_range(last_level = 0)
>    ...
>    map_address = mmap()
>      map_address[i] = 'b'
>        <page fault loop>
>        # PTE appeared valid to me,
>        # so I suspected stale TLB entry at higher level as result of "freed_tables = 0"
>
>
> I'm happy to apply/run any debug patches to get more data that would help.
>
>>>
>>> And we can fix that by having tlb_finish_mmu() sync up. Never let a
>>> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers
>>> have completed.
>> Not sure if this will scale well.
>>
>>> This should not be too hard to make happen.
>>
Jan Stancek May 9, 2019, 10:12 p.m. UTC | #15
----- Original Message -----
> 
> 
> On 5/9/19 2:06 PM, Jan Stancek wrote:
> > ----- Original Message -----
> >>
> >> On 5/9/19 11:24 AM, Peter Zijlstra wrote:
> >>> On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote:
> >>>>> On May 9, 2019, at 3:38 AM, Peter Zijlstra <peterz@infradead.org>
> >>>>> wrote:
> >>>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> >>>>> index 99740e1dd273..fe768f8d612e 100644
> >>>>> --- a/mm/mmu_gather.c
> >>>>> +++ b/mm/mmu_gather.c
> >>>>> @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
> >>>>> 		unsigned long start, unsigned long end)
> >>>>> {
> >>>>> 	/*
> >>>>> -	 * If there are parallel threads are doing PTE changes on same range
> >>>>> -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> >>>>> -	 * flush by batching, a thread has stable TLB entry can fail to flush
> >>>>> -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> >>>>> -	 * forcefully if we detect parallel PTE batching threads.
> >>>>> +	 * Sensible comment goes here..
> >>>>> 	 */
> >>>>> -	if (mm_tlb_flush_nested(tlb->mm)) {
> >>>>> -		__tlb_reset_range(tlb);
> >>>>> -		__tlb_adjust_range(tlb, start, end - start);
> >>>>> +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
> >>>>> +		/*
> >>>>> +		 * Since we're can't tell what we actually should have
> >>>>> +		 * flushed flush everything in the given range.
> >>>>> +		 */
> >>>>> +		tlb->start = start;
> >>>>> +		tlb->end = end;
> >>>>> +		tlb->freed_tables = 1;
> >>>>> +		tlb->cleared_ptes = 1;
> >>>>> +		tlb->cleared_pmds = 1;
> >>>>> +		tlb->cleared_puds = 1;
> >>>>> +		tlb->cleared_p4ds = 1;
> >>>>> 	}
> >>>>>
> >>>>> 	tlb_flush_mmu(tlb);
> >>>> As a simple optimization, I think it is possible to hold multiple
> >>>> nesting
> >>>> counters in the mm, similar to tlb_flush_pending, for freed_tables,
> >>>> cleared_ptes, etc.
> >>>>
> >>>> The first time you set tlb->freed_tables, you also atomically increase
> >>>> mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use
> >>>> mm->tlb_flush_freed_tables instead of tlb->freed_tables.
> >>> That sounds fraught with races and expensive; I would much prefer to not
> >>> go there for this arguably rare case.
> >>>
> >>> Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1
> >>> races and doesn't see that PTE. Therefore CPU-0 sets and counts
> >>> cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather,
> >>> it will see cleared_ptes count increased and flush that granularity,
> >>> OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall
> >>> miss an invalidate it should have had.
> >>>
> >>> This whole concurrent mmu_gather stuff is horrible.
> >>>
> >>>     /me ponders more....
> >>>
> >>> So I think the fundamental race here is this:
> >>>
> >>> 	CPU-0				CPU-1
> >>>
> >>> 	tlb_gather_mmu(.start=1,	tlb_gather_mmu(.start=2,
> >>> 		       .end=3);			       .end=4);
> >>>
> >>> 	ptep_get_and_clear_full(2)
> >>> 	tlb_remove_tlb_entry(2);
> >>> 	__tlb_remove_page();
> >>> 					if (pte_present(2)) // nope
> >>>
> >>> 					tlb_finish_mmu();
> >>>
> >>> 					// continue without TLBI(2)
> >>> 					// whoopsie
> >>>
> >>> 	tlb_finish_mmu();
> >>> 	  tlb_flush()		->	TLBI(2)
> >> I'm not quite sure if this is the case Jan really met. But, according to
> >> his test, once correct tlb->freed_tables and tlb->cleared_* are set, his
> >> test works well.
> > My theory was following sequence:
> >
> > t1: map_write_unmap()                 t2: dummy()
> >
> >    map_address = mmap()
> >    map_address[i] = 'b'
> >    munmap(map_address)
> >    downgrade_write(&mm->mmap_sem);
> >    unmap_region()
> >    tlb_gather_mmu()
> >      inc_tlb_flush_pending(tlb->mm);
> >    free_pgtables()
> >      tlb->freed_tables = 1
> >      tlb->cleared_pmds = 1
> >
> >                                          pthread_exit()
> >                                          madvise(thread_stack, 8M,
> >                                          MADV_DONTNEED)
> 
> I'm not quite familiar with the implementation detail of pthread_exit(),
> does pthread_exit() call MADV_DONTNEED all the time? I don't see your
> test call it.

It's called by glibc:
  https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/allocatestack.c;h=fcbc46f0d796abce8d58970d4a1d3df685981e33;hb=refs/heads/master#l380
  https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_create.c;h=18b7bbe7659c027dfd7b0ce3b0c83f54a6f15b18;hb=refs/heads/master#l569

(gdb) bt
#0  madvise () at ../sysdeps/unix/syscall-template.S:78
#1  0x0000ffffbe7679f8 in advise_stack_range (guardsize=<optimized out>, pd=281474976706191, size=<optimized out>, mem=0xffffbddd0000)
    at allocatestack.c:392
#2  start_thread (arg=0xffffffffee8f) at pthread_create.c:576
#3  0x0000ffffbe6b157c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Dump of assembler code for function madvise:
=> 0x0000ffffbe6adaf0 <+0>:     mov     x8, #0xe9                       // #233
   0x0000ffffbe6adaf4 <+4>:     svc     #0x0
   0x0000ffffbe6adaf8 <+8>:     cmn     x0, #0xfff
   0x0000ffffbe6adafc <+12>:    b.cs    0xffffbe6adb04 <madvise+20>  // b.hs, b.nlast
   0x0000ffffbe6adb00 <+16>:    ret
   0x0000ffffbe6adb04 <+20>:    b       0xffffbe600e18 <__GI___syscall_error>


> If so this pattern is definitely possible.
> 
> >                                            zap_page_range()
> >                                              tlb_gather_mmu()
> >                                                inc_tlb_flush_pending(tlb->mm);
> >
> >    tlb_finish_mmu()
> >      if (mm_tlb_flush_nested(tlb->mm))
> >        __tlb_reset_range()
> >          tlb->freed_tables = 0
> >          tlb->cleared_pmds = 0
> >      __flush_tlb_range(last_level = 0)
> >    ...
> >    map_address = mmap()
> >      map_address[i] = 'b'
> >        <page fault loop>
> >        # PTE appeared valid to me,
> >        # so I suspected stale TLB entry at higher level as result of
> >        "freed_tables = 0"
> >
> >
> > I'm happy to apply/run any debug patches to get more data that would help.
> >
> >>>
> >>> And we can fix that by having tlb_finish_mmu() sync up. Never let a
> >>> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers
> >>> have completed.
> >> Not sure if this will scale well.
> >>
> >>> This should not be too hard to make happen.
> >>
> 
>
Peter Zijlstra May 13, 2019, 8:36 a.m. UTC | #16
On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote:

> >>> And we can fix that by having tlb_finish_mmu() sync up. Never let a
> >>> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers
> >>> have completed.
> >>> 
> >>> This should not be too hard to make happen.
> >> 
> >> This synchronization sounds much more expensive than what I proposed. But I
> >> agree that cache-lines that move from one CPU to another might become an
> >> issue. But I think that the scheme I suggested would minimize this overhead.
> > 
> > Well, it would have a lot more unconditional atomic ops. My scheme only
> > waits when there is actual concurrency.
> 
> Well, something has to give. I didn’t think that if the same core does the
> atomic op it would be too expensive.

They're still at least 20 cycles a pop, uncontended.

> > I _think_ something like the below ought to work, but its not even been
> > near a compiler. The only problem is the unconditional wakeup; we can
> > play games to avoid that if we want to continue with this.
> > 
> > Ideally we'd only do this when there's been actual overlap, but I've not
> > found a sensible way to detect that.
> > 
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 4ef4bbe78a1d..b70e35792d29 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm)
> > 	 *
> > 	 * Therefore we must rely on tlb_flush_*() to guarantee order.
> > 	 */
> > -	atomic_dec(&mm->tlb_flush_pending);
> > +	if (atomic_dec_and_test(&mm->tlb_flush_pending)) {
> > +		wake_up_var(&mm->tlb_flush_pending);
> > +	} else {
> > +		wait_event_var(&mm->tlb_flush_pending,
> > +			       !atomic_read_acquire(&mm->tlb_flush_pending));
> > +	}
> > }
> 
> It still seems very expensive to me, at least for certain workloads (e.g.,
> Apache with multithreaded MPM).

Is that Apache-MPM workload triggering this lots? Having a known
benchmark for this stuff is good for when someone has time to play with
things.

> It may be possible to avoid false-positive nesting indications (when the
> flushes do not overlap) by creating a new struct mmu_gather_pending, with
> something like:
> 
>   struct mmu_gather_pending {
>  	u64 start;
> 	u64 end;
> 	struct mmu_gather_pending *next;
>   }
> 
> tlb_finish_mmu() would then iterate over the mm->mmu_gather_pending
> (pointing to the linked list) and find whether there is any overlap. This
> would still require synchronization (acquiring a lock when allocating and
> deallocating or something fancier).

We have an interval_tree for this, and yes, that's how far I got :/

The other thing I was thinking of is trying to detect overlap through
the page-tables themselves, but we have a distinct lack of storage
there.

The things is, if this threaded monster runs on all CPUs (busy front end
server) and does a ton of invalidation due to all the short lived
request crud, then all the extra invalidations will add up too. Having
to do process (machine in this case) wide invalidations is expensive,
having to do more of them surely isn't cheap either.

So there might be something to win here.
Nadav Amit May 13, 2019, 9:11 a.m. UTC | #17
> On May 13, 2019, at 1:36 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote:
> 
>>>>> And we can fix that by having tlb_finish_mmu() sync up. Never let a
>>>>> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers
>>>>> have completed.
>>>>> 
>>>>> This should not be too hard to make happen.
>>>> 
>>>> This synchronization sounds much more expensive than what I proposed. But I
>>>> agree that cache-lines that move from one CPU to another might become an
>>>> issue. But I think that the scheme I suggested would minimize this overhead.
>>> 
>>> Well, it would have a lot more unconditional atomic ops. My scheme only
>>> waits when there is actual concurrency.
>> 
>> Well, something has to give. I didn’t think that if the same core does the
>> atomic op it would be too expensive.
> 
> They're still at least 20 cycles a pop, uncontended.
> 
>>> I _think_ something like the below ought to work, but its not even been
>>> near a compiler. The only problem is the unconditional wakeup; we can
>>> play games to avoid that if we want to continue with this.
>>> 
>>> Ideally we'd only do this when there's been actual overlap, but I've not
>>> found a sensible way to detect that.
>>> 
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index 4ef4bbe78a1d..b70e35792d29 100644
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm)
>>> 	 *
>>> 	 * Therefore we must rely on tlb_flush_*() to guarantee order.
>>> 	 */
>>> -	atomic_dec(&mm->tlb_flush_pending);
>>> +	if (atomic_dec_and_test(&mm->tlb_flush_pending)) {
>>> +		wake_up_var(&mm->tlb_flush_pending);
>>> +	} else {
>>> +		wait_event_var(&mm->tlb_flush_pending,
>>> +			       !atomic_read_acquire(&mm->tlb_flush_pending));
>>> +	}
>>> }
>> 
>> It still seems very expensive to me, at least for certain workloads (e.g.,
>> Apache with multithreaded MPM).
> 
> Is that Apache-MPM workload triggering this lots? Having a known
> benchmark for this stuff is good for when someone has time to play with
> things.

Setting Apache2 with mpm_worker causes every request to go through
mmap-writev-munmap flow on every thread. I didn’t run this workload after
the patches that downgrade the mmap_sem to read before the page-table
zapping were introduced. I presume these patches would allow the page-table
zapping to be done concurrently, and therefore would hit this flow.

>> It may be possible to avoid false-positive nesting indications (when the
>> flushes do not overlap) by creating a new struct mmu_gather_pending, with
>> something like:
>> 
>>  struct mmu_gather_pending {
>> 	u64 start;
>> 	u64 end;
>> 	struct mmu_gather_pending *next;
>>  }
>> 
>> tlb_finish_mmu() would then iterate over the mm->mmu_gather_pending
>> (pointing to the linked list) and find whether there is any overlap. This
>> would still require synchronization (acquiring a lock when allocating and
>> deallocating or something fancier).
> 
> We have an interval_tree for this, and yes, that's how far I got :/
> 
> The other thing I was thinking of is trying to detect overlap through
> the page-tables themselves, but we have a distinct lack of storage
> there.

I tried to think about saving some generation info somewhere in the
page-struct, but I could not come up with a reasonable solution that
would not requite to traverse all the page tables again one the TLB
flush is done.

> The things is, if this threaded monster runs on all CPUs (busy front end
> server) and does a ton of invalidation due to all the short lived
> request crud, then all the extra invalidations will add up too. Having
> to do process (machine in this case) wide invalidations is expensive,
> having to do more of them surely isn't cheap either.
> 
> So there might be something to win here.

Yes. I remember that these full TLB flushes leave their mark.

BTW: sometimes you don’t see the effect of these full TLB flushes as much in
VMs. I encountered a strange phenomenon at the time - INVLPG for an
arbitrary page cause my Haswell machine flush the entire TLB, when the
INVLPG was issued inside a VM. It took me quite some time to analyze this
problem. Eventually Intel told me that’s part of what is called “page
fracturing” - if the host uses 4k pages in the EPT, they (usually) need to
flush the entire TLB for any INVLPG. That’s happens since they don’t know
the size of the flushed page.

I really need to finish my blog-post about it some day.
Peter Zijlstra May 13, 2019, 9:12 a.m. UTC | #18
On Mon, May 13, 2019 at 10:36:06AM +0200, Peter Zijlstra wrote:
> On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote:
> > It may be possible to avoid false-positive nesting indications (when the
> > flushes do not overlap) by creating a new struct mmu_gather_pending, with
> > something like:
> > 
> >   struct mmu_gather_pending {
> >  	u64 start;
> > 	u64 end;
> > 	struct mmu_gather_pending *next;
> >   }
> > 
> > tlb_finish_mmu() would then iterate over the mm->mmu_gather_pending
> > (pointing to the linked list) and find whether there is any overlap. This
> > would still require synchronization (acquiring a lock when allocating and
> > deallocating or something fancier).
> 
> We have an interval_tree for this, and yes, that's how far I got :/
> 
> The other thing I was thinking of is trying to detect overlap through
> the page-tables themselves, but we have a distinct lack of storage
> there.

We might just use some state in the pmd, there's still 2 _pt_pad_[12] in
struct page to 'use'. So we could come up with some tlb generation
scheme that would detect conflict.
Nadav Amit May 13, 2019, 9:21 a.m. UTC | #19
> On May 13, 2019, at 2:12 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Mon, May 13, 2019 at 10:36:06AM +0200, Peter Zijlstra wrote:
>> On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote:
>>> It may be possible to avoid false-positive nesting indications (when the
>>> flushes do not overlap) by creating a new struct mmu_gather_pending, with
>>> something like:
>>> 
>>>  struct mmu_gather_pending {
>>> 	u64 start;
>>> 	u64 end;
>>> 	struct mmu_gather_pending *next;
>>>  }
>>> 
>>> tlb_finish_mmu() would then iterate over the mm->mmu_gather_pending
>>> (pointing to the linked list) and find whether there is any overlap. This
>>> would still require synchronization (acquiring a lock when allocating and
>>> deallocating or something fancier).
>> 
>> We have an interval_tree for this, and yes, that's how far I got :/
>> 
>> The other thing I was thinking of is trying to detect overlap through
>> the page-tables themselves, but we have a distinct lack of storage
>> there.
> 
> We might just use some state in the pmd, there's still 2 _pt_pad_[12] in
> struct page to 'use'. So we could come up with some tlb generation
> scheme that would detect conflict.

It is rather easy to come up with a scheme (and I did similar things) if you
flush the table while you hold the page-tables lock. But if you batch across
page-tables it becomes harder.

Thinking about it while typing, perhaps it is simpler than I think - if you
need to flush range that runs across more than a single table, you are very
likely to flush a range of more than 33 entries, so anyhow you are likely to
do a full TLB flush.

So perhaps just avoiding the batching if only entries from a single table
are flushed would be enough.
Peter Zijlstra May 13, 2019, 11:27 a.m. UTC | #20
On Mon, May 13, 2019 at 09:21:01AM +0000, Nadav Amit wrote:
> > On May 13, 2019, at 2:12 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> >> The other thing I was thinking of is trying to detect overlap through
> >> the page-tables themselves, but we have a distinct lack of storage
> >> there.
> > 
> > We might just use some state in the pmd, there's still 2 _pt_pad_[12] in
> > struct page to 'use'. So we could come up with some tlb generation
> > scheme that would detect conflict.
> 
> It is rather easy to come up with a scheme (and I did similar things) if you
> flush the table while you hold the page-tables lock. But if you batch across
> page-tables it becomes harder.

Yeah; finding that out now. I keep finding holes :/

> Thinking about it while typing, perhaps it is simpler than I think - if you
> need to flush range that runs across more than a single table, you are very
> likely to flush a range of more than 33 entries, so anyhow you are likely to
> do a full TLB flush.

We can't rely on the 33, that x86 specific. Other architectures can have
another (or no) limit on that.

> So perhaps just avoiding the batching if only entries from a single table
> are flushed would be enough.

That's near to what Will suggested initially, just flush the entire
thing when there's a conflict.
Peter Zijlstra May 13, 2019, 11:30 a.m. UTC | #21
On Mon, May 13, 2019 at 09:11:38AM +0000, Nadav Amit wrote:
> BTW: sometimes you don’t see the effect of these full TLB flushes as much in
> VMs. I encountered a strange phenomenon at the time - INVLPG for an
> arbitrary page cause my Haswell machine flush the entire TLB, when the
> INVLPG was issued inside a VM. It took me quite some time to analyze this
> problem. Eventually Intel told me that’s part of what is called “page
> fracturing” - if the host uses 4k pages in the EPT, they (usually) need to
> flush the entire TLB for any INVLPG. That’s happens since they don’t know
> the size of the flushed page.

Cute... if only they'd given us an interface to tell them... :-)
Will Deacon May 13, 2019, 4:37 p.m. UTC | #22
On Mon, May 13, 2019 at 09:11:38AM +0000, Nadav Amit wrote:
> > On May 13, 2019, at 1:36 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote:
> > 
> >>>>> And we can fix that by having tlb_finish_mmu() sync up. Never let a
> >>>>> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers
> >>>>> have completed.
> >>>>> 
> >>>>> This should not be too hard to make happen.
> >>>> 
> >>>> This synchronization sounds much more expensive than what I proposed. But I
> >>>> agree that cache-lines that move from one CPU to another might become an
> >>>> issue. But I think that the scheme I suggested would minimize this overhead.
> >>> 
> >>> Well, it would have a lot more unconditional atomic ops. My scheme only
> >>> waits when there is actual concurrency.
> >> 
> >> Well, something has to give. I didn’t think that if the same core does the
> >> atomic op it would be too expensive.
> > 
> > They're still at least 20 cycles a pop, uncontended.
> > 
> >>> I _think_ something like the below ought to work, but its not even been
> >>> near a compiler. The only problem is the unconditional wakeup; we can
> >>> play games to avoid that if we want to continue with this.
> >>> 
> >>> Ideally we'd only do this when there's been actual overlap, but I've not
> >>> found a sensible way to detect that.
> >>> 
> >>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >>> index 4ef4bbe78a1d..b70e35792d29 100644
> >>> --- a/include/linux/mm_types.h
> >>> +++ b/include/linux/mm_types.h
> >>> @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm)
> >>> 	 *
> >>> 	 * Therefore we must rely on tlb_flush_*() to guarantee order.
> >>> 	 */
> >>> -	atomic_dec(&mm->tlb_flush_pending);
> >>> +	if (atomic_dec_and_test(&mm->tlb_flush_pending)) {
> >>> +		wake_up_var(&mm->tlb_flush_pending);
> >>> +	} else {
> >>> +		wait_event_var(&mm->tlb_flush_pending,
> >>> +			       !atomic_read_acquire(&mm->tlb_flush_pending));
> >>> +	}
> >>> }
> >> 
> >> It still seems very expensive to me, at least for certain workloads (e.g.,
> >> Apache with multithreaded MPM).
> > 
> > Is that Apache-MPM workload triggering this lots? Having a known
> > benchmark for this stuff is good for when someone has time to play with
> > things.
> 
> Setting Apache2 with mpm_worker causes every request to go through
> mmap-writev-munmap flow on every thread. I didn’t run this workload after
> the patches that downgrade the mmap_sem to read before the page-table
> zapping were introduced. I presume these patches would allow the page-table
> zapping to be done concurrently, and therefore would hit this flow.

Hmm, I don't think so: munmap() still has to take the semaphore for write
initially, so it will be serialised against other munmap() threads even
after they've downgraded afaict.

The initial bug report was about concurrent madvise() vs munmap().

Will
Nadav Amit May 13, 2019, 5:06 p.m. UTC | #23
> On May 13, 2019, at 9:37 AM, Will Deacon <will.deacon@arm.com> wrote:
> 
> On Mon, May 13, 2019 at 09:11:38AM +0000, Nadav Amit wrote:
>>> On May 13, 2019, at 1:36 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote:
>>> 
>>>>>>> And we can fix that by having tlb_finish_mmu() sync up. Never let a
>>>>>>> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers
>>>>>>> have completed.
>>>>>>> 
>>>>>>> This should not be too hard to make happen.
>>>>>> 
>>>>>> This synchronization sounds much more expensive than what I proposed. But I
>>>>>> agree that cache-lines that move from one CPU to another might become an
>>>>>> issue. But I think that the scheme I suggested would minimize this overhead.
>>>>> 
>>>>> Well, it would have a lot more unconditional atomic ops. My scheme only
>>>>> waits when there is actual concurrency.
>>>> 
>>>> Well, something has to give. I didn’t think that if the same core does the
>>>> atomic op it would be too expensive.
>>> 
>>> They're still at least 20 cycles a pop, uncontended.
>>> 
>>>>> I _think_ something like the below ought to work, but its not even been
>>>>> near a compiler. The only problem is the unconditional wakeup; we can
>>>>> play games to avoid that if we want to continue with this.
>>>>> 
>>>>> Ideally we'd only do this when there's been actual overlap, but I've not
>>>>> found a sensible way to detect that.
>>>>> 
>>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>>> index 4ef4bbe78a1d..b70e35792d29 100644
>>>>> --- a/include/linux/mm_types.h
>>>>> +++ b/include/linux/mm_types.h
>>>>> @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm)
>>>>> 	 *
>>>>> 	 * Therefore we must rely on tlb_flush_*() to guarantee order.
>>>>> 	 */
>>>>> -	atomic_dec(&mm->tlb_flush_pending);
>>>>> +	if (atomic_dec_and_test(&mm->tlb_flush_pending)) {
>>>>> +		wake_up_var(&mm->tlb_flush_pending);
>>>>> +	} else {
>>>>> +		wait_event_var(&mm->tlb_flush_pending,
>>>>> +			       !atomic_read_acquire(&mm->tlb_flush_pending));
>>>>> +	}
>>>>> }
>>>> 
>>>> It still seems very expensive to me, at least for certain workloads (e.g.,
>>>> Apache with multithreaded MPM).
>>> 
>>> Is that Apache-MPM workload triggering this lots? Having a known
>>> benchmark for this stuff is good for when someone has time to play with
>>> things.
>> 
>> Setting Apache2 with mpm_worker causes every request to go through
>> mmap-writev-munmap flow on every thread. I didn’t run this workload after
>> the patches that downgrade the mmap_sem to read before the page-table
>> zapping were introduced. I presume these patches would allow the page-table
>> zapping to be done concurrently, and therefore would hit this flow.
> 
> Hmm, I don't think so: munmap() still has to take the semaphore for write
> initially, so it will be serialised against other munmap() threads even
> after they've downgraded afaict.
> 
> The initial bug report was about concurrent madvise() vs munmap().

I guess you are right (and I’m wrong).

Short search suggests that ebizzy might be affected (a thread by Mel
Gorman): https://lkml.org/lkml/2015/2/2/493
Nadav Amit May 13, 2019, 5:41 p.m. UTC | #24
> On May 13, 2019, at 4:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Mon, May 13, 2019 at 09:21:01AM +0000, Nadav Amit wrote:
>>> On May 13, 2019, at 2:12 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>>>> The other thing I was thinking of is trying to detect overlap through
>>>> the page-tables themselves, but we have a distinct lack of storage
>>>> there.
>>> 
>>> We might just use some state in the pmd, there's still 2 _pt_pad_[12] in
>>> struct page to 'use'. So we could come up with some tlb generation
>>> scheme that would detect conflict.
>> 
>> It is rather easy to come up with a scheme (and I did similar things) if you
>> flush the table while you hold the page-tables lock. But if you batch across
>> page-tables it becomes harder.
> 
> Yeah; finding that out now. I keep finding holes :/

You can use Uhlig’s dissertation for inspiration (Section 4.4).

[1] https://www.researchgate.net/publication/36450482_Scalability_of_microkernel-based_systems/download

> 
>> Thinking about it while typing, perhaps it is simpler than I think - if you
>> need to flush range that runs across more than a single table, you are very
>> likely to flush a range of more than 33 entries, so anyhow you are likely to
>> do a full TLB flush.
> 
> We can't rely on the 33, that x86 specific. Other architectures can have
> another (or no) limit on that.

I wonder whether there are architectures that do no invalidate the TLB entry
by entry, experiencing these kind of overheads.

>> So perhaps just avoiding the batching if only entries from a single table
>> are flushed would be enough.
> 
> That's near to what Will suggested initially, just flush the entire
> thing when there's a conflict.

One question is how you define a conflict. IIUC, Will suggests same mm marks
a conflict. In addition, I suggest that if you only remove a single entry
(or few ones), you would just not batch and do the flushing while holding
the page-table lock.
Mel Gorman May 14, 2019, 8:58 a.m. UTC | #25
On Mon, May 13, 2019 at 05:06:03PM +0000, Nadav Amit wrote:
> > On May 13, 2019, at 9:37 AM, Will Deacon <will.deacon@arm.com> wrote:
> > 
> > On Mon, May 13, 2019 at 09:11:38AM +0000, Nadav Amit wrote:
> >>> On May 13, 2019, at 1:36 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >>> 
> >>> On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote:
> >>> 
> >>>>>>> And we can fix that by having tlb_finish_mmu() sync up. Never let a
> >>>>>>> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers
> >>>>>>> have completed.
> >>>>>>> 
> >>>>>>> This should not be too hard to make happen.
> >>>>>> 
> >>>>>> This synchronization sounds much more expensive than what I proposed. But I
> >>>>>> agree that cache-lines that move from one CPU to another might become an
> >>>>>> issue. But I think that the scheme I suggested would minimize this overhead.
> >>>>> 
> >>>>> Well, it would have a lot more unconditional atomic ops. My scheme only
> >>>>> waits when there is actual concurrency.
> >>>> 
> >>>> Well, something has to give. I didn???t think that if the same core does the
> >>>> atomic op it would be too expensive.
> >>> 
> >>> They're still at least 20 cycles a pop, uncontended.
> >>> 
> >>>>> I _think_ something like the below ought to work, but its not even been
> >>>>> near a compiler. The only problem is the unconditional wakeup; we can
> >>>>> play games to avoid that if we want to continue with this.
> >>>>> 
> >>>>> Ideally we'd only do this when there's been actual overlap, but I've not
> >>>>> found a sensible way to detect that.
> >>>>> 
> >>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >>>>> index 4ef4bbe78a1d..b70e35792d29 100644
> >>>>> --- a/include/linux/mm_types.h
> >>>>> +++ b/include/linux/mm_types.h
> >>>>> @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm)
> >>>>> 	 *
> >>>>> 	 * Therefore we must rely on tlb_flush_*() to guarantee order.
> >>>>> 	 */
> >>>>> -	atomic_dec(&mm->tlb_flush_pending);
> >>>>> +	if (atomic_dec_and_test(&mm->tlb_flush_pending)) {
> >>>>> +		wake_up_var(&mm->tlb_flush_pending);
> >>>>> +	} else {
> >>>>> +		wait_event_var(&mm->tlb_flush_pending,
> >>>>> +			       !atomic_read_acquire(&mm->tlb_flush_pending));
> >>>>> +	}
> >>>>> }
> >>>> 
> >>>> It still seems very expensive to me, at least for certain workloads (e.g.,
> >>>> Apache with multithreaded MPM).
> >>> 
> >>> Is that Apache-MPM workload triggering this lots? Having a known
> >>> benchmark for this stuff is good for when someone has time to play with
> >>> things.
> >> 
> >> Setting Apache2 with mpm_worker causes every request to go through
> >> mmap-writev-munmap flow on every thread. I didn???t run this workload after
> >> the patches that downgrade the mmap_sem to read before the page-table
> >> zapping were introduced. I presume these patches would allow the page-table
> >> zapping to be done concurrently, and therefore would hit this flow.
> > 
> > Hmm, I don't think so: munmap() still has to take the semaphore for write
> > initially, so it will be serialised against other munmap() threads even
> > after they've downgraded afaict.
> > 
> > The initial bug report was about concurrent madvise() vs munmap().
> 
> I guess you are right (and I???m wrong).
> 
> Short search suggests that ebizzy might be affected (a thread by Mel
> Gorman): https://lkml.org/lkml/2015/2/2/493
> 

Glibc has since been fixed to be less munmap/mmap intensive and the
system CPU usage of ebizzy is generally negligible unless configured so
specifically use mmap/munmap instead of malloc/free which is unrealistic
for good application behaviour.
diff mbox series

Patch

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 99740e1..9fd5272 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -249,11 +249,12 @@  void tlb_finish_mmu(struct mmu_gather *tlb,
 	 * flush by batching, a thread has stable TLB entry can fail to flush
 	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
 	 * forcefully if we detect parallel PTE batching threads.
+	 *
+	 * munmap() may change mapping under non-excluse lock and also free
+	 * page tables.  Do not call __tlb_reset_range() for it.
 	 */
-	if (mm_tlb_flush_nested(tlb->mm)) {
-		__tlb_reset_range(tlb);
+	if (mm_tlb_flush_nested(tlb->mm))
 		__tlb_adjust_range(tlb, start, end - start);
-	}
 
 	tlb_flush_mmu(tlb);