diff mbox series

arm64: hugetlb: Avoid unnecessary clearing in huge_ptep_set_access_flags

Message ID 20180919174437.1866-1-steve.capper@arm.com (mailing list archive)
State New, archived
Headers show
Series arm64: hugetlb: Avoid unnecessary clearing in huge_ptep_set_access_flags | expand

Commit Message

Steve Capper Sept. 19, 2018, 5:44 p.m. UTC
For contiguous hugetlb, huge_ptep_set_access_flags performs a
get_clear_flush (which then flushes the TLBs) even when no change of ptes
is necessary.

Unfortunately, this behaviour can lead to back-to-back page faults being
generated when running with multiple threads that access the same
contiguous huge page.

Thread 1                     |  Thread 2
-----------------------------+------------------------------
hugetlb_fault                |
huge_ptep_set_access_flags   |
  -> invalidate pte range    | hugetlb_fault
continue processing          | wait for hugetlb fault mutex
release mutex and return     | huge_ptep_set_access_flags
                             |   -> invalidate pte range
hugetlb_fault
...

This patch changes huge_ptep_set_access_flags s.t. we first read the
contiguous range of ptes (whilst preserving dirty information); the pte
range is only then invalidated where necessary and this prevents further
spurious page faults.

Fixes: d8bdcff28764 ("arm64: hugetlb: Add break-before-make logic for contiguous entries")
Reported-by: Lei Zhang <zhang.lei@jp.fujitsu.com>
Signed-off-by: Steve Capper <steve.capper@arm.com>

---

I was unable to test this on any hardware as I'm away from the office.

Can you please test this Lei Zhang?

Cheers,
--
Steve
---
 arch/arm64/mm/hugetlbpage.c | 36 ++++++++++++++++++++++++++++++++----
 1 file changed, 32 insertions(+), 4 deletions(-)

Comments

Will Deacon Sept. 20, 2018, 9:39 a.m. UTC | #1
Hi Steve,

On Wed, Sep 19, 2018 at 06:44:37PM +0100, Steve Capper wrote:
> For contiguous hugetlb, huge_ptep_set_access_flags performs a
> get_clear_flush (which then flushes the TLBs) even when no change of ptes
> is necessary.
> 
> Unfortunately, this behaviour can lead to back-to-back page faults being
> generated when running with multiple threads that access the same
> contiguous huge page.
> 
> Thread 1                     |  Thread 2
> -----------------------------+------------------------------
> hugetlb_fault                |
> huge_ptep_set_access_flags   |
>   -> invalidate pte range    | hugetlb_fault
> continue processing          | wait for hugetlb fault mutex
> release mutex and return     | huge_ptep_set_access_flags
>                              |   -> invalidate pte range
> hugetlb_fault

Is this serialised by a mutex of the mmap_sem? (or both?)

> This patch changes huge_ptep_set_access_flags s.t. we first read the
> contiguous range of ptes (whilst preserving dirty information); the pte
> range is only then invalidated where necessary and this prevents further
> spurious page faults.
> 
> Fixes: d8bdcff28764 ("arm64: hugetlb: Add break-before-make logic for contiguous entries")
> Reported-by: Lei Zhang <zhang.lei@jp.fujitsu.com>
> Signed-off-by: Steve Capper <steve.capper@arm.com>
> 
> ---
> 
> I was unable to test this on any hardware as I'm away from the office.
> 
> Can you please test this Lei Zhang?
> 
> Cheers,
> --
> Steve
> ---
>  arch/arm64/mm/hugetlbpage.c | 36 ++++++++++++++++++++++++++++++++----
>  1 file changed, 32 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 192b3ba07075..76d229eb6ba1 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -131,6 +131,27 @@ static pte_t get_clear_flush(struct mm_struct *mm,
>  	return orig_pte;
>  }
>  
> +static pte_t get_contig_pte(pte_t *ptep, unsigned long pgsize,
> +		unsigned long ncontig)
> +{
> +	unsigned long i;
> +	pte_t orig_pte = huge_ptep_get(ptep);
> +
> +	for (i = 0; i < ncontig; i++, ptep++) {
> +		pte_t pte = huge_ptep_get(ptep);
> +
> +		/*
> +		 * If HW_AFDBM is enabled, then the HW could turn on
> +		 * the dirty bit for any page in the set, so check
> +		 * them all.  All hugetlb entries are already young.

Maybe mention that young implies that AF is set?
Having said that, are you sure that these entries are already young?
If so, why does hugetlb_fault() call pte_mkyoung() on the pte before
passing it into huge_ptep_set_access_flags()? Also, since
huge_ptep_set_access_flags() can only ever relax the permissions for
an entry, if everything was young then it would imply that the only
operation it ever does is to set dirty. In which case, we could
assert pte_dirty(pte) and parts of the current code are redundant.

What's going on here?

> +		 */
> +		if (pte_dirty(pte))
> +			orig_pte = pte_mkdirty(orig_pte);
> +	}
> +
> +	return orig_pte;
> +}
> +
>  /*
>   * Changing some bits of contiguous entries requires us to follow a
>   * Break-Before-Make approach, breaking the whole contiguous set
> @@ -324,7 +345,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>  			       unsigned long addr, pte_t *ptep,
>  			       pte_t pte, int dirty)
>  {
> -	int ncontig, i, changed = 0;
> +	int ncontig, i;
>  	size_t pgsize = 0;
>  	unsigned long pfn = pte_pfn(pte), dpfn;
>  	pgprot_t hugeprot;
> @@ -336,9 +357,16 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>  	ncontig = find_num_contig(vma->vm_mm, addr, ptep, &pgsize);
>  	dpfn = pgsize >> PAGE_SHIFT;
>  
> +	orig_pte = get_contig_pte(ptep, pgsize, ncontig);
> +	if (pte_same(orig_pte, pte))
> +		return 0;

Hmm, I don't understand how this solves the other problem we found. Imagine
we have a contiguous range of ptes [pte0 ... pteN] and we have a CPU that
doesn't support the contiguous hint and also doesn't support HW_AFDBM.
Further, imagine that another CPU in the system *does* support HW_AFDBM.

In this case, it would be possible for the first CPU to take a fault on
pteN as a result of trying to write to a clean page. If the other CPU had
already set the dirty bit for pte0, then your get_contig_pte() function
will return a dirty orig_pte and the pte_same() check will pass for the
faulting CPU, causing it to get stuck in a recursive fault.

Ideally, we'd solve this by only performing the pte_same() check on the
pte that is offset by the faulting address, but unfortunately the address
we get passed here has already been rounded down. That's why I bailed and
simply tried to check pte_same() on every pte. Really, we just to check
the access and dirty flags -- another diff below.

> +	/*
> +	 * we need to get our orig_pte again as HW DBM may have happened since
> +	 * above. get_clear_flush will ultimately cmpxchg with 0 to ensure

s/cmpxchg/xchg/

Will

--->8

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 192b3ba07075..0b7738d76f14 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -324,7 +324,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 			       unsigned long addr, pte_t *ptep,
 			       pte_t pte, int dirty)
 {
-	int ncontig, i, changed = 0;
+	int ncontig, i;
 	size_t pgsize = 0;
 	unsigned long pfn = pte_pfn(pte), dpfn;
 	pgprot_t hugeprot;
@@ -336,9 +336,20 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 	ncontig = find_num_contig(vma->vm_mm, addr, ptep, &pgsize);
 	dpfn = pgsize >> PAGE_SHIFT;
 
+	for (i = 0; i < ncontig; i++) {
+		orig_pte = huge_ptep_get(ptep + i);
+
+		if (pte_dirty(orig_pte) != pte_dirty(pte))
+			break;
+
+		if (pte_young(orig_pte) != pte_young(pte))
+			break;
+	}
+
+	if (i == ncontig)
+		return 0;
+
 	orig_pte = get_clear_flush(vma->vm_mm, addr, ptep, pgsize, ncontig);
-	if (!pte_same(orig_pte, pte))
-		changed = 1;
 
 	/* Make sure we don't lose the dirty state */
 	if (pte_dirty(orig_pte))
@@ -348,7 +359,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
 		set_pte_at(vma->vm_mm, addr, ptep, pfn_pte(pfn, hugeprot));
 
-	return changed;
+	return 1;
 }
 
 void huge_ptep_set_wrprotect(struct mm_struct *mm,
Steve Capper Sept. 20, 2018, 4:34 p.m. UTC | #2
On Thu, Sep 20, 2018 at 10:39:01AM +0100, Will Deacon wrote:
> Hi Steve,

Hi Will,

>
> On Wed, Sep 19, 2018 at 06:44:37PM +0100, Steve Capper wrote:
> > For contiguous hugetlb, huge_ptep_set_access_flags performs a
> > get_clear_flush (which then flushes the TLBs) even when no change of ptes
> > is necessary.
> >
> > Unfortunately, this behaviour can lead to back-to-back page faults being
> > generated when running with multiple threads that access the same
> > contiguous huge page.
> >
> > Thread 1                     |  Thread 2
> > -----------------------------+------------------------------
> > hugetlb_fault                |
> > huge_ptep_set_access_flags   |
> >   -> invalidate pte range    | hugetlb_fault
> > continue processing          | wait for hugetlb fault mutex
> > release mutex and return     | huge_ptep_set_access_flags
> >                              |   -> invalidate pte range
> > hugetlb_fault
>
> Is this serialised by a mutex of the mmap_sem? (or both?)
>

The serialisation point I was looking at was hugetlb_fault_mutex_hash.
I will make this clearer in the commit log.

> > This patch changes huge_ptep_set_access_flags s.t. we first read the
> > contiguous range of ptes (whilst preserving dirty information); the pte
> > range is only then invalidated where necessary and this prevents further
> > spurious page faults.
> >
> > Fixes: d8bdcff28764 ("arm64: hugetlb: Add break-before-make logic for contiguous entries")
> > Reported-by: Lei Zhang <zhang.lei@jp.fujitsu.com>
> > Signed-off-by: Steve Capper <steve.capper@arm.com>
> >
> > ---
> >
> > I was unable to test this on any hardware as I'm away from the office.
> >
> > Can you please test this Lei Zhang?
> >
> > Cheers,
> > --
> > Steve
> > ---
> >  arch/arm64/mm/hugetlbpage.c | 36 ++++++++++++++++++++++++++++++++----
> >  1 file changed, 32 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> > index 192b3ba07075..76d229eb6ba1 100644
> > --- a/arch/arm64/mm/hugetlbpage.c
> > +++ b/arch/arm64/mm/hugetlbpage.c
> > @@ -131,6 +131,27 @@ static pte_t get_clear_flush(struct mm_struct *mm,
> >  return orig_pte;
> >  }
> >
> > +static pte_t get_contig_pte(pte_t *ptep, unsigned long pgsize,
> > +unsigned long ncontig)
> > +{
> > +unsigned long i;
> > +pte_t orig_pte = huge_ptep_get(ptep);
> > +
> > +for (i = 0; i < ncontig; i++, ptep++) {
> > +pte_t pte = huge_ptep_get(ptep);
> > +
> > +/*
> > + * If HW_AFDBM is enabled, then the HW could turn on
> > + * the dirty bit for any page in the set, so check
> > + * them all.  All hugetlb entries are already young.
>
> Maybe mention that young implies that AF is set?
> Having said that, are you sure that these entries are already young?
> If so, why does hugetlb_fault() call pte_mkyoung() on the pte before
> passing it into huge_ptep_set_access_flags()? Also, since
> huge_ptep_set_access_flags() can only ever relax the permissions for
> an entry, if everything was young then it would imply that the only
> operation it ever does is to set dirty. In which case, we could
> assert pte_dirty(pte) and parts of the current code are redundant.
>
> What's going on here?

I'll scrutinise the young path a bit more. In make_huge_pte a young pte
is created right away, I suspect the call to pte_mkyoung in hugetlb
fault is to prevent the call to huge_ptep_set_access_flags from removing
the young information.

>
> > + */
> > +if (pte_dirty(pte))
> > +orig_pte = pte_mkdirty(orig_pte);
> > +}
> > +
> > +return orig_pte;
> > +}
> > +
> >  /*
> >   * Changing some bits of contiguous entries requires us to follow a
> >   * Break-Before-Make approach, breaking the whole contiguous set
> > @@ -324,7 +345,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> >         unsigned long addr, pte_t *ptep,
> >         pte_t pte, int dirty)
> >  {
> > -int ncontig, i, changed = 0;
> > +int ncontig, i;
> >  size_t pgsize = 0;
> >  unsigned long pfn = pte_pfn(pte), dpfn;
> >  pgprot_t hugeprot;
> > @@ -336,9 +357,16 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> >  ncontig = find_num_contig(vma->vm_mm, addr, ptep, &pgsize);
> >  dpfn = pgsize >> PAGE_SHIFT;
> >
> > +orig_pte = get_contig_pte(ptep, pgsize, ncontig);
> > +if (pte_same(orig_pte, pte))
> > +return 0;
>
> Hmm, I don't understand how this solves the other problem we found. Imagine
> we have a contiguous range of ptes [pte0 ... pteN] and we have a CPU that
> doesn't support the contiguous hint and also doesn't support HW_AFDBM.
> Further, imagine that another CPU in the system *does* support HW_AFDBM.
>

:-(

> In this case, it would be possible for the first CPU to take a fault on
> pteN as a result of trying to write to a clean page. If the other CPU had
> already set the dirty bit for pte0, then your get_contig_pte() function
> will return a dirty orig_pte and the pte_same() check will pass for the
> faulting CPU, causing it to get stuck in a recursive fault.
>
> Ideally, we'd solve this by only performing the pte_same() check on the
> pte that is offset by the faulting address, but unfortunately the address
> we get passed here has already been rounded down. That's why I bailed and
> simply tried to check pte_same() on every pte. Really, we just to check
> the access and dirty flags -- another diff below.

Thanks, I understand, apologies for missing that earlier. I'll have
another think about huge_ptep_set_access_flags and will post a V2.

>
> > +/*
> > + * we need to get our orig_pte again as HW DBM may have happened since
> > + * above. get_clear_flush will ultimately cmpxchg with 0 to ensure
>
> s/cmpxchg/xchg/

Gah, thanks :-)

>
> Will
>
> --->8
>
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 192b3ba07075..0b7738d76f14 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -324,7 +324,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>         unsigned long addr, pte_t *ptep,
>         pte_t pte, int dirty)
>  {
> -int ncontig, i, changed = 0;
> +int ncontig, i;
>  size_t pgsize = 0;
>  unsigned long pfn = pte_pfn(pte), dpfn;
>  pgprot_t hugeprot;
> @@ -336,9 +336,20 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>  ncontig = find_num_contig(vma->vm_mm, addr, ptep, &pgsize);
>  dpfn = pgsize >> PAGE_SHIFT;
>
> +for (i = 0; i < ncontig; i++) {
> +orig_pte = huge_ptep_get(ptep + i);
> +
> +if (pte_dirty(orig_pte) != pte_dirty(pte))
> +break;
> +
> +if (pte_young(orig_pte) != pte_young(pte))
> +break;
> +}
> +
> +if (i == ncontig)
> +return 0;
> +
>  orig_pte = get_clear_flush(vma->vm_mm, addr, ptep, pgsize, ncontig);
> -if (!pte_same(orig_pte, pte))
> -changed = 1;
>
>  /* Make sure we don't lose the dirty state */
>  if (pte_dirty(orig_pte))
> @@ -348,7 +359,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>  for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
>  set_pte_at(vma->vm_mm, addr, ptep, pfn_pte(pfn, hugeprot));
>
> -return changed;
> +return 1;
>  }
>
>  void huge_ptep_set_wrprotect(struct mm_struct *mm,
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Steve Capper Sept. 20, 2018, 4:40 p.m. UTC | #3
On Thu, Sep 20, 2018 at 10:39:01AM +0100, Will Deacon wrote:
> Hi Steve,

Hi Will,

> 
> On Wed, Sep 19, 2018 at 06:44:37PM +0100, Steve Capper wrote:
> > For contiguous hugetlb, huge_ptep_set_access_flags performs a
> > get_clear_flush (which then flushes the TLBs) even when no change of ptes
> > is necessary.
> > 
> > Unfortunately, this behaviour can lead to back-to-back page faults being
> > generated when running with multiple threads that access the same
> > contiguous huge page.
> > 
> > Thread 1                     |  Thread 2
> > -----------------------------+------------------------------
> > hugetlb_fault                |
> > huge_ptep_set_access_flags   |
> >   -> invalidate pte range    | hugetlb_fault
> > continue processing          | wait for hugetlb fault mutex
> > release mutex and return     | huge_ptep_set_access_flags
> >                              |   -> invalidate pte range
> > hugetlb_fault
> 
> Is this serialised by a mutex of the mmap_sem? (or both?)

The serialisation point I was looking at was from hugetlb_fault_mutex_table,
I will make this clearer in the commit log.

> 
> > This patch changes huge_ptep_set_access_flags s.t. we first read the
> > contiguous range of ptes (whilst preserving dirty information); the pte
> > range is only then invalidated where necessary and this prevents further
> > spurious page faults.
> > 
> > Fixes: d8bdcff28764 ("arm64: hugetlb: Add break-before-make logic for contiguous entries")
> > Reported-by: Lei Zhang <zhang.lei@jp.fujitsu.com>
> > Signed-off-by: Steve Capper <steve.capper@arm.com>
> > 
> > ---
> > 
> > I was unable to test this on any hardware as I'm away from the office.
> > 
> > Can you please test this Lei Zhang?
> > 
> > Cheers,
> > --
> > Steve
> > ---
> >  arch/arm64/mm/hugetlbpage.c | 36 ++++++++++++++++++++++++++++++++----
> >  1 file changed, 32 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> > index 192b3ba07075..76d229eb6ba1 100644
> > --- a/arch/arm64/mm/hugetlbpage.c
> > +++ b/arch/arm64/mm/hugetlbpage.c
> > @@ -131,6 +131,27 @@ static pte_t get_clear_flush(struct mm_struct *mm,
> >  	return orig_pte;
> >  }
> >  
> > +static pte_t get_contig_pte(pte_t *ptep, unsigned long pgsize,
> > +		unsigned long ncontig)
> > +{
> > +	unsigned long i;
> > +	pte_t orig_pte = huge_ptep_get(ptep);
> > +
> > +	for (i = 0; i < ncontig; i++, ptep++) {
> > +		pte_t pte = huge_ptep_get(ptep);
> > +
> > +		/*
> > +		 * If HW_AFDBM is enabled, then the HW could turn on
> > +		 * the dirty bit for any page in the set, so check
> > +		 * them all.  All hugetlb entries are already young.
> 
> Maybe mention that young implies that AF is set?
> Having said that, are you sure that these entries are already young?
> If so, why does hugetlb_fault() call pte_mkyoung() on the pte before
> passing it into huge_ptep_set_access_flags()? Also, since
> huge_ptep_set_access_flags() can only ever relax the permissions for
> an entry, if everything was young then it would imply that the only
> operation it ever does is to set dirty. In which case, we could
> assert pte_dirty(pte) and parts of the current code are redundant.
> 
> What's going on here?
> 

Huge ptes are created young in make_huge_pte, I suspect the call to
pte_mkyoung in hugetlb_fault is to prevent huge_ptep_set_access_flags
from losing the young information. I will scrutinise the young logic
further.

> > +		 */
> > +		if (pte_dirty(pte))
> > +			orig_pte = pte_mkdirty(orig_pte);
> > +	}
> > +
> > +	return orig_pte;
> > +}
> > +
> >  /*
> >   * Changing some bits of contiguous entries requires us to follow a
> >   * Break-Before-Make approach, breaking the whole contiguous set
> > @@ -324,7 +345,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> >  			       unsigned long addr, pte_t *ptep,
> >  			       pte_t pte, int dirty)
> >  {
> > -	int ncontig, i, changed = 0;
> > +	int ncontig, i;
> >  	size_t pgsize = 0;
> >  	unsigned long pfn = pte_pfn(pte), dpfn;
> >  	pgprot_t hugeprot;
> > @@ -336,9 +357,16 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> >  	ncontig = find_num_contig(vma->vm_mm, addr, ptep, &pgsize);
> >  	dpfn = pgsize >> PAGE_SHIFT;
> >  
> > +	orig_pte = get_contig_pte(ptep, pgsize, ncontig);
> > +	if (pte_same(orig_pte, pte))
> > +		return 0;
> 
> Hmm, I don't understand how this solves the other problem we found. Imagine
> we have a contiguous range of ptes [pte0 ... pteN] and we have a CPU that
> doesn't support the contiguous hint and also doesn't support HW_AFDBM.
> Further, imagine that another CPU in the system *does* support HW_AFDBM.
> 

:-(

> In this case, it would be possible for the first CPU to take a fault on
> pteN as a result of trying to write to a clean page. If the other CPU had
> already set the dirty bit for pte0, then your get_contig_pte() function
> will return a dirty orig_pte and the pte_same() check will pass for the
> faulting CPU, causing it to get stuck in a recursive fault.
> 
> Ideally, we'd solve this by only performing the pte_same() check on the
> pte that is offset by the faulting address, but unfortunately the address
> we get passed here has already been rounded down. That's why I bailed and
> simply tried to check pte_same() on every pte. Really, we just to check
> the access and dirty flags -- another diff below.
> 

Thanks I understand, apologies for missing this before (I'll blame
jetlag ;-) ). I'll have a think about this and will post a V2.

> > +	/*
> > +	 * we need to get our orig_pte again as HW DBM may have happened since
> > +	 * above. get_clear_flush will ultimately cmpxchg with 0 to ensure
> 
> s/cmpxchg/xchg/

Gah, thanks :-)

> 
> Will
> 
> --->8
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 192b3ba07075..0b7738d76f14 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -324,7 +324,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>  			       unsigned long addr, pte_t *ptep,
>  			       pte_t pte, int dirty)
>  {
> -	int ncontig, i, changed = 0;
> +	int ncontig, i;
>  	size_t pgsize = 0;
>  	unsigned long pfn = pte_pfn(pte), dpfn;
>  	pgprot_t hugeprot;
> @@ -336,9 +336,20 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>  	ncontig = find_num_contig(vma->vm_mm, addr, ptep, &pgsize);
>  	dpfn = pgsize >> PAGE_SHIFT;
>  
> +	for (i = 0; i < ncontig; i++) {
> +		orig_pte = huge_ptep_get(ptep + i);
> +
> +		if (pte_dirty(orig_pte) != pte_dirty(pte))
> +			break;
> +
> +		if (pte_young(orig_pte) != pte_young(pte))
> +			break;
> +	}
> +
> +	if (i == ncontig)
> +		return 0;
> +
>  	orig_pte = get_clear_flush(vma->vm_mm, addr, ptep, pgsize, ncontig);
> -	if (!pte_same(orig_pte, pte))
> -		changed = 1;
>  
>  	/* Make sure we don't lose the dirty state */
>  	if (pte_dirty(orig_pte))
> @@ -348,7 +359,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>  	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
>  		set_pte_at(vma->vm_mm, addr, ptep, pfn_pte(pfn, hugeprot));
>  
> -	return changed;
> +	return 1;
>  }
>  
>  void huge_ptep_set_wrprotect(struct mm_struct *mm,

Cheers,
Zhang, Lei Nov. 6, 2018, 1:16 p.m. UTC | #4
Hi, Steve

Sorry to late.
Thanks for your patch, and I will test your patch.

Best Regards,
Lei Zhang

> -----Original Message-----
> From: linux-arm-kernel
> [mailto:linux-arm-kernel-bounces@lists.infradead.org] On Behalf Of Steve
> Capper
> Sent: Thursday, September 20, 2018 2:45 AM
> To: linux-arm-kernel@lists.infradead.org
> Cc: catalin.marinas@arm.com; Steve Capper; will.deacon@arm.com; Zhang, Lei/
> 張 雷
> Subject: [PATCH] arm64: hugetlb: Avoid unnecessary clearing in
> huge_ptep_set_access_flags
> 
> For contiguous hugetlb, huge_ptep_set_access_flags performs a
> get_clear_flush (which then flushes the TLBs) even when no change of ptes
> is necessary.
> 
> Unfortunately, this behaviour can lead to back-to-back page faults being
> generated when running with multiple threads that access the same
> contiguous huge page.
> 
> Thread 1                     |  Thread 2
> -----------------------------+------------------------------
> hugetlb_fault                |
> huge_ptep_set_access_flags   |
>   -> invalidate pte range    | hugetlb_fault
> continue processing          | wait for hugetlb fault mutex
> release mutex and return     | huge_ptep_set_access_flags
>                              |   -> invalidate pte range
> hugetlb_fault
> ...
> 
> This patch changes huge_ptep_set_access_flags s.t. we first read the
> contiguous range of ptes (whilst preserving dirty information); the pte
> range is only then invalidated where necessary and this prevents further
> spurious page faults.
> 
> Fixes: d8bdcff28764 ("arm64: hugetlb: Add break-before-make logic for
> contiguous entries")
> Reported-by: Lei Zhang <zhang.lei@jp.fujitsu.com>
> Signed-off-by: Steve Capper <steve.capper@arm.com>
> 
> ---
> 
> I was unable to test this on any hardware as I'm away from the office.
> 
> Can you please test this Lei Zhang?
> 
> Cheers,
> --
> Steve
> ---
>  arch/arm64/mm/hugetlbpage.c | 36 ++++++++++++++++++++++++++++++++----
>  1 file changed, 32 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 192b3ba07075..76d229eb6ba1 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -131,6 +131,27 @@ static pte_t get_clear_flush(struct mm_struct *mm,
>  	return orig_pte;
>  }
> 
> +static pte_t get_contig_pte(pte_t *ptep, unsigned long pgsize,
> +		unsigned long ncontig)
> +{
> +	unsigned long i;
> +	pte_t orig_pte = huge_ptep_get(ptep);
> +
> +	for (i = 0; i < ncontig; i++, ptep++) {
> +		pte_t pte = huge_ptep_get(ptep);
> +
> +		/*
> +		 * If HW_AFDBM is enabled, then the HW could turn on
> +		 * the dirty bit for any page in the set, so check
> +		 * them all.  All hugetlb entries are already young.
> +		 */
> +		if (pte_dirty(pte))
> +			orig_pte = pte_mkdirty(orig_pte);
> +	}
> +
> +	return orig_pte;
> +}
> +
>  /*
>   * Changing some bits of contiguous entries requires us to follow a
>   * Break-Before-Make approach, breaking the whole contiguous set
> @@ -324,7 +345,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct
> *vma,
>  			       unsigned long addr, pte_t *ptep,
>  			       pte_t pte, int dirty)
>  {
> -	int ncontig, i, changed = 0;
> +	int ncontig, i;
>  	size_t pgsize = 0;
>  	unsigned long pfn = pte_pfn(pte), dpfn;
>  	pgprot_t hugeprot;
> @@ -336,9 +357,16 @@ int huge_ptep_set_access_flags(struct vm_area_struct
> *vma,
>  	ncontig = find_num_contig(vma->vm_mm, addr, ptep, &pgsize);
>  	dpfn = pgsize >> PAGE_SHIFT;
> 
> +	orig_pte = get_contig_pte(ptep, pgsize, ncontig);
> +	if (pte_same(orig_pte, pte))
> +		return 0;
> +
> +	/*
> +	 * we need to get our orig_pte again as HW DBM may have happened
> since
> +	 * above. get_clear_flush will ultimately cmpxchg with 0 to ensure
> +	 * that we can't lose any dirty information.
> +	 */
>  	orig_pte = get_clear_flush(vma->vm_mm, addr, ptep, pgsize,
> ncontig);
> -	if (!pte_same(orig_pte, pte))
> -		changed = 1;
> 
>  	/* Make sure we don't lose the dirty state */
>  	if (pte_dirty(orig_pte))
> @@ -348,7 +376,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct
> *vma,
>  	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
>  		set_pte_at(vma->vm_mm, addr, ptep, pfn_pte(pfn,
> hugeprot));
> 
> -	return changed;
> +	return 1;
>  }
> 
>  void huge_ptep_set_wrprotect(struct mm_struct *mm,
> --
> 2.11.0
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Will Deacon Nov. 6, 2018, 5:14 p.m. UTC | #5
Hi Lei Zhang,

On Tue, Nov 06, 2018 at 01:16:01PM +0000, Zhang, Lei wrote:
> Sorry to late.
> Thanks for your patch, and I will test your patch.

Probably best to test the version that landed in mainline, as I think we
went through a few revisions of this:

http://git.kernel.org/linus/031e6e6b4e12

Thanks!

Will
Zhang, Lei Dec. 13, 2018, 12:25 a.m. UTC | #6
Hi, Will Deacon

> Probably best to test the version that landed in mainline, as I think we
> went through a few revisions of this:
Sorry for late.

I did test on linux-4.18, but even without
your patch, test program has been passed.
The test program links some libraries, and the
libraries's version is different from linux-4.18
environment and linxu-4.16 environment which I
found this problem.

I backport commits as below to linux-4.16,
and my test program passed.

Thanks a lot for your help.

commit 469ed9d823b7d240d6b9574f061ded7c3834c167 
and
commit 031e6e6b4e1277e76e73a6ab209095ad9bf3ce52

Test environment
CPU: A64FX
Kernel version: v4.16.0

Tested-by: Lei Zhang <zhang.lei@jp.fujitsu.com>

Best Regards,
Lei, Zhang
zhang.lei@jp.fujitsu.com


> -----Original Message-----
> From: Will Deacon [mailto:will.deacon@arm.com]
> Sent: Wednesday, November 07, 2018 2:15 AM
> To: Zhang, Lei/張 雷
> Cc: 'Steve Capper'; linux-arm-kernel@lists.infradead.org;
> catalin.marinas@arm.com
> Subject: Re: [PATCH] arm64: hugetlb: Avoid unnecessary clearing in
> huge_ptep_set_access_flags
> 
> Hi Lei Zhang,
> 
> On Tue, Nov 06, 2018 at 01:16:01PM +0000, Zhang, Lei wrote:
> > Sorry to late.
> > Thanks for your patch, and I will test your patch.
> 
> Probably best to test the version that landed in mainline, as I think we
> went through a few revisions of this:
> 
> http://git.kernel.org/linus/031e6e6b4e12
> 
> Thanks!
> 
> Will
diff mbox series

Patch

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 192b3ba07075..76d229eb6ba1 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -131,6 +131,27 @@  static pte_t get_clear_flush(struct mm_struct *mm,
 	return orig_pte;
 }
 
+static pte_t get_contig_pte(pte_t *ptep, unsigned long pgsize,
+		unsigned long ncontig)
+{
+	unsigned long i;
+	pte_t orig_pte = huge_ptep_get(ptep);
+
+	for (i = 0; i < ncontig; i++, ptep++) {
+		pte_t pte = huge_ptep_get(ptep);
+
+		/*
+		 * If HW_AFDBM is enabled, then the HW could turn on
+		 * the dirty bit for any page in the set, so check
+		 * them all.  All hugetlb entries are already young.
+		 */
+		if (pte_dirty(pte))
+			orig_pte = pte_mkdirty(orig_pte);
+	}
+
+	return orig_pte;
+}
+
 /*
  * Changing some bits of contiguous entries requires us to follow a
  * Break-Before-Make approach, breaking the whole contiguous set
@@ -324,7 +345,7 @@  int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 			       unsigned long addr, pte_t *ptep,
 			       pte_t pte, int dirty)
 {
-	int ncontig, i, changed = 0;
+	int ncontig, i;
 	size_t pgsize = 0;
 	unsigned long pfn = pte_pfn(pte), dpfn;
 	pgprot_t hugeprot;
@@ -336,9 +357,16 @@  int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 	ncontig = find_num_contig(vma->vm_mm, addr, ptep, &pgsize);
 	dpfn = pgsize >> PAGE_SHIFT;
 
+	orig_pte = get_contig_pte(ptep, pgsize, ncontig);
+	if (pte_same(orig_pte, pte))
+		return 0;
+
+	/*
+	 * we need to get our orig_pte again as HW DBM may have happened since
+	 * above. get_clear_flush will ultimately cmpxchg with 0 to ensure
+	 * that we can't lose any dirty information.
+	 */
 	orig_pte = get_clear_flush(vma->vm_mm, addr, ptep, pgsize, ncontig);
-	if (!pte_same(orig_pte, pte))
-		changed = 1;
 
 	/* Make sure we don't lose the dirty state */
 	if (pte_dirty(orig_pte))
@@ -348,7 +376,7 @@  int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
 		set_pte_at(vma->vm_mm, addr, ptep, pfn_pte(pfn, hugeprot));
 
-	return changed;
+	return 1;
 }
 
 void huge_ptep_set_wrprotect(struct mm_struct *mm,