diff mbox series

[03/11] mm: page_vma_mapped_walk(): use pmd_read_atomic()

Message ID 594c1f0-d396-5346-1f36-606872cddb18@google.com (mailing list archive)
State New, archived
Headers show
Series mm: page_vma_mapped_walk() cleanup and THP fixes | expand

Commit Message

Hugh Dickins June 10, 2021, 6:38 a.m. UTC
page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier()
instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE)
have a multi-word pmd entry, for which READ_ONCE() is not good enough.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 mm/page_vma_mapped.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Kirill A . Shutemov June 10, 2021, 9:06 a.m. UTC | #1
On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote:
> page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier()
> instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE)
> have a multi-word pmd entry, for which READ_ONCE() is not good enough.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>  mm/page_vma_mapped.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 7c0504641fb8..973c3c4e72cc 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  	pud = pud_offset(p4d, pvmw->address);
>  	if (!pud_present(*pud))
>  		return false;
> +
>  	pvmw->pmd = pmd_offset(pud, pvmw->address);
>  	/*
>  	 * Make sure the pmd value isn't cached in a register by the
>  	 * compiler and used as a stale value after we've observed a
>  	 * subsequent update.
>  	 */
> -	pmde = READ_ONCE(*pvmw->pmd);
> +	pmde = pmd_read_atomic(pvmw->pmd);
> +	barrier();
> +

Hm. It makes me wounder if barrier() has to be part of pmd_read_atomic().
mm/hmm.c uses the same pattern as you are and I tend to think that the
rest of pmd_read_atomic() users may be broken.

Am I wrong?
Jason Gunthorpe June 10, 2021, 12:15 p.m. UTC | #2
On Thu, Jun 10, 2021 at 12:06:17PM +0300, Kirill A. Shutemov wrote:
> On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote:
> > page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier()
> > instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE)
> > have a multi-word pmd entry, for which READ_ONCE() is not good enough.
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Cc: <stable@vger.kernel.org>
> >  mm/page_vma_mapped.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index 7c0504641fb8..973c3c4e72cc 100644
> > +++ b/mm/page_vma_mapped.c
> > @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> >  	pud = pud_offset(p4d, pvmw->address);
> >  	if (!pud_present(*pud))
> >  		return false;
> > +
> >  	pvmw->pmd = pmd_offset(pud, pvmw->address);
> >  	/*
> >  	 * Make sure the pmd value isn't cached in a register by the
> >  	 * compiler and used as a stale value after we've observed a
> >  	 * subsequent update.
> >  	 */
> > -	pmde = READ_ONCE(*pvmw->pmd);
> > +	pmde = pmd_read_atomic(pvmw->pmd);
> > +	barrier();
> > +
> 
> Hm. It makes me wounder if barrier() has to be part of pmd_read_atomic().
> mm/hmm.c uses the same pattern as you are and I tend to think that the
> rest of pmd_read_atomic() users may be broken.
> 
> Am I wrong?

I agree with you, something called _atomic should not require the
caller to provide barriers.

I think the issue is simply that the two implementations of
pmd_read_atomic() should use READ_ONCE() internally, no?

Jason
Jason Gunthorpe June 11, 2021, 3:36 p.m. UTC | #3
On Thu, Jun 10, 2021 at 11:37:14PM -0700, Hugh Dickins wrote:
> On Thu, 10 Jun 2021, Jason Gunthorpe wrote:
> > On Thu, Jun 10, 2021 at 12:06:17PM +0300, Kirill A. Shutemov wrote:
> > > On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote:
> > > > page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier()
> > > > instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE)
> > > > have a multi-word pmd entry, for which READ_ONCE() is not good enough.
> > > > 
> > > > Signed-off-by: Hugh Dickins <hughd@google.com>
> > > > Cc: <stable@vger.kernel.org>
> > > >  mm/page_vma_mapped.c | 5 ++++-
> > > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > > > index 7c0504641fb8..973c3c4e72cc 100644
> > > > +++ b/mm/page_vma_mapped.c
> > > > @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > > >  	pud = pud_offset(p4d, pvmw->address);
> > > >  	if (!pud_present(*pud))
> > > >  		return false;
> > > > +
> > > >  	pvmw->pmd = pmd_offset(pud, pvmw->address);
> > > >  	/*
> > > >  	 * Make sure the pmd value isn't cached in a register by the
> > > >  	 * compiler and used as a stale value after we've observed a
> > > >  	 * subsequent update.
> > > >  	 */
> > > > -	pmde = READ_ONCE(*pvmw->pmd);
> > > > +	pmde = pmd_read_atomic(pvmw->pmd);
> > > > +	barrier();
> > > > +
> > > 
> > > Hm. It makes me wounder if barrier() has to be part of pmd_read_atomic().
> > > mm/hmm.c uses the same pattern as you are and I tend to think that the
> > > rest of pmd_read_atomic() users may be broken.
> > > 
> > > Am I wrong?
> > 
> > I agree with you, something called _atomic should not require the
> > caller to provide barriers.
> > 
> > I think the issue is simply that the two implementations of
> > pmd_read_atomic() should use READ_ONCE() internally, no?
> 
> I've had great difficulty coming up with answers for you.
> 
> This patch was based on two notions I've had lodged in my mind
> for several years:
> 
> 1) that pmd_read_atomic() is the creme-de-la-creme for reading a pmd_t
> atomically (even if the pmd_t spans multiple words); but reading again
> after all this time the comment above it, it seems to be more specialized
> than I'd thought (biggest selling point being for when you want to check
> pmd_none(), which we don't).  And has no READ_ONCE() or barrier() inside,
> so really needs that barrier() to be as safe as the previous READ_ONCE().

IMHO it is a simple bug that the 64 bit version of this is not marked
with READ_ONCE() internally. It is reading data without a lock, AFAIK
our modern understanding of the memory model is that requires
READ_ONCE().

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index e896ebef8c24cb..0bf1fdec928e71 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -75,7 +75,7 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
 static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 {
 	pmdval_t ret;
-	u32 *tmp = (u32 *)pmdp;
+	u32 *tmp = READ_ONCE((u32 *)pmdp);
 
 	ret = (pmdval_t) (*tmp);
 	if (ret) {
@@ -84,7 +84,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 		 * or we can end up with a partial pmd.
 		 */
 		smp_rmb();
-		ret |= ((pmdval_t)*(tmp + 1)) << 32;
+		ret |= READ_ONCE((pmdval_t)*(tmp + 1)) << 32;
 	}
 
 	return (pmd_t) { ret };
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 46b13780c2c8c9..c8c7a3307d2773 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1228,7 +1228,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 	 * only going to work, if the pmdval_t isn't larger than
 	 * an unsigned long.
 	 */
-	return *pmdp;
+	return READ_ONCE(*pmdp);
 }
 #endif
 

> 2) the barrier() in mm_find_pmd(), that replaced an earlier READ_ONCE(),
> because READ_ONCE() did not work (did not give the necessary guarantee? or
> did not build?) on architectures with multiple word pmd_ts e.g. i386 PAE.

This is really interesting, the git history e37c69827063 ("mm: replace
ACCESS_ONCE with READ_ONCE or barriers") says the READ_ONCE was
dropped here "because it doesn't work on non-scalar types" due to a
(now 8 year old) gcc bug.

According to the gcc bug READ_ONCE() on anything that is a scalar
sized struct triggers GCC to ignore the READ_ONCEyness. To work around
this bug then READ_ONCE can never be used on any of the struct
protected page table elements. While I am not 100% sure, it looks like
this is a pre gcc 4.9 bug, and since gcc 4.9 is now the minimum
required compiler this is all just cruft. Use READ_ONCE() here too...

> But I've now come across some changes that Will Deacon made last year:
> the include/asm-generic/rwonce.h READ_ONCE() now appears to allow for
> native word type *or* type sizeof(long long) (e.g. i386 PAE) - given
> "a strong prevailing wind" anyway :)  And 8e958839e4b9 ("sparc32: mm:
> Restructure sparc32 MMU page-table layout") put an end to sparc32's
> typedef struct { unsigned long pmdv[16]; } pmd_t.

Indeed, that is good news

> As to your questions about pmd_read_atomic() usage elsewhere:
> please don't force me to think so hard!  (And you've set me half-
> wondering, whether there are sneaky THP transitions, perhaps of the
> "unstable" kind, that page_vma_mapped_walk() should be paying more
> attention to: but for sanity's sake I won't go there, not now.)

If I recall, (and I didn't recheck this right now) the only thing
pmd_read_atomic() provides is the special property that if the pmd's
flags are observed to point to a pte table then pmd_read_atomic() will
reliably return the pte table pointer.

Otherwise it returns the flags and a garbage pointer because under the
THP protocol a PMD pointing at a page can be converted to a PTE table
if you hold the mmap sem in read mode. Upgrading a PTE table to a PMD
page requires mmap sem in write mode so once a PTE table is observed
'locklessly' the value is stable.. Or at least so says the documentation

pmd_read_atomic() is badly named, tricky to use, and missing the
READ_ONCE() because it is so old..

As far as is page_vma_mapped_walk correct.. Everything except
is_pmd_migration_entry() is fine to my eye, and I simply don't know
the rules aroudn migration entries to know right/wrong.

I suspect it probably is required to manipulate a migration entry
while holding the mmap_sem in write mode??

There is some list of changes to the page table that require holding
the mmap sem in write mode that I've never seen documented for..

Jason
Hugh Dickins June 11, 2021, 7:05 p.m. UTC | #4
On Fri, 11 Jun 2021, Jason Gunthorpe wrote:
> On Thu, Jun 10, 2021 at 11:37:14PM -0700, Hugh Dickins wrote:
> > On Thu, 10 Jun 2021, Jason Gunthorpe wrote:
> > > On Thu, Jun 10, 2021 at 12:06:17PM +0300, Kirill A. Shutemov wrote:
> > > > On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote:
> > > > > page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier()
> > > > > instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE)
> > > > > have a multi-word pmd entry, for which READ_ONCE() is not good enough.
> > > > > 
> > > > > Signed-off-by: Hugh Dickins <hughd@google.com>
> > > > > Cc: <stable@vger.kernel.org>
> > > > >  mm/page_vma_mapped.c | 5 ++++-
> > > > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > > > > index 7c0504641fb8..973c3c4e72cc 100644
> > > > > +++ b/mm/page_vma_mapped.c
> > > > > @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > > > >  	pud = pud_offset(p4d, pvmw->address);
> > > > >  	if (!pud_present(*pud))
> > > > >  		return false;
> > > > > +
> > > > >  	pvmw->pmd = pmd_offset(pud, pvmw->address);
> > > > >  	/*
> > > > >  	 * Make sure the pmd value isn't cached in a register by the
> > > > >  	 * compiler and used as a stale value after we've observed a
> > > > >  	 * subsequent update.
> > > > >  	 */
> > > > > -	pmde = READ_ONCE(*pvmw->pmd);
> > > > > +	pmde = pmd_read_atomic(pvmw->pmd);
> > > > > +	barrier();
> > > > > +
> > > > 
> > > > Hm. It makes me wounder if barrier() has to be part of pmd_read_atomic().
> > > > mm/hmm.c uses the same pattern as you are and I tend to think that the
> > > > rest of pmd_read_atomic() users may be broken.
> > > > 
> > > > Am I wrong?
> > > 
> > > I agree with you, something called _atomic should not require the
> > > caller to provide barriers.
> > > 
> > > I think the issue is simply that the two implementations of
> > > pmd_read_atomic() should use READ_ONCE() internally, no?
> > 
> > I've had great difficulty coming up with answers for you.
> > 
> > This patch was based on two notions I've had lodged in my mind
> > for several years:
> > 
> > 1) that pmd_read_atomic() is the creme-de-la-creme for reading a pmd_t
> > atomically (even if the pmd_t spans multiple words); but reading again
> > after all this time the comment above it, it seems to be more specialized
> > than I'd thought (biggest selling point being for when you want to check
> > pmd_none(), which we don't).  And has no READ_ONCE() or barrier() inside,
> > so really needs that barrier() to be as safe as the previous READ_ONCE().
> 
> IMHO it is a simple bug that the 64 bit version of this is not marked
> with READ_ONCE() internally. It is reading data without a lock, AFAIK
> our modern understanding of the memory model is that requires
> READ_ONCE().
> 
> diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
> index e896ebef8c24cb..0bf1fdec928e71 100644
> --- a/arch/x86/include/asm/pgtable-3level.h
> +++ b/arch/x86/include/asm/pgtable-3level.h
> @@ -75,7 +75,7 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
>  static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
>  {
>  	pmdval_t ret;
> -	u32 *tmp = (u32 *)pmdp;
> +	u32 *tmp = READ_ONCE((u32 *)pmdp);
>  
>  	ret = (pmdval_t) (*tmp);
>  	if (ret) {
> @@ -84,7 +84,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
>  		 * or we can end up with a partial pmd.
>  		 */
>  		smp_rmb();
> -		ret |= ((pmdval_t)*(tmp + 1)) << 32;
> +		ret |= READ_ONCE((pmdval_t)*(tmp + 1)) << 32;
>  	}

Maybe that.  Or maybe now (since Will's changes) it can just do
one READ_ONCE() of the whole, then adjust its local copy.

>  
>  	return (pmd_t) { ret };
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 46b13780c2c8c9..c8c7a3307d2773 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1228,7 +1228,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
>  	 * only going to work, if the pmdval_t isn't larger than
>  	 * an unsigned long.
>  	 */
> -	return *pmdp;
> +	return READ_ONCE(*pmdp);
>  }
>  #endif

And it should now be possible to delete the #ifdef THP barrier() in
function_with_long_name_I_didn't_look_up() which calls pmd_read_atomic().

>  
> 
> > 2) the barrier() in mm_find_pmd(), that replaced an earlier READ_ONCE(),
> > because READ_ONCE() did not work (did not give the necessary guarantee? or
> > did not build?) on architectures with multiple word pmd_ts e.g. i386 PAE.
> 
> This is really interesting, the git history e37c69827063 ("mm: replace
> ACCESS_ONCE with READ_ONCE or barriers") says the READ_ONCE was
> dropped here "because it doesn't work on non-scalar types" due to a
> (now 8 year old) gcc bug.
> 
> According to the gcc bug READ_ONCE() on anything that is a scalar
> sized struct triggers GCC to ignore the READ_ONCEyness. To work around
> this bug then READ_ONCE can never be used on any of the struct
> protected page table elements. While I am not 100% sure, it looks like
> this is a pre gcc 4.9 bug, and since gcc 4.9 is now the minimum
> required compiler this is all just cruft. Use READ_ONCE() here too...

Oh, thanks particularly for looking into the gcc end of it:
I never knew that part of the story.

> 
> > But I've now come across some changes that Will Deacon made last year:
> > the include/asm-generic/rwonce.h READ_ONCE() now appears to allow for
> > native word type *or* type sizeof(long long) (e.g. i386 PAE) - given
> > "a strong prevailing wind" anyway :)  And 8e958839e4b9 ("sparc32: mm:
> > Restructure sparc32 MMU page-table layout") put an end to sparc32's
> > typedef struct { unsigned long pmdv[16]; } pmd_t.
> 
> Indeed, that is good news
> 
> > As to your questions about pmd_read_atomic() usage elsewhere:
> > please don't force me to think so hard!  (And you've set me half-
> > wondering, whether there are sneaky THP transitions, perhaps of the
> > "unstable" kind, that page_vma_mapped_walk() should be paying more
> > attention to: but for sanity's sake I won't go there, not now.)
> 
> If I recall, (and I didn't recheck this right now) the only thing
> pmd_read_atomic() provides is the special property that if the pmd's
> flags are observed to point to a pte table then pmd_read_atomic() will
> reliably return the pte table pointer.

I expect your right, I haven't rechecked.  But it does also provide that
special guarantee on matching pmd_none() when half matches pmd_none():
which one or more callers want, but irrelevant where I added it.

> 
> Otherwise it returns the flags and a garbage pointer because under the
> THP protocol a PMD pointing at a page can be converted to a PTE table
> if you hold the mmap sem in read mode. Upgrading a PTE table to a PMD
> page requires mmap sem in write mode so once a PTE table is observed
> 'locklessly' the value is stable.. Or at least so says the documentation
> 
> pmd_read_atomic() is badly named, tricky to use, and missing the
> READ_ONCE() because it is so old..

I think yes to each of those three points.

> 
> As far as is page_vma_mapped_walk correct.. Everything except
> is_pmd_migration_entry() is fine to my eye, and I simply don't know
> the rules aroudn migration entries to know right/wrong.

So long as the swap "type" is entirely in the same word as the pte
flags would be, I think is_pmd_migration_entry() should be fine.
There it's just looking for a hint, is it worth taking pmd_lock()
to obtain a reliable pmde.

> 
> I suspect it probably is required to manipulate a migration entry
> while holding the mmap_sem in write mode??

I don't think in terms of mmap_sem/mmap_lock here: this is on the
rmap lookup path, and typically mmap_lock is not held (whereas I
was surprised to find the comment above pmd_read_atomic() saying a
lot about it - another reason it's inappropriate in pvmw I guess).

The important lock here is pmd_lock() a.k.a. pmd_trans_huge_lock():
get it when it's necessary, but (the tricky part) try to avoid the
overhead and contention in getting it when not necessary.  Before
THP it was easy, but various THP transitions make it hard.

> 
> There is some list of changes to the page table that require holding
> the mmap sem in write mode that I've never seen documented for..

That is a subtler subject than I dare get into at the moment.
But if you're just doing lookups, pmd_lock() is enough.

I'll direct a further mail in this thread to Andrew now,
asking him to drop this patch, when convenient.

Hugh
Hugh Dickins June 11, 2021, 7:33 p.m. UTC | #5
On Fri, 11 Jun 2021, Jason Gunthorpe wrote:
> On Thu, Jun 10, 2021 at 11:37:14PM -0700, Hugh Dickins wrote:
> > On Thu, 10 Jun 2021, Jason Gunthorpe wrote:
> > > On Thu, Jun 10, 2021 at 12:06:17PM +0300, Kirill A. Shutemov wrote:
> > > > On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote:
> > > > > page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier()
> > > > > instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE)
> > > > > have a multi-word pmd entry, for which READ_ONCE() is not good enough.
> > > > > 
> > > > > Signed-off-by: Hugh Dickins <hughd@google.com>
> > > > > Cc: <stable@vger.kernel.org>
> > > > >  mm/page_vma_mapped.c | 5 ++++-
> > > > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > > > > index 7c0504641fb8..973c3c4e72cc 100644
> > > > > +++ b/mm/page_vma_mapped.c
> > > > > @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > > > >  	pud = pud_offset(p4d, pvmw->address);
> > > > >  	if (!pud_present(*pud))
> > > > >  		return false;
> > > > > +
> > > > >  	pvmw->pmd = pmd_offset(pud, pvmw->address);
> > > > >  	/*
> > > > >  	 * Make sure the pmd value isn't cached in a register by the
> > > > >  	 * compiler and used as a stale value after we've observed a
> > > > >  	 * subsequent update.
> > > > >  	 */
> > > > > -	pmde = READ_ONCE(*pvmw->pmd);
> > > > > +	pmde = pmd_read_atomic(pvmw->pmd);
> > > > > +	barrier();
> > > > > +

Andrew, please drop this patch from your queue: it's harmless,
but pretending to do some good when it does not.  The situation has
changed since I originally wrote it, gcc min version has been raised,
Will Deacon has corrected the applicability of READ_ONCE() to pmd_t in
April 2020 commits, pmd_read_atomic() is not as magic as I thought (it
has good uses but not here), and the commit comment is no longer right.

However... if you're nearing the point of a fresh mmotm, probably best
to leave it in for now and we sort out the mess later.  Because although
it's functionally independent from the other patches in the series,
there is of course the "indentation" patch, and this falls in the middle
of what's indented a level there.

I don't imagine that will tax your abilities to their limit, but after
that there's the main bugfix patch, which expects a blank line I added
in this one, that's no longer there when reverted.  Enough to make me
sigh, and just write to say "drop, when you have time".

Let me know if you'd prefer a resend of the series (today? not sure).

(Discussion of pmd_read_atomic() and barrier() and mm_find_pmd() and
suchlike may continue in this thread, but this 03/11 should just be
dropped.  Whether it's in or out should not affect Alistair's series.)

Thanks,
HUgh
Jason Gunthorpe June 11, 2021, 7:42 p.m. UTC | #6
On Fri, Jun 11, 2021 at 12:05:42PM -0700, Hugh Dickins wrote:
> > diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
> > index e896ebef8c24cb..0bf1fdec928e71 100644
> > +++ b/arch/x86/include/asm/pgtable-3level.h
> > @@ -75,7 +75,7 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
> >  static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> >  {
> >  	pmdval_t ret;
> > -	u32 *tmp = (u32 *)pmdp;
> > +	u32 *tmp = READ_ONCE((u32 *)pmdp);
> >  
> >  	ret = (pmdval_t) (*tmp);
> >  	if (ret) {
> > @@ -84,7 +84,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> >  		 * or we can end up with a partial pmd.
> >  		 */
> >  		smp_rmb();
> > -		ret |= ((pmdval_t)*(tmp + 1)) << 32;
> > +		ret |= READ_ONCE((pmdval_t)*(tmp + 1)) << 32;
> >  	}
> 
> Maybe that.  Or maybe now (since Will's changes) it can just do
> one READ_ONCE() of the whole, then adjust its local copy.

I think the smb_rmb() is critical here to ensure a PTE table pointer
is coherent, READ_ONCE is not a substitute, unless I am miss
understanding what Will's changes are???

> > @@ -1228,7 +1228,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> >  	 * only going to work, if the pmdval_t isn't larger than
> >  	 * an unsigned long.
> >  	 */
> > -	return *pmdp;
> > +	return READ_ONCE(*pmdp);
> >  }
> >  #endif
> 
> And it should now be possible to delete the #ifdef THP barrier() in
> function_with_long_name_I_didn't_look_up() which calls pmd_read_atomic().

Yes - I think I know what you mean :)
 
> > If I recall, (and I didn't recheck this right now) the only thing
> > pmd_read_atomic() provides is the special property that if the pmd's
> > flags are observed to point to a pte table then pmd_read_atomic() will
> > reliably return the pte table pointer.
> 
> I expect your right, I haven't rechecked.  But it does also provide that
> special guarantee on matching pmd_none() when half matches pmd_none():
> which one or more callers want, but irrelevant where I added it.

Ah, this I don't know much about.. Hum, I'd have to think about it way
too much to have an opinion if the races around pmd_none are
meaningful enough to resolve with _atomic vs READ_ONCE()

> > As far as is page_vma_mapped_walk correct.. Everything except
> > is_pmd_migration_entry() is fine to my eye, and I simply don't know
> > the rules aroudn migration entries to know right/wrong.
> 
> So long as the swap "type" is entirely in the same word as the pte
> flags would be, I think is_pmd_migration_entry() should be fine.
> There it's just looking for a hint, is it worth taking pmd_lock()
> to obtain a reliable pmde.

I wonder if anyone knows to guarantee that in the arches?

> > I suspect it probably is required to manipulate a migration entry
> > while holding the mmap_sem in write mode??
> 
> I don't think in terms of mmap_sem/mmap_lock here: this is on the
> rmap lookup path, and typically mmap_lock is not held (whereas I
> was surprised to find the comment above pmd_read_atomic() saying a
> lot about it - another reason it's inappropriate in pvmw I guess).

Ah.. Honestly I'm not familiar with all the ways to lock a VMA so that
the page tables it spans are guaranteed not to be freed.

I just saw the _offset() ladder without any page table spinlocks and
knew this was one of the lockless walkers that, somehow, relies on
another lock to ensure the page table level memory is not concurrently
freed.

In that case I take it back, I have no idea if this is correct or not
because it calls map_pte() which does pte_offset_map() based on that
READ_ONCE.

pte_offset_map() without holding the pmd_lock is only OK if you know
that the pmd can't be upgraded to a THP - and the only locking for
that I have memorized is the mmap lock :\ 

I can't tell what other lock is protecting the page tables here or if
that lock is held during the THP upgrade process.. Sorry

> > There is some list of changes to the page table that require
> > holding the mmap sem in write mode that I've never seen documented
> > for..
>
> That is a subtler subject than I dare get into at the moment.
> But if you're just doing lookups, pmd_lock() is enough.

It is enough if you take it, I don't see page_vma_mapped_walk() taking
it in the map_pte() flow :\

Jason
Will Deacon June 15, 2021, 9:46 a.m. UTC | #7
On Fri, Jun 11, 2021 at 04:42:49PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 11, 2021 at 12:05:42PM -0700, Hugh Dickins wrote:
> > > diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
> > > index e896ebef8c24cb..0bf1fdec928e71 100644
> > > +++ b/arch/x86/include/asm/pgtable-3level.h
> > > @@ -75,7 +75,7 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
> > >  static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> > >  {
> > >  	pmdval_t ret;
> > > -	u32 *tmp = (u32 *)pmdp;
> > > +	u32 *tmp = READ_ONCE((u32 *)pmdp);
> > >  
> > >  	ret = (pmdval_t) (*tmp);
> > >  	if (ret) {
> > > @@ -84,7 +84,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> > >  		 * or we can end up with a partial pmd.
> > >  		 */
> > >  		smp_rmb();
> > > -		ret |= ((pmdval_t)*(tmp + 1)) << 32;
> > > +		ret |= READ_ONCE((pmdval_t)*(tmp + 1)) << 32;
> > >  	}
> > 
> > Maybe that.  Or maybe now (since Will's changes) it can just do
> > one READ_ONCE() of the whole, then adjust its local copy.
> 
> I think the smb_rmb() is critical here to ensure a PTE table pointer
> is coherent, READ_ONCE is not a substitute, unless I am miss
> understanding what Will's changes are???

Yes, I agree that the barrier is needed here for x86 PAE. I would really
have liked to enforce native-sized access in READ_ONCE(), but unfortunately
there is plenty of code out there which is resilient to a 64-bit access
being split into two separate 32-bit accesses and so I wasn't able to go
that far.

That being said, pmd_read_atomic() probably _should_ be using READ_ONCE()
because using it inconsistently can give rise to broken codegen, e.g. if
you do:

	pmdval_t x, y, z;

	x = *pmdp;			// Invalid
	y = READ_ONCE(*pmdp);		// Valid
	if (pmd_valid(y))
		z = *pmdp;		// Invalid again!

Then the compiler can allocate the same register for x and z, but will
issue an additional load for y. If a concurrent update takes place to the
pmd which transitions from Invalid -> Valid, then it will look as though
things went back in time, because z will be stale. We actually hit this
on arm64 in practice [1].

Will

[1] https://lore.kernel.org/lkml/20171003114244.430374928@linuxfoundation.org/
Jason Gunthorpe June 16, 2021, 12:42 a.m. UTC | #8
On Tue, Jun 15, 2021 at 10:46:39AM +0100, Will Deacon wrote:

> Then the compiler can allocate the same register for x and z, but will
> issue an additional load for y. If a concurrent update takes place to the
> pmd which transitions from Invalid -> Valid, then it will look as though
> things went back in time, because z will be stale. We actually hit this
> on arm64 in practice [1].

The fact you actually hit this in the real world just seem to confirm
my thinking that the mm's lax use of the memory model is something
that deserves addressing.

Honestly I'm not sure the fix to stick a READ_ONCE in the macros is
very robust. I prefer the gup_fast pattern of:

  pmd_t pmd = READ_ONCE(*pmdp);
  pte_offset_phys(&pmd, addr);

To correctly force the READ_ONCE under unlocked access and the
consistently use the single read of the unstable data.

It seems more maintainable 'hey look at me, I have no locks!' and has
fewer possibilities for obscure order related bugs to creep in.

Jason
Will Deacon June 16, 2021, 10:27 a.m. UTC | #9
On Tue, Jun 15, 2021 at 09:42:07PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 15, 2021 at 10:46:39AM +0100, Will Deacon wrote:
> 
> > Then the compiler can allocate the same register for x and z, but will
> > issue an additional load for y. If a concurrent update takes place to the
> > pmd which transitions from Invalid -> Valid, then it will look as though
> > things went back in time, because z will be stale. We actually hit this
> > on arm64 in practice [1].
> 
> The fact you actually hit this in the real world just seem to confirm
> my thinking that the mm's lax use of the memory model is something
> that deserves addressing.
> 
> Honestly I'm not sure the fix to stick a READ_ONCE in the macros is
> very robust. I prefer the gup_fast pattern of:
> 
>   pmd_t pmd = READ_ONCE(*pmdp);
>   pte_offset_phys(&pmd, addr);
> 
> To correctly force the READ_ONCE under unlocked access and the
> consistently use the single read of the unstable data.
> 
> It seems more maintainable 'hey look at me, I have no locks!' and has
> fewer possibilities for obscure order related bugs to creep in.

Oh, no objection to cleaning this up. It was a "issuing msync(2) causes
data loss argh!" issue, so adding READ_ONCE() to all the macros was the
most straightforward way to solve the immediate problem.

Generally speaking, I think all accesses to live page-tables should be
using READ_ONCE(), as there's also hardware updates from the CPU table
walker to contend with. If that's done in the caller and the macros are
changed to operate on the loaded value, all the better (although this
probably doesn't work so well once you get into rmw operations such as
ptep_test_and_clear_young()).

Will
diff mbox series

Patch

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 7c0504641fb8..973c3c4e72cc 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -182,13 +182,16 @@  bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	pud = pud_offset(p4d, pvmw->address);
 	if (!pud_present(*pud))
 		return false;
+
 	pvmw->pmd = pmd_offset(pud, pvmw->address);
 	/*
 	 * Make sure the pmd value isn't cached in a register by the
 	 * compiler and used as a stale value after we've observed a
 	 * subsequent update.
 	 */
-	pmde = READ_ONCE(*pvmw->pmd);
+	pmde = pmd_read_atomic(pvmw->pmd);
+	barrier();
+
 	if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
 		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
 		if (likely(pmd_trans_huge(*pvmw->pmd))) {