mbox series

[RFC,v2,00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

Message ID 20221118011025.2178986-1-peterx@redhat.com (mailing list archive)
Headers show
Series mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare | expand

Message

Peter Xu Nov. 18, 2022, 1:10 a.m. UTC
Based on latest mm-unstable (96aa38b69507).

This can be seen as a follow-up series to Mike's recent hugetlb vma lock
series for pmd unsharing, so this series also depends on that one.
Hopefully this series can make it a more complete resolution for pmd
unsharing.

PS: so far no one strongly ACKed this, let me keep the RFC tag.  But I
think I'm already more confident than many of the RFCs I posted.

PS2: there're a lot of changes comparing to rfcv1, so I'm just not adding
the changelog.  The whole idea is still the same, though.

Problem
=======

huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable.  It's used mostly everywhere since that's needed even
before taking the pgtable lock.

huge_pte_offset() is always called with mmap lock held with either read or
write.

For normal memory types that's far enough, since any pgtable removal
requires mmap write lock (e.g. munmap or mm destructions).  However hugetlb
has the pmd unshare feature, it means not only the pgtable page can be gone
from under us when we're doing a walking, but also the pgtable page we're
walking (even after unshared, in this case it can only be the huge PUD page
which contains 512 huge pmd entries, with the vma VM_SHARED mapped).  It's
possible because even though freeing the pgtable page requires mmap write
lock, it doesn't help us when we're walking on another mm's pgtable, so
it's still on risk even if we're with the current->mm's mmap lock.

The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed.  It means if we can take the
vma lock around all huge_pte_offset() callers it'll be safe.

There're already a bunch of them that we did as per the latest mm-unstable,
but also quite a few others that we didn't for various reasons.  E.g. it
may not be applicable for not-allow-to-sleep contexts like FOLL_NOWAIT.
Or, huge_pmd_share() is actually a tricky user of huge_pte_offset(),
because even if we took the vma lock, we're walking on another mm's vma!
Taking vma lock for all the vmas are probably not gonna work.

I have totally no report showing that I can trigger such a race, but from
code wise I never see anything that stops the race from happening.  This
series is trying to resolve that problem.

Resolution
==========

What this patch proposed, besides using the vma lock, is that we can also
use other ways to protect the pgtable page from being freed from under us
in huge_pte_offset() context.  The idea is kind of similar to RCU fast-gup.
Note that fast-gup is very safe regarding pmd unsharing even before vma
lock, because fast-gup relies on RCU to protect walking any pgtable page,
including another mm's.  So fast-gup will never hit a freed page even if
pmd sharing is possible.

To apply similar same idea to huge_pte_offset(), it means with proper RCU
protection the pte_t* pointer returned from huge_pte_offset() can also be
always safe to access and de-reference, along with the pgtable lock that
was bound to the pgtable page.  Note that RCU will only work to protect
pgtables if MMU_GATHER_RCU_TABLE_FREE=y.  For the rest we need to disable
irq.  Of course, the whole locking idea is not needed if pmd sharing is not
possible at all, or, on private hugetlb mappings.

Patch Layout
============

Patch 1-3:         cleanup, or dependency of the follow up patches
Patch 4:           the core patch to introduce hugetlb walker lock
Patch 5-11:        each patch resolves one possible race condition
Patch 12:          introduce hugetlb_walk() to replace huge_pte_offset()

Tests
=====

Only lightly tested on hugetlb kselftests including uffd.

Comments welcomed, thanks.

Peter Xu (12):
  mm/hugetlb: Let vma_offset_start() to return start
  mm/hugetlb: Move swap entry handling into vma lock for fault
  mm/hugetlb: Don't wait for migration entry during follow page
  mm/hugetlb: Add pgtable walker lock
  mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare
  mm/hugetlb: Protect huge_pmd_share() with walker lock
  mm/hugetlb: Use hugetlb walker lock in hugetlb_follow_page_mask()
  mm/hugetlb: Use hugetlb walker lock in follow_hugetlb_page()
  mm/hugetlb: Use hugetlb walker lock in hugetlb_vma_maps_page()
  mm/hugetlb: Use hugetlb walker lock in walk_hugetlb_range()
  mm/hugetlb: Use hugetlb walker lock in page_vma_mapped_walk()
  mm/hugetlb: Introduce hugetlb_walk()

 arch/s390/mm/gmap.c      |   2 +
 fs/hugetlbfs/inode.c     |  41 +++++++-------
 fs/proc/task_mmu.c       |   2 +
 fs/userfaultfd.c         |  24 ++++++---
 include/linux/hugetlb.h  | 112 +++++++++++++++++++++++++++++++++++++++
 include/linux/pagewalk.h |   9 +++-
 include/linux/rmap.h     |   4 ++
 mm/hugetlb.c             |  97 +++++++++++++++++----------------
 mm/page_vma_mapped.c     |   7 ++-
 mm/pagewalk.c            |   6 +--
 10 files changed, 224 insertions(+), 80 deletions(-)

Comments

David Hildenbrand Nov. 23, 2022, 9:40 a.m. UTC | #1
On 18.11.22 02:10, Peter Xu wrote:
> Based on latest mm-unstable (96aa38b69507).
> 
> This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> series for pmd unsharing, so this series also depends on that one.
> Hopefully this series can make it a more complete resolution for pmd
> unsharing.
> 
> PS: so far no one strongly ACKed this, let me keep the RFC tag.  But I
> think I'm already more confident than many of the RFCs I posted.
> 
> PS2: there're a lot of changes comparing to rfcv1, so I'm just not adding
> the changelog.  The whole idea is still the same, though.
> 
> Problem
> =======
> 
> huge_pte_offset() is a major helper used by hugetlb code paths to walk a
> hugetlb pgtable.  It's used mostly everywhere since that's needed even
> before taking the pgtable lock.
> 
> huge_pte_offset() is always called with mmap lock held with either read or
> write.
> 
> For normal memory types that's far enough, since any pgtable removal
> requires mmap write lock (e.g. munmap or mm destructions).  However hugetlb
> has the pmd unshare feature, it means not only the pgtable page can be gone
> from under us when we're doing a walking, but also the pgtable page we're
> walking (even after unshared, in this case it can only be the huge PUD page
> which contains 512 huge pmd entries, with the vma VM_SHARED mapped).  It's
> possible because even though freeing the pgtable page requires mmap write
> lock, it doesn't help us when we're walking on another mm's pgtable, so
> it's still on risk even if we're with the current->mm's mmap lock.
> 
> The recent work from Mike on vma lock can resolve most of this already.
> It's achieved by forbidden pmd unsharing during the lock being taken, so no
> further risk of the pgtable page being freed.  It means if we can take the
> vma lock around all huge_pte_offset() callers it'll be safe.
> 
> There're already a bunch of them that we did as per the latest mm-unstable,
> but also quite a few others that we didn't for various reasons.  E.g. it
> may not be applicable for not-allow-to-sleep contexts like FOLL_NOWAIT.
> Or, huge_pmd_share() is actually a tricky user of huge_pte_offset(),
> because even if we took the vma lock, we're walking on another mm's vma!
> Taking vma lock for all the vmas are probably not gonna work.
> 
> I have totally no report showing that I can trigger such a race, but from
> code wise I never see anything that stops the race from happening.  This
> series is trying to resolve that problem.

Let me try understand the basic problem first:

hugetlb walks page tables semi-lockless: while we hold the mmap lock, we 
don't grab the page table locks. That's very hugetlb specific handling 
and I assume hugetlb uses different mechanisms to sync against 
MADV_DONTNEED, concurrent page fault s... but that's no news. hugetlb is 
weird in many ways :)

So, IIUC, you want a mechanism to synchronize against PMD unsharing. 
Can't we use some very basic locking for that?

Using RCU / disabling local irqs seems a bit excessive because we *are* 
holding the mmap lock and only care about concurrent unsharing
Peter Xu Nov. 23, 2022, 3:09 p.m. UTC | #2
Hi, David,

Thanks for taking a look.

On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
> On 18.11.22 02:10, Peter Xu wrote:
> > Based on latest mm-unstable (96aa38b69507).
> > 
> > This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> > series for pmd unsharing, so this series also depends on that one.
> > Hopefully this series can make it a more complete resolution for pmd
> > unsharing.
> > 
> > PS: so far no one strongly ACKed this, let me keep the RFC tag.  But I
> > think I'm already more confident than many of the RFCs I posted.
> > 
> > PS2: there're a lot of changes comparing to rfcv1, so I'm just not adding
> > the changelog.  The whole idea is still the same, though.
> > 
> > Problem
> > =======
> > 
> > huge_pte_offset() is a major helper used by hugetlb code paths to walk a
> > hugetlb pgtable.  It's used mostly everywhere since that's needed even
> > before taking the pgtable lock.
> > 
> > huge_pte_offset() is always called with mmap lock held with either read or
> > write.
> > 
> > For normal memory types that's far enough, since any pgtable removal
> > requires mmap write lock (e.g. munmap or mm destructions).  However hugetlb
> > has the pmd unshare feature, it means not only the pgtable page can be gone
> > from under us when we're doing a walking, but also the pgtable page we're
> > walking (even after unshared, in this case it can only be the huge PUD page
> > which contains 512 huge pmd entries, with the vma VM_SHARED mapped).  It's
> > possible because even though freeing the pgtable page requires mmap write
> > lock, it doesn't help us when we're walking on another mm's pgtable, so
> > it's still on risk even if we're with the current->mm's mmap lock.
> > 
> > The recent work from Mike on vma lock can resolve most of this already.
> > It's achieved by forbidden pmd unsharing during the lock being taken, so no
> > further risk of the pgtable page being freed.  It means if we can take the
> > vma lock around all huge_pte_offset() callers it'll be safe.

[1]

> > 
> > There're already a bunch of them that we did as per the latest mm-unstable,
> > but also quite a few others that we didn't for various reasons.  E.g. it
> > may not be applicable for not-allow-to-sleep contexts like FOLL_NOWAIT.
> > Or, huge_pmd_share() is actually a tricky user of huge_pte_offset(),

[2]

> > because even if we took the vma lock, we're walking on another mm's vma!
> > Taking vma lock for all the vmas are probably not gonna work.
> > 
> > I have totally no report showing that I can trigger such a race, but from
> > code wise I never see anything that stops the race from happening.  This
> > series is trying to resolve that problem.
> 
> Let me try understand the basic problem first:
> 
> hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
> don't grab the page table locks. That's very hugetlb specific handling and I
> assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
> concurrent page fault s... but that's no news. hugetlb is weird in many ways
> :)
> 
> So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
> we use some very basic locking for that?

Yes we can in most cases.  Please refer to above paragraph [1] where I
referred Mike's recent work on vma lock.  That's the basic locking we need
so far to protect pmd unsharing.  I'll attach the link too in the next
post, which is here:

https://lore.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com

> 
> Using RCU / disabling local irqs seems a bit excessive because we *are*
> holding the mmap lock and only care about concurrent unsharing

The series wanted to address where the vma lock is not easy to take.  It
originates from when I was reading Mike's other patch, I forgot why I did
that but I just noticed there's some code path that we may not want to take
a sleepable lock, e.g. in follow page code.

The other one is huge_pmd_share() where we may have the mmap lock for
current mm but we're fundamentally walking another mm.  It'll be tricky to
take a sleepable lock in such condition too.

I mentioned these cases in the other paragraph above [2].  Let me try to
expand that in my next post too.

It's debatable whether all the rest places can only work with either RCU or
irq disabled, but the idea is at least it should speed up those paths when
we still can.  Here, irqoff might be a bit heavy, but RCU lock should be
always superior to vma lock when possible, the payoff is we may still see
stale pgtable data (since unsharing can still happen in parallel), while
that can be completely avoided when we take the vma lock.

Thanks,
Mike Kravetz Nov. 23, 2022, 6:21 p.m. UTC | #3
On 11/23/22 10:09, Peter Xu wrote:
> On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
> > Let me try understand the basic problem first:
> > 
> > hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
> > don't grab the page table locks. That's very hugetlb specific handling and I
> > assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
> > concurrent page fault s... but that's no news. hugetlb is weird in many ways
> > :)
> > 
> > So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
> > we use some very basic locking for that?
> 
> Yes we can in most cases.  Please refer to above paragraph [1] where I
> referred Mike's recent work on vma lock.  That's the basic locking we need
> so far to protect pmd unsharing.  I'll attach the link too in the next
> post, which is here:
> 
> https://lore.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
> 
> > 
> > Using RCU / disabling local irqs seems a bit excessive because we *are*
> > holding the mmap lock and only care about concurrent unsharing
> 
> The series wanted to address where the vma lock is not easy to take.  It
> originates from when I was reading Mike's other patch, I forgot why I did
> that but I just noticed there's some code path that we may not want to take
> a sleepable lock, e.g. in follow page code.

Yes, it was the patch suggested by David,

https://lore.kernel.org/linux-mm/20221030225825.40872-1-mike.kravetz@oracle.com/

The issue was that FOLL_NOWAIT could be passed into follow_page_mask.  If so,
then we do not want potentially sleep on the mutex.

Since you both are on this thread, I thought of/noticed a related issue.  In
follow_hugetlb_page, it looks like we can call hugetlb_fault if FOLL_NOWAIT
is set.  hugetlb_fault certainly has the potential for sleeping.  Is this also
a similar issue?
Peter Xu Nov. 23, 2022, 6:56 p.m. UTC | #4
On Wed, Nov 23, 2022 at 10:21:30AM -0800, Mike Kravetz wrote:
> On 11/23/22 10:09, Peter Xu wrote:
> > On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
> > > Let me try understand the basic problem first:
> > > 
> > > hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
> > > don't grab the page table locks. That's very hugetlb specific handling and I
> > > assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
> > > concurrent page fault s... but that's no news. hugetlb is weird in many ways
> > > :)
> > > 
> > > So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
> > > we use some very basic locking for that?
> > 
> > Yes we can in most cases.  Please refer to above paragraph [1] where I
> > referred Mike's recent work on vma lock.  That's the basic locking we need
> > so far to protect pmd unsharing.  I'll attach the link too in the next
> > post, which is here:
> > 
> > https://lore.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
> > 
> > > 
> > > Using RCU / disabling local irqs seems a bit excessive because we *are*
> > > holding the mmap lock and only care about concurrent unsharing
> > 
> > The series wanted to address where the vma lock is not easy to take.  It
> > originates from when I was reading Mike's other patch, I forgot why I did
> > that but I just noticed there's some code path that we may not want to take
> > a sleepable lock, e.g. in follow page code.
> 
> Yes, it was the patch suggested by David,
> 
> https://lore.kernel.org/linux-mm/20221030225825.40872-1-mike.kravetz@oracle.com/
> 
> The issue was that FOLL_NOWAIT could be passed into follow_page_mask.  If so,
> then we do not want potentially sleep on the mutex.
> 
> Since you both are on this thread, I thought of/noticed a related issue.  In
> follow_hugetlb_page, it looks like we can call hugetlb_fault if FOLL_NOWAIT
> is set.  hugetlb_fault certainly has the potential for sleeping.  Is this also
> a similar issue?

Yeah maybe the clean way to do this is when FAULT_FLAG_RETRY_NOWAIT is set
we should always try to not sleep at all.

But maybe that's also not urgently needed. So far I don't see any real
non-sleepable caller of it exists - the only one (kvm) can actually sleep..

It's definitely not wanted, as kvm only attach NOWAIT for an async fault,
so ideally any wait should be offloaded into async threads.  Now with the
hugetlb code being able to sleep with NOWAIT, the waiting time will be
accounted to real fault time of vcpu and partly invalidate async page fault
handling.  Said that, it also means no immediate fault would trigger either.
It's just that for the pmd unshare we can start to at least use non-sleep
version of the locks.

Now I'm more concerned with huge_pmd_share(), which seems to have no good
option but only the RCU approach.

One other thing I noticed is I cannot quickly figure out whether
follow_hugetlb_page() is needed anymore, since follow_page_mask() seems to
be also fine with walking hugetlb pgtables.

follow_hugetlb_page() can be traced back to the git initial commit, I had a
feeling that the old version of follow_page_mask() doesn't support hugetlb,
but now after it's supported maybe we can drop follow_hugetlb_page() as a
whole?
David Hildenbrand Nov. 23, 2022, 7:31 p.m. UTC | #5
On 23.11.22 19:56, Peter Xu wrote:
> On Wed, Nov 23, 2022 at 10:21:30AM -0800, Mike Kravetz wrote:
>> On 11/23/22 10:09, Peter Xu wrote:
>>> On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
>>>> Let me try understand the basic problem first:
>>>>
>>>> hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
>>>> don't grab the page table locks. That's very hugetlb specific handling and I
>>>> assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
>>>> concurrent page fault s... but that's no news. hugetlb is weird in many ways
>>>> :)
>>>>
>>>> So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
>>>> we use some very basic locking for that?
>>>
>>> Yes we can in most cases.  Please refer to above paragraph [1] where I
>>> referred Mike's recent work on vma lock.  That's the basic locking we need
>>> so far to protect pmd unsharing.  I'll attach the link too in the next
>>> post, which is here:
>>>
>>> https://lore.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
>>>
>>>>
>>>> Using RCU / disabling local irqs seems a bit excessive because we *are*
>>>> holding the mmap lock and only care about concurrent unsharing
>>>
>>> The series wanted to address where the vma lock is not easy to take.  It
>>> originates from when I was reading Mike's other patch, I forgot why I did
>>> that but I just noticed there's some code path that we may not want to take
>>> a sleepable lock, e.g. in follow page code.
>>
>> Yes, it was the patch suggested by David,
>>
>> https://lore.kernel.org/linux-mm/20221030225825.40872-1-mike.kravetz@oracle.com/
>>
>> The issue was that FOLL_NOWAIT could be passed into follow_page_mask.  If so,
>> then we do not want potentially sleep on the mutex.
>>
>> Since you both are on this thread, I thought of/noticed a related issue.  In
>> follow_hugetlb_page, it looks like we can call hugetlb_fault if FOLL_NOWAIT
>> is set.  hugetlb_fault certainly has the potential for sleeping.  Is this also
>> a similar issue?
> 
> Yeah maybe the clean way to do this is when FAULT_FLAG_RETRY_NOWAIT is set
> we should always try to not sleep at all.

hva_to_pfn_slow() that sets FOLL_NOWAIT calls get_user_pages_unlocked(), 
which will just do a straight mmap_read_lock().

The interpretation of FOLL_NOWAIT should not be "don't take any 
sleepable locks" but instead more like "don't wait for a page to get 
swapped in".

#define FOLL_NOWAIT  0x20  /* if a disk transfer is needed, start the IO


I did not read the full replies yet (sorry, busy hacking :) ) but *any* 
code path that already takes the mmap_read_lock() can just take whatever 
other lock we want -- IMHO. No need to over-complicate our code trying 
to avoid locks in that case.
David Hildenbrand Nov. 25, 2022, 9:43 a.m. UTC | #6
On 23.11.22 16:09, Peter Xu wrote:
> Hi, David,
> 
> Thanks for taking a look.
> 
> On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
>> On 18.11.22 02:10, Peter Xu wrote:
>>> Based on latest mm-unstable (96aa38b69507).
>>>
>>> This can be seen as a follow-up series to Mike's recent hugetlb vma lock
>>> series for pmd unsharing, so this series also depends on that one.
>>> Hopefully this series can make it a more complete resolution for pmd
>>> unsharing.
>>>
>>> PS: so far no one strongly ACKed this, let me keep the RFC tag.  But I
>>> think I'm already more confident than many of the RFCs I posted.
>>>
>>> PS2: there're a lot of changes comparing to rfcv1, so I'm just not adding
>>> the changelog.  The whole idea is still the same, though.
>>>
>>> Problem
>>> =======
>>>
>>> huge_pte_offset() is a major helper used by hugetlb code paths to walk a
>>> hugetlb pgtable.  It's used mostly everywhere since that's needed even
>>> before taking the pgtable lock.
>>>
>>> huge_pte_offset() is always called with mmap lock held with either read or
>>> write.
>>>
>>> For normal memory types that's far enough, since any pgtable removal
>>> requires mmap write lock (e.g. munmap or mm destructions).  However hugetlb
>>> has the pmd unshare feature, it means not only the pgtable page can be gone
>>> from under us when we're doing a walking, but also the pgtable page we're
>>> walking (even after unshared, in this case it can only be the huge PUD page
>>> which contains 512 huge pmd entries, with the vma VM_SHARED mapped).  It's
>>> possible because even though freeing the pgtable page requires mmap write
>>> lock, it doesn't help us when we're walking on another mm's pgtable, so
>>> it's still on risk even if we're with the current->mm's mmap lock.
>>>
>>> The recent work from Mike on vma lock can resolve most of this already.
>>> It's achieved by forbidden pmd unsharing during the lock being taken, so no
>>> further risk of the pgtable page being freed.  It means if we can take the
>>> vma lock around all huge_pte_offset() callers it'll be safe.
> 
> [1]
> 
>>>
>>> There're already a bunch of them that we did as per the latest mm-unstable,
>>> but also quite a few others that we didn't for various reasons.  E.g. it
>>> may not be applicable for not-allow-to-sleep contexts like FOLL_NOWAIT.
>>> Or, huge_pmd_share() is actually a tricky user of huge_pte_offset(),
> 
> [2]
> 
>>> because even if we took the vma lock, we're walking on another mm's vma!
>>> Taking vma lock for all the vmas are probably not gonna work.
>>>
>>> I have totally no report showing that I can trigger such a race, but from
>>> code wise I never see anything that stops the race from happening.  This
>>> series is trying to resolve that problem.
>>
>> Let me try understand the basic problem first:
>>
>> hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
>> don't grab the page table locks. That's very hugetlb specific handling and I
>> assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
>> concurrent page fault s... but that's no news. hugetlb is weird in many ways
>> :)
>>
>> So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
>> we use some very basic locking for that?
> 

Sorry for the delay, finally found time to look into this again. :)

> Yes we can in most cases.  Please refer to above paragraph [1] where I
> referred Mike's recent work on vma lock.  That's the basic locking we need
> so far to protect pmd unsharing.  I'll attach the link too in the next
> post, which is here:
> 
> https://lore.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
> 
>>
>> Using RCU / disabling local irqs seems a bit excessive because we *are*
>> holding the mmap lock and only care about concurrent unsharing
> 
> The series wanted to address where the vma lock is not easy to take.  It
> originates from when I was reading Mike's other patch, I forgot why I did
> that but I just noticed there's some code path that we may not want to take
> a sleepable lock, e.g. in follow page code.

As I stated, whenever we already take the (expensive) mmap lock, the 
least thing we should have to worry about is taking another sleepable 
lock IMHO. Please correct me if I'm wrong.

> 
> The other one is huge_pmd_share() where we may have the mmap lock for
> current mm but we're fundamentally walking another mm.  It'll be tricky to
> take a sleepable lock in such condition too.

We're already grabbing the i_mmap_lock_read(mapping), and the VMAs are 
should be stable in that interval tree IIUC. So I wonder if taking VMA 
locks would really be problematic here. Anything obvious I am missing?

> 
> I mentioned these cases in the other paragraph above [2].  Let me try to
> expand that in my next post too.

That would be great. I yet have to dedicate more time to understand all 
that complexity.

> 
> It's debatable whether all the rest places can only work with either RCU or
> irq disabled, but the idea is at least it should speed up those paths when
> we still can.  Here, irqoff might be a bit heavy, but RCU lock should be
> always superior to vma lock when possible, the payoff is we may still see
> stale pgtable data (since unsharing can still happen in parallel), while
> that can be completely avoided when we take the vma lock.

IRQ disabled is frowned upon by RT folks, that's why I'd like to 
understand if this is really required. Also, adding RCU to an already 
complex mechanism doesn't necessarily make it easier :)

Let me dedicate some time today to dig into some details.
Peter Xu Nov. 25, 2022, 1:55 p.m. UTC | #7
On Fri, Nov 25, 2022 at 10:43:43AM +0100, David Hildenbrand wrote:
> On 23.11.22 16:09, Peter Xu wrote:
> > Hi, David,
> > 
> > Thanks for taking a look.
> > 
> > On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
> > > On 18.11.22 02:10, Peter Xu wrote:
> > > > Based on latest mm-unstable (96aa38b69507).
> > > > 
> > > > This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> > > > series for pmd unsharing, so this series also depends on that one.
> > > > Hopefully this series can make it a more complete resolution for pmd
> > > > unsharing.
> > > > 
> > > > PS: so far no one strongly ACKed this, let me keep the RFC tag.  But I
> > > > think I'm already more confident than many of the RFCs I posted.
> > > > 
> > > > PS2: there're a lot of changes comparing to rfcv1, so I'm just not adding
> > > > the changelog.  The whole idea is still the same, though.
> > > > 
> > > > Problem
> > > > =======
> > > > 
> > > > huge_pte_offset() is a major helper used by hugetlb code paths to walk a
> > > > hugetlb pgtable.  It's used mostly everywhere since that's needed even
> > > > before taking the pgtable lock.
> > > > 
> > > > huge_pte_offset() is always called with mmap lock held with either read or
> > > > write.
> > > > 
> > > > For normal memory types that's far enough, since any pgtable removal
> > > > requires mmap write lock (e.g. munmap or mm destructions).  However hugetlb
> > > > has the pmd unshare feature, it means not only the pgtable page can be gone
> > > > from under us when we're doing a walking, but also the pgtable page we're
> > > > walking (even after unshared, in this case it can only be the huge PUD page
> > > > which contains 512 huge pmd entries, with the vma VM_SHARED mapped).  It's
> > > > possible because even though freeing the pgtable page requires mmap write
> > > > lock, it doesn't help us when we're walking on another mm's pgtable, so
> > > > it's still on risk even if we're with the current->mm's mmap lock.
> > > > 
> > > > The recent work from Mike on vma lock can resolve most of this already.
> > > > It's achieved by forbidden pmd unsharing during the lock being taken, so no
> > > > further risk of the pgtable page being freed.  It means if we can take the
> > > > vma lock around all huge_pte_offset() callers it'll be safe.
> > 
> > [1]
> > 
> > > > 
> > > > There're already a bunch of them that we did as per the latest mm-unstable,
> > > > but also quite a few others that we didn't for various reasons.  E.g. it
> > > > may not be applicable for not-allow-to-sleep contexts like FOLL_NOWAIT.
> > > > Or, huge_pmd_share() is actually a tricky user of huge_pte_offset(),
> > 
> > [2]
> > 
> > > > because even if we took the vma lock, we're walking on another mm's vma!
> > > > Taking vma lock for all the vmas are probably not gonna work.
> > > > 
> > > > I have totally no report showing that I can trigger such a race, but from
> > > > code wise I never see anything that stops the race from happening.  This
> > > > series is trying to resolve that problem.
> > > 
> > > Let me try understand the basic problem first:
> > > 
> > > hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
> > > don't grab the page table locks. That's very hugetlb specific handling and I
> > > assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
> > > concurrent page fault s... but that's no news. hugetlb is weird in many ways
> > > :)
> > > 
> > > So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
> > > we use some very basic locking for that?
> > 
> 
> Sorry for the delay, finally found time to look into this again. :)
> 
> > Yes we can in most cases.  Please refer to above paragraph [1] where I
> > referred Mike's recent work on vma lock.  That's the basic locking we need
> > so far to protect pmd unsharing.  I'll attach the link too in the next
> > post, which is here:
> > 
> > https://lore.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
> > 
> > > 
> > > Using RCU / disabling local irqs seems a bit excessive because we *are*
> > > holding the mmap lock and only care about concurrent unsharing
> > 
> > The series wanted to address where the vma lock is not easy to take.  It
> > originates from when I was reading Mike's other patch, I forgot why I did
> > that but I just noticed there's some code path that we may not want to take
> > a sleepable lock, e.g. in follow page code.
> 
> As I stated, whenever we already take the (expensive) mmap lock, the least
> thing we should have to worry about is taking another sleepable lock IMHO.
> Please correct me if I'm wrong.

Yes that's not a major concern.  But I still think the follow page path
should sleep as less as possible.  For example, non-hugetlb doesn't sleep
now.  If with RCU lock we may do it lockless, then why not?

The same thing to patch 3 of this patchset - I would think it beneficial to
have even without a new lock type introduced, because it still makes the
follow page path cleaner, and have the hugetlb and non-hugetlb match.

> 
> > 
> > The other one is huge_pmd_share() where we may have the mmap lock for
> > current mm but we're fundamentally walking another mm.  It'll be tricky to
> > take a sleepable lock in such condition too.
> 
> We're already grabbing the i_mmap_lock_read(mapping), and the VMAs are
> should be stable in that interval tree IIUC. So I wonder if taking VMA locks
> would really be problematic here. Anything obvious I am missing?

No, I think you're right, and I found that myself just yesterday when I was
writting a reproducer.  huge_pmd_share() is safe here, so at least that
patch in this patchset can be dropped.

> 
> > 
> > I mentioned these cases in the other paragraph above [2].  Let me try to
> > expand that in my next post too.
> 
> That would be great. I yet have to dedicate more time to understand all that
> complexity.
> 
> > 
> > It's debatable whether all the rest places can only work with either RCU or
> > irq disabled, but the idea is at least it should speed up those paths when
> > we still can.  Here, irqoff might be a bit heavy, but RCU lock should be
> > always superior to vma lock when possible, the payoff is we may still see
> > stale pgtable data (since unsharing can still happen in parallel), while
> > that can be completely avoided when we take the vma lock.
> 
> IRQ disabled is frowned upon by RT folks, that's why I'd like to understand
> if this is really required. Also, adding RCU to an already complex mechanism
> doesn't necessarily make it easier :)

I've posted it before, let me copy that over:

arch/arm64/Kconfig:     select MMU_GATHER_RCU_TABLE_FREE     
arch/x86/Kconfig:       select MMU_GATHER_RCU_TABLE_FREE        if PARAVIRT

arch/arm64/Kconfig:     select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
arch/riscv/Kconfig:     select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
arch/x86/Kconfig:       select ARCH_WANT_HUGE_PMD_SHARE

The irqoff thing is definitely unfortunate, but that's only happening with
riscv or x86 when !PARAVIRT.  I would suppose PARAVIRT a common
config.. then it's majorly for riscv only.  If someday riscv can support
rcu table free we can remove it and enforce rcu table free for pmd sharing
if wanted.

Said that, the "new lock" part is definitely not the core of the patchset
(even if it might read like that..).  The core part is to first identify
the issue on overlooked usage of huge_pte_offset() and how to make it
always safe.

Let me also update a few more things when at it.

Since it'll be very hard to reproduce the race discussed in this series, I
didn't try to write a reproducer until yesterday. I'll need some kernel
delays to trigger that, only if so I can trigger some use-after-free.  So I
think problem confirmed.  The rest is how to resolve it, and whether the
vma lock is good enough.

One other thing to mention is I overlooked one important thing on the huge
pgtable lock, which is actually not protected by RCU (as it's part of
pmd_free()).  IOW if a hugetlb walker that wants to do huge_pte_lock()
after huge_pte_offset() it won't be guarded by RCU, hence the new lock
won't easily work for them.  That's another thing very unfortunate.  I'm
not sure whether it's okay to move that part into the page free rcu
callback, but definitely needs more thoughts as pmd_free() is an arch API.

Debatably irqoff might work if the arch needs IPI for tlb flush so irqoff
may protect both the pgtable page but also the huge_pte_lock(), but I don't
think it's wise either to enlarge the irqoff to generic archs, and also I
think "whether tlb flush requires IPI" is arch-specific, that makes this
over complicated too and not necessary.

So firstly, I think I need to rework the patchset so more places will need
to take the vma lock (where pgtable lock needed).  I tend to keep the RCU
lock because it's lighter at least to !riscv, if so it'll look like this
for the rule to use huge_pte_offset():

/*
 * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
 * Returns the pte_t* if found, or NULL if the address is not mapped.
 *
 * IMPORTANT: we should normally not directly call this function, instead
 * this is only a common interface to implement arch-specific walker.
 * Please consider using the hugetlb_walk() helper to make sure of the
 * correct locking is satisfied.
 *
 * Since this function will walk all the pgtable pages (including not only
 * high-level pgtable page, but also PUD entry that can be unshared
 * concurrently for VM_SHARED), the caller of this function should be
 * responsible of its thread safety.  One can follow this rule:
 *
 *  (1) For private mappings: pmd unsharing is not possible, so it'll
 *      always be safe if we're with the mmap sem for either read or write.
 *      This is normally always the case, IOW we don't need to do anything
 *      special.
 *
 *  (2) For shared mappings: pmd unsharing is possible (so the PUD-ranged
 *      pgtable page can go away from under us!  It can be done by a pmd
 *      unshare with a follow up munmap() on the other process), then we
 *      need either:
 *
 *     (2.1) hugetlb vma lock read or write held, to make sure pmd unshare
 *           won't happen upon the range (it also makes sure the pte_t we
 *           read is the right and stable one), or,
 *
 *     (2.2) hugetlb mapping i_mmap_rwsem lock held read or write, to make
 *           sure even if unshare happened the racy unmap() will wait until
 *           i_mmap_rwsem is released, or,
 *
 *     (2.3) pgtable walker lock, to make sure even pmd unsharing happened,
 *           the old shared PUD page won't get freed from under us.  In
 *           this case, the pteval can be obsolete, but at least it's still
 *           always safe to access the page (e.g., de-referencing pte_t*
 *           would not cause use-after-free).  Note, it's not safe to
 *           access pgtable lock with this lock.  If huge_pte_lock()
 *           needed, look for (2.1) or (2.2).
 *
 * Option (2.3) is the lightest, but it means pmd unshare can still happen
 * so the pte got from pgtable walk can be stalled; also only page data is
 * safe to access not others (e.g. pgtable lock).  Option (2.1) is the
 * safest, which guarantees pte stability until the vma lock released,
 * however heavier than (2.3).
 */

Where hugetlb_walk() will be a new wrapper I'll introduce just to make sure
lock protections:

static inline pte_t *
hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
{
#ifdef CONFIG_LOCKDEP
	/* lockdep_is_held() only defined with CONFIG_LOCKDEP */
	if (vma->vm_flags & VM_MAYSHARE) {
		struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;

		/*
		 * Here, taking any of the three locks will guarantee
		 * safety of hugetlb pgtable walk.
		 *
		 * For more information on the locking rules of using
		 * huge_pte_offset(), please see the comment above
		 * huge_pte_offset() in the header file.
		 */
		WARN_ON_ONCE(!hugetlb_walker_locked() &&
			     !lockdep_is_held(&vma_lock->rw_sema) &&
			     !lockdep_is_held(
				 &vma->vm_file->f_mapping->i_mmap_rwsem));
	}
#endif

	return huge_pte_offset(vma->vm_mm, addr, sz);
}