diff mbox series

[18/41] mm/khugepaged: write-lock VMA while collapsing a huge page

Message ID 20230109205336.3665937-19-surenb@google.com (mailing list archive)
State New
Headers show
Series Per-VMA locks | expand

Commit Message

Suren Baghdasaryan Jan. 9, 2023, 8:53 p.m. UTC
Protect VMA from concurrent page fault handler while collapsing a huge
page. Page fault handler needs a stable PMD to use PTL and relies on
per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
not be detected by a page fault handler without proper locking.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/khugepaged.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Michal Hocko Jan. 17, 2023, 3:25 p.m. UTC | #1
On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> Protect VMA from concurrent page fault handler while collapsing a huge
> page. Page fault handler needs a stable PMD to use PTL and relies on
> per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> not be detected by a page fault handler without proper locking.

I am struggling with this changelog. Maybe because my recollection of
the THP collapsing subtleties is weak. But aren't you just trying to say
that the current #PF handling and THP collapsing need to be mutually
exclusive currently so in order to keep that assumption you have mark
the vma write locked?

Also it is not really clear to me how that handles other vmas which can
share the same thp?
Jann Horn Jan. 17, 2023, 8:28 p.m. UTC | #2
On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko <mhocko@suse.com> wrote:
> On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > Protect VMA from concurrent page fault handler while collapsing a huge
> > page. Page fault handler needs a stable PMD to use PTL and relies on
> > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > not be detected by a page fault handler without proper locking.
>
> I am struggling with this changelog. Maybe because my recollection of
> the THP collapsing subtleties is weak. But aren't you just trying to say
> that the current #PF handling and THP collapsing need to be mutually
> exclusive currently so in order to keep that assumption you have mark
> the vma write locked?
>
> Also it is not really clear to me how that handles other vmas which can
> share the same thp?

It's not about the hugepage itself, it's about how the THP collapse
operation frees page tables.

Before this series, page tables can be walked under any one of the
mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
unlinks and frees page tables, it must ensure that all of those either
are locked or don't exist. This series adds a fourth lock under which
page tables can be traversed, and so khugepaged must also lock out that one.

There is a codepath in khugepaged that iterates through all mappings
of a file to zap page tables (retract_page_tables()), which locks each
visited mm with mmap_write_trylock() and now also does
vma_write_lock().


I think one aspect of this patch that might cause trouble later on, if
support for non-anonymous VMAs is added, is that retract_page_tables()
now does vma_write_lock() while holding the mapping lock; the page
fault handling path would probably take the locks the other way
around, leading to a deadlock? So the vma_write_lock() in
retract_page_tables() might have to become a trylock later on.

Related: Please add the new VMA lock to the big lock ordering comments
at the top of mm/rmap.c. (And maybe later mm/filemap.c, if/when you
add file VMA support.)
Suren Baghdasaryan Jan. 17, 2023, 9:05 p.m. UTC | #3
On Tue, Jan 17, 2023 at 12:28 PM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko <mhocko@suse.com> wrote:
> > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > > Protect VMA from concurrent page fault handler while collapsing a huge
> > > page. Page fault handler needs a stable PMD to use PTL and relies on
> > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > > not be detected by a page fault handler without proper locking.
> >
> > I am struggling with this changelog. Maybe because my recollection of
> > the THP collapsing subtleties is weak. But aren't you just trying to say
> > that the current #PF handling and THP collapsing need to be mutually
> > exclusive currently so in order to keep that assumption you have mark
> > the vma write locked?
> >
> > Also it is not really clear to me how that handles other vmas which can
> > share the same thp?
>
> It's not about the hugepage itself, it's about how the THP collapse
> operation frees page tables.
>
> Before this series, page tables can be walked under any one of the
> mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
> unlinks and frees page tables, it must ensure that all of those either
> are locked or don't exist. This series adds a fourth lock under which
> page tables can be traversed, and so khugepaged must also lock out that one.
>
> There is a codepath in khugepaged that iterates through all mappings
> of a file to zap page tables (retract_page_tables()), which locks each
> visited mm with mmap_write_trylock() and now also does
> vma_write_lock().
>
>
> I think one aspect of this patch that might cause trouble later on, if
> support for non-anonymous VMAs is added, is that retract_page_tables()
> now does vma_write_lock() while holding the mapping lock; the page
> fault handling path would probably take the locks the other way
> around, leading to a deadlock? So the vma_write_lock() in
> retract_page_tables() might have to become a trylock later on.
>
> Related: Please add the new VMA lock to the big lock ordering comments
> at the top of mm/rmap.c. (And maybe later mm/filemap.c, if/when you
> add file VMA support.)

Thanks for the clarifications and the warning. I'll add appropriate
comments and will take this deadlocking scenario into account when
later implementing support for file-backed page faults.
Michal Hocko Jan. 18, 2023, 9:40 a.m. UTC | #4
On Tue 17-01-23 21:28:06, Jann Horn wrote:
> On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko <mhocko@suse.com> wrote:
> > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > > Protect VMA from concurrent page fault handler while collapsing a huge
> > > page. Page fault handler needs a stable PMD to use PTL and relies on
> > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > > not be detected by a page fault handler without proper locking.
> >
> > I am struggling with this changelog. Maybe because my recollection of
> > the THP collapsing subtleties is weak. But aren't you just trying to say
> > that the current #PF handling and THP collapsing need to be mutually
> > exclusive currently so in order to keep that assumption you have mark
> > the vma write locked?
> >
> > Also it is not really clear to me how that handles other vmas which can
> > share the same thp?
> 
> It's not about the hugepage itself, it's about how the THP collapse
> operation frees page tables.
> 
> Before this series, page tables can be walked under any one of the
> mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
> unlinks and frees page tables, it must ensure that all of those either
> are locked or don't exist. This series adds a fourth lock under which
> page tables can be traversed, and so khugepaged must also lock out that one.
> 
> There is a codepath in khugepaged that iterates through all mappings
> of a file to zap page tables (retract_page_tables()), which locks each
> visited mm with mmap_write_trylock() and now also does
> vma_write_lock().

OK, I see. This would be a great addendum to the changelog.
 
> I think one aspect of this patch that might cause trouble later on, if
> support for non-anonymous VMAs is added, is that retract_page_tables()
> now does vma_write_lock() while holding the mapping lock; the page
> fault handling path would probably take the locks the other way
> around, leading to a deadlock? So the vma_write_lock() in
> retract_page_tables() might have to become a trylock later on.

This, right?
#PF			retract_page_tables
vma_read_lock
			i_mmap_lock_write
i_mmap_lock_read
			vma_write_lock


I might be missing something but I have only found huge_pmd_share to be
called from the #PF path. That one should be safe as it cannot be a
target for THP. Not that it would matter much because such a dependency
chain would be really subtle.
Jann Horn Jan. 18, 2023, 12:38 p.m. UTC | #5
On Wed, Jan 18, 2023 at 10:40 AM Michal Hocko <mhocko@suse.com> wrote:
> On Tue 17-01-23 21:28:06, Jann Horn wrote:
> > On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko <mhocko@suse.com> wrote:
> > > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > > > Protect VMA from concurrent page fault handler while collapsing a huge
> > > > page. Page fault handler needs a stable PMD to use PTL and relies on
> > > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > > > not be detected by a page fault handler without proper locking.
> > >
> > > I am struggling with this changelog. Maybe because my recollection of
> > > the THP collapsing subtleties is weak. But aren't you just trying to say
> > > that the current #PF handling and THP collapsing need to be mutually
> > > exclusive currently so in order to keep that assumption you have mark
> > > the vma write locked?
> > >
> > > Also it is not really clear to me how that handles other vmas which can
> > > share the same thp?
> >
> > It's not about the hugepage itself, it's about how the THP collapse
> > operation frees page tables.
> >
> > Before this series, page tables can be walked under any one of the
> > mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
> > unlinks and frees page tables, it must ensure that all of those either
> > are locked or don't exist. This series adds a fourth lock under which
> > page tables can be traversed, and so khugepaged must also lock out that one.
> >
> > There is a codepath in khugepaged that iterates through all mappings
> > of a file to zap page tables (retract_page_tables()), which locks each
> > visited mm with mmap_write_trylock() and now also does
> > vma_write_lock().
>
> OK, I see. This would be a great addendum to the changelog.
>
> > I think one aspect of this patch that might cause trouble later on, if
> > support for non-anonymous VMAs is added, is that retract_page_tables()
> > now does vma_write_lock() while holding the mapping lock; the page
> > fault handling path would probably take the locks the other way
> > around, leading to a deadlock? So the vma_write_lock() in
> > retract_page_tables() might have to become a trylock later on.
>
> This, right?
> #PF                     retract_page_tables
> vma_read_lock
>                         i_mmap_lock_write
> i_mmap_lock_read
>                         vma_write_lock
>
>
> I might be missing something but I have only found huge_pmd_share to be
> called from the #PF path. That one should be safe as it cannot be a
> target for THP. Not that it would matter much because such a dependency
> chain would be really subtle.

Oops, yeah. Now that I'm looking closer I also don't see a path from
the #PF path to i_mmap_lock_read. Sorry for sending you on a wild
goose chase.
Suren Baghdasaryan Jan. 18, 2023, 5:41 p.m. UTC | #6
On Wed, Jan 18, 2023 at 1:40 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 17-01-23 21:28:06, Jann Horn wrote:
> > On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko <mhocko@suse.com> wrote:
> > > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > > > Protect VMA from concurrent page fault handler while collapsing a huge
> > > > page. Page fault handler needs a stable PMD to use PTL and relies on
> > > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > > > not be detected by a page fault handler without proper locking.
> > >
> > > I am struggling with this changelog. Maybe because my recollection of
> > > the THP collapsing subtleties is weak. But aren't you just trying to say
> > > that the current #PF handling and THP collapsing need to be mutually
> > > exclusive currently so in order to keep that assumption you have mark
> > > the vma write locked?
> > >
> > > Also it is not really clear to me how that handles other vmas which can
> > > share the same thp?
> >
> > It's not about the hugepage itself, it's about how the THP collapse
> > operation frees page tables.
> >
> > Before this series, page tables can be walked under any one of the
> > mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
> > unlinks and frees page tables, it must ensure that all of those either
> > are locked or don't exist. This series adds a fourth lock under which
> > page tables can be traversed, and so khugepaged must also lock out that one.
> >
> > There is a codepath in khugepaged that iterates through all mappings
> > of a file to zap page tables (retract_page_tables()), which locks each
> > visited mm with mmap_write_trylock() and now also does
> > vma_write_lock().
>
> OK, I see. This would be a great addendum to the changelog.

I'll add Jann's description in the changelog. Thanks Jann!

>
> > I think one aspect of this patch that might cause trouble later on, if
> > support for non-anonymous VMAs is added, is that retract_page_tables()
> > now does vma_write_lock() while holding the mapping lock; the page
> > fault handling path would probably take the locks the other way
> > around, leading to a deadlock? So the vma_write_lock() in
> > retract_page_tables() might have to become a trylock later on.
>
> This, right?
> #PF                     retract_page_tables
> vma_read_lock
>                         i_mmap_lock_write
> i_mmap_lock_read
>                         vma_write_lock
>
>
> I might be missing something but I have only found huge_pmd_share to be
> called from the #PF path. That one should be safe as it cannot be a
> target for THP. Not that it would matter much because such a dependency
> chain would be really subtle.
> --
> Michal Hocko
> SUSE Labs
diff mbox series

Patch

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5376246a3052..d8d0647f0c2c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1032,6 +1032,7 @@  static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 
+	vma_write_lock(vma);
 	anon_vma_lock_write(vma->anon_vma);
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm,
@@ -1503,6 +1504,9 @@  int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		goto drop_hpage;
 	}
 
+	/* Lock the vma before taking i_mmap and page table locks */
+	vma_write_lock(vma);
+
 	/*
 	 * We need to lock the mapping so that from here on, only GUP-fast and
 	 * hardware page walks can access the parts of the page tables that
@@ -1690,6 +1694,7 @@  static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
 				result = SCAN_PTE_UFFD_WP;
 				goto unlock_next;
 			}
+			vma_write_lock(vma);
 			collapse_and_free_pmd(mm, vma, addr, pmd);
 			if (!cc->is_khugepaged && is_target)
 				result = set_huge_pmd(vma, addr, pmd, hpage);