diff mbox series

[RFC,24/37] mm: implement speculative handling in __do_fault()

Message ID 20210407014502.24091-25-michel@lespinasse.org (mailing list archive)
State New, archived
Headers show
Series [RFC,01/37] mmap locking API: mmap_lock_is_contended returns a bool | expand

Commit Message

Michel Lespinasse April 7, 2021, 1:44 a.m. UTC
In the speculative case, call the vm_ops->fault() method from within
an rcu read locked section, and verify the mmap sequence lock at the
start of the section. A match guarantees that the original vma is still
valid at that time, and that the associated vma->vm_file stays valid
while the vm_ops->fault() method is running.

Note that this implies that speculative faults can not sleep within
the vm_ops->fault method. We will only attempt to fetch existing pages
from the page cache during speculative faults; any miss (or prefetch)
will be handled by falling back to non-speculative fault handling.

The speculative handling case also does not preallocate page tables,
as it is always called with a pre-existing page table.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
---
 mm/memory.c | 63 +++++++++++++++++++++++++++++++++++------------------
 1 file changed, 42 insertions(+), 21 deletions(-)

Comments

Matthew Wilcox April 7, 2021, 2:35 a.m. UTC | #1
On Tue, Apr 06, 2021 at 06:44:49PM -0700, Michel Lespinasse wrote:
> In the speculative case, call the vm_ops->fault() method from within
> an rcu read locked section, and verify the mmap sequence lock at the
> start of the section. A match guarantees that the original vma is still
> valid at that time, and that the associated vma->vm_file stays valid
> while the vm_ops->fault() method is running.
> 
> Note that this implies that speculative faults can not sleep within
> the vm_ops->fault method. We will only attempt to fetch existing pages
> from the page cache during speculative faults; any miss (or prefetch)
> will be handled by falling back to non-speculative fault handling.
> 
> The speculative handling case also does not preallocate page tables,
> as it is always called with a pre-existing page table.

I still don't understand why you want to do this.  The speculative
fault that doesn't do I/O is already here, and it's called ->map_pages
(which I see you also do later).  So what's the point of this patch?
Michel Lespinasse April 7, 2021, 2:53 a.m. UTC | #2
On Wed, Apr 07, 2021 at 03:35:27AM +0100, Matthew Wilcox wrote:
> On Tue, Apr 06, 2021 at 06:44:49PM -0700, Michel Lespinasse wrote:
> > In the speculative case, call the vm_ops->fault() method from within
> > an rcu read locked section, and verify the mmap sequence lock at the
> > start of the section. A match guarantees that the original vma is still
> > valid at that time, and that the associated vma->vm_file stays valid
> > while the vm_ops->fault() method is running.
> > 
> > Note that this implies that speculative faults can not sleep within
> > the vm_ops->fault method. We will only attempt to fetch existing pages
> > from the page cache during speculative faults; any miss (or prefetch)
> > will be handled by falling back to non-speculative fault handling.
> > 
> > The speculative handling case also does not preallocate page tables,
> > as it is always called with a pre-existing page table.
> 
> I still don't understand why you want to do this.  The speculative
> fault that doesn't do I/O is already here, and it's called ->map_pages
> (which I see you also do later).  So what's the point of this patch?

I have to admit I did not give much tought about which path would be
generally most common here.

The speculative vm_ops->fault path would be used:
- for private mapping write faults,
- when fault-around is disabled (probably an uncommon case in general,
  but actually common at Google).

That said, I do think your point makes sense in general, espicially if
this could help avoid the per-filesystem enable bit.
Matthew Wilcox April 7, 2021, 3:01 a.m. UTC | #3
On Tue, Apr 06, 2021 at 07:53:20PM -0700, Michel Lespinasse wrote:
> On Wed, Apr 07, 2021 at 03:35:27AM +0100, Matthew Wilcox wrote:
> > On Tue, Apr 06, 2021 at 06:44:49PM -0700, Michel Lespinasse wrote:
> > > In the speculative case, call the vm_ops->fault() method from within
> > > an rcu read locked section, and verify the mmap sequence lock at the
> > > start of the section. A match guarantees that the original vma is still
> > > valid at that time, and that the associated vma->vm_file stays valid
> > > while the vm_ops->fault() method is running.
> > > 
> > > Note that this implies that speculative faults can not sleep within
> > > the vm_ops->fault method. We will only attempt to fetch existing pages
> > > from the page cache during speculative faults; any miss (or prefetch)
> > > will be handled by falling back to non-speculative fault handling.
> > > 
> > > The speculative handling case also does not preallocate page tables,
> > > as it is always called with a pre-existing page table.
> > 
> > I still don't understand why you want to do this.  The speculative
> > fault that doesn't do I/O is already here, and it's called ->map_pages
> > (which I see you also do later).  So what's the point of this patch?
> 
> I have to admit I did not give much tought about which path would be
> generally most common here.
> 
> The speculative vm_ops->fault path would be used:
> - for private mapping write faults,
> - when fault-around is disabled (probably an uncommon case in general,
>   but actually common at Google).

Why is it disabled?  The PTE table is already being allocated and filled
... why not quickly check the page cache to see if there are other pages
within this 2MB range and fill in their PTEs too?  Even if only one
of them is ever hit, the reduction in page faults is surely worth it.
Obviously if your workload has such non-locality that you hit only one
page in a 2MB range and then no other, it'll lose ... but then you have
a really badly designed workload!

> That said, I do think your point makes sense in general, espicially if
> this could help avoid the per-filesystem enable bit.
Peter Zijlstra April 7, 2021, 2:40 p.m. UTC | #4
On Tue, Apr 06, 2021 at 06:44:49PM -0700, Michel Lespinasse wrote:
> In the speculative case, call the vm_ops->fault() method from within
> an rcu read locked section, and verify the mmap sequence lock at the
> start of the section. A match guarantees that the original vma is still
> valid at that time, and that the associated vma->vm_file stays valid
> while the vm_ops->fault() method is running.
> 
> Note that this implies that speculative faults can not sleep within
> the vm_ops->fault method. We will only attempt to fetch existing pages
> from the page cache during speculative faults; any miss (or prefetch)
> will be handled by falling back to non-speculative fault handling.
> 
> The speculative handling case also does not preallocate page tables,
> as it is always called with a pre-existing page table.

So what's wrong with SRCU ? Laurent mumbled something about frequent
SRCU kthread activity being a problem; is that still so and is that
fundamentally unfixable?

Because to me it seems a much more natural solution to the whole thing.
Michel Lespinasse April 7, 2021, 9:20 p.m. UTC | #5
On Wed, Apr 07, 2021 at 04:40:34PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 06, 2021 at 06:44:49PM -0700, Michel Lespinasse wrote:
> > In the speculative case, call the vm_ops->fault() method from within
> > an rcu read locked section, and verify the mmap sequence lock at the
> > start of the section. A match guarantees that the original vma is still
> > valid at that time, and that the associated vma->vm_file stays valid
> > while the vm_ops->fault() method is running.
> > 
> > Note that this implies that speculative faults can not sleep within
> > the vm_ops->fault method. We will only attempt to fetch existing pages
> > from the page cache during speculative faults; any miss (or prefetch)
> > will be handled by falling back to non-speculative fault handling.
> > 
> > The speculative handling case also does not preallocate page tables,
> > as it is always called with a pre-existing page table.
> 
> So what's wrong with SRCU ? Laurent mumbled something about frequent
> SRCU kthread activity being a problem; is that still so and is that
> fundamentally unfixable?
> 
> Because to me it seems a much more natural solution to the whole thing.

The short answer is that I did not try SRCU. My thought process was,
page cache already uses an RCU read lock, I just need to expand its
scope a little.

Using SRCU might allow us to hit disk during speculative faults; OTOH
we may need to switch to a more robust validation mechanism than the
global counter to reap any benefits.
Matthew Wilcox April 7, 2021, 9:27 p.m. UTC | #6
On Wed, Apr 07, 2021 at 02:20:27PM -0700, Michel Lespinasse wrote:
> On Wed, Apr 07, 2021 at 04:40:34PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 06, 2021 at 06:44:49PM -0700, Michel Lespinasse wrote:
> > > In the speculative case, call the vm_ops->fault() method from within
> > > an rcu read locked section, and verify the mmap sequence lock at the
> > > start of the section. A match guarantees that the original vma is still
> > > valid at that time, and that the associated vma->vm_file stays valid
> > > while the vm_ops->fault() method is running.
> > > 
> > > Note that this implies that speculative faults can not sleep within
> > > the vm_ops->fault method. We will only attempt to fetch existing pages
> > > from the page cache during speculative faults; any miss (or prefetch)
> > > will be handled by falling back to non-speculative fault handling.
> > > 
> > > The speculative handling case also does not preallocate page tables,
> > > as it is always called with a pre-existing page table.
> > 
> > So what's wrong with SRCU ? Laurent mumbled something about frequent
> > SRCU kthread activity being a problem; is that still so and is that
> > fundamentally unfixable?
> > 
> > Because to me it seems a much more natural solution to the whole thing.
> 
> The short answer is that I did not try SRCU. My thought process was,
> page cache already uses an RCU read lock, I just need to expand its
> scope a little.
> 
> Using SRCU might allow us to hit disk during speculative faults; OTOH
> we may need to switch to a more robust validation mechanism than the
> global counter to reap any benefits.

Why would you want to do I/O under SRCU?!  The benefit of SRCU is that
you can allocate page tables under SRCU.

Doing I/O without any lock held already works; it just uses the file
refcount.  It would be better to use a vma refcount, as I already said.
Peter Zijlstra April 8, 2021, 7 a.m. UTC | #7
On Wed, Apr 07, 2021 at 10:27:12PM +0100, Matthew Wilcox wrote:
> Doing I/O without any lock held already works; it just uses the file
> refcount.  It would be better to use a vma refcount, as I already said.

The original workload that I developed SPF for (waaaay back when) was
prefaulting a single huge vma. Using a vma refcount was a total loss
because it resulted in the same cacheline contention that down_read()
was having.

As such, I'm always incredibly sad to see mention of vma refcounts.
They're fundamentally not solving the problem :/
Matthew Wilcox April 8, 2021, 7:13 a.m. UTC | #8
On Thu, Apr 08, 2021 at 09:00:26AM +0200, Peter Zijlstra wrote:
> On Wed, Apr 07, 2021 at 10:27:12PM +0100, Matthew Wilcox wrote:
> > Doing I/O without any lock held already works; it just uses the file
> > refcount.  It would be better to use a vma refcount, as I already said.
> 
> The original workload that I developed SPF for (waaaay back when) was
> prefaulting a single huge vma. Using a vma refcount was a total loss
> because it resulted in the same cacheline contention that down_read()
> was having.
> 
> As such, I'm always incredibly sad to see mention of vma refcounts.
> They're fundamentally not solving the problem :/

OK, let me outline my locking scheme because I think it's rather better
than Michel's.  The vma refcount is the slow path.

1. take the RCU read lock
2. walk the pgd/p4d/pud/pmd
3. allocate page tables if necessary.  *handwave GFP flags*.
4. walk the vma tree
5. call ->map_pages
6. take ptlock
7. insert page(s)
8. drop ptlock
if this all worked out, we're done, drop the RCU read lock and return.
9. increment vma refcount
10. drop RCU read lock
11. call ->fault
12. decrement vma refcount

Compared to today, where we bump the refcount on the file underlying the
vma, this is _better_ scalability -- different mappings of the same file
will not contend on the file's refcount.

I suspect your huge VMA was anon, and that wouldn't need a vma refcount
as faulting in new pages doesn't need to do I/O, just drop the RCU
lock, allocate and retry.
Peter Zijlstra April 8, 2021, 8:18 a.m. UTC | #9
On Thu, Apr 08, 2021 at 08:13:43AM +0100, Matthew Wilcox wrote:
> On Thu, Apr 08, 2021 at 09:00:26AM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 07, 2021 at 10:27:12PM +0100, Matthew Wilcox wrote:
> > > Doing I/O without any lock held already works; it just uses the file
> > > refcount.  It would be better to use a vma refcount, as I already said.
> > 
> > The original workload that I developed SPF for (waaaay back when) was
> > prefaulting a single huge vma. Using a vma refcount was a total loss
> > because it resulted in the same cacheline contention that down_read()
> > was having.
> > 
> > As such, I'm always incredibly sad to see mention of vma refcounts.
> > They're fundamentally not solving the problem :/
> 
> OK, let me outline my locking scheme because I think it's rather better
> than Michel's.  The vma refcount is the slow path.
> 
> 1. take the RCU read lock
> 2. walk the pgd/p4d/pud/pmd
> 3. allocate page tables if necessary.  *handwave GFP flags*.

The problem with allocating page-tables was that you can race with
zap_page_range() if you're not holding mmap_sem, and as such can install
a page-table after, in which case it leaks.

IIRC that was solvable, but it did need a bit of care.

> 4. walk the vma tree
> 5. call ->map_pages

I can't remember ->map_pages().. I think that's 'new'. git-blame tells
me that's 2014, and I did the original SPF in 2010.

Yes, that looks like a useful thing to have, it does the non-blocking
part of ->fault().

I suppose the thing missing here is that if ->map_pages() does not
return a page, we have:

  goto 9

> 6. take ptlock
> 7. insert page(s)
> 8. drop ptlock
> if this all worked out, we're done, drop the RCU read lock and return.

> 9. increment vma refcount
> 10. drop RCU read lock
> 11. call ->fault
> 12. decrement vma refcount

And here we do 6-8 again, right?

> Compared to today, where we bump the refcount on the file underlying the
> vma, this is _better_ scalability -- different mappings of the same file
> will not contend on the file's refcount.
>
> I suspect your huge VMA was anon, and that wouldn't need a vma refcount
> as faulting in new pages doesn't need to do I/O, just drop the RCU
> lock, allocate and retry.

IIRC yes, it was either a huge matrix setup or some database thing, I
can't remember. But the thing was, we didn't have that ->map_pages(), so
we had to call ->fault(), which can sleep, so I had to use SRCU across
the whole thing (or rather, I hacked up preemptible-rcu, because SRCU
was super primitive back then). It did kick start significant SRCU
rework IIRC. Anyway, that's all ancient history.
Michel Lespinasse April 8, 2021, 8:37 a.m. UTC | #10
On Thu, Apr 08, 2021 at 08:13:43AM +0100, Matthew Wilcox wrote:
> On Thu, Apr 08, 2021 at 09:00:26AM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 07, 2021 at 10:27:12PM +0100, Matthew Wilcox wrote:
> > > Doing I/O without any lock held already works; it just uses the file
> > > refcount.  It would be better to use a vma refcount, as I already said.
> > 
> > The original workload that I developed SPF for (waaaay back when) was
> > prefaulting a single huge vma. Using a vma refcount was a total loss
> > because it resulted in the same cacheline contention that down_read()
> > was having.
> > 
> > As such, I'm always incredibly sad to see mention of vma refcounts.
> > They're fundamentally not solving the problem :/
> 
> OK, let me outline my locking scheme because I think it's rather better
> than Michel's.  The vma refcount is the slow path.
> 
> 1. take the RCU read lock
> 2. walk the pgd/p4d/pud/pmd
> 3. allocate page tables if necessary.  *handwave GFP flags*.
> 4. walk the vma tree
> 5. call ->map_pages
> 6. take ptlock
> 7. insert page(s)
> 8. drop ptlock
> if this all worked out, we're done, drop the RCU read lock and return.
> 9. increment vma refcount
> 10. drop RCU read lock
> 11. call ->fault
> 12. decrement vma refcount

Note that most of your proposed steps seem similar in principle to mine.
Looking at the fast path (steps 1-8):
- step 2 sounds like the speculative part of __handle_mm_fault()
- (step 3 not included in my proposal)
- step 4 is basically the lookup I currently have in the arch fault handler
- step 6 sounds like the speculative part of map_pte_lock()

I have working implementations for each step, while your proposal
summarizes each as a point item. It's not clear to me what to make of it;
presumably you would be "filling in the blanks" in a different way
than I have but you are not explaining how. Are you suggesting that
the precautions taken in each step to avoid races with mmap writers
would not be necessary in your proposal ? if that is the case, what is
the alternative mechanism would you use to handle such races ?

Going back to the source of this, you suggested not copying the VMA,
what is your proposed alternative ? Do you suggest that fault handlers
should deal with the vma potentially mutating under them ? Or should
mmap writers consider vmas as immutable and copy them whenever they
want to change them ? or are you implying a locking mechanism that would
prevent mmap writers from executing while the fault is running ?

> Compared to today, where we bump the refcount on the file underlying the
> vma, this is _better_ scalability -- different mappings of the same file
> will not contend on the file's refcount.
> 
> I suspect your huge VMA was anon, and that wouldn't need a vma refcount
> as faulting in new pages doesn't need to do I/O, just drop the RCU
> lock, allocate and retry.
Matthew Wilcox April 8, 2021, 11:28 a.m. UTC | #11
On Thu, Apr 08, 2021 at 01:37:34AM -0700, Michel Lespinasse wrote:
> On Thu, Apr 08, 2021 at 08:13:43AM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 08, 2021 at 09:00:26AM +0200, Peter Zijlstra wrote:
> > > On Wed, Apr 07, 2021 at 10:27:12PM +0100, Matthew Wilcox wrote:
> > > > Doing I/O without any lock held already works; it just uses the file
> > > > refcount.  It would be better to use a vma refcount, as I already said.
> > > 
> > > The original workload that I developed SPF for (waaaay back when) was
> > > prefaulting a single huge vma. Using a vma refcount was a total loss
> > > because it resulted in the same cacheline contention that down_read()
> > > was having.
> > > 
> > > As such, I'm always incredibly sad to see mention of vma refcounts.
> > > They're fundamentally not solving the problem :/
> > 
> > OK, let me outline my locking scheme because I think it's rather better
> > than Michel's.  The vma refcount is the slow path.
> > 
> > 1. take the RCU read lock
> > 2. walk the pgd/p4d/pud/pmd
> > 3. allocate page tables if necessary.  *handwave GFP flags*.
> > 4. walk the vma tree
> > 5. call ->map_pages
> > 6. take ptlock
> > 7. insert page(s)
> > 8. drop ptlock
> > if this all worked out, we're done, drop the RCU read lock and return.
> > 9. increment vma refcount
> > 10. drop RCU read lock
> > 11. call ->fault
> > 12. decrement vma refcount
> 
> Note that most of your proposed steps seem similar in principle to mine.
> Looking at the fast path (steps 1-8):
> - step 2 sounds like the speculative part of __handle_mm_fault()
> - (step 3 not included in my proposal)
> - step 4 is basically the lookup I currently have in the arch fault handler
> - step 6 sounds like the speculative part of map_pte_lock()
> 
> I have working implementations for each step, while your proposal
> summarizes each as a point item. It's not clear to me what to make of it;
> presumably you would be "filling in the blanks" in a different way
> than I have but you are not explaining how. Are you suggesting that
> the precautions taken in each step to avoid races with mmap writers
> would not be necessary in your proposal ? if that is the case, what is
> the alternative mechanism would you use to handle such races ?

I don't know if you noticed, I've been a little busy with memory folios.
I did tell you that on the call, but you don't seem to retain anything
I tell you on the call, so maybe I shouldn't bother calling in any more.

> Going back to the source of this, you suggested not copying the VMA,
> what is your proposed alternative ? Do you suggest that fault handlers
> should deal with the vma potentially mutating under them ? Or should
> mmap writers consider vmas as immutable and copy them whenever they
> want to change them ? or are you implying a locking mechanism that would
> prevent mmap writers from executing while the fault is running ?

The VMA should be immutable, as I explained to you before.
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 6eddd7b4e89c..7139004c624d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3709,29 +3709,50 @@  static vm_fault_t __do_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	vm_fault_t ret;
 
-	/*
-	 * Preallocate pte before we take page_lock because this might lead to
-	 * deadlocks for memcg reclaim which waits for pages under writeback:
-	 *				lock_page(A)
-	 *				SetPageWriteback(A)
-	 *				unlock_page(A)
-	 * lock_page(B)
-	 *				lock_page(B)
-	 * pte_alloc_one
-	 *   shrink_page_list
-	 *     wait_on_page_writeback(A)
-	 *				SetPageWriteback(B)
-	 *				unlock_page(B)
-	 *				# flush A, B to clear the writeback
-	 */
-	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
-		vmf->prealloc_pte = pte_alloc_one(vma->vm_mm);
-		if (!vmf->prealloc_pte)
-			return VM_FAULT_OOM;
-		smp_wmb(); /* See comment in __pte_alloc() */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+	if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+		rcu_read_lock();
+		if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq)) {
+			ret = VM_FAULT_RETRY;
+		} else {
+			/*
+			 * The mmap sequence count check guarantees that the
+			 * vma we fetched at the start of the fault was still
+			 * current at that point in time. The rcu read lock
+			 * ensures vmf->vma->vm_file stays valid.
+			 */
+			ret = vma->vm_ops->fault(vmf);
+		}
+		rcu_read_unlock();
+	} else
+#endif
+	{
+		/*
+		 * Preallocate pte before we take page_lock because
+		 * this might lead to deadlocks for memcg reclaim
+		 * which waits for pages under writeback:
+		 *				lock_page(A)
+		 *				SetPageWriteback(A)
+		 *				unlock_page(A)
+		 * lock_page(B)
+		 *				lock_page(B)
+		 * pte_alloc_one
+		 *   shrink_page_list
+		 *     wait_on_page_writeback(A)
+		 *				SetPageWriteback(B)
+		 *				unlock_page(B)
+		 *				# flush A, B to clear writeback
+		 */
+		if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
+			vmf->prealloc_pte = pte_alloc_one(vma->vm_mm);
+			if (!vmf->prealloc_pte)
+				return VM_FAULT_OOM;
+			smp_wmb(); /* See comment in __pte_alloc() */
+		}
+
+		ret = vma->vm_ops->fault(vmf);
 	}
 
-	ret = vma->vm_ops->fault(vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
 			    VM_FAULT_DONE_COW)))
 		return ret;