diff mbox series

[RFC,11/37] x86/mm: attempt speculative mm faults first

Message ID 20210407014502.24091-12-michel@lespinasse.org (mailing list archive)
State New, archived
Headers show
Series [RFC,01/37] mmap locking API: mmap_lock_is_contended returns a bool | expand

Commit Message

Michel Lespinasse April 7, 2021, 1:44 a.m. UTC
Attempt speculative mm fault handling first, and fall back to the
existing (non-speculative) code if that fails.

The speculative handling closely mirrors the non-speculative logic.
This includes some x86 specific bits such as the access_error() call.
This is why we chose to implement the speculative handling in arch/x86
rather than in common code.

The vma is first looked up and copied, under protection of the rcu
read lock. The mmap lock sequence count is used to verify the
integrity of the copied vma, and passed to do_handle_mm_fault() to
allow checking against races with mmap writers when finalizing the fault.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
---
 arch/x86/mm/fault.c           | 36 +++++++++++++++++++++++++++++++++++
 include/linux/vm_event_item.h |  4 ++++
 mm/vmstat.c                   |  4 ++++
 3 files changed, 44 insertions(+)

Comments

Peter Zijlstra April 7, 2021, 2:48 p.m. UTC | #1
On Tue, Apr 06, 2021 at 06:44:36PM -0700, Michel Lespinasse wrote:
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1219,6 +1219,8 @@ void do_user_addr_fault(struct pt_regs *regs,
>  	struct mm_struct *mm;
>  	vm_fault_t fault;
>  	unsigned int flags = FAULT_FLAG_DEFAULT;
> +	struct vm_area_struct pvma;

That's 200 bytes on-stack... I suppose that's just about acceptible, but
perhaps we need a comment in struct vm_area_struct to make people aware
this things lives on-stack and size really is an issue now.
Matthew Wilcox April 7, 2021, 3:35 p.m. UTC | #2
On Wed, Apr 07, 2021 at 04:48:44PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 06, 2021 at 06:44:36PM -0700, Michel Lespinasse wrote:
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1219,6 +1219,8 @@ void do_user_addr_fault(struct pt_regs *regs,
> >  	struct mm_struct *mm;
> >  	vm_fault_t fault;
> >  	unsigned int flags = FAULT_FLAG_DEFAULT;
> > +	struct vm_area_struct pvma;
> 
> That's 200 bytes on-stack... I suppose that's just about acceptible, but
> perhaps we need a comment in struct vm_area_struct to make people aware
> this things lives on-stack and size really is an issue now.

Michel's gone off and done his own thing here.

The rest of us (Laurent, Liam & I) are working on top of the maple tree
which shrinks vm_area_struct by five pointers, so just 160 bytes.

Also, our approach doesn't involve copying VMAs in order to handle a fault.
Michel Lespinasse April 7, 2021, 8:14 p.m. UTC | #3
On Wed, Apr 07, 2021 at 04:48:44PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 06, 2021 at 06:44:36PM -0700, Michel Lespinasse wrote:
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1219,6 +1219,8 @@ void do_user_addr_fault(struct pt_regs *regs,
> >  	struct mm_struct *mm;
> >  	vm_fault_t fault;
> >  	unsigned int flags = FAULT_FLAG_DEFAULT;
> > +	struct vm_area_struct pvma;
> 
> That's 200 bytes on-stack... I suppose that's just about acceptible, but
> perhaps we need a comment in struct vm_area_struct to make people aware
> this things lives on-stack and size really is an issue now.

Right, I agree that having the vma copy on-stack is not ideal.

I think what really should be done, is to copy just the attributes of
the vma that will be needed during the page fault. Things like vm_mm,
vm_page_prot, vm_flags, vm_ops, vm_pgoff, vm_file, vm_private_data,
vm_policy. We definitely do not need rbtree and rmap fields such as
vm_prev, vm_next, vm_rb, rb_subtree_gap, shared, anon_vma_chain etc...

The reason I did things this way, is because changing the entire fault
handler to use attributes stored in struct vm_fault, rather than in
the original vma, would be quite intrusive. I think it would be a
reasonable step to consider once there is agreement on the rest of the
speculative fault patch set, but it's practical doing it before then.
Michel Lespinasse April 7, 2021, 8:18 p.m. UTC | #4
On Wed, Apr 07, 2021 at 01:14:53PM -0700, Michel Lespinasse wrote:
> On Wed, Apr 07, 2021 at 04:48:44PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 06, 2021 at 06:44:36PM -0700, Michel Lespinasse wrote:
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -1219,6 +1219,8 @@ void do_user_addr_fault(struct pt_regs *regs,
> > >  	struct mm_struct *mm;
> > >  	vm_fault_t fault;
> > >  	unsigned int flags = FAULT_FLAG_DEFAULT;
> > > +	struct vm_area_struct pvma;
> > 
> > That's 200 bytes on-stack... I suppose that's just about acceptible, but
> > perhaps we need a comment in struct vm_area_struct to make people aware
> > this things lives on-stack and size really is an issue now.
> 
> Right, I agree that having the vma copy on-stack is not ideal.
> 
> I think what really should be done, is to copy just the attributes of
> the vma that will be needed during the page fault. Things like vm_mm,
> vm_page_prot, vm_flags, vm_ops, vm_pgoff, vm_file, vm_private_data,
> vm_policy. We definitely do not need rbtree and rmap fields such as
> vm_prev, vm_next, vm_rb, rb_subtree_gap, shared, anon_vma_chain etc...
> 
> The reason I did things this way, is because changing the entire fault
> handler to use attributes stored in struct vm_fault, rather than in
> the original vma, would be quite intrusive. I think it would be a
> reasonable step to consider once there is agreement on the rest of the
> speculative fault patch set, but it's practical doing it before then.

I meant it's NOT practical using attributes rather than a vma copy
until there is sufficient agreement to merge the patchset.
Michel Lespinasse April 7, 2021, 8:32 p.m. UTC | #5
On Wed, Apr 07, 2021 at 04:35:28PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 07, 2021 at 04:48:44PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 06, 2021 at 06:44:36PM -0700, Michel Lespinasse wrote:
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -1219,6 +1219,8 @@ void do_user_addr_fault(struct pt_regs *regs,
> > >  	struct mm_struct *mm;
> > >  	vm_fault_t fault;
> > >  	unsigned int flags = FAULT_FLAG_DEFAULT;
> > > +	struct vm_area_struct pvma;
> > 
> > That's 200 bytes on-stack... I suppose that's just about acceptible, but
> > perhaps we need a comment in struct vm_area_struct to make people aware
> > this things lives on-stack and size really is an issue now.
> 
> Michel's gone off and done his own thing here.

I don't think that is an entirely fair representation. First we are
both aware of each other's work, there is no working in dark caves here.
But also, I don't even consider this patchset to be entirely my thing;
most of the main building blocks come from prior proposals before mine.

> The rest of us (Laurent, Liam & I) are working on top of the maple tree
> which shrinks vm_area_struct by five pointers, so just 160 bytes.

The idea of evaluating maple tree and speculative faults as a bundle
is actually worrying to me. I think both ideas are interesting and
worth looking into on their own, but I'm not convinced that they have
much to do with each other.

> Also, our approach doesn't involve copying VMAs in order to handle a fault.

See my other reply to Peter's message - copying VMAs is a convenient
way to reduce the size of the patchset, but it's not fundamental to
the approach at all.
diff mbox series

Patch

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a73347e2cdfc..f8c8e325af77 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1219,6 +1219,8 @@  void do_user_addr_fault(struct pt_regs *regs,
 	struct mm_struct *mm;
 	vm_fault_t fault;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
+	struct vm_area_struct pvma;
+	unsigned long seq;
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1316,6 +1318,39 @@  void do_user_addr_fault(struct pt_regs *regs,
 	}
 #endif
 
+	count_vm_event(SPF_ATTEMPT);
+	seq = mmap_seq_read_start(mm);
+	if (seq & 1)
+		goto spf_abort;
+	rcu_read_lock();
+	vma = find_vma(mm, address);
+	if (!vma || vma->vm_start > address) {
+		rcu_read_unlock();
+		goto spf_abort;
+	}
+	pvma = *vma;
+	rcu_read_unlock();
+	if (!mmap_seq_read_check(mm, seq))
+		goto spf_abort;
+	vma = &pvma;
+	if (unlikely(access_error(error_code, vma)))
+		goto spf_abort;
+	fault = do_handle_mm_fault(vma, address,
+				   flags | FAULT_FLAG_SPECULATIVE, seq, regs);
+
+	/* Quick path to respond to signals */
+	if (fault_signal_pending(fault, regs)) {
+		if (!user_mode(regs))
+			kernelmode_fixup_or_oops(regs, error_code, address,
+						 SIGBUS, BUS_ADRERR);
+		return;
+	}
+	if (!(fault & VM_FAULT_RETRY))
+		goto done;
+
+spf_abort:
+	count_vm_event(SPF_ABORT);
+
 	/*
 	 * Kernel-mode access to the user address space should only occur
 	 * on well-defined single instructions listed in the exception
@@ -1412,6 +1447,7 @@  void do_user_addr_fault(struct pt_regs *regs,
 	}
 
 	mmap_read_unlock(mm);
+done:
 	if (likely(!(fault & VM_FAULT_ERROR)))
 		return;
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 18e75974d4e3..cc4f8d14e43f 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -120,6 +120,10 @@  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_SWAP
 		SWAP_RA,
 		SWAP_RA_HIT,
+#endif
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+		SPF_ATTEMPT,
+		SPF_ABORT,
 #endif
 		NR_VM_EVENT_ITEMS
 };
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 74b2c374b86c..9ae1c27a549e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1365,6 +1365,10 @@  const char * const vmstat_text[] = {
 	"swap_ra",
 	"swap_ra_hit",
 #endif
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+	"spf_attempt",
+	"spf_abort",
+#endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */