diff mbox series

[resend,mm,net-next,2/3] mm: Add vm_insert_pages().

Message ID 20200128025958.43490-2-arjunroy.kdev@gmail.com (mailing list archive)
State New, archived
Headers show
Series [resend,mm,net-next,1/3] mm: Refactor insert_page to prepare for batched-lock insert. | expand

Commit Message

Arjun Roy Jan. 28, 2020, 2:59 a.m. UTC
From: Arjun Roy <arjunroy@google.com>

Add the ability to insert multiple pages at once to a user VM with
lower PTE spinlock operations.

The intention of this patch-set is to reduce atomic ops for
tcp zerocopy receives, which normally hits the same spinlock multiple
times consecutively.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>

---
 include/linux/mm.h |   2 +
 mm/memory.c        | 111 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 111 insertions(+), 2 deletions(-)

Comments

Andrew Morton Feb. 13, 2020, 2:41 a.m. UTC | #1
On Mon, 27 Jan 2020 18:59:57 -0800 Arjun Roy <arjunroy.kdev@gmail.com> wrote:

> Add the ability to insert multiple pages at once to a user VM with
> lower PTE spinlock operations.
> 
> The intention of this patch-set is to reduce atomic ops for
> tcp zerocopy receives, which normally hits the same spinlock multiple
> times consecutively.

Seems sensible, thanks.  Some other vm_insert_page() callers might want
to know about this, but I can't immediately spot any which appear to be
high bandwidth.

Is there much point in keeping the vm_insert_page() implementation
around?  Replace it with

static inline int
vm_insert_page(struct vm_area_struct *vma, unsigned long addr,
	       struct page *page)
{
	return vm_insert_pages(vma, addr, &page, 1);
}

?

Also, vm_insert_page() does

	if (!page_count(page))
		return -EINVAL;

and this was not carried over into vm_insert_pages().  How come?

I don't know what that test does - it was added by Linus in the
original commit a145dd411eb28c83.  It's only been 15 years so I'm sure
he remembers ;)
Arjun Roy Feb. 13, 2020, 5:09 p.m. UTC | #2
I think at least to start it probably makes sense to keep regular
vm_insert_page() around - smaller stack used, less branches, if you
know you just need one page - not sure if gcc would err towards
smaller binary or not when compiling.

Regarding the page_count() check - as far as I can tell that's just
checking to make sure that at least *someone* has a reference to the
page before inserting it; in the zerocopy case we most definitely do
but I guess a bad caller could call it with a bad page argument and
this would guard against that.

Actually, I appear to have fat fingered it - I intended for this check
to be in there but seem to have forgotten (per the comment "/* Defer
page refcount checking till we're about to map that page. */" but with
no actual check). So that check should go inside
insert_page_in_batch_locked(), right before the
validate_page_before_insert() check. I'll send an updated fixup diff
shortly.

-Arjun

On Wed, Feb 12, 2020 at 6:41 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 27 Jan 2020 18:59:57 -0800 Arjun Roy <arjunroy.kdev@gmail.com> wrote:
>
> > Add the ability to insert multiple pages at once to a user VM with
> > lower PTE spinlock operations.
> >
> > The intention of this patch-set is to reduce atomic ops for
> > tcp zerocopy receives, which normally hits the same spinlock multiple
> > times consecutively.
>
> Seems sensible, thanks.  Some other vm_insert_page() callers might want
> to know about this, but I can't immediately spot any which appear to be
> high bandwidth.
>
> Is there much point in keeping the vm_insert_page() implementation
> around?  Replace it with
>
> static inline int
> vm_insert_page(struct vm_area_struct *vma, unsigned long addr,
>                struct page *page)
> {
>         return vm_insert_pages(vma, addr, &page, 1);
> }
>
> ?
>
> Also, vm_insert_page() does
>
>         if (!page_count(page))
>                 return -EINVAL;
>
> and this was not carried over into vm_insert_pages().  How come?
>
> I don't know what that test does - it was added by Linus in the
> original commit a145dd411eb28c83.  It's only been 15 years so I'm sure
> he remembers ;)
Linus Torvalds Feb. 13, 2020, 9:37 p.m. UTC | #3
On Wed, Feb 12, 2020 at 6:41 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Also, vm_insert_page() does
>
>         if (!page_count(page))
>                 return -EINVAL;
>
> and this was not carried over into vm_insert_pages().  How come?

Sounds like that was just a mistake.

> I don't know what that test does - it was added by Linus in the
> original commit a145dd411eb28c83.  It's only been 15 years so I'm sure
> he remembers ;)

Oh, sure.

No, I have absolutely no memory of the details, but I think the commit
message is actually the big hint: the difference between
vm_insert_page() and some of the more random "insert any pdf" cases we
have is exactly that:

    The page you insert needs to be a nice clean kernel allocation, so you
    can't insert arbitrary page mappings with this, but that's not what
    people want.

thing. The comment above it also kind of hints at it.

We *historically* had interfaces to insert random reserved pages (for
IO mappings, but also the zero page etc), but the whole point of that
vm_insert_page() is that it's now an interface for drivers to insert
the pages they maintain into the page tables.

But that also means that we very much didn't allow just random pages
accessed by doing pfn lookups (that might not be in use at all).

Is "page_count()" a great test? No. But it's at least _a_ test of
that. No reserved pages or other magic need apply.

         Linus
Matthew Wilcox Feb. 13, 2020, 9:54 p.m. UTC | #4
On Mon, Jan 27, 2020 at 06:59:57PM -0800, Arjun Roy wrote:
>  int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
> +int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
> +			struct page **pages, unsigned long *num);
>  int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>  			unsigned long pfn);

Sorry I didn't notice these patches earlier.  I'm not thrilled about
the addition of a new vm_insert_* operation; we're moving towards a
vmf_insert_* API.  There are almost no users left of vm_insert_page
(10, at a quick count).  Once they're all gone, we can switch the
underlying primitives over to a vm_fault_t return type and get rid of the
errno-to-vm-fault translation step that currently goes on.

So ... is this called in the fault path?  Do you have a struct vm_fault
around?  Can you handle a vm_fault_t return value instead of an errno?
Arjun Roy Feb. 13, 2020, 10:06 p.m. UTC | #5
On Thu, Feb 13, 2020 at 1:54 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jan 27, 2020 at 06:59:57PM -0800, Arjun Roy wrote:
> >  int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
> > +int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
> > +                     struct page **pages, unsigned long *num);
> >  int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> >                       unsigned long pfn);
>
> Sorry I didn't notice these patches earlier.  I'm not thrilled about
> the addition of a new vm_insert_* operation; we're moving towards a
> vmf_insert_* API.  There are almost no users left of vm_insert_page
> (10, at a quick count).  Once they're all gone, we can switch the
> underlying primitives over to a vm_fault_t return type and get rid of the
> errno-to-vm-fault translation step that currently goes on.
>
> So ... is this called in the fault path?  Do you have a struct vm_fault
> around?  Can you handle a vm_fault_t return value instead of an errno?

This is not a page fault, really. This customer of vm_insert_page() is
the TCP receive zerocopy code, which is remapping pages from the NIC
into the userspace process (in lieu of sys_recvmsg()'s copy). See:
tcp_zerocopy_receive() in net/ipv4/tcp.c .

I took a peek at vmf_insert_page(). I think that hides the presence of
EBUSY, which would be a necessary signal for us. If that was exposed I
think vm_fault_t could be fine, *but* I shall defer to Eric for
actually deciding on it.

-Arjun
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 29c928ba6b42..a3ac40fbe8fd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2563,6 +2563,8 @@  struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
 int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
 			unsigned long pfn, unsigned long size, pgprot_t);
 int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
+int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
+			struct page **pages, unsigned long *num);
 int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn);
 int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index 7e23a9275386..1655c5feaf32 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1670,8 +1670,7 @@  int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 }
 EXPORT_SYMBOL_GPL(zap_vma_ptes);
 
-pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
-			spinlock_t **ptl)
+static pmd_t *walk_to_pmd(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -1690,6 +1689,16 @@  pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
 		return NULL;
 
 	VM_BUG_ON(pmd_trans_huge(*pmd));
+	return pmd;
+}
+
+pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
+			spinlock_t **ptl)
+{
+	pmd_t *pmd = walk_to_pmd(mm, addr);
+
+	if (!pmd)
+		return NULL;
 	return pte_alloc_map_lock(mm, pmd, addr, ptl);
 }
 
@@ -1714,6 +1723,15 @@  static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
 	return 0;
 }
 
+static int insert_page_in_batch_locked(struct mm_struct *mm, pmd_t *pmd,
+			unsigned long addr, struct page *page, pgprot_t prot)
+{
+	const int err = validate_page_before_insert(page);
+
+	return err ? err : insert_page_into_pte_locked(
+		mm, pte_offset_map(pmd, addr), addr, page, prot);
+}
+
 /*
  * This is the old fallback for page remapping.
  *
@@ -1742,6 +1760,95 @@  static int insert_page(struct vm_area_struct *vma, unsigned long addr,
 	return retval;
 }
 
+/* insert_pages() amortizes the cost of spinlock operations
+ * when inserting pages in a loop.
+ */
+static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
+			struct page **pages, unsigned long *num, pgprot_t prot)
+{
+	pmd_t *pmd = NULL;
+	spinlock_t *pte_lock = NULL;
+	struct mm_struct *const mm = vma->vm_mm;
+	unsigned long curr_page_idx = 0;
+	unsigned long remaining_pages_total = *num;
+	unsigned long pages_to_write_in_pmd;
+	int ret;
+more:
+	ret = -EFAULT;
+	pmd = walk_to_pmd(mm, addr);
+	if (!pmd)
+		goto out;
+
+	pages_to_write_in_pmd = min_t(unsigned long,
+		remaining_pages_total, PTRS_PER_PTE - pte_index(addr));
+
+	/* Allocate the PTE if necessary; takes PMD lock once only. */
+	ret = -ENOMEM;
+	if (pte_alloc(mm, pmd, addr))
+		goto out;
+	pte_lock = pte_lockptr(mm, pmd);
+
+	while (pages_to_write_in_pmd) {
+		int pte_idx = 0;
+		const int batch_size = min_t(int, pages_to_write_in_pmd, 8);
+
+		spin_lock(pte_lock);
+		for (; pte_idx < batch_size; ++pte_idx) {
+			int err = insert_page_in_batch_locked(mm, pmd,
+				addr, pages[curr_page_idx], prot);
+			if (unlikely(err)) {
+				spin_unlock(pte_lock);
+				ret = err;
+				remaining_pages_total -= pte_idx;
+				goto out;
+			}
+			addr += PAGE_SIZE;
+			++curr_page_idx;
+		}
+		spin_unlock(pte_lock);
+		pages_to_write_in_pmd -= batch_size;
+		remaining_pages_total -= batch_size;
+	}
+	if (remaining_pages_total)
+		goto more;
+	ret = 0;
+out:
+	*num = remaining_pages_total;
+	return ret;
+}
+
+/**
+ * vm_insert_pages - insert multiple pages into user vma, batching the pmd lock.
+ * @vma: user vma to map to
+ * @addr: target start user address of these pages
+ * @pages: source kernel pages
+ * @num: in: number of pages to map. out: number of pages that were *not*
+ * mapped. (0 means all pages were successfully mapped).
+ *
+ * Preferred over vm_insert_page() when inserting multiple pages.
+ *
+ * In case of error, we may have mapped a subset of the provided
+ * pages. It is the caller's responsibility to account for this case.
+ *
+ * The same restrictions apply as in vm_insert_page().
+ */
+int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
+			struct page **pages, unsigned long *num)
+{
+	const unsigned long end_addr = addr + (*num * PAGE_SIZE) - 1;
+
+	if (addr < vma->vm_start || end_addr >= vma->vm_end)
+		return -EFAULT;
+	if (!(vma->vm_flags & VM_MIXEDMAP)) {
+		BUG_ON(down_read_trylock(&vma->vm_mm->mmap_sem));
+		BUG_ON(vma->vm_flags & VM_PFNMAP);
+		vma->vm_flags |= VM_MIXEDMAP;
+	}
+	/* Defer page refcount checking till we're about to map that page. */
+	return insert_pages(vma, addr, pages, num, vma->vm_page_prot);
+}
+EXPORT_SYMBOL(vm_insert_pages);
+
 /**
  * vm_insert_page - insert single page into user vma
  * @vma: user vma to map to