Message ID | 20200128025958.43490-2-arjunroy.kdev@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [resend,mm,net-next,1/3] mm: Refactor insert_page to prepare for batched-lock insert. | expand |
On Mon, 27 Jan 2020 18:59:57 -0800 Arjun Roy <arjunroy.kdev@gmail.com> wrote: > Add the ability to insert multiple pages at once to a user VM with > lower PTE spinlock operations. > > The intention of this patch-set is to reduce atomic ops for > tcp zerocopy receives, which normally hits the same spinlock multiple > times consecutively. Seems sensible, thanks. Some other vm_insert_page() callers might want to know about this, but I can't immediately spot any which appear to be high bandwidth. Is there much point in keeping the vm_insert_page() implementation around? Replace it with static inline int vm_insert_page(struct vm_area_struct *vma, unsigned long addr, struct page *page) { return vm_insert_pages(vma, addr, &page, 1); } ? Also, vm_insert_page() does if (!page_count(page)) return -EINVAL; and this was not carried over into vm_insert_pages(). How come? I don't know what that test does - it was added by Linus in the original commit a145dd411eb28c83. It's only been 15 years so I'm sure he remembers ;)
I think at least to start it probably makes sense to keep regular vm_insert_page() around - smaller stack used, less branches, if you know you just need one page - not sure if gcc would err towards smaller binary or not when compiling. Regarding the page_count() check - as far as I can tell that's just checking to make sure that at least *someone* has a reference to the page before inserting it; in the zerocopy case we most definitely do but I guess a bad caller could call it with a bad page argument and this would guard against that. Actually, I appear to have fat fingered it - I intended for this check to be in there but seem to have forgotten (per the comment "/* Defer page refcount checking till we're about to map that page. */" but with no actual check). So that check should go inside insert_page_in_batch_locked(), right before the validate_page_before_insert() check. I'll send an updated fixup diff shortly. -Arjun On Wed, Feb 12, 2020 at 6:41 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Mon, 27 Jan 2020 18:59:57 -0800 Arjun Roy <arjunroy.kdev@gmail.com> wrote: > > > Add the ability to insert multiple pages at once to a user VM with > > lower PTE spinlock operations. > > > > The intention of this patch-set is to reduce atomic ops for > > tcp zerocopy receives, which normally hits the same spinlock multiple > > times consecutively. > > Seems sensible, thanks. Some other vm_insert_page() callers might want > to know about this, but I can't immediately spot any which appear to be > high bandwidth. > > Is there much point in keeping the vm_insert_page() implementation > around? Replace it with > > static inline int > vm_insert_page(struct vm_area_struct *vma, unsigned long addr, > struct page *page) > { > return vm_insert_pages(vma, addr, &page, 1); > } > > ? > > Also, vm_insert_page() does > > if (!page_count(page)) > return -EINVAL; > > and this was not carried over into vm_insert_pages(). How come? > > I don't know what that test does - it was added by Linus in the > original commit a145dd411eb28c83. It's only been 15 years so I'm sure > he remembers ;)
On Wed, Feb 12, 2020 at 6:41 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > Also, vm_insert_page() does > > if (!page_count(page)) > return -EINVAL; > > and this was not carried over into vm_insert_pages(). How come? Sounds like that was just a mistake. > I don't know what that test does - it was added by Linus in the > original commit a145dd411eb28c83. It's only been 15 years so I'm sure > he remembers ;) Oh, sure. No, I have absolutely no memory of the details, but I think the commit message is actually the big hint: the difference between vm_insert_page() and some of the more random "insert any pdf" cases we have is exactly that: The page you insert needs to be a nice clean kernel allocation, so you can't insert arbitrary page mappings with this, but that's not what people want. thing. The comment above it also kind of hints at it. We *historically* had interfaces to insert random reserved pages (for IO mappings, but also the zero page etc), but the whole point of that vm_insert_page() is that it's now an interface for drivers to insert the pages they maintain into the page tables. But that also means that we very much didn't allow just random pages accessed by doing pfn lookups (that might not be in use at all). Is "page_count()" a great test? No. But it's at least _a_ test of that. No reserved pages or other magic need apply. Linus
On Mon, Jan 27, 2020 at 06:59:57PM -0800, Arjun Roy wrote: > int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *); > +int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr, > + struct page **pages, unsigned long *num); > int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, > unsigned long pfn); Sorry I didn't notice these patches earlier. I'm not thrilled about the addition of a new vm_insert_* operation; we're moving towards a vmf_insert_* API. There are almost no users left of vm_insert_page (10, at a quick count). Once they're all gone, we can switch the underlying primitives over to a vm_fault_t return type and get rid of the errno-to-vm-fault translation step that currently goes on. So ... is this called in the fault path? Do you have a struct vm_fault around? Can you handle a vm_fault_t return value instead of an errno?
On Thu, Feb 13, 2020 at 1:54 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Jan 27, 2020 at 06:59:57PM -0800, Arjun Roy wrote: > > int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *); > > +int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr, > > + struct page **pages, unsigned long *num); > > int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, > > unsigned long pfn); > > Sorry I didn't notice these patches earlier. I'm not thrilled about > the addition of a new vm_insert_* operation; we're moving towards a > vmf_insert_* API. There are almost no users left of vm_insert_page > (10, at a quick count). Once they're all gone, we can switch the > underlying primitives over to a vm_fault_t return type and get rid of the > errno-to-vm-fault translation step that currently goes on. > > So ... is this called in the fault path? Do you have a struct vm_fault > around? Can you handle a vm_fault_t return value instead of an errno? This is not a page fault, really. This customer of vm_insert_page() is the TCP receive zerocopy code, which is remapping pages from the NIC into the userspace process (in lieu of sys_recvmsg()'s copy). See: tcp_zerocopy_receive() in net/ipv4/tcp.c . I took a peek at vmf_insert_page(). I think that hides the presence of EBUSY, which would be a necessary signal for us. If that was exposed I think vm_fault_t could be fine, *but* I shall defer to Eric for actually deciding on it. -Arjun
diff --git a/include/linux/mm.h b/include/linux/mm.h index 29c928ba6b42..a3ac40fbe8fd 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2563,6 +2563,8 @@ struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr); int remap_pfn_range(struct vm_area_struct *, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t); int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *); +int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr, + struct page **pages, unsigned long *num); int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, diff --git a/mm/memory.c b/mm/memory.c index 7e23a9275386..1655c5feaf32 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1670,8 +1670,7 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, } EXPORT_SYMBOL_GPL(zap_vma_ptes); -pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr, - spinlock_t **ptl) +static pmd_t *walk_to_pmd(struct mm_struct *mm, unsigned long addr) { pgd_t *pgd; p4d_t *p4d; @@ -1690,6 +1689,16 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr, return NULL; VM_BUG_ON(pmd_trans_huge(*pmd)); + return pmd; +} + +pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr, + spinlock_t **ptl) +{ + pmd_t *pmd = walk_to_pmd(mm, addr); + + if (!pmd) + return NULL; return pte_alloc_map_lock(mm, pmd, addr, ptl); } @@ -1714,6 +1723,15 @@ static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte, return 0; } +static int insert_page_in_batch_locked(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr, struct page *page, pgprot_t prot) +{ + const int err = validate_page_before_insert(page); + + return err ? err : insert_page_into_pte_locked( + mm, pte_offset_map(pmd, addr), addr, page, prot); +} + /* * This is the old fallback for page remapping. * @@ -1742,6 +1760,95 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr, return retval; } +/* insert_pages() amortizes the cost of spinlock operations + * when inserting pages in a loop. + */ +static int insert_pages(struct vm_area_struct *vma, unsigned long addr, + struct page **pages, unsigned long *num, pgprot_t prot) +{ + pmd_t *pmd = NULL; + spinlock_t *pte_lock = NULL; + struct mm_struct *const mm = vma->vm_mm; + unsigned long curr_page_idx = 0; + unsigned long remaining_pages_total = *num; + unsigned long pages_to_write_in_pmd; + int ret; +more: + ret = -EFAULT; + pmd = walk_to_pmd(mm, addr); + if (!pmd) + goto out; + + pages_to_write_in_pmd = min_t(unsigned long, + remaining_pages_total, PTRS_PER_PTE - pte_index(addr)); + + /* Allocate the PTE if necessary; takes PMD lock once only. */ + ret = -ENOMEM; + if (pte_alloc(mm, pmd, addr)) + goto out; + pte_lock = pte_lockptr(mm, pmd); + + while (pages_to_write_in_pmd) { + int pte_idx = 0; + const int batch_size = min_t(int, pages_to_write_in_pmd, 8); + + spin_lock(pte_lock); + for (; pte_idx < batch_size; ++pte_idx) { + int err = insert_page_in_batch_locked(mm, pmd, + addr, pages[curr_page_idx], prot); + if (unlikely(err)) { + spin_unlock(pte_lock); + ret = err; + remaining_pages_total -= pte_idx; + goto out; + } + addr += PAGE_SIZE; + ++curr_page_idx; + } + spin_unlock(pte_lock); + pages_to_write_in_pmd -= batch_size; + remaining_pages_total -= batch_size; + } + if (remaining_pages_total) + goto more; + ret = 0; +out: + *num = remaining_pages_total; + return ret; +} + +/** + * vm_insert_pages - insert multiple pages into user vma, batching the pmd lock. + * @vma: user vma to map to + * @addr: target start user address of these pages + * @pages: source kernel pages + * @num: in: number of pages to map. out: number of pages that were *not* + * mapped. (0 means all pages were successfully mapped). + * + * Preferred over vm_insert_page() when inserting multiple pages. + * + * In case of error, we may have mapped a subset of the provided + * pages. It is the caller's responsibility to account for this case. + * + * The same restrictions apply as in vm_insert_page(). + */ +int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr, + struct page **pages, unsigned long *num) +{ + const unsigned long end_addr = addr + (*num * PAGE_SIZE) - 1; + + if (addr < vma->vm_start || end_addr >= vma->vm_end) + return -EFAULT; + if (!(vma->vm_flags & VM_MIXEDMAP)) { + BUG_ON(down_read_trylock(&vma->vm_mm->mmap_sem)); + BUG_ON(vma->vm_flags & VM_PFNMAP); + vma->vm_flags |= VM_MIXEDMAP; + } + /* Defer page refcount checking till we're about to map that page. */ + return insert_pages(vma, addr, pages, num, vma->vm_page_prot); +} +EXPORT_SYMBOL(vm_insert_pages); + /** * vm_insert_page - insert single page into user vma * @vma: user vma to map to