Message ID | 20230113163538.23412-1-fengwei.yin@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] mm: populate multiple PTEs if file page is large folio | expand |
On Sat, Jan 14, 2023 at 12:35:38AM +0800, Yin Fengwei wrote: > The page fault number can be reduced by batched PTEs population. > The batch size of PTEs population is not allowed to cross: > - page table boundaries > - vma range > - large folio size > - fault_around_bytes I find this patch very interesting. But is it really worth it? Most file-backed page faults are resolved through the ->map_pages() path which is almost always filemap_map_pages(), which does something fairly similar to this already. Do you have any performance numbers?
On 1/14/2023 2:13 AM, Matthew Wilcox wrote: > I find this patch very interesting. But is it really worth it? Most > file-backed page faults are resolved through the ->map_pages() path > which is almost always filemap_map_pages(), which does something > fairly similar to this already. Do you have any performance numbers? Yes. The real benefit this patch brings could be very minor. I will try to get some performance data against this patch. Thanks. Regards Yin, Fengwei
On 1/14/2023 2:13 AM, Matthew Wilcox wrote: > On Sat, Jan 14, 2023 at 12:35:38AM +0800, Yin Fengwei wrote: >> The page fault number can be reduced by batched PTEs population. >> The batch size of PTEs population is not allowed to cross: >> - page table boundaries >> - vma range >> - large folio size >> - fault_around_bytes > > I find this patch very interesting. But is it really worth it? Most > file-backed page faults are resolved through the ->map_pages() path > which is almost always filemap_map_pages(), which does something > fairly similar to this already. Do you have any performance numbers? > I tried the will-it-scale page_fault3: https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c with 96 processes on a test box with 48C/86T. The test result got about 3.75X better with 4.1X less page fault number with this patch. But It's a micro benchmark which shows extreme friendly case to this patch. I didn't see observed performance gain with other workloads. I suppose shared file write operations may not be common operations? Thanks. Regards Yin, Fengwei
On 17.01.23 10:19, Yin, Fengwei wrote: > > > On 1/14/2023 2:13 AM, Matthew Wilcox wrote: >> On Sat, Jan 14, 2023 at 12:35:38AM +0800, Yin Fengwei wrote: >>> The page fault number can be reduced by batched PTEs population. >>> The batch size of PTEs population is not allowed to cross: >>> - page table boundaries >>> - vma range >>> - large folio size >>> - fault_around_bytes >> >> I find this patch very interesting. But is it really worth it? Most >> file-backed page faults are resolved through the ->map_pages() path >> which is almost always filemap_map_pages(), which does something >> fairly similar to this already. Do you have any performance numbers? >> > I tried the will-it-scale page_fault3: > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c > with 96 processes on a test box with 48C/86T. > > The test result got about 3.75X better with 4.1X less page fault number > with this patch. > > But It's a micro benchmark which shows extreme friendly case to this patch. > > I didn't see observed performance gain with other workloads. I suppose > shared file write operations may not be common operations? Thanks. One question I have after reading "which does something fairly similar to this already", if both paths could be unified.
On Tue, Jan 17, 2023 at 11:37:05AM +0100, David Hildenbrand wrote: > On 17.01.23 10:19, Yin, Fengwei wrote: > > > > > > On 1/14/2023 2:13 AM, Matthew Wilcox wrote: > > > On Sat, Jan 14, 2023 at 12:35:38AM +0800, Yin Fengwei wrote: > > > > The page fault number can be reduced by batched PTEs population. > > > > The batch size of PTEs population is not allowed to cross: > > > > - page table boundaries > > > > - vma range > > > > - large folio size > > > > - fault_around_bytes > > > > > > I find this patch very interesting. But is it really worth it? Most > > > file-backed page faults are resolved through the ->map_pages() path > > > which is almost always filemap_map_pages(), which does something > > > fairly similar to this already. Do you have any performance numbers? > > > > > I tried the will-it-scale page_fault3: > > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c > > with 96 processes on a test box with 48C/86T. > > > > The test result got about 3.75X better with 4.1X less page fault number > > with this patch. > > > > But It's a micro benchmark which shows extreme friendly case to this patch. > > > > I didn't see observed performance gain with other workloads. I suppose > > shared file write operations may not be common operations? Thanks. > > One question I have after reading "which does something fairly similar to > this already", if both paths could be unified. I've been thinking about this already; not so much in terms of "unifying these two implementations" but rather "What's the right API for mapping the parts of a folio that fit into userspace in response to a fault". I haven't got quite as far as drafting code, but I'm thinking there should be an API where we pass in the vmf, a folio and some other information, and that function takes care of mapping however many pages from this folio that it can. And the reason to split it up like that is to batch as many page operations into this folio as possible. eg filemap_map_pages() was written "to get it working", not "to be efficient", so it does stupid things like call folio_ref_inc() every time it maps a page instead of counting how many pages it maps and calling folio_ref_add() at the end. Similar optimisations should be done for mapcount, which implies some kind of batched equivalent of page_add_*_rmap().
On 1/17/2023 6:37 PM, David Hildenbrand wrote: > On 17.01.23 10:19, Yin, Fengwei wrote: >> >> >> On 1/14/2023 2:13 AM, Matthew Wilcox wrote: >>> On Sat, Jan 14, 2023 at 12:35:38AM +0800, Yin Fengwei wrote: >>>> The page fault number can be reduced by batched PTEs population. >>>> The batch size of PTEs population is not allowed to cross: >>>> - page table boundaries >>>> - vma range >>>> - large folio size >>>> - fault_around_bytes >>> >>> I find this patch very interesting. But is it really worth it? Most >>> file-backed page faults are resolved through the ->map_pages() path >>> which is almost always filemap_map_pages(), which does something >>> fairly similar to this already. Do you have any performance numbers? >>> >> I tried the will-it-scale page_fault3: >> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c >> with 96 processes on a test box with 48C/86T. >> >> The test result got about 3.75X better with 4.1X less page fault number >> with this patch. >> >> But It's a micro benchmark which shows extreme friendly case to this patch. >> >> I didn't see observed performance gain with other workloads. I suppose >> shared file write operations may not be common operations? Thanks. > > One question I have after reading "which does something fairly similar to this already", if both paths could be unified. Thanks for the suggestion. I will see what I can do for it. Regards Yin, Fengwei >
On 1/17/2023 10:46 PM, Matthew Wilcox wrote: > On Tue, Jan 17, 2023 at 11:37:05AM +0100, David Hildenbrand wrote: >> On 17.01.23 10:19, Yin, Fengwei wrote: >>> >>> >>> On 1/14/2023 2:13 AM, Matthew Wilcox wrote: >>>> On Sat, Jan 14, 2023 at 12:35:38AM +0800, Yin Fengwei wrote: >>>>> The page fault number can be reduced by batched PTEs population. >>>>> The batch size of PTEs population is not allowed to cross: >>>>> - page table boundaries >>>>> - vma range >>>>> - large folio size >>>>> - fault_around_bytes >>>> >>>> I find this patch very interesting. But is it really worth it? Most >>>> file-backed page faults are resolved through the ->map_pages() path >>>> which is almost always filemap_map_pages(), which does something >>>> fairly similar to this already. Do you have any performance numbers? >>>> >>> I tried the will-it-scale page_fault3: >>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c >>> with 96 processes on a test box with 48C/86T. >>> >>> The test result got about 3.75X better with 4.1X less page fault number >>> with this patch. >>> >>> But It's a micro benchmark which shows extreme friendly case to this patch. >>> >>> I didn't see observed performance gain with other workloads. I suppose >>> shared file write operations may not be common operations? Thanks. >> >> One question I have after reading "which does something fairly similar to >> this already", if both paths could be unified. > > I've been thinking about this already; not so much in terms of "unifying > these two implementations" but rather "What's the right API for mapping > the parts of a folio that fit into userspace in response to a fault". > I haven't got quite as far as drafting code, but I'm thinking there should > be an API where we pass in the vmf, a folio and some other information, > and that function takes care of mapping however many pages from this > folio that it can. That would be lovely. > > And the reason to split it up like that is to batch as many page > operations into this folio as possible. eg filemap_map_pages() was > written "to get it working", not "to be efficient", so it does stupid > things like call folio_ref_inc() every time it maps a page instead of > counting how many pages it maps and calling folio_ref_add() at the end. > Similar optimisations should be done for mapcount, which implies some > kind of batched equivalent of page_add_*_rmap(). I tried to do batched mapcount when I worked on this patch. But page_add_file_rmap() just support compound PMD_SIZE folio (maybe it is time to remove that limitation?). Regards Yin, Fengwei
On 1/17/2023 10:46 PM, Matthew Wilcox wrote: > On Tue, Jan 17, 2023 at 11:37:05AM +0100, David Hildenbrand wrote: >> On 17.01.23 10:19, Yin, Fengwei wrote: >>> >>> >>> On 1/14/2023 2:13 AM, Matthew Wilcox wrote: >>>> On Sat, Jan 14, 2023 at 12:35:38AM +0800, Yin Fengwei wrote: >>>>> The page fault number can be reduced by batched PTEs population. >>>>> The batch size of PTEs population is not allowed to cross: >>>>> - page table boundaries >>>>> - vma range >>>>> - large folio size >>>>> - fault_around_bytes >>>> >>>> I find this patch very interesting. But is it really worth it? Most >>>> file-backed page faults are resolved through the ->map_pages() path >>>> which is almost always filemap_map_pages(), which does something >>>> fairly similar to this already. Do you have any performance numbers? >>>> >>> I tried the will-it-scale page_fault3: >>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c >>> with 96 processes on a test box with 48C/86T. >>> >>> The test result got about 3.75X better with 4.1X less page fault number >>> with this patch. >>> >>> But It's a micro benchmark which shows extreme friendly case to this patch. >>> >>> I didn't see observed performance gain with other workloads. I suppose >>> shared file write operations may not be common operations? Thanks. >> >> One question I have after reading "which does something fairly similar to >> this already", if both paths could be unified. > > I've been thinking about this already; not so much in terms of "unifying > these two implementations" but rather "What's the right API for mapping > the parts of a folio that fit into userspace in response to a fault". > I haven't got quite as far as drafting code, but I'm thinking there should > be an API where we pass in the vmf, a folio and some other information, > and that function takes care of mapping however many pages from this > folio that it can. > > And the reason to split it up like that is to batch as many page > operations into this folio as possible. eg filemap_map_pages() was > written "to get it working", not "to be efficient", so it does stupid > things like call folio_ref_inc() every time it maps a page instead of > counting how many pages it maps and calling folio_ref_add() at the end. > Similar optimisations should be done for mapcount, which implies some > kind of batched equivalent of page_add_*_rmap(). I did following experience (Just did boot testing in Qemu). I think another gain is no folio_trylock/folio_unlock to each sub-pages of large folio. Just need trylock/unlock entire folio once. Batching mapcount may need more changes. Like check no !pte_none case in advance and it's to map entire large folios. diff --git a/mm/filemap.c b/mm/filemap.c index c4d4ace9cc70..a79fd4766d2f 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3353,6 +3353,41 @@ static inline struct folio *next_map_page(struct address_space *mapping, mapping, xas, end_pgoff); } + +static vm_fault_t __filemap_map_folio(struct vm_fault *vmf, + struct folio *folio, unsigned long addr, pgoff_t idx, int count) +{ + struct vm_area_struct *vma = vmf->vma; + int i = 0, nr_pages = folio_nr_pages(folio); + int len = ((count + idx) > nr_pages) ? (nr_pages - idx) : count; + int ref_count = 0; + vm_fault_t ret = 0; + + do { + struct page *page; + + page = folio_page(folio, i + idx); + if (PageHWPoison(page)) + continue; + + if (!pte_none(*vmf->pte)) + continue; + + if (vmf->address == addr) + ret = VM_FAULT_NOPAGE; + + ref_count++; + + do_set_pte(vmf, page, addr); + update_mmu_cache(vma, addr, vmf->pte); + } while (vmf->pte++, addr += PAGE_SIZE, ++i < len); + + vmf->pte -= len; + folio_ref_add(folio, ref_count); + + return ret; +} + vm_fault_t filemap_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff) { @@ -3363,9 +3398,9 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long addr; XA_STATE(xas, &mapping->i_pages, start_pgoff); struct folio *folio; - struct page *page; unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss); vm_fault_t ret = 0; + int idx; rcu_read_lock(); folio = first_map_page(mapping, &xas, end_pgoff); @@ -3380,45 +3415,21 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); do { -again: - page = folio_file_page(folio, xas.xa_index); - if (PageHWPoison(page)) - goto unlock; - - if (mmap_miss > 0) - mmap_miss--; + int count; + idx = xas.xa_index & (folio_nr_pages(folio) - 1); addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT; vmf->pte += xas.xa_index - last_pgoff; last_pgoff = xas.xa_index; - /* - * NOTE: If there're PTE markers, we'll leave them to be - * handled in the specific fault path, and it'll prohibit the - * fault-around logic. - */ - if (!pte_none(*vmf->pte)) - goto unlock; + count = end_pgoff - last_pgoff + 1; - /* We're about to handle the fault */ - if (vmf->address == addr) + if (VM_FAULT_NOPAGE == + __filemap_map_folio(vmf, folio, addr, idx, count)) ret = VM_FAULT_NOPAGE; - do_set_pte(vmf, page, addr); - /* no need to invalidate: a not-present page won't be cached */ - update_mmu_cache(vma, addr, vmf->pte); - if (folio_more_pages(folio, xas.xa_index, end_pgoff)) { - xas.xa_index++; - folio_ref_inc(folio); - goto again; - } - folio_unlock(folio); - continue; -unlock: - if (folio_more_pages(folio, xas.xa_index, end_pgoff)) { - xas.xa_index++; - goto again; - } + xas.xa_index += folio_nr_pages(folio) - idx; + folio_unlock(folio); folio_put(folio); } while ((folio = next_map_page(mapping, &xas, end_pgoff)) != NULL); NOTE: PoC code doesn't cover file->f_ra.mmap_miss. Thanks. Regards Yin, Fengwei
diff --git a/mm/memory.c b/mm/memory.c index 56b571c83a0e..755e6e590481 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -104,6 +104,10 @@ EXPORT_SYMBOL(mem_map); #endif static vm_fault_t do_fault(struct vm_fault *vmf); +static inline bool allowed_batched_set_ptes(struct vm_fault *vmf, + struct page *page); +static void do_set_multi_ptes(struct vm_fault *vmf, struct page *page, + unsigned long addr); /* * A number of key systems in x86 including ioremap() rely on the assumption @@ -4359,10 +4363,16 @@ vm_fault_t finish_fault(struct vm_fault *vmf) /* Re-check under ptl */ if (likely(!vmf_pte_changed(vmf))) { - do_set_pte(vmf, page, vmf->address); + if (allowed_batched_set_ptes(vmf, page)) + do_set_multi_ptes(vmf, page, vmf->address); + else { + do_set_pte(vmf, page, vmf->address); - /* no need to invalidate: a not-present page won't be cached */ - update_mmu_cache(vma, vmf->address, vmf->pte); + /* no need to invalidate: a not-present page + * won't be cached + */ + update_mmu_cache(vma, vmf->address, vmf->pte); + } ret = 0; } else { @@ -4476,6 +4486,92 @@ static inline bool should_fault_around(struct vm_fault *vmf) return fault_around_bytes >> PAGE_SHIFT > 1; } +/* Return true if we should do fault-around for file fault, false otherwise */ +static inline bool allowed_batched_set_ptes(struct vm_fault *vmf, + struct page *page) +{ + struct folio *folio = page_folio(page); + + if (uffd_disable_fault_around(vmf->vma)) + return false; + + if (!folio_test_large(folio)) + return false; + + /* TODO: Will revise after anon mapping support folio */ + if ((vmf->flags & FAULT_FLAG_WRITE) && + !(vmf->vma->vm_flags & VM_SHARED)) + return false; + + return fault_around_bytes >> PAGE_SHIFT > 1; +} + +static void do_set_multi_ptes(struct vm_fault *vmf, struct page *pg, + unsigned long addr) +{ + struct folio *folio = page_folio(pg); + struct vm_area_struct *vma = vmf->vma; + unsigned long size, mask, start, end, folio_start, folio_end; + int dist, first_idx, i = 0; + pte_t *pte; + + /* in page table range */ + start = ALIGN_DOWN(addr, PMD_SIZE); + end = ALIGN(addr, PMD_SIZE); + + /* in fault_around_bytes range */ + size = READ_ONCE(fault_around_bytes); + mask = ~(size - 1) & PAGE_MASK; + + /* in vma range */ + start = max3(start, (addr & mask), vma->vm_start); + end = min3(end, (addr & mask) + size, vma->vm_end); + + /* folio is locked and referenced. It will not be split or + * removed from page cache in this function. + */ + folio_start = addr - (folio_page_idx(folio, pg) << PAGE_SHIFT); + folio_end = folio_start + (folio_nr_pages(folio) << PAGE_SHIFT); + + /* in folio size range */ + start = max(start, folio_start); + end = min(end, folio_end); + + dist = (addr - start) >> PAGE_SHIFT; + first_idx = folio_page_idx(folio, pg) - dist; + pte = vmf->pte - dist; + + do { + struct page *page = folio_page(folio, first_idx + i); + bool write = vmf->flags & FAULT_FLAG_WRITE; + bool prefault = page != pg; + pte_t entry; + + if (!pte_none(*pte)) + continue; + + flush_icache_page(vma, page); + entry = mk_pte(page, vma->vm_page_prot); + + if (prefault) + folio_get(folio); + + if (prefault && arch_wants_old_prefaulted_pte()) + entry = pte_mkold(entry); + else + entry = pte_sw_mkyoung(entry); + + if (write) + entry=maybe_mkwrite(pte_mkdirty(entry), vma); + + inc_mm_counter(vma->vm_mm, mm_counter_file(&folio->page)); + page_add_file_rmap(page, vma, false); + + set_pte_at(vma->vm_mm, start, pte, entry); + update_mmu_cache(vma, start, pte); + } while (pte++, start += PAGE_SIZE, i++, start < end); +} + static vm_fault_t do_read_fault(struct vm_fault *vmf) { vm_fault_t ret = 0;
The page fault number can be reduced by batched PTEs population. The batch size of PTEs population is not allowed to cross: - page table boundaries - vma range - large folio size - fault_around_bytes fault_around_bytes allows to control batch size if user has attention to to so. Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> --- * base on next-20230112 mm/memory.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 99 insertions(+), 3 deletions(-)