diff mbox series

[RFC] mm: populate multiple PTEs if file page is large folio

Message ID 20230113163538.23412-1-fengwei.yin@intel.com (mailing list archive)
State New
Headers show
Series [RFC] mm: populate multiple PTEs if file page is large folio | expand

Commit Message

Yin, Fengwei Jan. 13, 2023, 4:35 p.m. UTC
The page fault number can be reduced by batched PTEs population.
The batch size of PTEs population is not allowed to cross:
  - page table boundaries
  - vma range
  - large folio size
  - fault_around_bytes

fault_around_bytes allows to control batch size if user has
attention to to so.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
* base on next-20230112

 mm/memory.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 99 insertions(+), 3 deletions(-)

Comments

Matthew Wilcox Jan. 13, 2023, 6:13 p.m. UTC | #1
On Sat, Jan 14, 2023 at 12:35:38AM +0800, Yin Fengwei wrote:
> The page fault number can be reduced by batched PTEs population.
> The batch size of PTEs population is not allowed to cross:
>   - page table boundaries
>   - vma range
>   - large folio size
>   - fault_around_bytes

I find this patch very interesting.  But is it really worth it?  Most
file-backed page faults are resolved through the ->map_pages() path
which is almost always filemap_map_pages(), which does something
fairly similar to this already.  Do you have any performance numbers?
Yin, Fengwei Jan. 14, 2023, 12:58 a.m. UTC | #2
On 1/14/2023 2:13 AM, Matthew Wilcox wrote:
> I find this patch very interesting.  But is it really worth it?  Most
> file-backed page faults are resolved through the ->map_pages() path
> which is almost always filemap_map_pages(), which does something
> fairly similar to this already.  Do you have any performance numbers?

Yes. The real benefit this patch brings could be very minor. I will try
to get some performance data against this patch. Thanks. 


Regards
Yin, Fengwei
Yin, Fengwei Jan. 17, 2023, 9:19 a.m. UTC | #3
On 1/14/2023 2:13 AM, Matthew Wilcox wrote:
> On Sat, Jan 14, 2023 at 12:35:38AM +0800, Yin Fengwei wrote:
>> The page fault number can be reduced by batched PTEs population.
>> The batch size of PTEs population is not allowed to cross:
>>   - page table boundaries
>>   - vma range
>>   - large folio size
>>   - fault_around_bytes
> 
> I find this patch very interesting.  But is it really worth it?  Most
> file-backed page faults are resolved through the ->map_pages() path
> which is almost always filemap_map_pages(), which does something
> fairly similar to this already.  Do you have any performance numbers?
> 
I tried the will-it-scale page_fault3:
https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
with 96 processes on a test box with 48C/86T.

The test result got about 3.75X better with 4.1X less page fault number
with this patch.

But It's a micro benchmark which shows extreme friendly case to this patch.

I didn't see observed performance gain with other workloads. I suppose
shared file write operations may not be common operations? Thanks.

Regards
Yin, Fengwei
David Hildenbrand Jan. 17, 2023, 10:37 a.m. UTC | #4
On 17.01.23 10:19, Yin, Fengwei wrote:
> 
> 
> On 1/14/2023 2:13 AM, Matthew Wilcox wrote:
>> On Sat, Jan 14, 2023 at 12:35:38AM +0800, Yin Fengwei wrote:
>>> The page fault number can be reduced by batched PTEs population.
>>> The batch size of PTEs population is not allowed to cross:
>>>    - page table boundaries
>>>    - vma range
>>>    - large folio size
>>>    - fault_around_bytes
>>
>> I find this patch very interesting.  But is it really worth it?  Most
>> file-backed page faults are resolved through the ->map_pages() path
>> which is almost always filemap_map_pages(), which does something
>> fairly similar to this already.  Do you have any performance numbers?
>>
> I tried the will-it-scale page_fault3:
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
> with 96 processes on a test box with 48C/86T.
> 
> The test result got about 3.75X better with 4.1X less page fault number
> with this patch.
> 
> But It's a micro benchmark which shows extreme friendly case to this patch.
> 
> I didn't see observed performance gain with other workloads. I suppose
> shared file write operations may not be common operations? Thanks.

One question I have after reading "which does something fairly similar 
to this already", if both paths could be unified.
Matthew Wilcox Jan. 17, 2023, 2:46 p.m. UTC | #5
On Tue, Jan 17, 2023 at 11:37:05AM +0100, David Hildenbrand wrote:
> On 17.01.23 10:19, Yin, Fengwei wrote:
> > 
> > 
> > On 1/14/2023 2:13 AM, Matthew Wilcox wrote:
> > > On Sat, Jan 14, 2023 at 12:35:38AM +0800, Yin Fengwei wrote:
> > > > The page fault number can be reduced by batched PTEs population.
> > > > The batch size of PTEs population is not allowed to cross:
> > > >    - page table boundaries
> > > >    - vma range
> > > >    - large folio size
> > > >    - fault_around_bytes
> > > 
> > > I find this patch very interesting.  But is it really worth it?  Most
> > > file-backed page faults are resolved through the ->map_pages() path
> > > which is almost always filemap_map_pages(), which does something
> > > fairly similar to this already.  Do you have any performance numbers?
> > > 
> > I tried the will-it-scale page_fault3:
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
> > with 96 processes on a test box with 48C/86T.
> > 
> > The test result got about 3.75X better with 4.1X less page fault number
> > with this patch.
> > 
> > But It's a micro benchmark which shows extreme friendly case to this patch.
> > 
> > I didn't see observed performance gain with other workloads. I suppose
> > shared file write operations may not be common operations? Thanks.
> 
> One question I have after reading "which does something fairly similar to
> this already", if both paths could be unified.

I've been thinking about this already; not so much in terms of "unifying
these two implementations" but rather "What's the right API for mapping
the parts of a folio that fit into userspace in response to a fault".
I haven't got quite as far as drafting code, but I'm thinking there should
be an API where we pass in the vmf, a folio and some other information,
and that function takes care of mapping however many pages from this
folio that it can.

And the reason to split it up like that is to batch as many page
operations into this folio as possible.  eg filemap_map_pages() was
written "to get it working", not "to be efficient", so it does stupid
things like call folio_ref_inc() every time it maps a page instead of
counting how many pages it maps and calling folio_ref_add() at the end.
Similar optimisations should be done for mapcount, which implies some
kind of batched equivalent of page_add_*_rmap().
Yin, Fengwei Jan. 18, 2023, 12:58 a.m. UTC | #6
On 1/17/2023 6:37 PM, David Hildenbrand wrote:
> On 17.01.23 10:19, Yin, Fengwei wrote:
>>
>>
>> On 1/14/2023 2:13 AM, Matthew Wilcox wrote:
>>> On Sat, Jan 14, 2023 at 12:35:38AM +0800, Yin Fengwei wrote:
>>>> The page fault number can be reduced by batched PTEs population.
>>>> The batch size of PTEs population is not allowed to cross:
>>>>    - page table boundaries
>>>>    - vma range
>>>>    - large folio size
>>>>    - fault_around_bytes
>>>
>>> I find this patch very interesting.  But is it really worth it?  Most
>>> file-backed page faults are resolved through the ->map_pages() path
>>> which is almost always filemap_map_pages(), which does something
>>> fairly similar to this already.  Do you have any performance numbers?
>>>
>> I tried the will-it-scale page_fault3:
>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
>> with 96 processes on a test box with 48C/86T.
>>
>> The test result got about 3.75X better with 4.1X less page fault number
>> with this patch.
>>
>> But It's a micro benchmark which shows extreme friendly case to this patch.
>>
>> I didn't see observed performance gain with other workloads. I suppose
>> shared file write operations may not be common operations? Thanks.
> 
> One question I have after reading "which does something fairly similar to this already", if both paths could be unified.
Thanks for the suggestion. I will see what I can do for it.

Regards
Yin, Fengwei

>
Yin, Fengwei Jan. 18, 2023, 1:41 a.m. UTC | #7
On 1/17/2023 10:46 PM, Matthew Wilcox wrote:
> On Tue, Jan 17, 2023 at 11:37:05AM +0100, David Hildenbrand wrote:
>> On 17.01.23 10:19, Yin, Fengwei wrote:
>>>
>>>
>>> On 1/14/2023 2:13 AM, Matthew Wilcox wrote:
>>>> On Sat, Jan 14, 2023 at 12:35:38AM +0800, Yin Fengwei wrote:
>>>>> The page fault number can be reduced by batched PTEs population.
>>>>> The batch size of PTEs population is not allowed to cross:
>>>>>    - page table boundaries
>>>>>    - vma range
>>>>>    - large folio size
>>>>>    - fault_around_bytes
>>>>
>>>> I find this patch very interesting.  But is it really worth it?  Most
>>>> file-backed page faults are resolved through the ->map_pages() path
>>>> which is almost always filemap_map_pages(), which does something
>>>> fairly similar to this already.  Do you have any performance numbers?
>>>>
>>> I tried the will-it-scale page_fault3:
>>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
>>> with 96 processes on a test box with 48C/86T.
>>>
>>> The test result got about 3.75X better with 4.1X less page fault number
>>> with this patch.
>>>
>>> But It's a micro benchmark which shows extreme friendly case to this patch.
>>>
>>> I didn't see observed performance gain with other workloads. I suppose
>>> shared file write operations may not be common operations? Thanks.
>>
>> One question I have after reading "which does something fairly similar to
>> this already", if both paths could be unified.
> 
> I've been thinking about this already; not so much in terms of "unifying
> these two implementations" but rather "What's the right API for mapping
> the parts of a folio that fit into userspace in response to a fault".
> I haven't got quite as far as drafting code, but I'm thinking there should
> be an API where we pass in the vmf, a folio and some other information,
> and that function takes care of mapping however many pages from this
> folio that it can.
That would be lovely.

> 
> And the reason to split it up like that is to batch as many page
> operations into this folio as possible.  eg filemap_map_pages() was
> written "to get it working", not "to be efficient", so it does stupid
> things like call folio_ref_inc() every time it maps a page instead of
> counting how many pages it maps and calling folio_ref_add() at the end.
> Similar optimisations should be done for mapcount, which implies some
> kind of batched equivalent of page_add_*_rmap().
I tried to do batched mapcount when I worked on this patch.
But page_add_file_rmap() just support compound PMD_SIZE folio (maybe it
is time to remove that limitation?).

Regards
Yin, Fengwei
Yin, Fengwei Jan. 18, 2023, 2:05 p.m. UTC | #8
On 1/17/2023 10:46 PM, Matthew Wilcox wrote:
> On Tue, Jan 17, 2023 at 11:37:05AM +0100, David Hildenbrand wrote:
>> On 17.01.23 10:19, Yin, Fengwei wrote:
>>>
>>>
>>> On 1/14/2023 2:13 AM, Matthew Wilcox wrote:
>>>> On Sat, Jan 14, 2023 at 12:35:38AM +0800, Yin Fengwei wrote:
>>>>> The page fault number can be reduced by batched PTEs population.
>>>>> The batch size of PTEs population is not allowed to cross:
>>>>>    - page table boundaries
>>>>>    - vma range
>>>>>    - large folio size
>>>>>    - fault_around_bytes
>>>>
>>>> I find this patch very interesting.  But is it really worth it?  Most
>>>> file-backed page faults are resolved through the ->map_pages() path
>>>> which is almost always filemap_map_pages(), which does something
>>>> fairly similar to this already.  Do you have any performance numbers?
>>>>
>>> I tried the will-it-scale page_fault3:
>>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
>>> with 96 processes on a test box with 48C/86T.
>>>
>>> The test result got about 3.75X better with 4.1X less page fault number
>>> with this patch.
>>>
>>> But It's a micro benchmark which shows extreme friendly case to this patch.
>>>
>>> I didn't see observed performance gain with other workloads. I suppose
>>> shared file write operations may not be common operations? Thanks.
>>
>> One question I have after reading "which does something fairly similar to
>> this already", if both paths could be unified.
> 
> I've been thinking about this already; not so much in terms of "unifying
> these two implementations" but rather "What's the right API for mapping
> the parts of a folio that fit into userspace in response to a fault".
> I haven't got quite as far as drafting code, but I'm thinking there should
> be an API where we pass in the vmf, a folio and some other information,
> and that function takes care of mapping however many pages from this
> folio that it can.
> 
> And the reason to split it up like that is to batch as many page
> operations into this folio as possible.  eg filemap_map_pages() was
> written "to get it working", not "to be efficient", so it does stupid
> things like call folio_ref_inc() every time it maps a page instead of
> counting how many pages it maps and calling folio_ref_add() at the end.
> Similar optimisations should be done for mapcount, which implies some
> kind of batched equivalent of page_add_*_rmap().

I did following experience (Just did boot testing in Qemu). I think another
gain is no folio_trylock/folio_unlock to each sub-pages of large folio.
Just need trylock/unlock entire folio once.

Batching mapcount may need more changes. Like check no !pte_none case in
advance and it's to map entire large folios.


diff --git a/mm/filemap.c b/mm/filemap.c
index c4d4ace9cc70..a79fd4766d2f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3353,6 +3353,41 @@ static inline struct folio *next_map_page(struct address_space *mapping,
 				  mapping, xas, end_pgoff);
 }
 
+
+static vm_fault_t __filemap_map_folio(struct vm_fault *vmf,
+	struct folio *folio, unsigned long addr, pgoff_t idx, int count)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	int i = 0, nr_pages = folio_nr_pages(folio);
+	int len = ((count + idx) > nr_pages) ? (nr_pages - idx) : count;
+	int ref_count = 0;
+	vm_fault_t ret = 0;
+
+	do {
+		struct page *page;
+
+		page = folio_page(folio, i + idx);
+		if (PageHWPoison(page))
+			continue;
+
+		if (!pte_none(*vmf->pte))
+			continue;
+
+		if (vmf->address == addr)
+			ret = VM_FAULT_NOPAGE;
+
+		ref_count++;
+
+		do_set_pte(vmf, page, addr);
+		update_mmu_cache(vma, addr, vmf->pte);
+	} while (vmf->pte++, addr += PAGE_SIZE, ++i < len);
+
+	vmf->pte -= len;
+	folio_ref_add(folio, ref_count);
+
+	return ret;
+}
+
 vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 			     pgoff_t start_pgoff, pgoff_t end_pgoff)
 {
@@ -3363,9 +3398,9 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	unsigned long addr;
 	XA_STATE(xas, &mapping->i_pages, start_pgoff);
 	struct folio *folio;
-	struct page *page;
 	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
 	vm_fault_t ret = 0;
+	int idx;
 
 	rcu_read_lock();
 	folio = first_map_page(mapping, &xas, end_pgoff);
@@ -3380,45 +3415,21 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
 	do {
-again:
-		page = folio_file_page(folio, xas.xa_index);
-		if (PageHWPoison(page))
-			goto unlock;
-
-		if (mmap_miss > 0)
-			mmap_miss--;
+		int count;
 
+		idx = xas.xa_index & (folio_nr_pages(folio) - 1);
 		addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
 		vmf->pte += xas.xa_index - last_pgoff;
 		last_pgoff = xas.xa_index;
 
-		/*
-		 * NOTE: If there're PTE markers, we'll leave them to be
-		 * handled in the specific fault path, and it'll prohibit the
-		 * fault-around logic.
-		 */
-		if (!pte_none(*vmf->pte))
-			goto unlock;
+		count = end_pgoff - last_pgoff + 1;
 
-		/* We're about to handle the fault */
-		if (vmf->address == addr)
+		if (VM_FAULT_NOPAGE ==
+			__filemap_map_folio(vmf, folio, addr, idx, count))
 			ret = VM_FAULT_NOPAGE;
 
-		do_set_pte(vmf, page, addr);
-		/* no need to invalidate: a not-present page won't be cached */
-		update_mmu_cache(vma, addr, vmf->pte);
-		if (folio_more_pages(folio, xas.xa_index, end_pgoff)) {
-			xas.xa_index++;
-			folio_ref_inc(folio);
-			goto again;
-		}
-		folio_unlock(folio);
-		continue;
-unlock:
-		if (folio_more_pages(folio, xas.xa_index, end_pgoff)) {
-			xas.xa_index++;
-			goto again;
-		}
+		xas.xa_index += folio_nr_pages(folio) - idx;
+
 		folio_unlock(folio);
 		folio_put(folio);
 	} while ((folio = next_map_page(mapping, &xas, end_pgoff)) != NULL);

NOTE: PoC code doesn't cover file->f_ra.mmap_miss. Thanks.

Regards
Yin, Fengwei
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 56b571c83a0e..755e6e590481 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -104,6 +104,10 @@  EXPORT_SYMBOL(mem_map);
 #endif
 
 static vm_fault_t do_fault(struct vm_fault *vmf);
+static inline bool allowed_batched_set_ptes(struct vm_fault *vmf,
+		struct page *page);
+static void do_set_multi_ptes(struct vm_fault *vmf, struct page *page,
+		unsigned long addr);
 
 /*
  * A number of key systems in x86 including ioremap() rely on the assumption
@@ -4359,10 +4363,16 @@  vm_fault_t finish_fault(struct vm_fault *vmf)
 
 	/* Re-check under ptl */
 	if (likely(!vmf_pte_changed(vmf))) {
-		do_set_pte(vmf, page, vmf->address);
+		if (allowed_batched_set_ptes(vmf, page))
+			do_set_multi_ptes(vmf, page, vmf->address);
+		else {
+			do_set_pte(vmf, page, vmf->address);
 
-		/* no need to invalidate: a not-present page won't be cached */
-		update_mmu_cache(vma, vmf->address, vmf->pte);
+			/* no need to invalidate: a not-present page
+			 * won't be cached
+			 */
+			update_mmu_cache(vma, vmf->address, vmf->pte);
+		}
 
 		ret = 0;
 	} else {
@@ -4476,6 +4486,92 @@  static inline bool should_fault_around(struct vm_fault *vmf)
 	return fault_around_bytes >> PAGE_SHIFT > 1;
 }
 
+/* Return true if we should do fault-around for file fault, false otherwise */
+static inline bool allowed_batched_set_ptes(struct vm_fault *vmf,
+		struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	if (uffd_disable_fault_around(vmf->vma))
+		return false;
+
+	if (!folio_test_large(folio))
+		return false;
+
+	/* TODO: Will revise after anon mapping support folio */
+	if ((vmf->flags & FAULT_FLAG_WRITE) &&
+			!(vmf->vma->vm_flags & VM_SHARED))
+		return false;
+
+	return fault_around_bytes >> PAGE_SHIFT > 1;
+}
+
+static void do_set_multi_ptes(struct vm_fault *vmf, struct page *pg,
+		unsigned long addr)
+{
+	struct folio *folio = page_folio(pg);
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long size, mask, start, end, folio_start, folio_end;
+	int dist, first_idx, i = 0;
+	pte_t *pte;
+
+	/* in page table range */
+	start = ALIGN_DOWN(addr, PMD_SIZE);
+	end = ALIGN(addr, PMD_SIZE);
+
+	/* in fault_around_bytes range */
+	size = READ_ONCE(fault_around_bytes);
+	mask = ~(size - 1) & PAGE_MASK;
+
+	/* in vma range */
+	start = max3(start, (addr & mask), vma->vm_start);
+	end = min3(end, (addr & mask) + size, vma->vm_end);
+
+	/* folio is locked and referenced. It will not be split or
+	 * removed from page cache in this function.
+	 */
+	folio_start = addr - (folio_page_idx(folio, pg) << PAGE_SHIFT);
+	folio_end = folio_start + (folio_nr_pages(folio) << PAGE_SHIFT);
+
+	/* in folio size range */
+	start = max(start, folio_start);
+	end = min(end, folio_end);
+
+	dist = (addr - start) >> PAGE_SHIFT;
+	first_idx = folio_page_idx(folio, pg) - dist;
+	pte = vmf->pte - dist;
+
+	do {
+		struct page *page = folio_page(folio, first_idx + i);
+		bool write = vmf->flags & FAULT_FLAG_WRITE;
+		bool prefault = page != pg;
+		pte_t entry;
+
+		if (!pte_none(*pte))
+			continue;
+
+		flush_icache_page(vma, page);
+		entry = mk_pte(page, vma->vm_page_prot);
+
+		if (prefault)
+			folio_get(folio);
+
+		if (prefault && arch_wants_old_prefaulted_pte())
+			entry = pte_mkold(entry);
+		else
+			entry = pte_sw_mkyoung(entry);
+
+		if (write)
+			entry=maybe_mkwrite(pte_mkdirty(entry), vma);
+
+		inc_mm_counter(vma->vm_mm, mm_counter_file(&folio->page));
+		page_add_file_rmap(page, vma, false);
+
+		set_pte_at(vma->vm_mm, start, pte, entry);
+		update_mmu_cache(vma, start, pte);
+	} while (pte++, start += PAGE_SIZE, i++, start < end);
+}
+
 static vm_fault_t do_read_fault(struct vm_fault *vmf)
 {
 	vm_fault_t ret = 0;