mbox series

[RFC,0/6] variable-order, large folios for anonymous memory

Message ID 20230317105802.2634004-1-ryan.roberts@arm.com (mailing list archive)
Headers show
Series variable-order, large folios for anonymous memory | expand

Message

Ryan Roberts March 17, 2023, 10:57 a.m. UTC
Hi All,

This is an RFC for an initial, _very_ limited stab at implementing support for
using variable-order, large folios for anonymous memory. It intends to be the
minimal change, upon which additions can be made incrementally. That said, with
just this change, I achive a 4% performance improvement when compiling the
kernel (more on that later).

My motivation for posting the RFC now is twofold:

- Get feedback on the approach I'm taking before I go too far down the path;
  does this fit with the direction of the community? Are there any bear traps
  that I've not considered (due to my being fairly new to mm and not having a
  complete understanding of its entirety)?

- Seek support for a bug I'm encountering when MADV_FREE is attempting to
  split_folio() one of these new variable-order anon folios. I've been pouring
  through the source and can't find the root cause. For now I have a work
  around, but hopefully someone can give me some pointers as to where the
  problem is likely to be. (see details below).

The patches apply on top of v6.3-rc1 + patches 1-31 of [4] (which needed one
minor conflict resolution). And I have a tree at [5].

See [1], [2], [3] for more background.

Approach
========

For now, I'm only modifying the allocation path (do_anonymous_page()). I'm not
touching the CoW path. First, I determine the order of the folio to allocate for
the given fault. This is determined by:

  - Folio must be naturally aligned within VA space
  - Folio must not breach boundaries of vma
  - Folio must be fully contained inside one pmd entry
  - Folio must not overlap any non-none ptes
  - Order must not be higher than a provided starting order

Where the "provided starting order" is currently hardcoded to 4, but the idea is
that this would eventually be a per-vma value that gets dynamically tuned.

We then try to allocate a large folio of the determined order, and keep trying
to allocate with successively smaller orders until we succeed. Once the folio is
allocated we can take the PTL and re-check that all the covered PTE entries are
still none. If not, we decrement the order and start again.

Next, the folio is added to the rmap using a new API,
folio_add_new_anon_rmap_range(), which is similar to Yin, Fengwei's
folio_add_file_rmap_range() at [4]. And finally set the ptes using Matthew
Wilcox's new set_ptes() API, also at [4].

Folio/page refcounts and mapcounts are managed in the same way as Yin, Fengwei
is doing in folio_add_file_rmap_range(); A reference is taken on the folio for
each pte, _mapcount is incremented on each page by 1, and
folio->_nr_pages_mapped is set to the number of pages in the folio (since every
page is initially mapped).

It is my assumption that the mm should be able to deal with these folios
correctly for CoW and reclaim etc. although perhaps not as optimally as we would
(eventually) like.

Bug(s)
======

When I run this code without the last (workaround) patch, with DEBUG_VM et al,
PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
relating to invalid kernel addresses (which usually look like either NULL +
small offset or mostly zeros with a few mid-order bits set + a small offset) or
lockdep complaining about a bad unlock balance. Call stacks are often in
madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
email example oopses out separately if anyone wants to review them). My hunch is
that struct pages adjacent to the folio are being corrupted, but don't have hard
evidence.

When adding the workaround patch, which prevents madvise_free_pte_range() from
attempting to split a large folio, I never see any issues. Although I'm not
putting the system under memory pressure so guess I might see the same types of
problem crop up under swap, etc.

I've reviewed most of the code within split_folio() and can't find any smoking
gun, but I wonder if there are implicit assumptions about the large folio being
PMD sized that I'm obviously breaking now?

The code in madvise_free_pte_range():

	if (folio_test_large(folio)) {
		if (folio_mapcount(folio) != 1)
			goto out;
		folio_get(folio);
		if (!folio_trylock(folio)) {
			folio_put(folio);
			goto out;
		}
		pte_unmap_unlock(orig_pte, ptl);
		if (split_folio(folio)) {
			folio_unlock(folio);
			folio_put(folio);
			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
			goto out;
		}
		...
	}

Will normally skip my large folios because they have a mapcount > 1, due to
incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
or CoW in this case?

I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
large anon folio is not using the folio lock in the same way as a THP would and
we are therefore not getting the expected serialization?

I'd really appreciate any suggestions for how to pregress here!

Performance
===========

With the above bug worked around, I'm benchmarking kernel compilation, which is
known to be heavy on anonymous page faults. Overall, I see a reduction in
wall-time by 4%. This is inline with my predictions based on earlier experiments
summarised at [1]. I beleive there is scope for future improvement on the CoW
and reclaim paths. I'd also expect to see performance improvements due to
reduced TLB pressure on CPUs that support HPA (I'm running on Ampere Altra where
HPA is not enabled).

Of the 4%, all of it is (obviously) in the kernel; overall kernel execution time
has reduced by 34%, more than halving the time spent servicing data faults, and
significantly speeding up sys_exit_group().

Thanks,
Ryan

[1] https://lore.kernel.org/linux-mm/4c991dcb-c5bb-86bb-5a29-05df24429607@arm.com/
[2] https://lore.kernel.org/linux-mm/a7cd938e-a86f-e3af-f56c-433c92ac69c2@arm.com/
[3] https://lore.kernel.org/linux-mm/Y%2FblF0GIunm+pRIC@casper.infradead.org/
[4] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
[5] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc


Ryan Roberts (6):
  mm: Expose clear_huge_page() unconditionally
  mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
  mm: Introduce try_vma_alloc_zeroed_movable_folio()
  mm: Implement folio_add_new_anon_rmap_range()
  mm: Allocate large folios for anonymous memory
  WORKAROUND: Don't split large folios on madvise

 arch/alpha/include/asm/page.h   |   5 +-
 arch/arm64/include/asm/page.h   |   3 +-
 arch/arm64/mm/fault.c           |   7 +-
 arch/ia64/include/asm/page.h    |   5 +-
 arch/m68k/include/asm/page_no.h |   7 +-
 arch/s390/include/asm/page.h    |   5 +-
 arch/x86/include/asm/page.h     |   5 +-
 include/linux/highmem.h         |  23 +++--
 include/linux/mm.h              |   3 +-
 include/linux/rmap.h            |   2 +
 mm/madvise.c                    |   8 ++
 mm/memory.c                     | 167 ++++++++++++++++++++++++++++----
 mm/rmap.c                       |  43 ++++++++
 13 files changed, 239 insertions(+), 44 deletions(-)

--
2.25.1

Comments

Ryan Roberts March 22, 2023, 12:03 p.m. UTC | #1
Hi Matthew,

On 17/03/2023 10:57, Ryan Roberts wrote:
> Hi All,
> 
> [...]
> 
> Bug(s)
> ======
> 
> When I run this code without the last (workaround) patch, with DEBUG_VM et al,
> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
> relating to invalid kernel addresses (which usually look like either NULL +
> small offset or mostly zeros with a few mid-order bits set + a small offset) or
> lockdep complaining about a bad unlock balance. Call stacks are often in
> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
> email example oopses out separately if anyone wants to review them). My hunch is
> that struct pages adjacent to the folio are being corrupted, but don't have hard
> evidence.
> 
> When adding the workaround patch, which prevents madvise_free_pte_range() from
> attempting to split a large folio, I never see any issues. Although I'm not
> putting the system under memory pressure so guess I might see the same types of
> problem crop up under swap, etc.
> 
> I've reviewed most of the code within split_folio() and can't find any smoking
> gun, but I wonder if there are implicit assumptions about the large folio being
> PMD sized that I'm obviously breaking now?
> 
> The code in madvise_free_pte_range():
> 
> 	if (folio_test_large(folio)) {
> 		if (folio_mapcount(folio) != 1)
> 			goto out;
> 		folio_get(folio);
> 		if (!folio_trylock(folio)) {
> 			folio_put(folio);
> 			goto out;
> 		}
> 		pte_unmap_unlock(orig_pte, ptl);
> 		if (split_folio(folio)) {
> 			folio_unlock(folio);
> 			folio_put(folio);
> 			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> 			goto out;
> 		}
> 		...
> 	}

I've noticed that its folio_split() with a folio order of 1 that causes my
problems. And I also see that the page cache code always explicitly never
allocates order-1 folios:

void page_cache_ra_order(struct readahead_control *ractl,
		struct file_ra_state *ra, unsigned int new_order)
{
	...

	while (index <= limit) {
		unsigned int order = new_order;

		/* Align with smaller pages if needed */
		if (index & ((1UL << order) - 1)) {
			order = __ffs(index);
			if (order == 1)
				order = 0;
		}
		/* Don't allocate pages past EOF */
		while (index + (1UL << order) - 1 > limit) {
			if (--order == 1)
				order = 0;
		}
		err = ra_alloc_folio(ractl, index, mark, order, gfp);
		if (err)
			break;
		index += 1UL << order;
	}

	...
}

Matthew, what is the reason for this? I suspect its guarding against the same
problem I'm seeing.

If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause
any oops/panic/etc. I'd just like to understand the root cause.

Thanks,
Ryan



> 
> Will normally skip my large folios because they have a mapcount > 1, due to
> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
> or CoW in this case?
> 
> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
> large anon folio is not using the folio lock in the same way as a THP would and
> we are therefore not getting the expected serialization?
> 
> I'd really appreciate any suggestions for how to pregress here!
>
Yin Fengwei March 22, 2023, 1:36 p.m. UTC | #2
On 3/22/2023 8:03 PM, Ryan Roberts wrote:
> Hi Matthew,
> 
> On 17/03/2023 10:57, Ryan Roberts wrote:
>> Hi All,
>>
>> [...]
>>
>> Bug(s)
>> ======
>>
>> When I run this code without the last (workaround) patch, with DEBUG_VM et al,
>> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
>> relating to invalid kernel addresses (which usually look like either NULL +
>> small offset or mostly zeros with a few mid-order bits set + a small offset) or
>> lockdep complaining about a bad unlock balance. Call stacks are often in
>> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
>> email example oopses out separately if anyone wants to review them). My hunch is
>> that struct pages adjacent to the folio are being corrupted, but don't have hard
>> evidence.
>>
>> When adding the workaround patch, which prevents madvise_free_pte_range() from
>> attempting to split a large folio, I never see any issues. Although I'm not
>> putting the system under memory pressure so guess I might see the same types of
>> problem crop up under swap, etc.
>>
>> I've reviewed most of the code within split_folio() and can't find any smoking
>> gun, but I wonder if there are implicit assumptions about the large folio being
>> PMD sized that I'm obviously breaking now?
>>
>> The code in madvise_free_pte_range():
>>
>> 	if (folio_test_large(folio)) {
>> 		if (folio_mapcount(folio) != 1)
>> 			goto out;
>> 		folio_get(folio);
>> 		if (!folio_trylock(folio)) {
>> 			folio_put(folio);
>> 			goto out;
>> 		}
>> 		pte_unmap_unlock(orig_pte, ptl);
>> 		if (split_folio(folio)) {
>> 			folio_unlock(folio);
>> 			folio_put(folio);
>> 			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>> 			goto out;
>> 		}
>> 		...
>> 	}
> 
> I've noticed that its folio_split() with a folio order of 1 that causes my
> problems. And I also see that the page cache code always explicitly never
> allocates order-1 folios:
> 
> void page_cache_ra_order(struct readahead_control *ractl,
> 		struct file_ra_state *ra, unsigned int new_order)
> {
> 	...
> 
> 	while (index <= limit) {
> 		unsigned int order = new_order;
> 
> 		/* Align with smaller pages if needed */
> 		if (index & ((1UL << order) - 1)) {
> 			order = __ffs(index);
> 			if (order == 1)
> 				order = 0;
> 		}
> 		/* Don't allocate pages past EOF */
> 		while (index + (1UL << order) - 1 > limit) {
> 			if (--order == 1)
> 				order = 0;
> 		}
> 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
> 		if (err)
> 			break;
> 		index += 1UL << order;
> 	}
> 
> 	...
> }
> 
> Matthew, what is the reason for this? I suspect its guarding against the same
> problem I'm seeing.
> 
> If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause
> any oops/panic/etc. I'd just like to understand the root cause.
Checked the struct folio definition. The _deferred_list is in third page struct.
My understanding is to support folio split, the folio order must >= 2. Thanks.


Regards
Yin, Fengwei

> 
> Thanks,
> Ryan
> 
> 
> 
>>
>> Will normally skip my large folios because they have a mapcount > 1, due to
>> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
>> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
>> or CoW in this case?
>>
>> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
>> large anon folio is not using the folio lock in the same way as a THP would and
>> we are therefore not getting the expected serialization?
>>
>> I'd really appreciate any suggestions for how to pregress here!
>>
>
Ryan Roberts March 22, 2023, 2:25 p.m. UTC | #3
On 22/03/2023 13:36, Yin, Fengwei wrote:
> 
> 
> On 3/22/2023 8:03 PM, Ryan Roberts wrote:
>> Hi Matthew,
>>
>> On 17/03/2023 10:57, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> [...]
>>>
>>> Bug(s)
>>> ======
>>>
>>> When I run this code without the last (workaround) patch, with DEBUG_VM et al,
>>> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
>>> relating to invalid kernel addresses (which usually look like either NULL +
>>> small offset or mostly zeros with a few mid-order bits set + a small offset) or
>>> lockdep complaining about a bad unlock balance. Call stacks are often in
>>> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
>>> email example oopses out separately if anyone wants to review them). My hunch is
>>> that struct pages adjacent to the folio are being corrupted, but don't have hard
>>> evidence.
>>>
>>> When adding the workaround patch, which prevents madvise_free_pte_range() from
>>> attempting to split a large folio, I never see any issues. Although I'm not
>>> putting the system under memory pressure so guess I might see the same types of
>>> problem crop up under swap, etc.
>>>
>>> I've reviewed most of the code within split_folio() and can't find any smoking
>>> gun, but I wonder if there are implicit assumptions about the large folio being
>>> PMD sized that I'm obviously breaking now?
>>>
>>> The code in madvise_free_pte_range():
>>>
>>> 	if (folio_test_large(folio)) {
>>> 		if (folio_mapcount(folio) != 1)
>>> 			goto out;
>>> 		folio_get(folio);
>>> 		if (!folio_trylock(folio)) {
>>> 			folio_put(folio);
>>> 			goto out;
>>> 		}
>>> 		pte_unmap_unlock(orig_pte, ptl);
>>> 		if (split_folio(folio)) {
>>> 			folio_unlock(folio);
>>> 			folio_put(folio);
>>> 			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>>> 			goto out;
>>> 		}
>>> 		...
>>> 	}
>>
>> I've noticed that its folio_split() with a folio order of 1 that causes my
>> problems. And I also see that the page cache code always explicitly never
>> allocates order-1 folios:
>>
>> void page_cache_ra_order(struct readahead_control *ractl,
>> 		struct file_ra_state *ra, unsigned int new_order)
>> {
>> 	...
>>
>> 	while (index <= limit) {
>> 		unsigned int order = new_order;
>>
>> 		/* Align with smaller pages if needed */
>> 		if (index & ((1UL << order) - 1)) {
>> 			order = __ffs(index);
>> 			if (order == 1)
>> 				order = 0;
>> 		}
>> 		/* Don't allocate pages past EOF */
>> 		while (index + (1UL << order) - 1 > limit) {
>> 			if (--order == 1)
>> 				order = 0;
>> 		}
>> 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
>> 		if (err)
>> 			break;
>> 		index += 1UL << order;
>> 	}
>>
>> 	...
>> }
>>
>> Matthew, what is the reason for this? I suspect its guarding against the same
>> problem I'm seeing.
>>
>> If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause
>> any oops/panic/etc. I'd just like to understand the root cause.
> Checked the struct folio definition. The _deferred_list is in third page struct.
> My understanding is to support folio split, the folio order must >= 2. Thanks.

Yep, looks like we have found the root cause - thanks for your help!

I've updated calc_anonymous_folio_order() to only use non-0 order if THP is
available and in that case, never allocate order-1. I think that both fixes the
problem and manages the dependency we have on THP:

static void calc_anonymous_folio_order(struct vm_fault *vmf,
				       int *order_out,
				       unsigned long *addr_out)
{
	/*
	 * The aim here is to determine what size of folio we should allocate
	 * for this fault. Factors include:
	 * - Folio must be naturally aligned within VA space
	 * - Folio must not breach boundaries of vma
	 * - Folio must be fully contained inside one pmd entry
	 * - Folio must not overlap any non-none ptes
	 * - Order must not be higher than *order_out upon entry
	 *
	 * Additionally, we do not allow order-1 since this breaks assumptions
	 * elsewhere in the mm; THP pages must be at least order-2 (since they
	 * store state up to the 3rd struct page subpage), and these pages must
	 * be THP in order to correctly use pre-existing THP infrastructure such
	 * as folio_split().
	 *
	 * As a consequence of relying on the THP infrastructure, if the system
	 * does not support THP, we always fallback to order-0.
	 *
	 * Note that the caller may or may not choose to lock the pte. If
	 * unlocked, the calculation should be considered an estimate that will
	 * need to be validated under the lock.
	 */

	struct vm_area_struct *vma = vmf->vma;
	int nr;
	int order;
	unsigned long addr;
	pte_t *pte;
	pte_t *first_set = NULL;
	int ret;

	if (has_transparent_hugepage()) {
		order = min(*order_out, PMD_SHIFT - PAGE_SHIFT);

		for (; order > 1; order--) {
			nr = 1 << order;
			addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
			pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);

			/* Check vma bounds. */
			if (addr < vma->vm_start ||
			    addr + nr * PAGE_SIZE > vma->vm_end)
				continue;

			/* Ptes covered by order already known to be none. */
			if (pte + nr <= first_set)
				break;

			/* Already found set pte in range covered by order. */
			if (pte <= first_set)
				continue;

			/* Need to check if all the ptes are none. */
			ret = check_all_ptes_none(pte, nr);
			if (ret == nr)
				break;

			first_set = pte + ret;
		}

		if (order == 1)
			order = 0;
	} else {
		order = 0;
	}

	*order_out = order;
	*addr_out = order > 0 ? addr : vmf->address;
}



> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> Thanks,
>> Ryan
>>
>>
>>
>>>
>>> Will normally skip my large folios because they have a mapcount > 1, due to
>>> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
>>> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
>>> or CoW in this case?
>>>
>>> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
>>> large anon folio is not using the folio lock in the same way as a THP would and
>>> we are therefore not getting the expected serialization?
>>>
>>> I'd really appreciate any suggestions for how to pregress here!
>>>
>>