mbox series

[RFC,00/14] Rearrange batched folio freeing

Message ID 20230825135918.4164671-1-willy@infradead.org (mailing list archive)
Headers show
Series Rearrange batched folio freeing | expand

Message

Matthew Wilcox Aug. 25, 2023, 1:59 p.m. UTC
Other than the obvious "remove calls to compound_head" changes, the
fundamental belief here is that iterating a linked list is much slower
than iterating an array (5-15x slower in my testing).  There's also
an associated belief that since we iterate the batch of folios three
times, we do better when the array is small (ie 15 entries) than we do
with a batch that is hundreds of entries long, which only gives us the
opportunity for the first pages to fall out of cache by the time we get
to the end.

The one place where that probably falls down is "Free folios in a batch
in shrink_folio_list()" where we'll flush the TLB once per batch instead
of at the end.  That's going to take some benchmarking.

Matthew Wilcox (Oracle) (14):
  mm: Make folios_put() the basis of release_pages()
  mm: Convert free_unref_page_list() to use folios
  mm: Add free_unref_folios()
  mm: Use folios_put() in __folio_batch_release()
  memcg: Add mem_cgroup_uncharge_folios()
  mm: Remove use of folio list from folios_put()
  mm: Use free_unref_folios() in put_pages_list()
  mm: use __page_cache_release() in folios_put()
  mm: Handle large folios in free_unref_folios()
  mm: Allow non-hugetlb large folios to be batch processed
  mm: Free folios in a batch in shrink_folio_list()
  mm: Free folios directly in move_folios_to_lru()
  memcg: Remove mem_cgroup_uncharge_list()
  mm: Remove free_unref_page_list()

 include/linux/memcontrol.h |  24 ++---
 include/linux/mm.h         |  19 +---
 mm/internal.h              |   4 +-
 mm/memcontrol.c            |  16 ++--
 mm/mlock.c                 |   3 +-
 mm/page_alloc.c            |  74 ++++++++-------
 mm/swap.c                  | 180 ++++++++++++++++++++-----------------
 mm/vmscan.c                |  51 +++++------
 8 files changed, 181 insertions(+), 190 deletions(-)

Comments

Ryan Roberts Sept. 4, 2023, 1:25 p.m. UTC | #1
On 25/08/2023 14:59, Matthew Wilcox (Oracle) wrote:
> Other than the obvious "remove calls to compound_head" changes, the
> fundamental belief here is that iterating a linked list is much slower
> than iterating an array (5-15x slower in my testing).  There's also
> an associated belief that since we iterate the batch of folios three
> times, we do better when the array is small (ie 15 entries) than we do
> with a batch that is hundreds of entries long, which only gives us the
> opportunity for the first pages to fall out of cache by the time we get
> to the end.
> 
> The one place where that probably falls down is "Free folios in a batch
> in shrink_folio_list()" where we'll flush the TLB once per batch instead
> of at the end.  That's going to take some benchmarking.
> 
> Matthew Wilcox (Oracle) (14):
>   mm: Make folios_put() the basis of release_pages()
>   mm: Convert free_unref_page_list() to use folios
>   mm: Add free_unref_folios()
>   mm: Use folios_put() in __folio_batch_release()
>   memcg: Add mem_cgroup_uncharge_folios()
>   mm: Remove use of folio list from folios_put()
>   mm: Use free_unref_folios() in put_pages_list()
>   mm: use __page_cache_release() in folios_put()
>   mm: Handle large folios in free_unref_folios()
>   mm: Allow non-hugetlb large folios to be batch processed
>   mm: Free folios in a batch in shrink_folio_list()
>   mm: Free folios directly in move_folios_to_lru()
>   memcg: Remove mem_cgroup_uncharge_list()
>   mm: Remove free_unref_page_list()
> 
>  include/linux/memcontrol.h |  24 ++---
>  include/linux/mm.h         |  19 +---
>  mm/internal.h              |   4 +-
>  mm/memcontrol.c            |  16 ++--
>  mm/mlock.c                 |   3 +-
>  mm/page_alloc.c            |  74 ++++++++-------
>  mm/swap.c                  | 180 ++++++++++++++++++++-----------------
>  mm/vmscan.c                |  51 +++++------
>  8 files changed, 181 insertions(+), 190 deletions(-)
> 

Hi Matthew,

I've been doing some benchmarking of this series, as promised, but have hit an oops. It doesn't appear to be easily reproducible, and I'm struggling to figure out the root cause, so thought I would share in case you have any ideas?


--->8---

[  205.942771] ================================================================================
[  205.951201] UBSAN: array-index-out-of-bounds in mm/page_alloc.c:668:46
[  205.957716] index 10283 is out of range for type 'list_head [6]'
[  205.963774] ================================================================================
[  205.972200] Unable to handle kernel paging request at virtual address 006808190c06670f
[  205.980103] Mem abort info:
[  205.982884]   ESR = 0x0000000096000044
[  205.986620]   EC = 0x25: DABT (current EL), IL = 32 bits
[  205.991918]   SET = 0, FnV = 0
[  205.994959]   EA = 0, S1PTW = 0
[  205.998088]   FSC = 0x04: level 0 translation fault
[  206.002952] Data abort info:
[  206.005819]   ISV = 0, ISS = 0x00000044, ISS2 = 0x00000000
[  206.011291]   CM = 0, WnR = 1, TnD = 0, TagAccess = 0
[  206.016329]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  206.021627] [006808190c06670f] address between user and kernel address ranges
[  206.028749] Internal error: Oops: 0000000096000044 [#1] SMP
[  206.034310] Modules linked in: nfs lockd grace sunrpc fscache netfs nls_iso8859_1 scsi_dh_rdac scsi_dh_emc scsi_dh_alua drm xfs btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core mlx5_core mlxfw pci_hyperv_intf crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce tls nvme psample nvme_core aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher
[  206.075579] CPU: 43 PID: 159703 Comm: git Not tainted 6.5.0-rc4-ryarob01-all #1
[  206.082875] Hardware name: WIWYNN Mt.Jade Server System B81.03001.0005/Mt.Jade Motherboard, BIOS 1.08.20220218 (SCP: 1.08.20220218) 2022/02/18
[  206.095638] pstate: 004000c9 (nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  206.102587] pc : free_pcppages_bulk+0x330/0x7f8
[  206.107104] lr : free_pcppages_bulk+0x7a8/0x7f8
[  206.111622] sp : ffff8000aeef3680
[  206.114923] x29: ffff8000aeef3680 x28: 000000000000282b x27: 00000000000000fc
[  206.122046] x26: 000000008015a39a x25: ffff08181ef9e840 x24: ffff0818836caf80
[  206.129169] x23: 0000000000000001 x22: 0000000000000000 x21: ffff08181ef9e850
[  206.136292] x20: fffffc200368e680 x19: fffffc200368e6c0 x18: 0000000000000000
[  206.143414] x17: 3d3d3d3d3d3d3d3d x16: 3d3d3d3d3d3d3d3d x15: 3d3d3d3d3d3d3d3d
[  206.150537] x14: 3d3d3d3d3d3d3d3d x13: 3d3d3d3d3d3d3d3d x12: 3d3d3d3d3d3d3d3d
[  206.157659] x11: 3d3d3d3d3d3d3d3d x10: 3d3d3d3d3d3d3d3d x9 : fffffc200368e688
[  206.164782] x8 : fffffc200368e680 x7 : 205d343737333639 x6 : ffff08181dee0000
[  206.171904] x5 : ffff0818836caf80 x4 : 0000000000000000 x3 : 0000000000000001
[  206.179027] x2 : ffff0818836f3330 x1 : ffff0818836f3230 x0 : 006808190c066707
[  206.186149] Call trace:
[  206.188582]  free_pcppages_bulk+0x330/0x7f8
[  206.192752]  free_unref_page_commit+0x15c/0x250
[  206.197270]  free_unref_folios+0x37c/0x4a8
[  206.201353]  release_unref_folios+0xac/0xf8
[  206.205524]  folios_put+0xe0/0x1f0
[  206.208914]  __folio_batch_release+0x34/0x88
[  206.213170]  truncate_inode_pages_range+0x160/0x540
[  206.218035]  truncate_inode_pages_final+0x58/0x90
[  206.222726]  ext4_evict_inode+0x164/0x900
[  206.226722]  evict+0xac/0x160
[  206.229678]  iput+0x170/0x228
[  206.232633]  do_unlinkat+0x1d0/0x290
[  206.236196]  __arm64_sys_unlinkat+0x48/0x98
[  206.240367]  invoke_syscall+0x74/0xf8
[  206.244016]  el0_svc_common.constprop.0+0x58/0x130
[  206.248793]  do_el0_svc+0x40/0xa8
[  206.252095]  el0_svc+0x2c/0xb8
[  206.255137]  el0t_64_sync_handler+0xc0/0xc8
[  206.259308]  el0t_64_sync+0x1a8/0x1b0
[  206.262958] Code: 8b0110a1 8b050305 8b010301 f9408020 (f9000409) 
[  206.269039] ---[ end trace 0000000000000000 ]---

--->8---


UBSAN is complaining about migratetype being out of range here:

/* Used for pages not on another list */
static inline void add_to_free_list(struct page *page, struct zone *zone,
				    unsigned int order, int migratetype)
{
	struct free_area *area = &zone->free_area[order];

	list_add(&page->buddy_list, &area->free_list[migratetype]);
	area->nr_free++;
}

And I think that is called from __free_one_page(), which is called from free_pcppages_bulk() at the top of the stack trace. migratetype originates from get_pcppage_migratetype(page), which is page->index. But I can't see where this might be getting corrupted, or how yours or my changes could affect this.




Config: This particular config actually has my large anon folios and mmu_gather series as well as this series from you. Although this issue is happening for file-backed memory, so I'm hoping its not related to LAF.

I have modified your series in a couple of relevant ways though:

 - I'm using `pcp_allowed_order(order)` instead of direct compare to PAGE_ALLOC_COSTLY_ORDER in free_unref_folios() (as I suggested in the review).

 - I've refactored folios_put() as you suggested in another thread to make implementation of my free_folio_regions_and_swap_cache() simpler:


static void release_unref_folios(struct folio_batch *folios)
{
	struct lruvec *lruvec = NULL;
	unsigned long flags = 0;
	int i;

	for (i = 0; i < folios->nr; i++)
		__page_cache_release(folios->folios[i], &lruvec, &flags);

	if (lruvec)
		unlock_page_lruvec_irqrestore(lruvec, flags);

	mem_cgroup_uncharge_folios(folios);
	free_unref_folios(folios);
}

/**
 * free_folio_regions_and_swap_cache - free swap cache and put refs
 * @folios: array of `struct folio_region`s to release
 * @nr: number of folio regions in array
 *
 * Each `struct folio_region` describes the start and end pfn of a region within
 * a single folio. The folio's swap cache is freed, then the folio reference
 * count is decremented once for each pfn in the region. If it falls to zero,
 * remove the folio from the LRU and free it.
 */
void free_folio_regions_and_swap_cache(struct folio_region *folios, int nr)
{
	struct folio_batch fbatch;
	unsigned int i;

	folio_batch_init(&fbatch);
	lru_add_drain();

	for (i = 0; i < nr; i++) {
		struct folio *folio = pfn_folio(folios[i].pfn_start);
		int refs = folios[i].pfn_end - folios[i].pfn_start;

		free_swap_cache(folio);

		if (is_huge_zero_page(&folio->page))
			continue;

		if (folio_is_zone_device(folio)) {
			if (put_devmap_managed_page_refs(&folio->page, refs))
				continue;
			if (folio_ref_sub_and_test(folio, refs))
				free_zone_device_page(&folio->page);
			continue;
		}

		if (!folio_ref_sub_and_test(folio, refs))
			continue;

		/* hugetlb has its own memcg */
		if (folio_test_hugetlb(folio)) {
			free_huge_folio(folio);
			continue;
		}

		if (folio_batch_add(&fbatch, folio) == 0)
			release_unref_folios(&fbatch);
	}

	if (fbatch.nr)
		release_unref_folios(&fbatch);
}

/**
 * folios_put - Decrement the reference count on a batch of folios.
 * @folios: The folios.
 *
 * Like folio_put(), but for a batch of folios.  This is more efficient
 * than writing the loop yourself as it will optimise the locks which need
 * to be taken if the folios are freed.  The folios batch is returned
 * empty and ready to be reused for another batch; there is no need to
 * reinitialise it.
 *
 * Context: May be called in process or interrupt context, but not in NMI
 * context.  May be called while holding a spinlock.
 */
void folios_put(struct folio_batch *folios)
{
	int i, j;

	for (i = 0, j = 0; i < folios->nr; i++) {
		struct folio *folio = folios->folios[i];

		if (is_huge_zero_page(&folio->page))
			continue;

		if (folio_is_zone_device(folio)) {
			if (put_devmap_managed_page(&folio->page))
				continue;
			if (folio_put_testzero(folio))
				free_zone_device_page(&folio->page);
			continue;
		}

		if (!folio_put_testzero(folio))
			continue;

		/* hugetlb has its own memcg */
		if (folio_test_hugetlb(folio)) {
			free_huge_folio(folio);
			continue;
		}

		if (j != i)
			folios->folios[j] = folio;
		j++;
	}

	folios->nr = j;
	if (!j)
		return;
	release_unref_folios(folios);
}
EXPORT_SYMBOL(folios_put);


Let me know if you have any thoughts.

Thanks,
Ryan
Matthew Wilcox Sept. 5, 2023, 1:15 p.m. UTC | #2
On Mon, Sep 04, 2023 at 02:25:41PM +0100, Ryan Roberts wrote:
> I've been doing some benchmarking of this series, as promised, but have hit an oops. It doesn't appear to be easily reproducible, and I'm struggling to figure out the root cause, so thought I would share in case you have any ideas?

I didn't hit that with my testing.  Admittedly I was using xfs rather
than ext4, but ...

>  UBSAN: array-index-out-of-bounds in mm/page_alloc.c:668:46
>  index 10283 is out of range for type 'list_head [6]'
>  pstate: 004000c9 (nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>  pc : free_pcppages_bulk+0x330/0x7f8
>  lr : free_pcppages_bulk+0x7a8/0x7f8
>  sp : ffff8000aeef3680
>  x29: ffff8000aeef3680 x28: 000000000000282b x27: 00000000000000fc
>  x26: 000000008015a39a x25: ffff08181ef9e840 x24: ffff0818836caf80
>  x23: 0000000000000001 x22: 0000000000000000 x21: ffff08181ef9e850
>  x20: fffffc200368e680 x19: fffffc200368e6c0 x18: 0000000000000000
>  x17: 3d3d3d3d3d3d3d3d x16: 3d3d3d3d3d3d3d3d x15: 3d3d3d3d3d3d3d3d
>  x14: 3d3d3d3d3d3d3d3d x13: 3d3d3d3d3d3d3d3d x12: 3d3d3d3d3d3d3d3d
>  x11: 3d3d3d3d3d3d3d3d x10: 3d3d3d3d3d3d3d3d x9 : fffffc200368e688
>  x8 : fffffc200368e680 x7 : 205d343737333639 x6 : ffff08181dee0000
>  x5 : ffff0818836caf80 x4 : 0000000000000000 x3 : 0000000000000001
>  x2 : ffff0818836f3330 x1 : ffff0818836f3230 x0 : 006808190c066707
>  Call trace:
>   free_pcppages_bulk+0x330/0x7f8
>   free_unref_page_commit+0x15c/0x250
>   free_unref_folios+0x37c/0x4a8
>   release_unref_folios+0xac/0xf8
>   folios_put+0xe0/0x1f0
>   __folio_batch_release+0x34/0x88
>   truncate_inode_pages_range+0x160/0x540
>   truncate_inode_pages_final+0x58/0x90
>   ext4_evict_inode+0x164/0x900
>   evict+0xac/0x160
>   iput+0x170/0x228
>   do_unlinkat+0x1d0/0x290
>   __arm64_sys_unlinkat+0x48/0x98
> 
> UBSAN is complaining about migratetype being out of range here:
> 
> /* Used for pages not on another list */
> static inline void add_to_free_list(struct page *page, struct zone *zone,
> 				    unsigned int order, int migratetype)
> {
> 	struct free_area *area = &zone->free_area[order];
> 
> 	list_add(&page->buddy_list, &area->free_list[migratetype]);
> 	area->nr_free++;
> }
> 
> And I think that is called from __free_one_page(), which is called
> from free_pcppages_bulk() at the top of the stack trace. migratetype
> originates from get_pcppage_migratetype(page), which is page->index. But
> I can't see where this might be getting corrupted, or how yours or my
> changes could affect this.

Agreed with your analysis.

My best guess is that page->index still contains the file index from
when this page was in the page cache instead of being overwritten with
the migratetype.  This is ext4, so large folios aren't in use.

I'll look more later, but I don't immediately see the problem.
Ryan Roberts Sept. 5, 2023, 1:26 p.m. UTC | #3
On 05/09/2023 14:15, Matthew Wilcox wrote:
> On Mon, Sep 04, 2023 at 02:25:41PM +0100, Ryan Roberts wrote:
>> I've been doing some benchmarking of this series, as promised, but have hit an oops. It doesn't appear to be easily reproducible, and I'm struggling to figure out the root cause, so thought I would share in case you have any ideas?
> 
> I didn't hit that with my testing.  Admittedly I was using xfs rather
> than ext4, but ...

I've only seen it once.

I have a bit of a hybrid setup - my rootfs is xfs (and using large folios), but
the linux tree (which is being built during the benchmark) is on an ext4
partition. Large anon folios is enabled in this config, so there will be plenty
of large folios in the system.

I'm not sure if the fact that this fired from the ext4 path is too relevant -
the page with the dodgy index is already on the PCP list so may or may not be large.

> 
>>  UBSAN: array-index-out-of-bounds in mm/page_alloc.c:668:46
>>  index 10283 is out of range for type 'list_head [6]'
>>  pstate: 004000c9 (nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>  pc : free_pcppages_bulk+0x330/0x7f8
>>  lr : free_pcppages_bulk+0x7a8/0x7f8
>>  sp : ffff8000aeef3680
>>  x29: ffff8000aeef3680 x28: 000000000000282b x27: 00000000000000fc
>>  x26: 000000008015a39a x25: ffff08181ef9e840 x24: ffff0818836caf80
>>  x23: 0000000000000001 x22: 0000000000000000 x21: ffff08181ef9e850
>>  x20: fffffc200368e680 x19: fffffc200368e6c0 x18: 0000000000000000
>>  x17: 3d3d3d3d3d3d3d3d x16: 3d3d3d3d3d3d3d3d x15: 3d3d3d3d3d3d3d3d
>>  x14: 3d3d3d3d3d3d3d3d x13: 3d3d3d3d3d3d3d3d x12: 3d3d3d3d3d3d3d3d
>>  x11: 3d3d3d3d3d3d3d3d x10: 3d3d3d3d3d3d3d3d x9 : fffffc200368e688
>>  x8 : fffffc200368e680 x7 : 205d343737333639 x6 : ffff08181dee0000
>>  x5 : ffff0818836caf80 x4 : 0000000000000000 x3 : 0000000000000001
>>  x2 : ffff0818836f3330 x1 : ffff0818836f3230 x0 : 006808190c066707
>>  Call trace:
>>   free_pcppages_bulk+0x330/0x7f8
>>   free_unref_page_commit+0x15c/0x250
>>   free_unref_folios+0x37c/0x4a8
>>   release_unref_folios+0xac/0xf8
>>   folios_put+0xe0/0x1f0
>>   __folio_batch_release+0x34/0x88
>>   truncate_inode_pages_range+0x160/0x540
>>   truncate_inode_pages_final+0x58/0x90
>>   ext4_evict_inode+0x164/0x900
>>   evict+0xac/0x160
>>   iput+0x170/0x228
>>   do_unlinkat+0x1d0/0x290
>>   __arm64_sys_unlinkat+0x48/0x98
>>
>> UBSAN is complaining about migratetype being out of range here:
>>
>> /* Used for pages not on another list */
>> static inline void add_to_free_list(struct page *page, struct zone *zone,
>> 				    unsigned int order, int migratetype)
>> {
>> 	struct free_area *area = &zone->free_area[order];
>>
>> 	list_add(&page->buddy_list, &area->free_list[migratetype]);
>> 	area->nr_free++;
>> }
>>
>> And I think that is called from __free_one_page(), which is called
>> from free_pcppages_bulk() at the top of the stack trace. migratetype
>> originates from get_pcppage_migratetype(page), which is page->index. But
>> I can't see where this might be getting corrupted, or how yours or my
>> changes could affect this.
> 
> Agreed with your analysis.
> 
> My best guess is that page->index still contains the file index from
> when this page was in the page cache instead of being overwritten with
> the migratetype.  

Yeah that was my guess too. But I couldn't see how that was possible. So started
thinking it could be some separate corruption somehow...

> This is ext4, so large folios aren't in use.
> 
> I'll look more later, but I don't immediately see the problem.
>
Matthew Wilcox Sept. 5, 2023, 2 p.m. UTC | #4
On Tue, Sep 05, 2023 at 02:26:54PM +0100, Ryan Roberts wrote:
> On 05/09/2023 14:15, Matthew Wilcox wrote:
> > On Mon, Sep 04, 2023 at 02:25:41PM +0100, Ryan Roberts wrote:
> >> I've been doing some benchmarking of this series, as promised, but have hit an oops. It doesn't appear to be easily reproducible, and I'm struggling to figure out the root cause, so thought I would share in case you have any ideas?
> > 
> > I didn't hit that with my testing.  Admittedly I was using xfs rather
> > than ext4, but ...
> 
> I've only seen it once.
> 
> I have a bit of a hybrid setup - my rootfs is xfs (and using large folios), but
> the linux tree (which is being built during the benchmark) is on an ext4
> partition. Large anon folios is enabled in this config, so there will be plenty
> of large folios in the system.
> 
> I'm not sure if the fact that this fired from the ext4 path is too relevant -
> the page with the dodgy index is already on the PCP list so may or may not be large.

Indeed.  I have a suspicion that this may be more common, but if pages
are commonly freed to and allocated from the PCP list without ever being
transferred to the free list, we'll never see it.  Perhaps adding a
check when pages are added to the PCP list that page->index is less
than 8 would catch the miscreant relatively quickly?

> > My best guess is that page->index still contains the file index from
> > when this page was in the page cache instead of being overwritten with
> > the migratetype.  
> 
> Yeah that was my guess too. But I couldn't see how that was possible. So started
> thinking it could be some separate corruption somehow...

It could be, but a value around 10,000 would put the page at being
offset 40MB into the file, which is entirely plausible.  But yes, it
could have been some entirely separate path that freed it.

VM_BUG_ON_PAGE() would dump the entire struct page which will give us
a lot more clues including the order of the page.
Matthew Wilcox Sept. 6, 2023, 3:48 a.m. UTC | #5
On Tue, Sep 05, 2023 at 03:00:51PM +0100, Matthew Wilcox wrote:
> On Tue, Sep 05, 2023 at 02:26:54PM +0100, Ryan Roberts wrote:
> > On 05/09/2023 14:15, Matthew Wilcox wrote:
> > > On Mon, Sep 04, 2023 at 02:25:41PM +0100, Ryan Roberts wrote:
> > >> I've been doing some benchmarking of this series, as promised, but have hit an oops. It doesn't appear to be easily reproducible, and I'm struggling to figure out the root cause, so thought I would share in case you have any ideas?
> > > 
> > > I didn't hit that with my testing.  Admittedly I was using xfs rather
> > > than ext4, but ...
> > 
> > I've only seen it once.
> > 
> > I have a bit of a hybrid setup - my rootfs is xfs (and using large folios), but
> > the linux tree (which is being built during the benchmark) is on an ext4
> > partition. Large anon folios is enabled in this config, so there will be plenty
> > of large folios in the system.
> > 
> > I'm not sure if the fact that this fired from the ext4 path is too relevant -
> > the page with the dodgy index is already on the PCP list so may or may not be large.
> 
> Indeed.  I have a suspicion that this may be more common, but if pages
> are commonly freed to and allocated from the PCP list without ever being
> transferred to the free list, we'll never see it.  Perhaps adding a
> check when pages are added to the PCP list that page->index is less
> than 8 would catch the miscreant relatively quickly?

Somehow my qemu setup started working again.  This stuff is black magic.

Anyway, I did this:

+++ b/mm/page_alloc.c
@@ -2405,6 +2405,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,

        __count_vm_events(PGFREE, 1 << order);
        pindex = order_to_pindex(migratetype, order);
+       VM_BUG_ON_PAGE(page->index > 7, page);
        list_add(&page->pcp_list, &pcp->lists[pindex]);
        pcp->count += 1 << order;


but I haven't caught a wascally wabbit yet after an hour of running
xfstests.  I think that's the only place we add a page to the
pcp->lists.
Ryan Roberts Sept. 6, 2023, 10:23 a.m. UTC | #6
On 06/09/2023 04:48, Matthew Wilcox wrote:
> On Tue, Sep 05, 2023 at 03:00:51PM +0100, Matthew Wilcox wrote:
>> On Tue, Sep 05, 2023 at 02:26:54PM +0100, Ryan Roberts wrote:
>>> On 05/09/2023 14:15, Matthew Wilcox wrote:
>>>> On Mon, Sep 04, 2023 at 02:25:41PM +0100, Ryan Roberts wrote:
>>>>> I've been doing some benchmarking of this series, as promised, but have hit an oops. It doesn't appear to be easily reproducible, and I'm struggling to figure out the root cause, so thought I would share in case you have any ideas?
>>>>
>>>> I didn't hit that with my testing.  Admittedly I was using xfs rather
>>>> than ext4, but ...
>>>
>>> I've only seen it once.
>>>
>>> I have a bit of a hybrid setup - my rootfs is xfs (and using large folios), but
>>> the linux tree (which is being built during the benchmark) is on an ext4
>>> partition. Large anon folios is enabled in this config, so there will be plenty
>>> of large folios in the system.
>>>
>>> I'm not sure if the fact that this fired from the ext4 path is too relevant -
>>> the page with the dodgy index is already on the PCP list so may or may not be large.
>>
>> Indeed.  I have a suspicion that this may be more common, but if pages
>> are commonly freed to and allocated from the PCP list without ever being
>> transferred to the free list, we'll never see it.  Perhaps adding a
>> check when pages are added to the PCP list that page->index is less
>> than 8 would catch the miscreant relatively quickly?
> 
> Somehow my qemu setup started working again.  This stuff is black magic.
> 
> Anyway, I did this:
> 
> +++ b/mm/page_alloc.c
> @@ -2405,6 +2405,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
> 
>         __count_vm_events(PGFREE, 1 << order);
>         pindex = order_to_pindex(migratetype, order);
> +       VM_BUG_ON_PAGE(page->index > 7, page);
>         list_add(&page->pcp_list, &pcp->lists[pindex]);
>         pcp->count += 1 << order;
> 
> 
> but I haven't caught a wascally wabbit yet after an hour of running
> xfstests.  I think that's the only place we add a page to the
> pcp->lists.


I added a smattering of VM_BUG_ON_PAGE(page->index > 5, page) to the places where the page is added and removed from the pcp lists. And one triggered on removing the page from the list (the same place I saw the UBSAN oops previously). But there is no page info dumped! I've enabled CONFIG_DEBUG_VM (and friends). I can't see how its possible to get the BUG report but not the dump_page() bit - what am I doing wrong?

Anyway, the fact that it did not trigger on insertion into the list suggests this is a corruption issue? I'll keep trying...



[  334.307831] kernel BUG at mm/page_alloc.c:1217!
[  334.312351] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[  334.318433] Modules linked in: nfs lockd grace sunrpc fscache netfs nls_iso8859_1 scsi_dh_rdac scsi_dh_emc scsi_dh_alua drm xfs btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core mlx5_core mlxfw pci_hyperv_intf crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce tls nvme psample nvme_core aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher
[  334.359704] CPU: 26 PID: 260858 Comm: git Not tainted 6.5.0-rc4-ryarob01-all-debug #1
[  334.367521] Hardware name: WIWYNN Mt.Jade Server System B81.03001.0005/Mt.Jade Motherboard, BIOS 1.08.20220218 (SCP: 1.08.20220218) 2022/02/18
[  334.380285] pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  334.387233] pc : free_pcppages_bulk+0x1b0/0x2d0
[  334.391753] lr : free_pcppages_bulk+0x1b0/0x2d0
[  334.396270] sp : ffff80008fe9b810
[  334.399571] x29: ffff80008fe9b810 x28: 0000000000000001 x27: 00000000000000e7
[  334.406694] x26: fffffc2007e07780 x25: ffff08181ed8f840 x24: ffff08181ed8f868
[  334.413817] x23: 0000000000000000 x22: ffff0818836caf80 x21: ffff800081bbf008
[  334.420939] x20: 0000000000000001 x19: ffff08181ed8f850 x18: 0000000000000000
[  334.428061] x17: 6666373066666666 x16: 2066666666666666 x15: 6632303030303030
[  334.435184] x14: 0000000000000000 x13: 2935203e20746d28 x12: 454741505f4e4f5f
[  334.442306] x11: 4755425f4d56203a x10: 6573756163656220 x9 : ffff80008014ef40
[  334.449429] x8 : 5f4d56203a657375 x7 : 6163656220646570 x6 : ffff08181dee0000
[  334.456551] x5 : ffff08181ed78d88 x4 : 0000000000000000 x3 : ffff80008fe9b538
[  334.463674] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 000000000000002b
[  334.470796] Call trace:
[  334.473230]  free_pcppages_bulk+0x1b0/0x2d0
[  334.477401]  free_unref_page_commit+0x124/0x2a8
[  334.481918]  free_unref_folios+0x3b4/0x4e8
[  334.486003]  release_unref_folios+0xac/0xf8
[  334.490175]  folios_put+0x100/0x228
[  334.493651]  __folio_batch_release+0x34/0x88
[  334.497908]  truncate_inode_pages_range+0x168/0x690
[  334.502773]  truncate_inode_pages_final+0x58/0x90
[  334.507464]  ext4_evict_inode+0x164/0x900
[  334.511463]  evict+0xac/0x160
[  334.514419]  iput+0x170/0x228
[  334.517375]  do_unlinkat+0x1d0/0x290
[  334.520938]  __arm64_sys_unlinkat+0x48/0x98
[  334.525108]  invoke_syscall+0x74/0xf8
[  334.528758]  el0_svc_common.constprop.0+0x58/0x130
[  334.533536]  do_el0_svc+0x40/0xa8
[  334.536837]  el0_svc+0x2c/0xb8
[  334.539881]  el0t_64_sync_handler+0xc0/0xc8
[  334.544052]  el0t_64_sync+0x1a8/0x1b0
[  334.547703] Code: aa1a03e0 90009dc1 91072021 97ff1097 (d4210000) 
[  334.553783] ---[ end trace 0000000000000000 ]---