Message ID | 20230317105802.2634004-1-ryan.roberts@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | variable-order, large folios for anonymous memory | expand |
Hi Matthew, On 17/03/2023 10:57, Ryan Roberts wrote: > Hi All, > > [...] > > Bug(s) > ====== > > When I run this code without the last (workaround) patch, with DEBUG_VM et al, > PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are > relating to invalid kernel addresses (which usually look like either NULL + > small offset or mostly zeros with a few mid-order bits set + a small offset) or > lockdep complaining about a bad unlock balance. Call stacks are often in > madvise_free_pte_range(), but I've seen them in filesystem code too. (I can > email example oopses out separately if anyone wants to review them). My hunch is > that struct pages adjacent to the folio are being corrupted, but don't have hard > evidence. > > When adding the workaround patch, which prevents madvise_free_pte_range() from > attempting to split a large folio, I never see any issues. Although I'm not > putting the system under memory pressure so guess I might see the same types of > problem crop up under swap, etc. > > I've reviewed most of the code within split_folio() and can't find any smoking > gun, but I wonder if there are implicit assumptions about the large folio being > PMD sized that I'm obviously breaking now? > > The code in madvise_free_pte_range(): > > if (folio_test_large(folio)) { > if (folio_mapcount(folio) != 1) > goto out; > folio_get(folio); > if (!folio_trylock(folio)) { > folio_put(folio); > goto out; > } > pte_unmap_unlock(orig_pte, ptl); > if (split_folio(folio)) { > folio_unlock(folio); > folio_put(folio); > orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl); > goto out; > } > ... > } I've noticed that its folio_split() with a folio order of 1 that causes my problems. And I also see that the page cache code always explicitly never allocates order-1 folios: void page_cache_ra_order(struct readahead_control *ractl, struct file_ra_state *ra, unsigned int new_order) { ... while (index <= limit) { unsigned int order = new_order; /* Align with smaller pages if needed */ if (index & ((1UL << order) - 1)) { order = __ffs(index); if (order == 1) order = 0; } /* Don't allocate pages past EOF */ while (index + (1UL << order) - 1 > limit) { if (--order == 1) order = 0; } err = ra_alloc_folio(ractl, index, mark, order, gfp); if (err) break; index += 1UL << order; } ... } Matthew, what is the reason for this? I suspect its guarding against the same problem I'm seeing. If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause any oops/panic/etc. I'd just like to understand the root cause. Thanks, Ryan > > Will normally skip my large folios because they have a mapcount > 1, due to > incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it > will see a mapcount of 1 and proceed. So I guess this is racing against reclaim > or CoW in this case? > > I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my > large anon folio is not using the folio lock in the same way as a THP would and > we are therefore not getting the expected serialization? > > I'd really appreciate any suggestions for how to pregress here! >
On 3/22/2023 8:03 PM, Ryan Roberts wrote: > Hi Matthew, > > On 17/03/2023 10:57, Ryan Roberts wrote: >> Hi All, >> >> [...] >> >> Bug(s) >> ====== >> >> When I run this code without the last (workaround) patch, with DEBUG_VM et al, >> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are >> relating to invalid kernel addresses (which usually look like either NULL + >> small offset or mostly zeros with a few mid-order bits set + a small offset) or >> lockdep complaining about a bad unlock balance. Call stacks are often in >> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can >> email example oopses out separately if anyone wants to review them). My hunch is >> that struct pages adjacent to the folio are being corrupted, but don't have hard >> evidence. >> >> When adding the workaround patch, which prevents madvise_free_pte_range() from >> attempting to split a large folio, I never see any issues. Although I'm not >> putting the system under memory pressure so guess I might see the same types of >> problem crop up under swap, etc. >> >> I've reviewed most of the code within split_folio() and can't find any smoking >> gun, but I wonder if there are implicit assumptions about the large folio being >> PMD sized that I'm obviously breaking now? >> >> The code in madvise_free_pte_range(): >> >> if (folio_test_large(folio)) { >> if (folio_mapcount(folio) != 1) >> goto out; >> folio_get(folio); >> if (!folio_trylock(folio)) { >> folio_put(folio); >> goto out; >> } >> pte_unmap_unlock(orig_pte, ptl); >> if (split_folio(folio)) { >> folio_unlock(folio); >> folio_put(folio); >> orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl); >> goto out; >> } >> ... >> } > > I've noticed that its folio_split() with a folio order of 1 that causes my > problems. And I also see that the page cache code always explicitly never > allocates order-1 folios: > > void page_cache_ra_order(struct readahead_control *ractl, > struct file_ra_state *ra, unsigned int new_order) > { > ... > > while (index <= limit) { > unsigned int order = new_order; > > /* Align with smaller pages if needed */ > if (index & ((1UL << order) - 1)) { > order = __ffs(index); > if (order == 1) > order = 0; > } > /* Don't allocate pages past EOF */ > while (index + (1UL << order) - 1 > limit) { > if (--order == 1) > order = 0; > } > err = ra_alloc_folio(ractl, index, mark, order, gfp); > if (err) > break; > index += 1UL << order; > } > > ... > } > > Matthew, what is the reason for this? I suspect its guarding against the same > problem I'm seeing. > > If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause > any oops/panic/etc. I'd just like to understand the root cause. Checked the struct folio definition. The _deferred_list is in third page struct. My understanding is to support folio split, the folio order must >= 2. Thanks. Regards Yin, Fengwei > > Thanks, > Ryan > > > >> >> Will normally skip my large folios because they have a mapcount > 1, due to >> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it >> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim >> or CoW in this case? >> >> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my >> large anon folio is not using the folio lock in the same way as a THP would and >> we are therefore not getting the expected serialization? >> >> I'd really appreciate any suggestions for how to pregress here! >> >
On 22/03/2023 13:36, Yin, Fengwei wrote: > > > On 3/22/2023 8:03 PM, Ryan Roberts wrote: >> Hi Matthew, >> >> On 17/03/2023 10:57, Ryan Roberts wrote: >>> Hi All, >>> >>> [...] >>> >>> Bug(s) >>> ====== >>> >>> When I run this code without the last (workaround) patch, with DEBUG_VM et al, >>> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are >>> relating to invalid kernel addresses (which usually look like either NULL + >>> small offset or mostly zeros with a few mid-order bits set + a small offset) or >>> lockdep complaining about a bad unlock balance. Call stacks are often in >>> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can >>> email example oopses out separately if anyone wants to review them). My hunch is >>> that struct pages adjacent to the folio are being corrupted, but don't have hard >>> evidence. >>> >>> When adding the workaround patch, which prevents madvise_free_pte_range() from >>> attempting to split a large folio, I never see any issues. Although I'm not >>> putting the system under memory pressure so guess I might see the same types of >>> problem crop up under swap, etc. >>> >>> I've reviewed most of the code within split_folio() and can't find any smoking >>> gun, but I wonder if there are implicit assumptions about the large folio being >>> PMD sized that I'm obviously breaking now? >>> >>> The code in madvise_free_pte_range(): >>> >>> if (folio_test_large(folio)) { >>> if (folio_mapcount(folio) != 1) >>> goto out; >>> folio_get(folio); >>> if (!folio_trylock(folio)) { >>> folio_put(folio); >>> goto out; >>> } >>> pte_unmap_unlock(orig_pte, ptl); >>> if (split_folio(folio)) { >>> folio_unlock(folio); >>> folio_put(folio); >>> orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl); >>> goto out; >>> } >>> ... >>> } >> >> I've noticed that its folio_split() with a folio order of 1 that causes my >> problems. And I also see that the page cache code always explicitly never >> allocates order-1 folios: >> >> void page_cache_ra_order(struct readahead_control *ractl, >> struct file_ra_state *ra, unsigned int new_order) >> { >> ... >> >> while (index <= limit) { >> unsigned int order = new_order; >> >> /* Align with smaller pages if needed */ >> if (index & ((1UL << order) - 1)) { >> order = __ffs(index); >> if (order == 1) >> order = 0; >> } >> /* Don't allocate pages past EOF */ >> while (index + (1UL << order) - 1 > limit) { >> if (--order == 1) >> order = 0; >> } >> err = ra_alloc_folio(ractl, index, mark, order, gfp); >> if (err) >> break; >> index += 1UL << order; >> } >> >> ... >> } >> >> Matthew, what is the reason for this? I suspect its guarding against the same >> problem I'm seeing. >> >> If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause >> any oops/panic/etc. I'd just like to understand the root cause. > Checked the struct folio definition. The _deferred_list is in third page struct. > My understanding is to support folio split, the folio order must >= 2. Thanks. Yep, looks like we have found the root cause - thanks for your help! I've updated calc_anonymous_folio_order() to only use non-0 order if THP is available and in that case, never allocate order-1. I think that both fixes the problem and manages the dependency we have on THP: static void calc_anonymous_folio_order(struct vm_fault *vmf, int *order_out, unsigned long *addr_out) { /* * The aim here is to determine what size of folio we should allocate * for this fault. Factors include: * - Folio must be naturally aligned within VA space * - Folio must not breach boundaries of vma * - Folio must be fully contained inside one pmd entry * - Folio must not overlap any non-none ptes * - Order must not be higher than *order_out upon entry * * Additionally, we do not allow order-1 since this breaks assumptions * elsewhere in the mm; THP pages must be at least order-2 (since they * store state up to the 3rd struct page subpage), and these pages must * be THP in order to correctly use pre-existing THP infrastructure such * as folio_split(). * * As a consequence of relying on the THP infrastructure, if the system * does not support THP, we always fallback to order-0. * * Note that the caller may or may not choose to lock the pte. If * unlocked, the calculation should be considered an estimate that will * need to be validated under the lock. */ struct vm_area_struct *vma = vmf->vma; int nr; int order; unsigned long addr; pte_t *pte; pte_t *first_set = NULL; int ret; if (has_transparent_hugepage()) { order = min(*order_out, PMD_SHIFT - PAGE_SHIFT); for (; order > 1; order--) { nr = 1 << order; addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); /* Check vma bounds. */ if (addr < vma->vm_start || addr + nr * PAGE_SIZE > vma->vm_end) continue; /* Ptes covered by order already known to be none. */ if (pte + nr <= first_set) break; /* Already found set pte in range covered by order. */ if (pte <= first_set) continue; /* Need to check if all the ptes are none. */ ret = check_all_ptes_none(pte, nr); if (ret == nr) break; first_set = pte + ret; } if (order == 1) order = 0; } else { order = 0; } *order_out = order; *addr_out = order > 0 ? addr : vmf->address; } > > > Regards > Yin, Fengwei > >> >> Thanks, >> Ryan >> >> >> >>> >>> Will normally skip my large folios because they have a mapcount > 1, due to >>> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it >>> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim >>> or CoW in this case? >>> >>> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my >>> large anon folio is not using the folio lock in the same way as a THP would and >>> we are therefore not getting the expected serialization? >>> >>> I'd really appreciate any suggestions for how to pregress here! >>> >>