Message ID | d10037699391c42a4943f578c2a0bca890640f30.1742478846.git.lorenzo.stoakes@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON | expand |
On Fri, Mar 21, 2025 at 10:54 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > diff --git a/mm/mremap.c b/mm/mremap.c > index 0865387531ed..bb67562a0114 100644 > --- a/mm/mremap.c > +++ b/mm/mremap.c [...] > +/* > + * If the folio mapped at the specified pte entry can have its index and mapping > + * relocated, then do so. > + * > + * Returns the number of pages we have traversed, or 0 if the operation failed. > + */ > +static unsigned long relocate_anon(struct pagetable_move_control *pmc, > + unsigned long old_addr, unsigned long new_addr, pte_t pte, > + bool undo) > +{ > + struct page *page; > + struct folio *folio; > + struct vm_area_struct *old, *new; > + pgoff_t new_index; > + unsigned long ret = 1; > + > + old = pmc->old; > + new = pmc->new; > + > + /* Ensure we have truly got an anon folio. */ > + page = vm_normal_page(old, old_addr, pte); > + if (!page) > + return ret; > + folio = page_folio(page); > + folio_lock(folio); > + > + /* no-op. */ > + if (!folio_test_anon(folio) || folio_test_ksm(folio)) > + goto out; > + > + /* > + * This should not happen as we explicitly disallow this, but check > + * anyway. > + */ > + if (folio_test_large(folio)) { > + ret = 0; > + goto out; > + } Do I understand correctly that you assume here that the page is exclusively mapped? Maybe we could at least WARN_ON(folio_mapcount(folio) != 1) or something like that? (I was also wondering if the PageAnonExclusive bit is somehow relevant, but we should probably not look at or touch that here, unless we want to think about cases where we _used to_ have a child from which the page may have been GUP'd...) > + if (!undo) > + new_index = linear_page_index(new, new_addr); > + else > + new_index = linear_page_index(old, old_addr); > + > + /* > + * The PTL should keep us safe from unmapping, and the fact the folio is > + * a PTE keeps the folio referenced. > + * > + * The mmap/VMA locks should keep us safe from fork and other processes. > + * > + * The rmap locks should keep us safe from anything happening to the > + * VMA/anon_vma. "The rmap locks"? I only see that we're holding the rmap lock on the new anon_vma; are we also holding a lock on the old anon_vma? > + * The folio lock should keep us safe from reclaim, migration, etc. > + */ > + folio_move_anon_rmap(folio, undo ? old : new); > + WRITE_ONCE(folio->index, new_index); > + > +out: > + folio_unlock(folio); > + return ret; > +} [...] > +static bool relocate_anon_ptes(struct pagetable_move_control *pmc, > + unsigned long extent, pmd_t *pmd, bool undo) > +{ > + struct mm_struct *mm = current->mm; > + struct pte_state state = { > + .old_addr = pmc->old_addr, > + .new_addr = pmc->new_addr, > + .old_end = pmc->old_addr + extent, > + }; > + spinlock_t *ptl; > + pte_t *ptep_start; > + bool ret; > + unsigned long nr_pages; > + > + ptep_start = pte_offset_map_lock(mm, pmd, pmc->old_addr, &ptl); > + /* > + * We prevent faults with mmap write lock, hold the rmap lock and should > + * not fail to obtain this lock. Just give up if we can't. > + */ > + if (!ptep_start) > + return false; > + > + state.ptep = ptep_start; > + > + for (; !pte_done(&state); pte_next(&state, nr_pages)) { > + pte_t pte = ptep_get(state.ptep); > + > + if (pte_none(pte) || !pte_present(pte)) { > + nr_pages = 1; > + continue; > + } > + > + nr_pages = relocate_anon(pmc, state.old_addr, state.new_addr, pte, undo); > + if (!nr_pages) { > + ret = false; > + goto out; > + } > + } > + > + ret = true; > +out: > + pte_unmap_unlock(ptep_start, ptl); > + return ret; > +} Just to make sure I understand correctly: This function is changing the ->pgoff and ->mapping of pages while they are still mapped in the old VMA, right? And if that fails midway through for whatever reason, we go and change all the already-changed ->pgoff and ->mapping values back? [...] > @@ -1132,6 +1347,42 @@ static void unmap_source_vma(struct vma_remap_struct *vrm) > } > } > > +/* > + * Should we attempt to relocate anonymous folios to the location that the VMA > + * is being moved to by updating index and mapping fields accordingly? > + */ > +static bool should_relocate_anon(struct vma_remap_struct *vrm, > + struct pagetable_move_control *pmc) > +{ > + struct vm_area_struct *old = vrm->vma; > + > + /* Currently we only do this if requested. */ > + if (!(vrm->flags & MREMAP_RELOCATE_ANON)) > + return false; > + > + /* We can't deal with special or hugetlb mappings. */ > + if (old->vm_flags & (VM_SPECIAL | VM_HUGETLB)) > + return false; > + > + /* We only support anonymous mappings. */ > + if (!vma_is_anonymous(old)) > + return false; > + > + /* If no folios are mapped, then no need to attempt this. */ > + if (!old->anon_vma) > + return false; > + > + /* > + * If the old VMA is a child (i.e. has been forked), then the index > + * references multiple VMAs, we have to bail. > + */ > + if (!list_is_singular(&old->anon_vma_chain)) > + return false; I think this is wrong: list_is_singular(&old->anon_vma_chain) tells you whether pages in the VMA might be shared due to this mm_struct being forked from a parent mm_struct; but it won't tell you whether pages in the VMA might be shared due to this mm_struct being the parent of another mm_struct (other way around). I guess checking old->anon_vma->root->num_children could maybe work... > + > + /* Otherwise, we're good to go! */ > + return true; > +} > + > /* > * Copy vrm->vma over to vrm->new_addr possibly adjusting size as part of the > * process. Additionally handle an error occurring on moving of page tables, > @@ -1151,9 +1402,11 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm, > struct vm_area_struct *new_vma; > int err = 0; > PAGETABLE_MOVE(pmc, NULL, NULL, vrm->addr, vrm->new_addr, vrm->old_len); > + bool relocate_anon = should_relocate_anon(vrm, &pmc); > > +again: > new_vma = copy_vma(&vma, vrm->new_addr, vrm->new_len, new_pgoff, > - &pmc.need_rmap_locks); > + &pmc.need_rmap_locks, &relocate_anon); > if (!new_vma) { > vrm_uncharge(vrm); > *new_vma_ptr = NULL; > @@ -1163,12 +1416,66 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm, > pmc.old = vma; > pmc.new = new_vma; > > + if (relocate_anon) { > + /* > + * We have a new VMA to reassign folios to. We take a lock on > + * its anon_vma so reclaim doesn't fail to unmap mappings. > + * > + * We have acquired a VMA write lock by now (in vma_link()), so > + * we do not have to worry about racing faults. > + */ > + anon_vma_lock_write(new_vma->anon_vma); > + pmc.relocate_locked = new_vma; > + > + if (!relocate_anon_folios(&pmc, /* undo= */false)) { > + unsigned long start = new_vma->vm_start; > + unsigned long size = new_vma->vm_end - start; > + > + /* Undo if fails. */ > + relocate_anon_folios(&pmc, /* undo= */true); This relocate_anon_folios() has to make sure to only operate on the subset of PTEs that we already edited successfully, right? > + vrm_stat_account(vrm, vrm->new_len); > + > + anon_vma_unlock_write(new_vma->anon_vma); > + pmc.relocate_locked = NULL; > + > + do_munmap(current->mm, start, size, NULL); > + relocate_anon = false; > + goto again; > + } > + } [...] > diff --git a/mm/vma.c b/mm/vma.c > index 5418eef3a852..09027448753f 100644 > --- a/mm/vma.c > +++ b/mm/vma.c [...] > @@ -1821,6 +1834,14 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, > new_vma->vm_ops->open(new_vma); > if (vma_link(mm, new_vma)) > goto out_vma_link; > + /* > + * If we're attempting to relocate anonymous VMAs, we > + * don't want to reuse an anon_vma as set by > + * vm_area_dup(), or copy anon_vma_chain or anything > + * like this. > + */ > + if (*relocate_anon && __anon_vma_prepare(new_vma)) > + goto out_vma_link; Is this bailout really okay? We go to the same label as if vma_link() failed, but we're in a very different state: For example, the VMA has already been inserted into the VMA tree with vma_iter_store() as part of vma_link() (unless that changed in one of the prerequisite patches). > *need_rmap_locks = false; > } > return new_vma;
On 22.03.25 01:14, Jann Horn wrote: > On Fri, Mar 21, 2025 at 10:54 PM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: >> diff --git a/mm/mremap.c b/mm/mremap.c >> index 0865387531ed..bb67562a0114 100644 >> --- a/mm/mremap.c >> +++ b/mm/mremap.c > [...] >> +/* >> + * If the folio mapped at the specified pte entry can have its index and mapping >> + * relocated, then do so. >> + * >> + * Returns the number of pages we have traversed, or 0 if the operation failed. >> + */ >> +static unsigned long relocate_anon(struct pagetable_move_control *pmc, >> + unsigned long old_addr, unsigned long new_addr, pte_t pte, >> + bool undo) >> +{ >> + struct page *page; >> + struct folio *folio; >> + struct vm_area_struct *old, *new; >> + pgoff_t new_index; >> + unsigned long ret = 1; >> + >> + old = pmc->old; >> + new = pmc->new; >> + >> + /* Ensure we have truly got an anon folio. */ >> + page = vm_normal_page(old, old_addr, pte); >> + if (!page) >> + return ret; >> + folio = page_folio(page); >> + folio_lock(folio); >> + >> + /* no-op. */ >> + if (!folio_test_anon(folio) || folio_test_ksm(folio)) >> + goto out; >> + >> + /* >> + * This should not happen as we explicitly disallow this, but check >> + * anyway. >> + */ >> + if (folio_test_large(folio)) { >> + ret = 0; >> + goto out; >> + } > > Do I understand correctly that you assume here that the page is > exclusively mapped? Maybe we could at least > WARN_ON(folio_mapcount(folio) != 1) or something like that? > > (I was also wondering if the PageAnonExclusive bit is somehow > relevant, but we should probably not look at or touch that here, > unless we want to think about cases where we _used to_ have a child > from which the page may have been GUP'd...) UFFDIO_MOVE implements something similar. Right now we keep it simple: if (folio_test_large(src_folio) || folio_maybe_dma_pinned(src_folio) || !PageAnonExclusive(&src_folio->page)) { err = -EBUSY; goto out; } Whereby we a) Make sure we cover all PTEs (-> small folio, single PTE). Large PTE-mapped folios are split. b) Make sure there are no GUP pins (maybe not required here?) c) The folio is exclusive to this process In general, things I can reason about with confidence are: a) As alternative to PageAnonExclusive(), we can check folio_mapcount()==1 under the folio lock for small folios / PMD-mapped folios. As you (Jann) say, there might be unexpected references on the folio from other processes. b) For large (pte-mapped) folios, we could try batching multiple PTEs (folio_pte_batch() etc.). We'd be processing all mappings with folio_lock + folio_mapcount() == #PTEs. c) In -next, there is now be the option to use folio lock + folio_maybe_mapped_shared() == false. But it doesn't tell you into how many VMAs a large folio is mapped into. In the following case: [ folio ] [ VMA#1 ] [ VMA#2 ] c) would not tell you if you are fine modifying the folio when moving VMA#2.
On 22.03.25 06:33, David Hildenbrand wrote: > On 22.03.25 01:14, Jann Horn wrote: >> On Fri, Mar 21, 2025 at 10:54 PM Lorenzo Stoakes >> <lorenzo.stoakes@oracle.com> wrote: >>> diff --git a/mm/mremap.c b/mm/mremap.c >>> index 0865387531ed..bb67562a0114 100644 >>> --- a/mm/mremap.c >>> +++ b/mm/mremap.c >> [...] >>> +/* >>> + * If the folio mapped at the specified pte entry can have its index and mapping >>> + * relocated, then do so. >>> + * >>> + * Returns the number of pages we have traversed, or 0 if the operation failed. >>> + */ >>> +static unsigned long relocate_anon(struct pagetable_move_control *pmc, >>> + unsigned long old_addr, unsigned long new_addr, pte_t pte, >>> + bool undo) >>> +{ >>> + struct page *page; >>> + struct folio *folio; >>> + struct vm_area_struct *old, *new; >>> + pgoff_t new_index; >>> + unsigned long ret = 1; >>> + >>> + old = pmc->old; >>> + new = pmc->new; >>> + >>> + /* Ensure we have truly got an anon folio. */ >>> + page = vm_normal_page(old, old_addr, pte); >>> + if (!page) >>> + return ret; >>> + folio = page_folio(page); >>> + folio_lock(folio); >>> + >>> + /* no-op. */ >>> + if (!folio_test_anon(folio) || folio_test_ksm(folio)) >>> + goto out; >>> + >>> + /* >>> + * This should not happen as we explicitly disallow this, but check >>> + * anyway. >>> + */ >>> + if (folio_test_large(folio)) { >>> + ret = 0; >>> + goto out; >>> + } >> >> Do I understand correctly that you assume here that the page is >> exclusively mapped? Maybe we could at least >> WARN_ON(folio_mapcount(folio) != 1) or something like that? >> >> (I was also wondering if the PageAnonExclusive bit is somehow >> relevant, but we should probably not look at or touch that here, >> unless we want to think about cases where we _used to_ have a child >> from which the page may have been GUP'd...) > > UFFDIO_MOVE implements something similar. Right now we keep it simple: > > if (folio_test_large(src_folio) || > folio_maybe_dma_pinned(src_folio) || > !PageAnonExclusive(&src_folio->page)) { > err = -EBUSY; > goto out; > } > > Whereby we > > a) Make sure we cover all PTEs (-> small folio, single PTE). Large > PTE-mapped folios are split. > > b) Make sure there are no GUP pins (maybe not required here?) > > c) The folio is exclusive to this process On additional note as my memory comes back: if PAE is set, there cannot be other (inactive) mappings from the swapcache. So whenever we use folio lock + mapcount data, the possibility of the swapcache (having inactive mappings from other processes etc.) must be considered.
Thanks for the early review :) I gather you're not coming to LSF? Which is a huge pity, would have loved to see you there. But this is much appreciated! :) On Sat, Mar 22, 2025 at 01:14:39AM +0100, Jann Horn wrote: > On Fri, Mar 21, 2025 at 10:54 PM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > diff --git a/mm/mremap.c b/mm/mremap.c > > index 0865387531ed..bb67562a0114 100644 > > --- a/mm/mremap.c > > +++ b/mm/mremap.c > [...] > > +/* > > + * If the folio mapped at the specified pte entry can have its index and mapping > > + * relocated, then do so. > > + * > > + * Returns the number of pages we have traversed, or 0 if the operation failed. > > + */ > > +static unsigned long relocate_anon(struct pagetable_move_control *pmc, > > + unsigned long old_addr, unsigned long new_addr, pte_t pte, > > + bool undo) > > +{ > > + struct page *page; > > + struct folio *folio; > > + struct vm_area_struct *old, *new; > > + pgoff_t new_index; > > + unsigned long ret = 1; > > + > > + old = pmc->old; > > + new = pmc->new; > > + > > + /* Ensure we have truly got an anon folio. */ > > + page = vm_normal_page(old, old_addr, pte); > > + if (!page) > > + return ret; > > + folio = page_folio(page); > > + folio_lock(folio); > > + > > + /* no-op. */ > > + if (!folio_test_anon(folio) || folio_test_ksm(folio)) > > + goto out; > > + > > + /* > > + * This should not happen as we explicitly disallow this, but check > > + * anyway. > > + */ > > + if (folio_test_large(folio)) { > > + ret = 0; > > + goto out; > > + } > > Do I understand correctly that you assume here that the page is > exclusively mapped? Maybe we could at least > WARN_ON(folio_mapcount(folio) != 1) or something like that? Ack and agreed, will add! > > (I was also wondering if the PageAnonExclusive bit is somehow > relevant, but we should probably not look at or touch that here, > unless we want to think about cases where we _used to_ have a child > from which the page may have been GUP'd...) Yeah I was thinking the same re: this flag, but given locks we hold this should not be the case, however you later do note the point that I need to check anon_vma->num_children == 1 (i.e. self-parented) to ensure that we aren't looking at a mapping in a parent of a child which would imply a folio that maybe isn't. > > > + if (!undo) > > + new_index = linear_page_index(new, new_addr); > > + else > > + new_index = linear_page_index(old, old_addr); > > + > > + /* > > + * The PTL should keep us safe from unmapping, and the fact the folio is > > + * a PTE keeps the folio referenced. > > + * > > + * The mmap/VMA locks should keep us safe from fork and other processes. > > + * > > + * The rmap locks should keep us safe from anything happening to the > > + * VMA/anon_vma. > > "The rmap locks"? I only see that we're holding the rmap lock on the > new anon_vma; are we also holding a lock on the old anon_vma? I should rephrase (this is partly because I kept changing how I did the locking - hey I did warn in preamble this is early :P) I don't believe we need to hold the _source_ rmap lock, because we establish a folio lock prior to adjusting the folio so it shouldn't be possible for an rmap walk to go terribly wrong here. But do correct me if you feel this is an invalid assumption. > > > + * The folio lock should keep us safe from reclaim, migration, etc. > > + */ > > + folio_move_anon_rmap(folio, undo ? old : new); > > + WRITE_ONCE(folio->index, new_index); > > + > > +out: > > + folio_unlock(folio); > > + return ret; > > +} > [...] > > +static bool relocate_anon_ptes(struct pagetable_move_control *pmc, > > + unsigned long extent, pmd_t *pmd, bool undo) > > +{ > > + struct mm_struct *mm = current->mm; > > + struct pte_state state = { > > + .old_addr = pmc->old_addr, > > + .new_addr = pmc->new_addr, > > + .old_end = pmc->old_addr + extent, > > + }; > > + spinlock_t *ptl; > > + pte_t *ptep_start; > > + bool ret; > > + unsigned long nr_pages; > > + > > + ptep_start = pte_offset_map_lock(mm, pmd, pmc->old_addr, &ptl); > > + /* > > + * We prevent faults with mmap write lock, hold the rmap lock and should > > + * not fail to obtain this lock. Just give up if we can't. > > + */ > > + if (!ptep_start) > > + return false; > > + > > + state.ptep = ptep_start; > > + > > + for (; !pte_done(&state); pte_next(&state, nr_pages)) { > > + pte_t pte = ptep_get(state.ptep); > > + > > + if (pte_none(pte) || !pte_present(pte)) { > > + nr_pages = 1; > > + continue; > > + } > > + > > + nr_pages = relocate_anon(pmc, state.old_addr, state.new_addr, pte, undo); > > + if (!nr_pages) { > > + ret = false; > > + goto out; > > + } > > + } > > + > > + ret = true; > > +out: > > + pte_unmap_unlock(ptep_start, ptl); > > + return ret; > > +} > > Just to make sure I understand correctly: > This function is changing the ->pgoff and ->mapping of pages while > they are still mapped in the old VMA, right? And if that fails midway > through for whatever reason, we go and change all the already-changed > ->pgoff and ->mapping values back? Yup. Write lock is held on both VMAs, and rmap lock for new VMA. > > [...] > > @@ -1132,6 +1347,42 @@ static void unmap_source_vma(struct vma_remap_struct *vrm) > > } > > } > > > > +/* > > + * Should we attempt to relocate anonymous folios to the location that the VMA > > + * is being moved to by updating index and mapping fields accordingly? > > + */ > > +static bool should_relocate_anon(struct vma_remap_struct *vrm, > > + struct pagetable_move_control *pmc) > > +{ > > + struct vm_area_struct *old = vrm->vma; > > + > > + /* Currently we only do this if requested. */ > > + if (!(vrm->flags & MREMAP_RELOCATE_ANON)) > > + return false; > > + > > + /* We can't deal with special or hugetlb mappings. */ > > + if (old->vm_flags & (VM_SPECIAL | VM_HUGETLB)) > > + return false; > > + > > + /* We only support anonymous mappings. */ > > + if (!vma_is_anonymous(old)) > > + return false; > > + > > + /* If no folios are mapped, then no need to attempt this. */ > > + if (!old->anon_vma) > > + return false; > > + > > + /* > > + * If the old VMA is a child (i.e. has been forked), then the index > > + * references multiple VMAs, we have to bail. > > + */ > > + if (!list_is_singular(&old->anon_vma_chain)) > > + return false; > > I think this is wrong: list_is_singular(&old->anon_vma_chain) tells > you whether pages in the VMA might be shared due to this mm_struct > being forked from a parent mm_struct; but it won't tell you whether > pages in the VMA might be shared due to this mm_struct being the > parent of another mm_struct (other way around). I guess checking > old->anon_vma->root->num_children could maybe work... Yeah you're completely right, this is entirely an oversight on my part. I don't think we need to be looking at old->anon_vma->root though? As for anon_vma->root != anon_vma here we'd need to be a child, and we just asserted we're not right? (could be an additional check though). But definitely need an old->anon_vma->num_children check. Holding old vma lock should prevent further forks, so we could just take the rmap lock - lock - rmap unlock here? > > > + > > + /* Otherwise, we're good to go! */ > > + return true; > > +} > > + > > /* > > * Copy vrm->vma over to vrm->new_addr possibly adjusting size as part of the > > * process. Additionally handle an error occurring on moving of page tables, > > @@ -1151,9 +1402,11 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm, > > struct vm_area_struct *new_vma; > > int err = 0; > > PAGETABLE_MOVE(pmc, NULL, NULL, vrm->addr, vrm->new_addr, vrm->old_len); > > + bool relocate_anon = should_relocate_anon(vrm, &pmc); > > > > +again: > > new_vma = copy_vma(&vma, vrm->new_addr, vrm->new_len, new_pgoff, > > - &pmc.need_rmap_locks); > > + &pmc.need_rmap_locks, &relocate_anon); > > if (!new_vma) { > > vrm_uncharge(vrm); > > *new_vma_ptr = NULL; > > @@ -1163,12 +1416,66 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm, > > pmc.old = vma; > > pmc.new = new_vma; > > > > + if (relocate_anon) { > > + /* > > + * We have a new VMA to reassign folios to. We take a lock on > > + * its anon_vma so reclaim doesn't fail to unmap mappings. > > + * > > + * We have acquired a VMA write lock by now (in vma_link()), so > > + * we do not have to worry about racing faults. > > + */ > > + anon_vma_lock_write(new_vma->anon_vma); > > + pmc.relocate_locked = new_vma; > > + > > + if (!relocate_anon_folios(&pmc, /* undo= */false)) { > > + unsigned long start = new_vma->vm_start; > > + unsigned long size = new_vma->vm_end - start; > > + > > + /* Undo if fails. */ > > + relocate_anon_folios(&pmc, /* undo= */true); > > This relocate_anon_folios() has to make sure to only operate on the > subset of PTEs that we already edited successfully, right? Assumption is that we'd hit the failure again at the end of the subset, we always start from the beginning. This may or may not be a safe one :) > > > + vrm_stat_account(vrm, vrm->new_len); > > + > > + anon_vma_unlock_write(new_vma->anon_vma); > > + pmc.relocate_locked = NULL; > > + > > + do_munmap(current->mm, start, size, NULL); > > + relocate_anon = false; > > + goto again; > > + } > > + } > [...] > > diff --git a/mm/vma.c b/mm/vma.c > > index 5418eef3a852..09027448753f 100644 > > --- a/mm/vma.c > > +++ b/mm/vma.c > [...] > > @@ -1821,6 +1834,14 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, > > new_vma->vm_ops->open(new_vma); > > if (vma_link(mm, new_vma)) > > goto out_vma_link; > > + /* > > + * If we're attempting to relocate anonymous VMAs, we > > + * don't want to reuse an anon_vma as set by > > + * vm_area_dup(), or copy anon_vma_chain or anything > > + * like this. > > + */ > > + if (*relocate_anon && __anon_vma_prepare(new_vma)) > > + goto out_vma_link; > > Is this bailout really okay? We go to the same label as if vma_link() > failed, but we're in a very different state: For example, the VMA has > already been inserted into the VMA tree with vma_iter_store() as part > of vma_link() (unless that changed in one of the prerequisite > patches). Yeah I think you're right, this is an oversight (hey - error paths - fun right??) - will correct. > > > *need_rmap_locks = false; > > } > > return new_vma;
Thanks for early review, pints_owed++ :>) On Sat, Mar 22, 2025 at 06:33:22AM +0100, David Hildenbrand wrote: > On 22.03.25 01:14, Jann Horn wrote: > > On Fri, Mar 21, 2025 at 10:54 PM Lorenzo Stoakes > > <lorenzo.stoakes@oracle.com> wrote: > > > diff --git a/mm/mremap.c b/mm/mremap.c > > > index 0865387531ed..bb67562a0114 100644 > > > --- a/mm/mremap.c > > > +++ b/mm/mremap.c > > [...] > > > +/* > > > + * If the folio mapped at the specified pte entry can have its index and mapping > > > + * relocated, then do so. > > > + * > > > + * Returns the number of pages we have traversed, or 0 if the operation failed. > > > + */ > > > +static unsigned long relocate_anon(struct pagetable_move_control *pmc, > > > + unsigned long old_addr, unsigned long new_addr, pte_t pte, > > > + bool undo) > > > +{ > > > + struct page *page; > > > + struct folio *folio; > > > + struct vm_area_struct *old, *new; > > > + pgoff_t new_index; > > > + unsigned long ret = 1; > > > + > > > + old = pmc->old; > > > + new = pmc->new; > > > + > > > + /* Ensure we have truly got an anon folio. */ > > > + page = vm_normal_page(old, old_addr, pte); > > > + if (!page) > > > + return ret; > > > + folio = page_folio(page); > > > + folio_lock(folio); > > > + > > > + /* no-op. */ > > > + if (!folio_test_anon(folio) || folio_test_ksm(folio)) > > > + goto out; > > > + > > > + /* > > > + * This should not happen as we explicitly disallow this, but check > > > + * anyway. > > > + */ > > > + if (folio_test_large(folio)) { > > > + ret = 0; > > > + goto out; > > > + } > > > > Do I understand correctly that you assume here that the page is > > exclusively mapped? Maybe we could at least > > WARN_ON(folio_mapcount(folio) != 1) or something like that? > > > > (I was also wondering if the PageAnonExclusive bit is somehow > > relevant, but we should probably not look at or touch that here, > > unless we want to think about cases where we _used to_ have a child > > from which the page may have been GUP'd...) > > UFFDIO_MOVE implements something similar. Right now we keep it simple: > > if (folio_test_large(src_folio) || > folio_maybe_dma_pinned(src_folio) || > !PageAnonExclusive(&src_folio->page)) { > err = -EBUSY; > goto out; > } > > Whereby we > > a) Make sure we cover all PTEs (-> small folio, single PTE). Large > PTE-mapped folios are split. > > b) Make sure there are no GUP pins (maybe not required here?) > > c) The folio is exclusive to this process Yeah, later I actually add handling for large folios :) but this is something separate. The maybe dma pinned thing is a thing and probably I need to add this. Will do so. > > > In general, things I can reason about with confidence are: > > a) As alternative to PageAnonExclusive(), we can check folio_mapcount()==1 > under the folio lock for small folios / PMD-mapped folios. As you (Jann) > say, there might be unexpected references on the folio from other processes. Ack for sure will add. > > b) For large (pte-mapped) folios, we could try batching multiple PTEs > (folio_pte_batch() etc.). We'd be processing all mappings with folio_lock + > folio_mapcount() == #PTEs. Interesting, hadn't thought about this, maybe we can discuss at LSF? > > c) In -next, there is now be the option to use folio lock + > folio_maybe_mapped_shared() == false. But it doesn't tell you into how many > VMAs a large folio is mapped into. > > In the following case: > > [ folio ] > [ VMA#1 ] [ VMA#2 ] > > c) would not tell you if you are fine modifying the folio when moving VMA#2. Right, I feel like prior checks made should assert this is not the case, however? But mapcount check should be a last ditch assurance? (actually at least one of the 'prior checks' for large folios are added in a later commit but still :P) > > > -- > Cheers, > > David / dhildenb >
On Sat, Mar 22, 2025 at 07:17:05AM +0100, David Hildenbrand wrote: > On 22.03.25 06:33, David Hildenbrand wrote: > > On 22.03.25 01:14, Jann Horn wrote: > > > On Fri, Mar 21, 2025 at 10:54 PM Lorenzo Stoakes > > > <lorenzo.stoakes@oracle.com> wrote: > > > > diff --git a/mm/mremap.c b/mm/mremap.c > > > > index 0865387531ed..bb67562a0114 100644 > > > > --- a/mm/mremap.c > > > > +++ b/mm/mremap.c > > > [...] > > > > +/* > > > > + * If the folio mapped at the specified pte entry can have its index and mapping > > > > + * relocated, then do so. > > > > + * > > > > + * Returns the number of pages we have traversed, or 0 if the operation failed. > > > > + */ > > > > +static unsigned long relocate_anon(struct pagetable_move_control *pmc, > > > > + unsigned long old_addr, unsigned long new_addr, pte_t pte, > > > > + bool undo) > > > > +{ > > > > + struct page *page; > > > > + struct folio *folio; > > > > + struct vm_area_struct *old, *new; > > > > + pgoff_t new_index; > > > > + unsigned long ret = 1; > > > > + > > > > + old = pmc->old; > > > > + new = pmc->new; > > > > + > > > > + /* Ensure we have truly got an anon folio. */ > > > > + page = vm_normal_page(old, old_addr, pte); > > > > + if (!page) > > > > + return ret; > > > > + folio = page_folio(page); > > > > + folio_lock(folio); > > > > + > > > > + /* no-op. */ > > > > + if (!folio_test_anon(folio) || folio_test_ksm(folio)) > > > > + goto out; > > > > + > > > > + /* > > > > + * This should not happen as we explicitly disallow this, but check > > > > + * anyway. > > > > + */ > > > > + if (folio_test_large(folio)) { > > > > + ret = 0; > > > > + goto out; > > > > + } > > > > > > Do I understand correctly that you assume here that the page is > > > exclusively mapped? Maybe we could at least > > > WARN_ON(folio_mapcount(folio) != 1) or something like that? > > > > > > (I was also wondering if the PageAnonExclusive bit is somehow > > > relevant, but we should probably not look at or touch that here, > > > unless we want to think about cases where we _used to_ have a child > > > from which the page may have been GUP'd...) > > > > UFFDIO_MOVE implements something similar. Right now we keep it simple: > > > > if (folio_test_large(src_folio) || > > folio_maybe_dma_pinned(src_folio) || > > !PageAnonExclusive(&src_folio->page)) { > > err = -EBUSY; > > goto out; > > } > > > > Whereby we > > > > a) Make sure we cover all PTEs (-> small folio, single PTE). Large > > PTE-mapped folios are split. > > > > b) Make sure there are no GUP pins (maybe not required here?) > > > > c) The folio is exclusive to this process > > On additional note as my memory comes back: if PAE is set, there cannot be > other (inactive) mappings from the swapcache. So whenever we use folio lock > + mapcount data, the possibility of the swapcache (having inactive mappings > from other processes etc.) must be considered. Ack, do you have a feel for how such a check would work? > > -- > Cheers, > > David / dhildenb > An aside in general: I realise all of this is very fiddly, this (albeit early) RFC is several revisions deep and fully expect considerably more fiddly cases to come up :) I plan to do some deeper testing on real iron running very heavy workloads to encourage reclaim and races btw. One test I've done in past is to just hack in forced-on MREMAP_RELOCATE_ANON so _all_ mremap()'s that move do it, which has been a good way of finding issues, but also in tests I add in series I intentionally trigger reclaim via MADV_PAGEOUT. So in general - I'm going to proactively try to really eek out all and any weird edge case stuff where possible before more seriously pushing this series forward. Thanks again for early review, it's much appreciated! :) and see you at the topic I'm giving on this and the pub after... ;)
>> >> c) In -next, there is now be the option to use folio lock + >> folio_maybe_mapped_shared() == false. But it doesn't tell you into how many >> VMAs a large folio is mapped into. >> >> In the following case: >> >> [ folio ] >> [ VMA#1 ] [ VMA#2 ] >> >> c) would not tell you if you are fine modifying the folio when moving VMA#2. > > Right, I feel like prior checks made should assert this is not the case, > however? But mapcount check should be a last ditch assurance? Something nice might be hiding in c) that might be able to handle a single folio being covered by multiple vmas. I was thinking about the following: [ folio0 ] [ VMA#0 ] Then we do a partial (old-school) mremap() [ folio0 ] [ folio0 ] [ VMA#1 ] [ VMA#2 ] To then extend VMA#1 and fault in pages [ folio0 ][ folio1 ] [ folio0 ] [ VMA#1 ] [ VMA#2 ] If that is possible (did not try!, maybe something prevents us from extending VMA#1) mremap(MREMAP_RELOCATE_ANON) of VMA#1 / VMA#2 cannot work. We'd have to detect that scenario (partial mremap). You might be doing that with the anon-vma magic, something different might be: Assume we flag large folios if they were partially mremapped in any process. Then (with folio lock only) 1) folio_maybe_mapped_shared() == false: mapped into single process 2) folio_maybe_partially_mremaped() == false: not scattered in virtual address space It would be sufficient to check if the folio fully falls into the memap() range to decide if we can adjust the folio index etc. We *might* be able to use that in the COW-reuse path for large folios to perform a folio_move_anon_rmap(), which we currently only perform for small folios / PMD-mapped folios (single mapping). Not sure yet if actually multiple VMAs are involved. Just throwing it out there ... > > (actually at least one of the 'prior checks' for large folios are added in a > later commit but still :P) Yeah, I'm looking at the bigger picture; small folios are easy :P
On 22.03.25 08:21, Lorenzo Stoakes wrote: > On Sat, Mar 22, 2025 at 07:17:05AM +0100, David Hildenbrand wrote: >> On 22.03.25 06:33, David Hildenbrand wrote: >>> On 22.03.25 01:14, Jann Horn wrote: >>>> On Fri, Mar 21, 2025 at 10:54 PM Lorenzo Stoakes >>>> <lorenzo.stoakes@oracle.com> wrote: >>>>> diff --git a/mm/mremap.c b/mm/mremap.c >>>>> index 0865387531ed..bb67562a0114 100644 >>>>> --- a/mm/mremap.c >>>>> +++ b/mm/mremap.c >>>> [...] >>>>> +/* >>>>> + * If the folio mapped at the specified pte entry can have its index and mapping >>>>> + * relocated, then do so. >>>>> + * >>>>> + * Returns the number of pages we have traversed, or 0 if the operation failed. >>>>> + */ >>>>> +static unsigned long relocate_anon(struct pagetable_move_control *pmc, >>>>> + unsigned long old_addr, unsigned long new_addr, pte_t pte, >>>>> + bool undo) >>>>> +{ >>>>> + struct page *page; >>>>> + struct folio *folio; >>>>> + struct vm_area_struct *old, *new; >>>>> + pgoff_t new_index; >>>>> + unsigned long ret = 1; >>>>> + >>>>> + old = pmc->old; >>>>> + new = pmc->new; >>>>> + >>>>> + /* Ensure we have truly got an anon folio. */ >>>>> + page = vm_normal_page(old, old_addr, pte); >>>>> + if (!page) >>>>> + return ret; >>>>> + folio = page_folio(page); >>>>> + folio_lock(folio); >>>>> + >>>>> + /* no-op. */ >>>>> + if (!folio_test_anon(folio) || folio_test_ksm(folio)) >>>>> + goto out; >>>>> + >>>>> + /* >>>>> + * This should not happen as we explicitly disallow this, but check >>>>> + * anyway. >>>>> + */ >>>>> + if (folio_test_large(folio)) { >>>>> + ret = 0; >>>>> + goto out; >>>>> + } >>>> >>>> Do I understand correctly that you assume here that the page is >>>> exclusively mapped? Maybe we could at least >>>> WARN_ON(folio_mapcount(folio) != 1) or something like that? >>>> >>>> (I was also wondering if the PageAnonExclusive bit is somehow >>>> relevant, but we should probably not look at or touch that here, >>>> unless we want to think about cases where we _used to_ have a child >>>> from which the page may have been GUP'd...) >>> >>> UFFDIO_MOVE implements something similar. Right now we keep it simple: >>> >>> if (folio_test_large(src_folio) || >>> folio_maybe_dma_pinned(src_folio) || >>> !PageAnonExclusive(&src_folio->page)) { >>> err = -EBUSY; >>> goto out; >>> } >>> >>> Whereby we >>> >>> a) Make sure we cover all PTEs (-> small folio, single PTE). Large >>> PTE-mapped folios are split. >>> >>> b) Make sure there are no GUP pins (maybe not required here?) >>> >>> c) The folio is exclusive to this process >> >> On additional note as my memory comes back: if PAE is set, there cannot be >> other (inactive) mappings from the swapcache. So whenever we use folio lock >> + mapcount data, the possibility of the swapcache (having inactive mappings >> from other processes etc.) must be considered. > > Ack, do you have a feel for how such a check would work? Likely under folio lock if (folio_test_swapcache(folio) && !folio_free_swap(folio)) { /* unable to move. */ folio_unlock(folio) return -ENOTGOINGTOHAPPEN; }
diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h index e89d00528f2f..d0542f872e0c 100644 --- a/include/uapi/linux/mman.h +++ b/include/uapi/linux/mman.h @@ -9,6 +9,7 @@ #define MREMAP_MAYMOVE 1 #define MREMAP_FIXED 2 #define MREMAP_DONTUNMAP 4 +#define MREMAP_RELOCATE_ANON 8 #define OVERCOMMIT_GUESS 0 #define OVERCOMMIT_ALWAYS 1 diff --git a/mm/internal.h b/mm/internal.h index 286520a424fe..7c120a6b0018 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -46,6 +46,7 @@ struct folio_batch; struct pagetable_move_control { struct vm_area_struct *old; /* Source VMA. */ struct vm_area_struct *new; /* Destination VMA. */ + struct vm_area_struct *relocate_locked; /* VMA which is rmap locked. */ unsigned long old_addr; /* Address from which the move begins. */ unsigned long old_end; /* Exclusive address at which old range ends. */ unsigned long new_addr; /* Address to move page tables to. */ diff --git a/mm/mremap.c b/mm/mremap.c index 0865387531ed..bb67562a0114 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -71,6 +71,14 @@ struct vma_remap_struct { unsigned long charged; /* If VM_ACCOUNT, # pages to account. */ }; +/* Represents local PTE state. */ +struct pte_state { + unsigned long old_addr; + unsigned long new_addr; + unsigned long old_end; + pte_t *ptep; +}; + static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr) { pgd_t *pgd; @@ -139,18 +147,42 @@ static pmd_t *alloc_new_pmd(struct mm_struct *mm, unsigned long addr) return pmd; } -static void take_rmap_locks(struct vm_area_struct *vma) +static void maybe_take_rmap_locks(struct pagetable_move_control *pmc) { + struct vm_area_struct *vma; + struct anon_vma *anon_vma; + + if (!pmc->need_rmap_locks) + return; + + vma = pmc->old; + anon_vma = vma->anon_vma; if (vma->vm_file) i_mmap_lock_write(vma->vm_file->f_mapping); - if (vma->anon_vma) - anon_vma_lock_write(vma->anon_vma); + if (anon_vma) { + struct vm_area_struct *relocate_vma = pmc->relocate_locked; + + if (!relocate_vma || relocate_vma->anon_vma != anon_vma) + anon_vma_lock_write(anon_vma); + } } -static void drop_rmap_locks(struct vm_area_struct *vma) +static void maybe_drop_rmap_locks(struct pagetable_move_control *pmc) { - if (vma->anon_vma) - anon_vma_unlock_write(vma->anon_vma); + struct vm_area_struct *vma; + struct anon_vma *anon_vma; + + if (!pmc->need_rmap_locks) + return; + + vma = pmc->old; + anon_vma = vma->anon_vma; + if (anon_vma) { + struct vm_area_struct *relocate_vma = pmc->relocate_locked; + + if (!relocate_vma || relocate_vma->anon_vma != anon_vma) + anon_vma_unlock_write(anon_vma); + } if (vma->vm_file) i_mmap_unlock_write(vma->vm_file->f_mapping); } @@ -204,8 +236,7 @@ static int move_ptes(struct pagetable_move_control *pmc, * serialize access to individual ptes, but only rmap traversal * order guarantees that we won't miss both the old and new ptes). */ - if (pmc->need_rmap_locks) - take_rmap_locks(vma); + maybe_take_rmap_locks(pmc); /* * We don't have to worry about the ordering of src and dst @@ -254,6 +285,7 @@ static int move_ptes(struct pagetable_move_control *pmc, */ if (pte_present(pte)) force_flush = true; + pte = move_pte(pte, old_addr, new_addr); pte = move_soft_dirty_pte(pte); @@ -278,8 +310,7 @@ static int move_ptes(struct pagetable_move_control *pmc, pte_unmap(new_pte - 1); pte_unmap_unlock(old_pte - 1, old_ptl); out: - if (pmc->need_rmap_locks) - drop_rmap_locks(vma); + maybe_drop_rmap_locks(pmc); return err; } @@ -537,15 +568,14 @@ static __always_inline unsigned long get_extent(enum pgt_entry entry, * Should move_pgt_entry() acquire the rmap locks? This is either expressed in * the PMC, or overridden in the case of normal, larger page tables. */ -static bool should_take_rmap_locks(struct pagetable_move_control *pmc, - enum pgt_entry entry) +static bool should_take_rmap_locks(enum pgt_entry entry) { switch (entry) { case NORMAL_PMD: case NORMAL_PUD: return true; default: - return pmc->need_rmap_locks; + return false; } } @@ -557,11 +587,15 @@ static bool move_pgt_entry(struct pagetable_move_control *pmc, enum pgt_entry entry, void *old_entry, void *new_entry) { bool moved = false; - bool need_rmap_locks = should_take_rmap_locks(pmc, entry); + bool override_locks = false; - /* See comment in move_ptes() */ - if (need_rmap_locks) - take_rmap_locks(pmc->old); + if (!pmc->need_rmap_locks && should_take_rmap_locks(entry)) { + override_locks = true; + + pmc->need_rmap_locks = true; + /* See comment in move_ptes() */ + maybe_take_rmap_locks(pmc); + } switch (entry) { case NORMAL_PMD: @@ -585,8 +619,9 @@ static bool move_pgt_entry(struct pagetable_move_control *pmc, break; } - if (need_rmap_locks) - drop_rmap_locks(pmc->old); + maybe_drop_rmap_locks(pmc); + if (override_locks) + pmc->need_rmap_locks = false; return moved; } @@ -752,6 +787,186 @@ static unsigned long pmc_progress(struct pagetable_move_control *pmc) return old_addr < orig_old_addr ? 0 : old_addr - orig_old_addr; } +/* + * If the folio mapped at the specified pte entry can have its index and mapping + * relocated, then do so. + * + * Returns the number of pages we have traversed, or 0 if the operation failed. + */ +static unsigned long relocate_anon(struct pagetable_move_control *pmc, + unsigned long old_addr, unsigned long new_addr, pte_t pte, + bool undo) +{ + struct page *page; + struct folio *folio; + struct vm_area_struct *old, *new; + pgoff_t new_index; + unsigned long ret = 1; + + old = pmc->old; + new = pmc->new; + + /* Ensure we have truly got an anon folio. */ + page = vm_normal_page(old, old_addr, pte); + if (!page) + return ret; + + folio = page_folio(page); + folio_lock(folio); + + /* no-op. */ + if (!folio_test_anon(folio) || folio_test_ksm(folio)) + goto out; + + /* + * This should not happen as we explicitly disallow this, but check + * anyway. + */ + if (folio_test_large(folio)) { + ret = 0; + goto out; + } + + if (!undo) + new_index = linear_page_index(new, new_addr); + else + new_index = linear_page_index(old, old_addr); + + /* + * The PTL should keep us safe from unmapping, and the fact the folio is + * a PTE keeps the folio referenced. + * + * The mmap/VMA locks should keep us safe from fork and other processes. + * + * The rmap locks should keep us safe from anything happening to the + * VMA/anon_vma. + * + * The folio lock should keep us safe from reclaim, migration, etc. + */ + folio_move_anon_rmap(folio, undo ? old : new); + WRITE_ONCE(folio->index, new_index); + +out: + folio_unlock(folio); + return ret; +} + +static bool pte_done(struct pte_state *state) +{ + return state->old_addr >= state->old_end; +} + +static void pte_next(struct pte_state *state, unsigned long nr_pages) +{ + state->old_addr += nr_pages * PAGE_SIZE; + state->new_addr += nr_pages * PAGE_SIZE; + state->ptep += nr_pages; +} + +static bool relocate_anon_ptes(struct pagetable_move_control *pmc, + unsigned long extent, pmd_t *pmd, bool undo) +{ + struct mm_struct *mm = current->mm; + struct pte_state state = { + .old_addr = pmc->old_addr, + .new_addr = pmc->new_addr, + .old_end = pmc->old_addr + extent, + }; + spinlock_t *ptl; + pte_t *ptep_start; + bool ret; + unsigned long nr_pages; + + ptep_start = pte_offset_map_lock(mm, pmd, pmc->old_addr, &ptl); + /* + * We prevent faults with mmap write lock, hold the rmap lock and should + * not fail to obtain this lock. Just give up if we can't. + */ + if (!ptep_start) + return false; + + state.ptep = ptep_start; + + for (; !pte_done(&state); pte_next(&state, nr_pages)) { + pte_t pte = ptep_get(state.ptep); + + if (pte_none(pte) || !pte_present(pte)) { + nr_pages = 1; + continue; + } + + nr_pages = relocate_anon(pmc, state.old_addr, state.new_addr, pte, undo); + if (!nr_pages) { + ret = false; + goto out; + } + } + + ret = true; +out: + pte_unmap_unlock(ptep_start, ptl); + return ret; +} + +static bool __relocate_anon_folios(struct pagetable_move_control *pmc, bool undo) +{ + pud_t *pudp; + pmd_t *pmdp; + unsigned long extent; + struct mm_struct *mm = current->mm; + + if (!pmc->len_in) + return true; + + for (; !pmc_done(pmc); pmc_next(pmc, extent)) { + pmd_t pmd; + pud_t pud; + + extent = get_extent(NORMAL_PUD, pmc); + + pudp = get_old_pud(mm, pmc->old_addr); + if (!pudp) + continue; + pud = pudp_get(pudp); + + if (pud_trans_huge(pud) || pud_devmap(pud)) + return false; + + extent = get_extent(NORMAL_PMD, pmc); + pmdp = get_old_pmd(mm, pmc->old_addr); + if (!pmdp) + continue; + pmd = pmdp_get(pmdp); + + if (is_swap_pmd(pmd) || pmd_trans_huge(pmd) || + pmd_devmap(pmd)) + return false; + + if (pmd_none(pmd)) + continue; + + if (!relocate_anon_ptes(pmc, extent, pmdp, undo)) + return false; + } + + return true; +} + +static bool relocate_anon_folios(struct pagetable_move_control *pmc, bool undo) +{ + unsigned long old_addr = pmc->old_addr; + unsigned long new_addr = pmc->new_addr; + bool ret; + + ret = __relocate_anon_folios(pmc, undo); + + /* Reset state ready for retry. */ + pmc->old_addr = old_addr; + pmc->new_addr = new_addr; + + return ret; +} + unsigned long move_page_tables(struct pagetable_move_control *pmc) { unsigned long extent; @@ -1132,6 +1347,42 @@ static void unmap_source_vma(struct vma_remap_struct *vrm) } } +/* + * Should we attempt to relocate anonymous folios to the location that the VMA + * is being moved to by updating index and mapping fields accordingly? + */ +static bool should_relocate_anon(struct vma_remap_struct *vrm, + struct pagetable_move_control *pmc) +{ + struct vm_area_struct *old = vrm->vma; + + /* Currently we only do this if requested. */ + if (!(vrm->flags & MREMAP_RELOCATE_ANON)) + return false; + + /* We can't deal with special or hugetlb mappings. */ + if (old->vm_flags & (VM_SPECIAL | VM_HUGETLB)) + return false; + + /* We only support anonymous mappings. */ + if (!vma_is_anonymous(old)) + return false; + + /* If no folios are mapped, then no need to attempt this. */ + if (!old->anon_vma) + return false; + + /* + * If the old VMA is a child (i.e. has been forked), then the index + * references multiple VMAs, we have to bail. + */ + if (!list_is_singular(&old->anon_vma_chain)) + return false; + + /* Otherwise, we're good to go! */ + return true; +} + /* * Copy vrm->vma over to vrm->new_addr possibly adjusting size as part of the * process. Additionally handle an error occurring on moving of page tables, @@ -1151,9 +1402,11 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm, struct vm_area_struct *new_vma; int err = 0; PAGETABLE_MOVE(pmc, NULL, NULL, vrm->addr, vrm->new_addr, vrm->old_len); + bool relocate_anon = should_relocate_anon(vrm, &pmc); +again: new_vma = copy_vma(&vma, vrm->new_addr, vrm->new_len, new_pgoff, - &pmc.need_rmap_locks); + &pmc.need_rmap_locks, &relocate_anon); if (!new_vma) { vrm_uncharge(vrm); *new_vma_ptr = NULL; @@ -1163,12 +1416,66 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm, pmc.old = vma; pmc.new = new_vma; + if (relocate_anon) { + /* + * We have a new VMA to reassign folios to. We take a lock on + * its anon_vma so reclaim doesn't fail to unmap mappings. + * + * We have acquired a VMA write lock by now (in vma_link()), so + * we do not have to worry about racing faults. + */ + anon_vma_lock_write(new_vma->anon_vma); + pmc.relocate_locked = new_vma; + + if (!relocate_anon_folios(&pmc, /* undo= */false)) { + unsigned long start = new_vma->vm_start; + unsigned long size = new_vma->vm_end - start; + + /* Undo if fails. */ + relocate_anon_folios(&pmc, /* undo= */true); + vrm_stat_account(vrm, vrm->new_len); + + anon_vma_unlock_write(new_vma->anon_vma); + pmc.relocate_locked = NULL; + + do_munmap(current->mm, start, size, NULL); + relocate_anon = false; + goto again; + } + } + moved_len = move_page_tables(&pmc); if (moved_len < vrm->old_len) err = -ENOMEM; else if (vma->vm_ops && vma->vm_ops->mremap) err = vma->vm_ops->mremap(new_vma); + if (unlikely(err && relocate_anon)) { + relocate_anon_folios(&pmc, /* undo= */true); + anon_vma_unlock_write(new_vma->anon_vma); + pmc.relocate_locked = NULL; + } else if (relocate_anon /* && !err */) { + unsigned long addr = vrm->new_addr; + unsigned long end = addr + vrm->new_len; + VMA_ITERATOR(vmi, vma->vm_mm, addr); + VMG_VMA_STATE(vmg, &vmi, NULL, new_vma, addr, end); + struct vm_area_struct *merged; + + /* + * Now we have successfully copied page tables and set up + * folios, we can safely drop the anon_vma lock. + */ + anon_vma_unlock_write(new_vma->anon_vma); + pmc.relocate_locked = NULL; + + /* Let's try merge again... */ + vmg.prev = vma_prev(&vmi); + vma_next(&vmi); + merged = vma_merge_existing_range(&vmg); + if (merged) + new_vma = merged; + } + if (unlikely(err)) { PAGETABLE_MOVE(pmc_revert, new_vma, vma, vrm->new_addr, vrm->addr, moved_len); @@ -1486,7 +1793,8 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm) unsigned long flags = vrm->flags; /* Ensure no unexpected flag values. */ - if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP)) + if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP | + MREMAP_RELOCATE_ANON)) return -EINVAL; /* Start address must be page-aligned. */ @@ -1501,6 +1809,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm) if (!PAGE_ALIGN(vrm->new_len)) return -EINVAL; + /* We can't relocate without allowing a move. */ + if ((flags & MREMAP_RELOCATE_ANON) && !(flags & MREMAP_MAYMOVE)) + return -EINVAL; + /* Remainder of checks are for cases with specific new_addr. */ if (!vrm_implies_new_addr(vrm)) return 0; diff --git a/mm/vma.c b/mm/vma.c index 5418eef3a852..09027448753f 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -774,8 +774,7 @@ static bool can_merge_remove_vma(struct vm_area_struct *vma) * - The caller must hold a WRITE lock on the mm_struct->mmap_lock. * - vmi must be positioned within [@vmg->middle->vm_start, @vmg->middle->vm_end). */ -static __must_check struct vm_area_struct *vma_merge_existing_range( - struct vma_merge_struct *vmg) +struct vm_area_struct *vma_merge_existing_range(struct vma_merge_struct *vmg) { struct vm_area_struct *middle = vmg->middle; struct vm_area_struct *prev = vmg->prev; @@ -1756,7 +1755,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma) */ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, unsigned long addr, unsigned long len, pgoff_t pgoff, - bool *need_rmap_locks) + bool *need_rmap_locks, bool *relocate_anon) { struct vm_area_struct *vma = *vmap; unsigned long vma_start = vma->vm_start; @@ -1782,7 +1781,19 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, vmg.middle = NULL; /* New VMA range. */ vmg.pgoff = pgoff; vmg.next = vma_iter_next_rewind(&vmi, NULL); + new_vma = vma_merge_new_range(&vmg); + if (*relocate_anon) { + /* + * If merge succeeds, no need to relocate. Otherwise, reset + * pgoff for newly established VMA which we will relocate folios + * to. + */ + if (new_vma) + *relocate_anon = false; + else + pgoff = addr >> PAGE_SHIFT; + } if (new_vma) { /* @@ -1813,7 +1824,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, vma_set_range(new_vma, addr, addr + len, pgoff); if (vma_dup_policy(vma, new_vma)) goto out_free_vma; - if (anon_vma_clone(new_vma, vma)) + if (*relocate_anon) + new_vma->anon_vma = NULL; + else if (anon_vma_clone(new_vma, vma)) goto out_free_mempol; if (new_vma->vm_file) get_file(new_vma->vm_file); @@ -1821,6 +1834,14 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, new_vma->vm_ops->open(new_vma); if (vma_link(mm, new_vma)) goto out_vma_link; + /* + * If we're attempting to relocate anonymous VMAs, we + * don't want to reuse an anon_vma as set by + * vm_area_dup(), or copy anon_vma_chain or anything + * like this. + */ + if (*relocate_anon && __anon_vma_prepare(new_vma)) + goto out_vma_link; *need_rmap_locks = false; } return new_vma; diff --git a/mm/vma.h b/mm/vma.h index 7356ca5a22d3..fc4f8352ec6f 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -260,6 +260,9 @@ __must_check struct vm_area_struct __must_check struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg); +__must_check struct vm_area_struct +*vma_merge_existing_range(struct vma_merge_struct *vmg); + __must_check struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi, struct vm_area_struct *vma, @@ -280,7 +283,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma); struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, unsigned long addr, unsigned long len, pgoff_t pgoff, - bool *need_rmap_locks); + bool *need_rmap_locks, bool *relocate_anon); struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma); diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 7cfd6e31db10..3d19df8fa17b 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -1543,13 +1543,14 @@ static bool test_copy_vma(void) unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; bool need_locks = false; + bool relocate_anon = false; VMA_ITERATOR(vmi, &mm, 0); struct vm_area_struct *vma, *vma_new, *vma_next; /* Move backwards and do not merge. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); - vma_new = copy_vma(&vma, 0, 0x2000, 0, &need_locks); + vma_new = copy_vma(&vma, 0, 0x2000, 0, &need_locks, &relocate_anon); ASSERT_NE(vma_new, vma); ASSERT_EQ(vma_new->vm_start, 0); ASSERT_EQ(vma_new->vm_end, 0x2000); @@ -1562,7 +1563,7 @@ static bool test_copy_vma(void) vma = alloc_and_link_vma(&mm, 0, 0x2000, 0, flags); vma_next = alloc_and_link_vma(&mm, 0x6000, 0x8000, 6, flags); - vma_new = copy_vma(&vma, 0x4000, 0x2000, 4, &need_locks); + vma_new = copy_vma(&vma, 0x4000, 0x2000, 4, &need_locks, &relocate_anon); vma_assert_attached(vma_new); ASSERT_EQ(vma_new, vma_next);
When mremap() moves a mapping around in memory, it goes to great lengths to avoid having to walk page tables as this is expensive and time-consuming. Rather, if the VMA was faulted (that is vma->anon_vma != NULL), the virtual page offset stored in the VMA at vma->vm_pgoff will remain the same, as well all the folio indexes pointed at the associated anon_vma object. This means the VMA and page tables can simply be moved and this affects the change (and if we can move page tables at a higher page table level, this is even faster). While this is efficient, it does lead to big problems with VMA merging - in essence it causes faulted anonymous VMAs to not be mergeable under many circumstances once moved. This is limiting and leads to both a proliferation of unreclaimable, unmovable kernel metadata (VMAs, anon_vma's, anon_vma_chain's) and has an impact on further use of mremap(), which has a requirement that the VMA moved (which can also be a partial range within a VMA) may span only a single VMA. This makes the mergeability or not of VMAs in effect a uAPI concern. In some use cases, users may wish to accept the overhead of actually going to the trouble of updating VMAs and folios to affect mremap() moves. Let's provide them with the choice. This patch add a new MREMAP_RELOCATE_ANON flag to do just that, which attempts to perform such an operation. If it is unable to do so, it cleanly falls back to the usual method. It carefully takes the rmap locks such that at no time will a racing rmap user encounter incorrect or missing VMAs. It is also designed to interact cleanly with the existing mremap() error fallback mechanism (inverting the remap should the page table move fail). Also, if we could merge cleanly without such a change, we do so, avoiding the overhead of the operation if it is not required. In the instance that no merge may occur when the move is performed, we still perform the folio and VMA updates to ensure that future mremap() or mprotect() calls will result in merges. In this implementation, we simply give up if we encounter large folios. A subsequent commit will extend the functionality to allow for these cases. We restrict this flag to purely anonymous memory only. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> --- include/uapi/linux/mman.h | 1 + mm/internal.h | 1 + mm/mremap.c | 354 +++++++++++++++++++++++++++++++++++--- mm/vma.c | 29 +++- mm/vma.h | 5 +- tools/testing/vma/vma.c | 5 +- 6 files changed, 367 insertions(+), 28 deletions(-)