diff mbox series

mm: use READ/WRITE_ONCE() for vma->vm_flags on migrate, mprotect

Message ID 20250207172442.78836-1-lorenzo.stoakes@oracle.com (mailing list archive)
State New
Headers show
Series mm: use READ/WRITE_ONCE() for vma->vm_flags on migrate, mprotect | expand

Commit Message

Lorenzo Stoakes Feb. 7, 2025, 5:24 p.m. UTC
According to the syzbot report referenced here, it is possible to encounter
a race between mprotect() writing to the vma->vm_flags field and migration
checking whether the VMA is locked.

There is no real problem with timing here per se, only that torn
reads/writes may occur. Therefore, as a proximate fix, ensure both
operations READ_ONCE() and WRITE_ONCE() to avoid this.

This race is possible due to the ability to look up VMAs via the rmap,
which migration does in this case, which takes no mmap or VMA lock and
therefore does not preclude an operation to modify a VMA.

When the final update of VMA flags is performed by mprotect, this will
cause the rmap lock to be taken while the VMA is inserted on split/merge.

However the means by which we perform splits/merges in the kernel is that
we perform the split/merge operation on the VMA, acquiring/releasing locks
as needed, and only then, after having done so, modifying fields.

We should carefully examine and determine whether we can combine the two
operations so as to avoid such races, and whether it might be possible to
otherwise annotate these rmap field accesses.

Reported-by: syzbot+c2e5712cbb14c95d4847@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67a34e60.050a0220.50516.0040.GAE@google.com/
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/migrate.c  | 2 +-
 mm/mprotect.c | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

Comments

Andrew Morton Feb. 8, 2025, 2:50 a.m. UTC | #1
On Fri,  7 Feb 2025 17:24:42 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> According to the syzbot report referenced here, it is possible to encounter
> a race between mprotect() writing to the vma->vm_flags field and migration
> checking whether the VMA is locked.
> 
> There is no real problem with timing here per se, only that torn
> reads/writes may occur. Therefore, as a proximate fix, ensure both
> operations READ_ONCE() and WRITE_ONCE() to avoid this.
> 
> This race is possible due to the ability to look up VMAs via the rmap,
> which migration does in this case, which takes no mmap or VMA lock and
> therefore does not preclude an operation to modify a VMA.
> 
> When the final update of VMA flags is performed by mprotect, this will
> cause the rmap lock to be taken while the VMA is inserted on split/merge.
> 
> However the means by which we perform splits/merges in the kernel is that
> we perform the split/merge operation on the VMA, acquiring/releasing locks
> as needed, and only then, after having done so, modifying fields.
> 
> We should carefully examine and determine whether we can combine the two
> operations so as to avoid such races, and whether it might be possible to
> otherwise annotate these rmap field accesses.

Thanks.

If some poor person reads this code and wonders "why is it using
READ_ONCE", what's our answer?  I guess it's "poke around with
git-blame".

And I guess we can live with that - it doesn't seem practical to paste
changelog text into every READ_ONCE() site.

Probably most people won't bother and READ_ONCEs of ->vm_flags will get
pasted into other places where unneeded.

I do wonder if we can do better.
Lorenzo Stoakes Feb. 8, 2025, 9:44 a.m. UTC | #2
On Fri, Feb 07, 2025 at 06:50:14PM -0800, Andrew Morton wrote:
> On Fri,  7 Feb 2025 17:24:42 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > According to the syzbot report referenced here, it is possible to encounter
> > a race between mprotect() writing to the vma->vm_flags field and migration
> > checking whether the VMA is locked.
> >
> > There is no real problem with timing here per se, only that torn
> > reads/writes may occur. Therefore, as a proximate fix, ensure both
> > operations READ_ONCE() and WRITE_ONCE() to avoid this.
> >
> > This race is possible due to the ability to look up VMAs via the rmap,
> > which migration does in this case, which takes no mmap or VMA lock and
> > therefore does not preclude an operation to modify a VMA.
> >
> > When the final update of VMA flags is performed by mprotect, this will
> > cause the rmap lock to be taken while the VMA is inserted on split/merge.
> >
> > However the means by which we perform splits/merges in the kernel is that
> > we perform the split/merge operation on the VMA, acquiring/releasing locks
> > as needed, and only then, after having done so, modifying fields.
> >
> > We should carefully examine and determine whether we can combine the two
> > operations so as to avoid such races, and whether it might be possible to
> > otherwise annotate these rmap field accesses.
>
> Thanks.
>
> If some poor person reads this code and wonders "why is it using
> READ_ONCE", what's our answer?  I guess it's "poke around with
> git-blame".
>
> And I guess we can live with that - it doesn't seem practical to paste
> changelog text into every READ_ONCE() site.
>
> Probably most people won't bother and READ_ONCEs of ->vm_flags will get
> pasted into other places where unneeded.
>
> I do wonder if we can do better.
>

Indeed, I was thinking the same, and we were discussing more broadly on the
thread (and separately too) - ideally we'd have a way of expressing these things
more concretely and explicitly making it clear that 'hey rmap might allow access
here'.

It's one I'm going to have to sit down to think about somewhat, as we also need
to audit accesses like this. It's on my list and on the whiteboard which means I
will get to it eventually... :)
diff mbox series

Patch

diff --git a/mm/migrate.c b/mm/migrate.c
index fb19a18892c8..365c6daa8d1b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -328,7 +328,7 @@  static bool remove_migration_pte(struct folio *folio,
 				folio_add_file_rmap_pte(folio, new, vma);
 			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 		}
-		if (vma->vm_flags & VM_LOCKED)
+		if (READ_ONCE(vma->vm_flags) & VM_LOCKED)
 			mlock_drain_local();
 
 		trace_remove_migration_pte(pvmw.address, pte_val(pte),
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 516b1d847e2c..84cf7689c5eb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -607,7 +607,7 @@  mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
 	       unsigned long start, unsigned long end, unsigned long newflags)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	unsigned long oldflags = vma->vm_flags;
+	unsigned long oldflags = READ_ONCE(vma->vm_flags);
 	long nrpages = (end - start) >> PAGE_SHIFT;
 	unsigned int mm_cp_flags = 0;
 	unsigned long charged = 0;
@@ -627,7 +627,7 @@  mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
 	 * uncommon case, so doesn't need to be very optimized.
 	 */
 	if (arch_has_pfn_modify_check() &&
-	    (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
+	    (oldflags & (VM_PFNMAP|VM_MIXEDMAP)) &&
 	    (newflags & VM_ACCESS_FLAGS) == 0) {
 		pgprot_t new_pgprot = vm_get_page_prot(newflags);
 
@@ -676,7 +676,7 @@  mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
 	 * held in write mode.
 	 */
 	vma_start_write(vma);
-	vm_flags_reset(vma, newflags);
+	vm_flags_reset_once(vma, newflags);
 	if (vma_wants_manual_pte_write_upgrade(vma))
 		mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
 	vma_set_page_prot(vma);