diff mbox series

[v6,1/2] rust: mm: add abstractions for mm_struct and vm_area_struct

Message ID 20241010-vma-v6-1-d89039b6f573@google.com (mailing list archive)
State New
Headers show
Series Rust support for mm_struct, vm_area_struct, and mmap for miscdevice | expand

Commit Message

Alice Ryhl Oct. 10, 2024, 12:56 p.m. UTC
These abstractions allow you to manipulate vmas. Rust Binder will uses
these in a few different ways.

In the mmap implementation, a VmAreaNew will be provided to the mmap
call which allows it to modify the vma in ways that are only okay during
initial setup. This is the case where the most methods are available.

However, Rust Binder needs to insert and remove pages from the vma as
time passes. When incoming messages arrive, pages may need to be
inserted if space is missing, and in this case that is done by using a
stashed ARef<Mm> and calling mmget_not_zero followed by mmap_write_lock
followed by vma_lookup followed by vm_insert_page. In this case, since
mmap_write_lock is used, the VmAreaMut type will be in use.

Another use-case is the shrinker, where the mmap read lock is taken
instead, and zap_page_range_single is used to remove pages from the vma.
In this case, only the read lock is taken, so the VmAreaRef type will be
in use.

Future extensions could involve a VmAreaRcuRef for accessing vma methods
that are okay to use when holding just the rcu read lock. However, these
methods are not needed from Rust yet.

This uses shared references even for VmAreaMut. This is preferable to
using pinned mutable references because those are pretty inconvenient
due to the lack of reborrowing. However, this means that VmAreaMut
cannot be Sync. I think it is an acceptable trade-off.

This patch is based on Wedson's implementation on the old rust branch,
but has been changed significantly. All mistakes are Alice's.

Co-developed-by: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
---
 rust/helpers/helpers.c |   1 +
 rust/helpers/mm.c      |  55 ++++++++
 rust/kernel/lib.rs     |   1 +
 rust/kernel/mm.rs      | 344 +++++++++++++++++++++++++++++++++++++++++++++++++
 rust/kernel/mm/virt.rs | 264 +++++++++++++++++++++++++++++++++++++
 5 files changed, 665 insertions(+)

Comments

Boqun Feng Oct. 13, 2024, 12:16 a.m. UTC | #1
On Thu, Oct 10, 2024 at 12:56:35PM +0000, Alice Ryhl wrote:
> These abstractions allow you to manipulate vmas. Rust Binder will uses
> these in a few different ways.
> 
> In the mmap implementation, a VmAreaNew will be provided to the mmap
> call which allows it to modify the vma in ways that are only okay during
> initial setup. This is the case where the most methods are available.
> 
> However, Rust Binder needs to insert and remove pages from the vma as
> time passes. When incoming messages arrive, pages may need to be
> inserted if space is missing, and in this case that is done by using a
> stashed ARef<Mm> and calling mmget_not_zero followed by mmap_write_lock
> followed by vma_lookup followed by vm_insert_page. In this case, since
> mmap_write_lock is used, the VmAreaMut type will be in use.
> 
> Another use-case is the shrinker, where the mmap read lock is taken
> instead, and zap_page_range_single is used to remove pages from the vma.
> In this case, only the read lock is taken, so the VmAreaRef type will be
> in use.
> 
> Future extensions could involve a VmAreaRcuRef for accessing vma methods
> that are okay to use when holding just the rcu read lock. However, these
> methods are not needed from Rust yet.
> 
> This uses shared references even for VmAreaMut. This is preferable to
> using pinned mutable references because those are pretty inconvenient
> due to the lack of reborrowing. However, this means that VmAreaMut
> cannot be Sync. I think it is an acceptable trade-off.
> 

Interesting ;-) I agree it's better than using Pin.

> This patch is based on Wedson's implementation on the old rust branch,
> but has been changed significantly. All mistakes are Alice's.
> 
> Co-developed-by: Wedson Almeida Filho <wedsonaf@gmail.com>
> Signed-off-by: Wedson Almeida Filho <wedsonaf@gmail.com>
> Signed-off-by: Alice Ryhl <aliceryhl@google.com>
> ---
[...]
> +/// A wrapper for the kernel's `struct mm_struct`.
> +///
> +/// This type is like [`Mm`], but with non-zero `mm_users`. It can only be used when `mm_users` can
> +/// be proven to be non-zero at compile-time, usually because the relevant code holds an `mmget`
> +/// refcount. It can be used to access the associated address space.
> +///
> +/// The `ARef<MmWithUser>` smart pointer holds an `mmget` refcount. Its destructor may sleep.
> +///
> +/// # Invariants
> +///
> +/// Values of this type are always refcounted using `mmget`. The value of `mm_users` is non-zero.
> +/// #[repr(transparent)]
> +pub struct MmWithUser {
> +    mm: Mm,
> +}
> +
> +// SAFETY: It is safe to call `mmput` on another thread than where `mmget` was called.
> +unsafe impl Send for MmWithUser {}
> +// SAFETY: All methods on `MmWithUser` can be called in parallel from several threads.
> +unsafe impl Sync for MmWithUser {}
> +
[...]
> +
> +/// A guard for the mmap read lock.
> +///
> +/// # Invariants
> +///
> +/// This `MmapReadLock` guard owns the mmap read lock.
> +pub struct MmapReadLock<'a> {
> +    mm: &'a MmWithUser,

Since `MmWithUser` is `Sync`, so `MmapReadLock<'a>` is `Send`? However,
it cannot be a `Send` because the lock must be released by the same
thread: although ->mmap_lock is a read-write *semaphore*, but
rw_semaphore by default has strict owner semantics (see
Documentation/locking/locktypes.rst).

Also given this type is really a lock guard, maybe name it
something like MmapReadGuard or MmapReadLockGuard?

Same `Send` issue and name suggestion for `MmapWriteLock`.

> +}
> +
> +impl<'a> MmapReadLock<'a> {
> +    /// Look up a vma at the given address.
> +    #[inline]
> +    pub fn vma_lookup(&self, vma_addr: usize) -> Option<&virt::VmAreaRef> {
> +        // SAFETY: We hold a reference to the mm, so the pointer must be valid. Any value is okay
> +        // for `vma_addr`.
> +        let vma = unsafe { bindings::vma_lookup(self.mm.as_raw(), vma_addr as _) };
> +
> +        if vma.is_null() {
> +            None
> +        } else {
> +            // SAFETY: We just checked that a vma was found, so the pointer is valid. Furthermore,
> +            // the returned area will borrow from this read lock guard, so it can only be used
> +            // while the read lock is still held.
> +            unsafe { Some(virt::VmAreaRef::from_raw(vma)) }
> +        }
> +    }
> +}
> +
> +impl Drop for MmapReadLock<'_> {
> +    #[inline]
> +    fn drop(&mut self) {
> +        // SAFETY: We hold the read lock by the type invariants.
> +        unsafe { bindings::mmap_read_unlock(self.mm.as_raw()) };
> +    }
> +}
> +
> +/// A guard for the mmap write lock.
> +///
> +/// # Invariants
> +///
> +/// This `MmapReadLock` guard owns the mmap write lock.
> +pub struct MmapWriteLock<'a> {
> +    mm: &'a MmWithUser,
> +}
> +
> +impl<'a> MmapWriteLock<'a> {
> +    /// Look up a vma at the given address.
> +    #[inline]
> +    pub fn vma_lookup(&mut self, vma_addr: usize) -> Option<&virt::VmAreaMut> {
> +        // SAFETY: We hold a reference to the mm, so the pointer must be valid. Any value is okay
> +        // for `vma_addr`.
> +        let vma = unsafe { bindings::vma_lookup(self.mm.as_raw(), vma_addr as _) };
> +
> +        if vma.is_null() {
> +            None
> +        } else {
> +            // SAFETY: We just checked that a vma was found, so the pointer is valid. Furthermore,
> +            // the returned area will borrow from this write lock guard, so it can only be used
> +            // while the write lock is still held.
> +            unsafe { Some(virt::VmAreaMut::from_raw(vma)) }
> +        }
> +    }
> +}
> +
> +impl Drop for MmapWriteLock<'_> {
> +    #[inline]
> +    fn drop(&mut self) {
> +        // SAFETY: We hold the write lock by the type invariants.
> +        unsafe { bindings::mmap_write_unlock(self.mm.as_raw()) };
> +    }
> +}
> diff --git a/rust/kernel/mm/virt.rs b/rust/kernel/mm/virt.rs
> new file mode 100644
> index 000000000000..7c09813e22f9
> --- /dev/null
> +++ b/rust/kernel/mm/virt.rs
> @@ -0,0 +1,264 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +// Copyright (C) 2024 Google LLC.
> +
> +//! Virtual memory.
[...]
> +impl VmAreaRef {
> +    /// Access a virtual memory area given a raw pointer.
> +    ///
> +    /// # Safety
> +    ///
> +    /// Callers must ensure that `vma` is valid for the duration of 'a, and that the mmap read lock
> +    /// (or stronger) is held for at least the duration of 'a.
> +    #[inline]
> +    pub unsafe fn from_raw<'a>(vma: *const bindings::vm_area_struct) -> &'a Self {

Unrelated to this patch, but since we have so many `from_raw`s, I want
to suggest that we should look into a #[derive(FromRaw)] ;-) For
example:

    pub trait FromRaw {
        type RawType;
        unsafe fn from_raw<'a>(raw: *const Self::RawType) -> &'a Self;
    }
and

    #[derive(FromRaw)]
    #[repr(transparent)] // repr(transparent) is mandatory.
    struct VmAreaRef {
        vma: Opaque<bindings::vm_area_struct> // Opaque is also mandatory.
    }

Regards,
Boqun

> +        // SAFETY: The caller ensures that the invariants are satisfied for the duration of 'a.
> +        unsafe { &*vma.cast() }
> +    }
[...]
Alice Ryhl Oct. 14, 2024, 9:33 a.m. UTC | #2
On Sun, Oct 13, 2024 at 2:16 AM Boqun Feng <boqun.feng@gmail.com> wrote:
>
> On Thu, Oct 10, 2024 at 12:56:35PM +0000, Alice Ryhl wrote:
> > These abstractions allow you to manipulate vmas. Rust Binder will uses
> > these in a few different ways.
> >
> > In the mmap implementation, a VmAreaNew will be provided to the mmap
> > call which allows it to modify the vma in ways that are only okay during
> > initial setup. This is the case where the most methods are available.
> >
> > However, Rust Binder needs to insert and remove pages from the vma as
> > time passes. When incoming messages arrive, pages may need to be
> > inserted if space is missing, and in this case that is done by using a
> > stashed ARef<Mm> and calling mmget_not_zero followed by mmap_write_lock
> > followed by vma_lookup followed by vm_insert_page. In this case, since
> > mmap_write_lock is used, the VmAreaMut type will be in use.
> >
> > Another use-case is the shrinker, where the mmap read lock is taken
> > instead, and zap_page_range_single is used to remove pages from the vma.
> > In this case, only the read lock is taken, so the VmAreaRef type will be
> > in use.
> >
> > Future extensions could involve a VmAreaRcuRef for accessing vma methods
> > that are okay to use when holding just the rcu read lock. However, these
> > methods are not needed from Rust yet.
> >
> > This uses shared references even for VmAreaMut. This is preferable to
> > using pinned mutable references because those are pretty inconvenient
> > due to the lack of reborrowing. However, this means that VmAreaMut
> > cannot be Sync. I think it is an acceptable trade-off.
> >
>
> Interesting ;-) I agree it's better than using Pin.
>
> > This patch is based on Wedson's implementation on the old rust branch,
> > but has been changed significantly. All mistakes are Alice's.
> >
> > Co-developed-by: Wedson Almeida Filho <wedsonaf@gmail.com>
> > Signed-off-by: Wedson Almeida Filho <wedsonaf@gmail.com>
> > Signed-off-by: Alice Ryhl <aliceryhl@google.com>
> > ---
> [...]
> > +/// A wrapper for the kernel's `struct mm_struct`.
> > +///
> > +/// This type is like [`Mm`], but with non-zero `mm_users`. It can only be used when `mm_users` can
> > +/// be proven to be non-zero at compile-time, usually because the relevant code holds an `mmget`
> > +/// refcount. It can be used to access the associated address space.
> > +///
> > +/// The `ARef<MmWithUser>` smart pointer holds an `mmget` refcount. Its destructor may sleep.
> > +///
> > +/// # Invariants
> > +///
> > +/// Values of this type are always refcounted using `mmget`. The value of `mm_users` is non-zero.
> > +/// #[repr(transparent)]
> > +pub struct MmWithUser {
> > +    mm: Mm,
> > +}
> > +
> > +// SAFETY: It is safe to call `mmput` on another thread than where `mmget` was called.
> > +unsafe impl Send for MmWithUser {}
> > +// SAFETY: All methods on `MmWithUser` can be called in parallel from several threads.
> > +unsafe impl Sync for MmWithUser {}
> > +
> [...]
> > +
> > +/// A guard for the mmap read lock.
> > +///
> > +/// # Invariants
> > +///
> > +/// This `MmapReadLock` guard owns the mmap read lock.
> > +pub struct MmapReadLock<'a> {
> > +    mm: &'a MmWithUser,
>
> Since `MmWithUser` is `Sync`, so `MmapReadLock<'a>` is `Send`? However,
> it cannot be a `Send` because the lock must be released by the same
> thread: although ->mmap_lock is a read-write *semaphore*, but
> rw_semaphore by default has strict owner semantics (see
> Documentation/locking/locktypes.rst).
>
> Also given this type is really a lock guard, maybe name it
> something like MmapReadGuard or MmapReadLockGuard?
>
> Same `Send` issue and name suggestion for `MmapWriteLock`.

Fixed in v7.

https://lore.kernel.org/lkml/20241014-vma-v7-0-01e32f861195@google.com/

> > +impl VmAreaRef {
> > +    /// Access a virtual memory area given a raw pointer.
> > +    ///
> > +    /// # Safety
> > +    ///
> > +    /// Callers must ensure that `vma` is valid for the duration of 'a, and that the mmap read lock
> > +    /// (or stronger) is held for at least the duration of 'a.
> > +    #[inline]
> > +    pub unsafe fn from_raw<'a>(vma: *const bindings::vm_area_struct) -> &'a Self {
>
> Unrelated to this patch, but since we have so many `from_raw`s, I want
> to suggest that we should look into a #[derive(FromRaw)] ;-) For
> example:
>
>     pub trait FromRaw {
>         type RawType;
>         unsafe fn from_raw<'a>(raw: *const Self::RawType) -> &'a Self;
>     }
> and
>
>     #[derive(FromRaw)]
>     #[repr(transparent)] // repr(transparent) is mandatory.
>     struct VmAreaRef {
>         vma: Opaque<bindings::vm_area_struct> // Opaque is also mandatory.
>     }

Well, perhaps. But it's not a trait, so is a derive the right way?
Either way, it's unrelated to this patch.

Alice
Jann Horn Nov. 8, 2024, 7:01 p.m. UTC | #3
On Thu, Oct 10, 2024 at 2:56 PM Alice Ryhl <aliceryhl@google.com> wrote:
> These abstractions allow you to manipulate vmas. Rust Binder will uses
> these in a few different ways.
>
> In the mmap implementation, a VmAreaNew will be provided to the mmap
> call which allows it to modify the vma in ways that are only okay during
> initial setup. This is the case where the most methods are available.
>
> However, Rust Binder needs to insert and remove pages from the vma as
> time passes. When incoming messages arrive, pages may need to be
> inserted if space is missing, and in this case that is done by using a
> stashed ARef<Mm> and calling mmget_not_zero followed by mmap_write_lock
> followed by vma_lookup followed by vm_insert_page. In this case, since
> mmap_write_lock is used, the VmAreaMut type will be in use.

FYI, the way the C binder implementation uses vma_lookup() and
vm_insert_page() is not very idiomatic. The standard way of
implementing binder_alloc_free_page() would be to use something like
unmap_mapping_range() instead of using
vma_lookup()+zap_page_range_single(); though to set that up, you'd
have to create one inode per binder file, maybe with something like
the DRM subsystem's drm_fs_inode_new(). And instead of having
binder_install_single_page(), the standard way would be to let
binder_vm_fault() install PTEs lazily on fault. That way you'd never
have to take mmap locks or grab MM references yourself.

I guess the one benefit of the current implementation of C binder over
the more idiomatic approach is that it avoids taking a fault after
binder_install_buffer_pages(), but that's probably not such a hot path
that it matters...

> Another use-case is the shrinker, where the mmap read lock is taken
> instead, and zap_page_range_single is used to remove pages from the vma.
> In this case, only the read lock is taken, so the VmAreaRef type will be
> in use.

I think this might also use unmap_mapping_range() in a more idiomatic
implementation.

> Future extensions could involve a VmAreaRcuRef for accessing vma methods
> that are okay to use when holding just the rcu read lock. However, these
> methods are not needed from Rust yet.

You're probably talking about the VMA read lock here? There's not much
you can do under just RCU.

> +    /// Unmap pages in the given page range.
> +    #[inline]
> +    pub fn zap_page_range_single(&self, address: usize, size: usize) {
> +        // SAFETY: By the type invariants, the caller holds at least the mmap read lock, so this
> +        // access is okay. Any value of `address` and `size` is allowed.

FWIW, if you passed in a size that is big enough to make the end
address wrap around the end of the address space, I think you could at
least hit a BUG_ON() in unmap_page_range(). I don't know what level of
safety kernel Rust is going for though - I guess that might be
acceptable?

> +        unsafe {
> +            bindings::zap_page_range_single(
> +                self.as_ptr(),
> +                address as _,
> +                size as _,
> +                core::ptr::null_mut(),
> +            )
> +        };
> +    }
> +}
> +
> +impl VmAreaMut {
> +    /// Access a virtual memory area given a raw pointer.
> +    ///
> +    /// # Safety
> +    ///
> +    /// Callers must ensure that `vma` is valid for the duration of 'a, and that the mmap write lock
> +    /// (or stronger) is held for at least the duration of 'a.

A mutable VMA should probably also come with the vma write lock held
(meaning vma_start_write() has been called on it) - that's needed for
almost all operations that actually mutate the VMA. (Though with the
comments below, maybe you don't actually have any use for VmAreaMut
for now.)

> +    #[inline]
> +    pub unsafe fn from_raw<'a>(vma: *const bindings::vm_area_struct) -> &'a Self {
> +        // SAFETY: The caller ensures that the invariants are satisfied for the duration of 'a.
> +        unsafe { &*vma.cast() }
> +    }
> +
> +    /// Sets the flags associated with the virtual memory area.
> +    ///
> +    /// The possible flags are a combination of the constants in [`flags`].
> +    #[inline]
> +    pub fn set_flags(&self, flags: usize) {
> +        // SAFETY: By the type invariants, the caller holds the mmap write lock, so this access is
> +        // not a data race.
> +        unsafe { (*self.as_ptr()).__bindgen_anon_2.vm_flags = flags as _ };

It can be a data race, for example you could race with pte_mkwrite()
called from remove_migration_pte(). (But that's an existing issue, the
same data race currently occurs with existing vm_flags updates in
other core kernel paths.)

Separately from that: If you want to update the flags of existing
VMAs, you need to call vma_start_write() first. The vm_flags_set()
helper is designed to enforce this automatically.

But really I think drivers normally don't change the flags of existing
VMAs anyway, only the flags of newly created VMAs; only core MM code
should be changing the flags of existing VMAs. You should maybe move
this method over into VmAreaNew.

There are ways to set VMA flags that are dangerous and invalid; I
think VM_PFNMAP|VM_MIXEDMAP would be an example (though that might not
actually lead to any actual bad consequences with the way the code is
currently written). Clearing VM_MIXEDMAP after calling
vm_insert_page() might be another one.

VM_HUGETLB is also a flag you really shouldn't be touching from driver
code, that would cause type confusion through a bad cast in
unmap_single_vma -> __unmap_hugepage_range -> hstate_vma ->
hstate_file -> hstate_inode.

I'm not sure how safe against programmer error this is supposed to be;
if you want the API to be safe even against silly blunders, I think
you'd have to at least allowlist which bits the driver may clear and
which bits it may set, and additionally deny at least the bad
combination VM_MIXEDMAP|VM_PFNMAP. Though you might want to introduce
higher-level helpers instead, so that a driver author doesn't blunder
by checking VM_WRITE without also clearing VM_MAYWRITE, which is a
somewhat common mistake that IIRC has led to several security bugs in
the past.

I think the reasonable ways in which drivers might want to change
flags on a VmAreaNew are:

1. clear VM_MAYWRITE after checking VM_WRITE is unset
2. (you could do the same with the VM_EXEC/VM_MAYEXEC bits)
3. set VM_PFNMAP (only valid if not VM_MIXEDMAP and no PTEs have been
installed yet)
4. set VM_MIXEDMAP (only valid if not VM_PFNMAP)
5. set VM_IO/VM_DONTEXPAND
6. set VM_DONTCOPY/VM_DONTDUMP (only permanently effective if VM_IO is also set)

> +    }
> +
> +    /// Maps a single page at the given address within the virtual memory area.
> +    ///
> +    /// This operation does not take ownership of the page.
> +    #[inline]
> +    pub fn vm_insert_page(&self, address: usize, page: &Page) -> Result {
> +        // SAFETY: By the type invariants, the caller holds the mmap write lock, so this access is
> +        // not a data race. The page is guaranteed to be valid and of order 0. The range of
> +        // `address` is already checked by `vm_insert_page`.
> +        to_result(unsafe { bindings::vm_insert_page(self.as_ptr(), address as _, page.as_ptr()) })

vm_insert_page() has a kinda weird contract because there are two
contexts from which you can call it cleanly:

1. You can call it on a VmAreaRef (just like zap_page_range_single(),
you only need to hold an mmap read lock or VMA read lock, no write
lock is required); however, you must ensure that the VMA is already
marked VM_MIXEDMAP. This is the API contract under which you'd call
this from a fault handler.

2. You can call it on a VmAreaNew (and it will take care of setting
VM_MIXEDMAP for you).

I think nothing would immediately crash if you called vm_insert_page()
on a VMA that does not yet have VM_MIXEDMAP while holding the mmap
lock in write mode; but that would permit weird scenarios where you
could, for example, have a userfaultfd context associated with a VMA
which becomes VM_MIXEDMAP, while normally you can't attach userfaultfd
contexts to VM_MIXEDMAP VMAs. I don't think if that actually leads to
anything bad, but it would be a weird state that probably shouldn't be
permitted.

There are also safety requirements for the page being installed, but I
guess the checks that vm_insert_page() already does via
validate_page_before_insert() might be enough to make this safe...

> +    }
> +}
> +
> +impl VmAreaNew {
> +    /// Access a virtual memory area given a raw pointer.
> +    ///
> +    /// # Safety
> +    ///
> +    /// Callers must ensure that `vma` is valid for the duration of 'a, and that the vma is
> +    /// currently undergoing initial VMA setup.
> +    #[inline]
> +    pub unsafe fn from_raw<'a>(vma: *const bindings::vm_area_struct) -> &'a Self {
> +        // SAFETY: The caller ensures that the invariants are satisfied for the duration of 'a.
> +        unsafe { &*vma.cast() }
> +    }
> +
> +    /// Make this vma anonymous.
> +    #[inline]
> +    pub fn set_anonymous(&self) {
> +        // SAFETY: The vma is not yet shared.
> +        unsafe { (*self.as_ptr()).vm_ops = core::ptr::null() };
> +    }

What would you use this for? I don't think drivers are supposed to
create anonymous VMAs. (Also, if you called this after inserting some
non-anonymous pages into the VMA, that might confuse some core MM
code, though I don't know if it can actually cause anything really
bad.)
Jann Horn Nov. 8, 2024, 8:05 p.m. UTC | #4
On Fri, Nov 8, 2024 at 8:01 PM Jann Horn <jannh@google.com> wrote:
> On Thu, Oct 10, 2024 at 2:56 PM Alice Ryhl <aliceryhl@google.com> wrote:
> > These abstractions allow you to manipulate vmas. Rust Binder will uses
> > these in a few different ways.
> >
> > In the mmap implementation, a VmAreaNew will be provided to the mmap
> > call which allows it to modify the vma in ways that are only okay during
> > initial setup. This is the case where the most methods are available.
> >
> > However, Rust Binder needs to insert and remove pages from the vma as
> > time passes. When incoming messages arrive, pages may need to be
> > inserted if space is missing, and in this case that is done by using a
> > stashed ARef<Mm> and calling mmget_not_zero followed by mmap_write_lock
> > followed by vma_lookup followed by vm_insert_page. In this case, since
> > mmap_write_lock is used, the VmAreaMut type will be in use.
>
> FYI, the way the C binder implementation uses vma_lookup() and
> vm_insert_page() is not very idiomatic. The standard way of
> implementing binder_alloc_free_page() would be to use something like
> unmap_mapping_range() instead of using
> vma_lookup()+zap_page_range_single(); though to set that up, you'd
> have to create one inode per binder file, maybe with something like
> the DRM subsystem's drm_fs_inode_new(). And instead of having
> binder_install_single_page(), the standard way would be to let
> binder_vm_fault() install PTEs lazily on fault. That way you'd never
> have to take mmap locks or grab MM references yourself.

Let me know if you think it would be helpful to see a prototype of
that in C - I think I can cobble that together, but doing it nicely
will require some work to convert at least some of the binder_alloc
locking to mutexes, and some more work to switch the ->f_mapping of
the binder file or something like that. (I guess modeling that in Rust
might be a bit of a pain though...)
Alice Ryhl Nov. 11, 2024, 11:41 a.m. UTC | #5
Thanks a lot for your review!

Note that there is a v7 already:
https://lore.kernel.org/all/20241014-vma-v7-0-01e32f861195@google.com/

On Fri, Nov 8, 2024 at 8:02 PM Jann Horn <jannh@google.com> wrote:
> On Thu, Oct 10, 2024 at 2:56 PM Alice Ryhl <aliceryhl@google.com> wrote:
>> These abstractions allow you to manipulate vmas. Rust Binder will uses
>> these in a few different ways.
>>
>> In the mmap implementation, a VmAreaNew will be provided to the mmap
>> call which allows it to modify the vma in ways that are only okay during
>> initial setup. This is the case where the most methods are available.
>>
>> However, Rust Binder needs to insert and remove pages from the vma as
>> time passes. When incoming messages arrive, pages may need to be
>> inserted if space is missing, and in this case that is done by using a
>> stashed ARef<Mm> and calling mmget_not_zero followed by mmap_write_lock
>> followed by vma_lookup followed by vm_insert_page. In this case, since
>> mmap_write_lock is used, the VmAreaMut type will be in use.
>
> FYI, the way the C binder implementation uses vma_lookup() and
> vm_insert_page() is not very idiomatic. The standard way of
> implementing binder_alloc_free_page() would be to use something like
> unmap_mapping_range() instead of using
> vma_lookup()+zap_page_range_single(); though to set that up, you'd
> have to create one inode per binder file, maybe with something like
> the DRM subsystem's drm_fs_inode_new(). And instead of having
> binder_install_single_page(), the standard way would be to let
> binder_vm_fault() install PTEs lazily on fault. That way you'd never
> have to take mmap locks or grab MM references yourself.

Would that actually work? All writes happen from kernel space, so it's
not clear to me that we can trigger the fault handler from there. We
can't use copy_to_user because the write happens from a different
process than the one that owns the vma.

That said, one alternative is to let Binder store an array of `struct
page` and just write directly to those pages when writing from kernel
space. Then binder_vm_fault() can look up the relevant page from the
array when userspace attempts to read the data. Though we could also
just vm_insert_page() right before returning the address to userspace,
since we know that userspace will read it right away after the syscall
returns.

> I guess the one benefit of the current implementation of C binder over
> the more idiomatic approach is that it avoids taking a fault after
> binder_install_buffer_pages(), but that's probably not such a hot path
> that it matters...

Saying "probably not a hot path" is dangerous when it comes to binder's
allocator. :)

> Let me know if you think it would be helpful to see a prototype of
> that in C - I think I can cobble that together, but doing it nicely
> will require some work to convert at least some of the binder_alloc
> locking to mutexes, and some more work to switch the ->f_mapping of
> the binder file or something like that. (I guess modeling that in Rust
> might be a bit of a pain though...)

It would be useful to hear about what the advantages of
unmap_mapping_range() are. I don't need a full prototype, I should be
able to understand given a rough description of what the required
changes are.

>> Another use-case is the shrinker, where the mmap read lock is taken
>> instead, and zap_page_range_single is used to remove pages from the vma.
>> In this case, only the read lock is taken, so the VmAreaRef type will be
>> in use.
>
> I think this might also use unmap_mapping_range() in a more idiomatic
> implementation.
>
>> Future extensions could involve a VmAreaRcuRef for accessing vma methods
>> that are okay to use when holding just the rcu read lock. However, these
>> methods are not needed from Rust yet.
>
> You're probably talking about the VMA read lock here? There's not much
> you can do under just RCU.

I'm not sure. At Plumbers I was explaining the idea of splitting the
Rust abstractions into VmaAreaRef, VmAreaMut, VmAreaNew to some mm folks
and I was told that there is a fourth option involving RCU by ... I
think it was Lorenzo. I haven't looked into this aspect further since
that conversation.

>> +    /// Unmap pages in the given page range.
>> +    #[inline]
>> +    pub fn zap_page_range_single(&self, address: usize, size: usize) {
>> +        // SAFETY: By the type invariants, the caller holds at least the mmap read lock, so this
>> +        // access is okay. Any value of `address` and `size` is allowed.
>
> FWIW, if you passed in a size that is big enough to make the end
> address wrap around the end of the address space, I think you could at
> least hit a BUG_ON() in unmap_page_range(). I don't know what level of
> safety kernel Rust is going for though - I guess that might be
> acceptable?

Hitting a BUG_ON is acceptable.

>> +        unsafe {
>> +            bindings::zap_page_range_single(
>> +                self.as_ptr(),
>> +                address as _,
>> +                size as _,
>> +                core::ptr::null_mut(),
>> +            )
>> +        };
>> +    }
>> +}
>> +
>> +impl VmAreaMut {
>> +    /// Access a virtual memory area given a raw pointer.
>> +    ///
>> +    /// # Safety
>> +    ///
>> +    /// Callers must ensure that `vma` is valid for the duration of 'a, and that the mmap write lock
>> +    /// (or stronger) is held for at least the duration of 'a.
>
> A mutable VMA should probably also come with the vma write lock held
> (meaning vma_start_write() has been called on it) - that's needed for
> almost all operations that actually mutate the VMA. (Though with the
> comments below, maybe you don't actually have any use for VmAreaMut
> for now.)

Agreed. See:
https://lore.kernel.org/r/CAH5fLggpEezhR_o+8RPmYix-JLZ47HwQLQM2OUzKQg3i7UYu5Q@mail.gmail.com

>> +    /// Sets the flags associated with the virtual memory area.
>> +    ///
>> +    /// The possible flags are a combination of the constants in [`flags`].
>> +    #[inline]
>> +    pub fn set_flags(&self, flags: usize) {
>> +        // SAFETY: By the type invariants, the caller holds the mmap write lock, so this access is
>> +        // not a data race.
>> +        unsafe { (*self.as_ptr()).__bindgen_anon_2.vm_flags = flags as _ };
>
> It can be a data race, for example you could race with pte_mkwrite()
> called from remove_migration_pte(). (But that's an existing issue, the
> same data race currently occurs with existing vm_flags updates in
> other core kernel paths.)
>
> Separately from that: If you want to update the flags of existing
> VMAs, you need to call vma_start_write() first. The vm_flags_set()
> helper is designed to enforce this automatically.
>
> But really I think drivers normally don't change the flags of existing
> VMAs anyway, only the flags of newly created VMAs; only core MM code
> should be changing the flags of existing VMAs. You should maybe move
> this method over into VmAreaNew.

Yeah. Based on what you're saying, moving this method to VmAreaNew
sounds like a good idea.

Just to confirm, vma_start_write() is not necessary if I move it to
VmAreaNew?

> There are ways to set VMA flags that are dangerous and invalid; I
> think VM_PFNMAP|VM_MIXEDMAP would be an example (though that might not
> actually lead to any actual bad consequences with the way the code is
> currently written). Clearing VM_MIXEDMAP after calling
> vm_insert_page() might be another one.
>
> VM_HUGETLB is also a flag you really shouldn't be touching from driver
> code, that would cause type confusion through a bad cast in
> unmap_single_vma -> __unmap_hugepage_range -> hstate_vma ->
> hstate_file -> hstate_inode.
>
> I'm not sure how safe against programmer error this is supposed to be;
> if you want the API to be safe even against silly blunders, I think
> you'd have to at least allowlist which bits the driver may clear and
> which bits it may set, and additionally deny at least the bad
> combination VM_MIXEDMAP|VM_PFNMAP. Though you might want to introduce
> higher-level helpers instead, so that a driver author doesn't blunder
> by checking VM_WRITE without also clearing VM_MAYWRITE, which is a
> somewhat common mistake that IIRC has led to several security bugs in
> the past.

In general, the goal is to write an abstraction such that you cannot
trigger memory safety issues, even if triggering it involves a silly
blunder. But only memory safety issues *must* be prevented; other issues
such as BUG_ON or deadlocks or resource leaks need not be prevented,
though those are still a nice-to-have when possible.

> I think the reasonable ways in which drivers might want to change
> flags on a VmAreaNew are:
>
> 1. clear VM_MAYWRITE after checking VM_WRITE is unset
> 2. (you could do the same with the VM_EXEC/VM_MAYEXEC bits)
> 3. set VM_PFNMAP (only valid if not VM_MIXEDMAP and no PTEs have been
> installed yet)
> 4. set VM_MIXEDMAP (only valid if not VM_PFNMAP)
> 5. set VM_IO/VM_DONTEXPAND
> 6. set VM_DONTCOPY/VM_DONTDUMP (only permanently effective if VM_IO is also set)

Okay, thanks for the overview. Comparing with the drivers I'm familiar
with, they do the following:

Ashmem: Clear VM_MAYREAD/VM_MAYWRITE/VM_MAYEXEC after checking
VM_READ/VM_WRITE/VM_EXEC. This is allowed by what you say. (You did not
mention VM_READ but I assume the same applies here.)

Binder: Clears VM_MAYWRITE after checking VM_WRITE. Ok. Binder also sets
the flags VM_DONTCOPY and VM_MIXEDMAP. Is a VM_PFNMAP check missing? But
I guess that flag only gets set if you do so yourself, so the check is
probably not necessary?

>> +    /// Maps a single page at the given address within the virtual memory area.
>> +    ///
>> +    /// This operation does not take ownership of the page.
>> +    #[inline]
>> +    pub fn vm_insert_page(&self, address: usize, page: &Page) -> Result {
>> +        // SAFETY: By the type invariants, the caller holds the mmap write lock, so this access is
>> +        // not a data race. The page is guaranteed to be valid and of order 0. The range of
>> +        // `address` is already checked by `vm_insert_page`.
>> +        to_result(unsafe { bindings::vm_insert_page(self.as_ptr(), address as _, page.as_ptr()) })
>
> vm_insert_page() has a kinda weird contract because there are two
> contexts from which you can call it cleanly:
>
> 1. You can call it on a VmAreaRef (just like zap_page_range_single(),
> you only need to hold an mmap read lock or VMA read lock, no write
> lock is required); however, you must ensure that the VMA is already
> marked VM_MIXEDMAP. This is the API contract under which you'd call
> this from a fault handler.
>
> 2. You can call it on a VmAreaNew (and it will take care of setting
> VM_MIXEDMAP for you).
>
> I think nothing would immediately crash if you called vm_insert_page()
> on a VMA that does not yet have VM_MIXEDMAP while holding the mmap
> lock in write mode; but that would permit weird scenarios where you
> could, for example, have a userfaultfd context associated with a VMA
> which becomes VM_MIXEDMAP, while normally you can't attach userfaultfd
> contexts to VM_MIXEDMAP VMAs. I don't think if that actually leads to
> anything bad, but it would be a weird state that probably shouldn't be
> permitted.
>
> There are also safety requirements for the page being installed, but I
> guess the checks that vm_insert_page() already does via
> validate_page_before_insert() might be enough to make this safe...

One way to handle this is to make an VmAreaRef::check_mixedmap that
returns a VmAreaMixedMapRef after checking the flag. That type can then
have a vm_insert_page method.

As for VmAreaNew, I'm not sure we should have it there. If we're not
careful, it would be a way to set VM_MIXEDMAP on something that already
has the VM_PFNMAP flag. We can probably just tell you to set VM_MIXEDMAP
directly and then go through the method on VmAreaRef.

>> +impl VmAreaNew {
>> +    /// Access a virtual memory area given a raw pointer.
>> +    ///
>> +    /// # Safety
>> +    ///
>> +    /// Callers must ensure that `vma` is valid for the duration of 'a, and that the vma is
>> +    /// currently undergoing initial VMA setup.
>> +    #[inline]
>> +    pub unsafe fn from_raw<'a>(vma: *const bindings::vm_area_struct) -> &'a Self {
>> +        // SAFETY: The caller ensures that the invariants are satisfied for the duration of 'a.
>> +        unsafe { &*vma.cast() }
>> +    }
>> +
>> +    /// Make this vma anonymous.
>> +    #[inline]
>> +    pub fn set_anonymous(&self) {
>> +        // SAFETY: The vma is not yet shared.
>> +        unsafe { (*self.as_ptr()).vm_ops = core::ptr::null() };
>> +    }
>
> What would you use this for? I don't think drivers are supposed to
> create anonymous VMAs. (Also, if you called this after inserting some
> non-anonymous pages into the VMA, that might confuse some core MM
> code, though I don't know if it can actually cause anything really
> bad.)

Hmm, I don't think I need it anymore. I can just drop it. But I'm
probably missing a way to set a custom vm_ops with your own
open/close/fault methods.

Alice
Jann Horn Nov. 11, 2024, 4:50 p.m. UTC | #6
On Mon, Nov 11, 2024 at 12:41 PM Alice Ryhl <aliceryhl@google.com> wrote:
> Thanks a lot for your review!
>
> Note that there is a v7 already:
> https://lore.kernel.org/all/20241014-vma-v7-0-01e32f861195@google.com/

Yeah, oops, somehow I only realized I was looking at an older version
of the series after sending my mail...

> On Fri, Nov 8, 2024 at 8:02 PM Jann Horn <jannh@google.com> wrote:
> > On Thu, Oct 10, 2024 at 2:56 PM Alice Ryhl <aliceryhl@google.com> wrote:
> >> These abstractions allow you to manipulate vmas. Rust Binder will uses
> >> these in a few different ways.
> >>
> >> In the mmap implementation, a VmAreaNew will be provided to the mmap
> >> call which allows it to modify the vma in ways that are only okay during
> >> initial setup. This is the case where the most methods are available.
> >>
> >> However, Rust Binder needs to insert and remove pages from the vma as
> >> time passes. When incoming messages arrive, pages may need to be
> >> inserted if space is missing, and in this case that is done by using a
> >> stashed ARef<Mm> and calling mmget_not_zero followed by mmap_write_lock
> >> followed by vma_lookup followed by vm_insert_page. In this case, since
> >> mmap_write_lock is used, the VmAreaMut type will be in use.
> >
> > FYI, the way the C binder implementation uses vma_lookup() and
> > vm_insert_page() is not very idiomatic. The standard way of
> > implementing binder_alloc_free_page() would be to use something like
> > unmap_mapping_range() instead of using
> > vma_lookup()+zap_page_range_single(); though to set that up, you'd
> > have to create one inode per binder file, maybe with something like
> > the DRM subsystem's drm_fs_inode_new(). And instead of having
> > binder_install_single_page(), the standard way would be to let
> > binder_vm_fault() install PTEs lazily on fault. That way you'd never
> > have to take mmap locks or grab MM references yourself.
>
> Would that actually work? All writes happen from kernel space, so it's
> not clear to me that we can trigger the fault handler from there. We
> can't use copy_to_user because the write happens from a different
> process than the one that owns the vma.
>
> That said, one alternative is to let Binder store an array of `struct
> page` and just write directly to those pages when writing from kernel
> space. Then binder_vm_fault() can look up the relevant page from the
> array when userspace attempts to read the data.

Right, that's what I was thinking of - keep allocating pages at the
same point in time when binder currently does it, only defer mapping
them into userspace.

> Though we could also
> just vm_insert_page() right before returning the address to userspace,
> since we know that userspace will read it right away after the syscall
> returns.

I think that is basically what binder does now?

> > I guess the one benefit of the current implementation of C binder over
> > the more idiomatic approach is that it avoids taking a fault after
> > binder_install_buffer_pages(), but that's probably not such a hot path
> > that it matters...
>
> Saying "probably not a hot path" is dangerous when it comes to binder's
> allocator. :)

Each page installed by binder_install_buffer_pages() has to be torn
down by either binder's shrinker callback or by userspace unmapping
binder VMAs, right? So for this to be a hot path, the system would
have to either constantly create and destroy new binder FDs, or
execute the shrinker at a fairly high frequency while lots of binder
clients perform infrequent binder transactions?

> > Let me know if you think it would be helpful to see a prototype of
> > that in C - I think I can cobble that together, but doing it nicely
> > will require some work to convert at least some of the binder_alloc
> > locking to mutexes, and some more work to switch the ->f_mapping of
> > the binder file or something like that. (I guess modeling that in Rust
> > might be a bit of a pain though...)
>
> It would be useful to hear about what the advantages of
> unmap_mapping_range() are. I don't need a full prototype, I should be
> able to understand given a rough description of what the required
> changes are.

The nice part is that basically, if you have a pointer to the file or
the inode, you can just do something like the following to zap a PTE:

unmap_mapping_range(file->f_mapping, index, 1, 1);

You don't have to take any locks yourself to make that work, you don't
even have to hold a reference to the mm_struct or anything like that,
and you don't need any of that logic you currently have in the C
binder driver to look up the VMA by address and revalidate it is still
the VMA you expect. The MM subsystem will automatically iterate
through all VMAs that overlap the specified range of the file's
address_space, and it will zap PTEs in the specified range (and it
even works fine if the VMAs have been moved or split or exist in
multiple processes or stuff like that). It's a similar story on the
allocation path. The only locks you need to explicitly take are
whatever locks the driver internally requires.

Going through the fault handler and/or the address_space for
installing/removing PTEs, instead of using vma_lookup(), is also safer
because it implicitly ensures you can only operate on your own
driver's VMAs. From a glance at the Rust binder RFC
(https://lore.kernel.org/all/20231101-rust-binder-v1-19-08ba9197f637@google.com/),
it looks like use_page_slow() just looks up the VMA at the expected
address and calls insert_page() on it. I don't immediately see
anything in the Rust Binder RFC that would prevent calling
insert_page() on a newly created VMA of a different type that has
appeared at the same address - which could cause the page to
inadvertently become writable by userspace (if the new VMA is
writable), could cause refcounted pages to be installed in VM_PFNMAP
regions that are supposed to only contain non-refcounted entries,
could probably cause type confusion when trying to install 4K pages in
hugetlb regions that can't contain 4K pages, and so on.

Though I just realized, you're only doing this in the shrinker
callback where you're not supposed to sleep, but unmap_mapping_range()
requires sleeping locks. So I guess directly using
unmap_mapping_range() wouldn't work so well. I guess one way to
address that could be to add a trylock version of
unmap_mapping_range().

Another more fancy solution to that would be to stop using shrinkers
entirely, and instead make binder pages show up as clean file pages on
the kernel's page allocation LRU lists, and let the kernel take care
of removing page mappings - basically similar to how reclaim works for
normal file pages. I'm a bit fuzzy on how this area of the kernel
works exactly; one option might be to do something similar to
aio_private_file(), creating a new anonymous inode - but unlike in
AIO, you'd then install that anonymous inode's address_space as the
i_mapping of the existing file in binder_open(), rather than creating
a new file. Then you could pin the inode's pages into a page pointer
array in the kernel (sort of like in aio_setup_ring(), except you'd
only do it on demand in binder_install_buffer_pages()), and then have
a "release_folio" operation in your address_space_operations that
drops your page reference if the page is currently unused. At that
point, the driver wouldn't really be participating in creating or
removing userspace PTEs at all, the kernel would mostly manage it for
you, except that you'd have to tell the kernel when it is okay to
reclaim pages, or something like that.

> >> +    /// Sets the flags associated with the virtual memory area.
> >> +    ///
> >> +    /// The possible flags are a combination of the constants in [`flags`].
> >> +    #[inline]
> >> +    pub fn set_flags(&self, flags: usize) {
> >> +        // SAFETY: By the type invariants, the caller holds the mmap write lock, so this access is
> >> +        // not a data race.
> >> +        unsafe { (*self.as_ptr()).__bindgen_anon_2.vm_flags = flags as _ };
> >
> > It can be a data race, for example you could race with pte_mkwrite()
> > called from remove_migration_pte(). (But that's an existing issue, the
> > same data race currently occurs with existing vm_flags updates in
> > other core kernel paths.)
> >
> > Separately from that: If you want to update the flags of existing
> > VMAs, you need to call vma_start_write() first. The vm_flags_set()
> > helper is designed to enforce this automatically.
> >
> > But really I think drivers normally don't change the flags of existing
> > VMAs anyway, only the flags of newly created VMAs; only core MM code
> > should be changing the flags of existing VMAs. You should maybe move
> > this method over into VmAreaNew.
>
> Yeah. Based on what you're saying, moving this method to VmAreaNew
> sounds like a good idea.
>
> Just to confirm, vma_start_write() is not necessary if I move it to
> VmAreaNew?

Right, the MM subsystem will have already internally called
vma_start_write() on the VMA by that time the ->mmap handler is
called.

> > There are ways to set VMA flags that are dangerous and invalid; I
> > think VM_PFNMAP|VM_MIXEDMAP would be an example (though that might not
> > actually lead to any actual bad consequences with the way the code is
> > currently written). Clearing VM_MIXEDMAP after calling
> > vm_insert_page() might be another one.
> >
> > VM_HUGETLB is also a flag you really shouldn't be touching from driver
> > code, that would cause type confusion through a bad cast in
> > unmap_single_vma -> __unmap_hugepage_range -> hstate_vma ->
> > hstate_file -> hstate_inode.
> >
> > I'm not sure how safe against programmer error this is supposed to be;
> > if you want the API to be safe even against silly blunders, I think
> > you'd have to at least allowlist which bits the driver may clear and
> > which bits it may set, and additionally deny at least the bad
> > combination VM_MIXEDMAP|VM_PFNMAP. Though you might want to introduce
> > higher-level helpers instead, so that a driver author doesn't blunder
> > by checking VM_WRITE without also clearing VM_MAYWRITE, which is a
> > somewhat common mistake that IIRC has led to several security bugs in
> > the past.
>
> In general, the goal is to write an abstraction such that you cannot
> trigger memory safety issues, even if triggering it involves a silly
> blunder. But only memory safety issues *must* be prevented; other issues
> such as BUG_ON or deadlocks or resource leaks need not be prevented,
> though those are still a nice-to-have when possible.

Ah, thanks for the context.

(I think in the Linux kernel, you might theoretically be able to cause
memory safety issues by deadlocking in particular ways - if you are
running on a CONFIG_PREEMPT kernel without panic_on_warn set, and
you're in a non-preemptible context because you're holding a spinlock
or such, and then you sleep because you try to wait on a mutex or
kmalloc() with GFP_KERNEL, the scheduler will print a "scheduling
while atomic" error message and then try to recover from that
situation by resetting the preempt_count and scheduling anyway. I
think that theoretically breaks the RCU read-side critical sections
normally implied by spinlocks once you return to the broken context,
though IIRC this situation will get fixed up at the next syscall
return or something along those lines. I haven't tried this though.)

I think the MM subsystem is a place where logic issues and memory
safety issues are closer to each other than in most other parts of the
kernel, since much of the logic in the MM subsystem is responsible for
managing memory...

> > I think the reasonable ways in which drivers might want to change
> > flags on a VmAreaNew are:
> >
> > 1. clear VM_MAYWRITE after checking VM_WRITE is unset
> > 2. (you could do the same with the VM_EXEC/VM_MAYEXEC bits)
> > 3. set VM_PFNMAP (only valid if not VM_MIXEDMAP and no PTEs have been
> > installed yet)
> > 4. set VM_MIXEDMAP (only valid if not VM_PFNMAP)
> > 5. set VM_IO/VM_DONTEXPAND
> > 6. set VM_DONTCOPY/VM_DONTDUMP (only permanently effective if VM_IO is also set)
>
> Okay, thanks for the overview. Comparing with the drivers I'm familiar
> with, they do the following:
>
> Ashmem: Clear VM_MAYREAD/VM_MAYWRITE/VM_MAYEXEC after checking
> VM_READ/VM_WRITE/VM_EXEC. This is allowed by what you say. (You did not
> mention VM_READ but I assume the same applies here.)

Yeah, I think VM_READ should work the same way. (With the caveat that
if you tried to create a mapping that can't be read but can be
written, or can't be read but can be executed, the restriction on read
access might be ineffective. In particular, arch-specific code for
mapping VMA access configuration to page table bits might silently
ignore such a restriction because the architecture has no way to
express write-only/execute-only access in the page table bits. I
imagine other kernel code might also have the assumption baked in that
at least write access implies read access.)

> Binder: Clears VM_MAYWRITE after checking VM_WRITE. Ok. Binder also sets
> the flags VM_DONTCOPY and VM_MIXEDMAP. Is a VM_PFNMAP check missing? But
> I guess that flag only gets set if you do so yourself, so the check is
> probably not necessary?

Yeah, VM_PFNMAP only gets set if you do that yourself (or if you call
remap_pfn_range(), which will do that for you).

> >> +    /// Maps a single page at the given address within the virtual memory area.
> >> +    ///
> >> +    /// This operation does not take ownership of the page.
> >> +    #[inline]
> >> +    pub fn vm_insert_page(&self, address: usize, page: &Page) -> Result {
> >> +        // SAFETY: By the type invariants, the caller holds the mmap write lock, so this access is
> >> +        // not a data race. The page is guaranteed to be valid and of order 0. The range of
> >> +        // `address` is already checked by `vm_insert_page`.
> >> +        to_result(unsafe { bindings::vm_insert_page(self.as_ptr(), address as _, page.as_ptr()) })
> >
> > vm_insert_page() has a kinda weird contract because there are two
> > contexts from which you can call it cleanly:
> >
> > 1. You can call it on a VmAreaRef (just like zap_page_range_single(),
> > you only need to hold an mmap read lock or VMA read lock, no write
> > lock is required); however, you must ensure that the VMA is already
> > marked VM_MIXEDMAP. This is the API contract under which you'd call
> > this from a fault handler.
> >
> > 2. You can call it on a VmAreaNew (and it will take care of setting
> > VM_MIXEDMAP for you).
> >
> > I think nothing would immediately crash if you called vm_insert_page()
> > on a VMA that does not yet have VM_MIXEDMAP while holding the mmap
> > lock in write mode; but that would permit weird scenarios where you
> > could, for example, have a userfaultfd context associated with a VMA
> > which becomes VM_MIXEDMAP, while normally you can't attach userfaultfd
> > contexts to VM_MIXEDMAP VMAs. I don't think if that actually leads to
> > anything bad, but it would be a weird state that probably shouldn't be
> > permitted.
> >
> > There are also safety requirements for the page being installed, but I
> > guess the checks that vm_insert_page() already does via
> > validate_page_before_insert() might be enough to make this safe...
>
> One way to handle this is to make an VmAreaRef::check_mixedmap that
> returns a VmAreaMixedMapRef after checking the flag. That type can then
> have a vm_insert_page method.

Sounds reasonable.

> As for VmAreaNew, I'm not sure we should have it there. If we're not
> careful, it would be a way to set VM_MIXEDMAP on something that already
> has the VM_PFNMAP flag. We can probably just tell you to set VM_MIXEDMAP
> directly and then go through the method on VmAreaRef.

Makes sense.

I guess one tricky part is that it might be bad if you could
Alice Ryhl Nov. 12, 2024, 12:12 p.m. UTC | #7
On Mon, Nov 11, 2024 at 5:51 PM Jann Horn <jannh@google.com> wrote:
>
> On Mon, Nov 11, 2024 at 12:41 PM Alice Ryhl <aliceryhl@google.com> wrote:
> > Thanks a lot for your review!
> >
> > Note that there is a v7 already:
> > https://lore.kernel.org/all/20241014-vma-v7-0-01e32f861195@google.com/
>
> Yeah, oops, somehow I only realized I was looking at an older version
> of the series after sending my mail...
>
> > On Fri, Nov 8, 2024 at 8:02 PM Jann Horn <jannh@google.com> wrote:
> > > On Thu, Oct 10, 2024 at 2:56 PM Alice Ryhl <aliceryhl@google.com> wrote:
> > >> These abstractions allow you to manipulate vmas. Rust Binder will uses
> > >> these in a few different ways.
> > >>
> > >> In the mmap implementation, a VmAreaNew will be provided to the mmap
> > >> call which allows it to modify the vma in ways that are only okay during
> > >> initial setup. This is the case where the most methods are available.
> > >>
> > >> However, Rust Binder needs to insert and remove pages from the vma as
> > >> time passes. When incoming messages arrive, pages may need to be
> > >> inserted if space is missing, and in this case that is done by using a
> > >> stashed ARef<Mm> and calling mmget_not_zero followed by mmap_write_lock
> > >> followed by vma_lookup followed by vm_insert_page. In this case, since
> > >> mmap_write_lock is used, the VmAreaMut type will be in use.
> > >
> > > FYI, the way the C binder implementation uses vma_lookup() and
> > > vm_insert_page() is not very idiomatic. The standard way of
> > > implementing binder_alloc_free_page() would be to use something like
> > > unmap_mapping_range() instead of using
> > > vma_lookup()+zap_page_range_single(); though to set that up, you'd
> > > have to create one inode per binder file, maybe with something like
> > > the DRM subsystem's drm_fs_inode_new(). And instead of having
> > > binder_install_single_page(), the standard way would be to let
> > > binder_vm_fault() install PTEs lazily on fault. That way you'd never
> > > have to take mmap locks or grab MM references yourself.
> >
> > Would that actually work? All writes happen from kernel space, so it's
> > not clear to me that we can trigger the fault handler from there. We
> > can't use copy_to_user because the write happens from a different
> > process than the one that owns the vma.
> >
> > That said, one alternative is to let Binder store an array of `struct
> > page` and just write directly to those pages when writing from kernel
> > space. Then binder_vm_fault() can look up the relevant page from the
> > array when userspace attempts to read the data.
>
> Right, that's what I was thinking of - keep allocating pages at the
> same point in time when binder currently does it, only defer mapping
> them into userspace.
>
> > Though we could also
> > just vm_insert_page() right before returning the address to userspace,
> > since we know that userspace will read it right away after the syscall
> > returns.
>
> I think that is basically what binder does now?

Right now binder calls vm_insert_page() right after calling
alloc_page(), which means that vm_insert_page() happens in the process
sending the message. But we could delay the call so that it happens in
the process that receives the message instead (which is also the
process that owns the mapping).

> > > I guess the one benefit of the current implementation of C binder over
> > > the more idiomatic approach is that it avoids taking a fault after
> > > binder_install_buffer_pages(), but that's probably not such a hot path
> > > that it matters...
> >
> > Saying "probably not a hot path" is dangerous when it comes to binder's
> > allocator. :)
>
> Each page installed by binder_install_buffer_pages() has to be torn
> down by either binder's shrinker callback or by userspace unmapping
> binder VMAs, right? So for this to be a hot path, the system would
> have to either constantly create and destroy new binder FDs, or
> execute the shrinker at a fairly high frequency while lots of binder
> clients perform infrequent binder transactions?

I think the situation where the phone is under major memory pressure
and runs the shrinker all the time can be rather subtle.

> > > Let me know if you think it would be helpful to see a prototype of
> > > that in C - I think I can cobble that together, but doing it nicely
> > > will require some work to convert at least some of the binder_alloc
> > > locking to mutexes, and some more work to switch the ->f_mapping of
> > > the binder file or something like that. (I guess modeling that in Rust
> > > might be a bit of a pain though...)
> >
> > It would be useful to hear about what the advantages of
> > unmap_mapping_range() are. I don't need a full prototype, I should be
> > able to understand given a rough description of what the required
> > changes are.
>
> The nice part is that basically, if you have a pointer to the file or
> the inode, you can just do something like the following to zap a PTE:
>
> unmap_mapping_range(file->f_mapping, index, 1, 1);
>
> You don't have to take any locks yourself to make that work, you don't
> even have to hold a reference to the mm_struct or anything like that,
> and you don't need any of that logic you currently have in the C
> binder driver to look up the VMA by address and revalidate it is still
> the VMA you expect. The MM subsystem will automatically iterate
> through all VMAs that overlap the specified range of the file's
> address_space, and it will zap PTEs in the specified range (and it
> even works fine if the VMAs have been moved or split or exist in
> multiple processes or stuff like that). It's a similar story on the
> allocation path. The only locks you need to explicitly take are
> whatever locks the driver internally requires.
>
> Going through the fault handler and/or the address_space for
> installing/removing PTEs, instead of using vma_lookup(), is also safer
> because it implicitly ensures you can only operate on your own
> driver's VMAs. From a glance at the Rust binder RFC
> (https://lore.kernel.org/all/20231101-rust-binder-v1-19-08ba9197f637@google.com/),
> it looks like use_page_slow() just looks up the VMA at the expected
> address and calls insert_page() on it. I don't immediately see
> anything in the Rust Binder RFC that would prevent calling
> insert_page() on a newly created VMA of a different type that has
> appeared at the same address - which could cause the page to
> inadvertently become writable by userspace (if the new VMA is
> writable), could cause refcounted pages to be installed in VM_PFNMAP
> regions that are supposed to only contain non-refcounted entries,
> could probably cause type confusion when trying to install 4K pages in
> hugetlb regions that can't contain 4K pages, and so on.

Right ... I guess I'm missing an equivalent to binder_vma_close to
ensure that once the vma is closed, we don't try to insert pages.

I gave a suggestion to enforce that vm_insert_page is only called on
MIXEDMAP vmas in my previous email. I guess that would prevent the
type confusion you mention, but it still seems dangerous ... you risk
that some other driver is storing special data in the private data of
pages in the new vma, and if you insert a random page there, there
could maybe be type confusion on the private data in the `struct
page`?

> Though I just realized, you're only doing this in the shrinker
> callback where you're not supposed to sleep, but unmap_mapping_range()
> requires sleeping locks. So I guess directly using
> unmap_mapping_range() wouldn't work so well. I guess one way to
> address that could be to add a trylock version of
> unmap_mapping_range().
>
> Another more fancy solution to that would be to stop using shrinkers
> entirely, and instead make binder pages show up as clean file pages on
> the kernel's page allocation LRU lists, and let the kernel take care
> of removing page mappings - basically similar to how reclaim works for
> normal file pages. I'm a bit fuzzy on how this area of the kernel
> works exactly; one option might be to do something similar to
> aio_private_file(), creating a new anonymous inode - but unlike in
> AIO, you'd then install that anonymous inode's address_space as the
> i_mapping of the existing file in binder_open(), rather than creating
> a new file. Then you could pin the inode's pages into a page pointer
> array in the kernel (sort of like in aio_setup_ring(), except you'd
> only do it on demand in binder_install_buffer_pages()), and then have
> a "release_folio" operation in your address_space_operations that
> drops your page reference if the page is currently unused. At that
> point, the driver wouldn't really be participating in creating or
> removing userspace PTEs at all, the kernel would mostly manage it for
> you, except that you'd have to tell the kernel when it is okay to
> reclaim pages, or something like that.

Whether it's okay to reclaim a given page can flip-flop very quickly
under some workflows. Changing that setting has to be a pretty fast
operation.

> > >> +    /// Sets the flags associated with the virtual memory area.
> > >> +    ///
> > >> +    /// The possible flags are a combination of the constants in [`flags`].
> > >> +    #[inline]
> > >> +    pub fn set_flags(&self, flags: usize) {
> > >> +        // SAFETY: By the type invariants, the caller holds the mmap write lock, so this access is
> > >> +        // not a data race.
> > >> +        unsafe { (*self.as_ptr()).__bindgen_anon_2.vm_flags = flags as _ };
> > >
> > > It can be a data race, for example you could race with pte_mkwrite()
> > > called from remove_migration_pte(). (But that's an existing issue, the
> > > same data race currently occurs with existing vm_flags updates in
> > > other core kernel paths.)
> > >
> > > Separately from that: If you want to update the flags of existing
> > > VMAs, you need to call vma_start_write() first. The vm_flags_set()
> > > helper is designed to enforce this automatically.
> > >
> > > But really I think drivers normally don't change the flags of existing
> > > VMAs anyway, only the flags of newly created VMAs; only core MM code
> > > should be changing the flags of existing VMAs. You should maybe move
> > > this method over into VmAreaNew.
> >
> > Yeah. Based on what you're saying, moving this method to VmAreaNew
> > sounds like a good idea.
> >
> > Just to confirm, vma_start_write() is not necessary if I move it to
> > VmAreaNew?
>
> Right, the MM subsystem will have already internally called
> vma_start_write() on the VMA by that time the ->mmap handler is
> called.
>
> > > There are ways to set VMA flags that are dangerous and invalid; I
> > > think VM_PFNMAP|VM_MIXEDMAP would be an example (though that might not
> > > actually lead to any actual bad consequences with the way the code is
> > > currently written). Clearing VM_MIXEDMAP after calling
> > > vm_insert_page() might be another one.
> > >
> > > VM_HUGETLB is also a flag you really shouldn't be touching from driver
> > > code, that would cause type confusion through a bad cast in
> > > unmap_single_vma -> __unmap_hugepage_range -> hstate_vma ->
> > > hstate_file -> hstate_inode.
> > >
> > > I'm not sure how safe against programmer error this is supposed to be;
> > > if you want the API to be safe even against silly blunders, I think
> > > you'd have to at least allowlist which bits the driver may clear and
> > > which bits it may set, and additionally deny at least the bad
> > > combination VM_MIXEDMAP|VM_PFNMAP. Though you might want to introduce
> > > higher-level helpers instead, so that a driver author doesn't blunder
> > > by checking VM_WRITE without also clearing VM_MAYWRITE, which is a
> > > somewhat common mistake that IIRC has led to several security bugs in
> > > the past.
> >
> > In general, the goal is to write an abstraction such that you cannot
> > trigger memory safety issues, even if triggering it involves a silly
> > blunder. But only memory safety issues *must* be prevented; other issues
> > such as BUG_ON or deadlocks or resource leaks need not be prevented,
> > though those are still a nice-to-have when possible.
>
> Ah, thanks for the context.
>
> (I think in the Linux kernel, you might theoretically be able to cause
> memory safety issues by deadlocking in particular ways - if you are
> running on a CONFIG_PREEMPT kernel without panic_on_warn set, and
> you're in a non-preemptible context because you're holding a spinlock
> or such, and then you sleep because you try to wait on a mutex or
> kmalloc() with GFP_KERNEL, the scheduler will print a "scheduling
> while atomic" error message and then try to recover from that
> situation by resetting the preempt_count and scheduling anyway. I
> think that theoretically breaks the RCU read-side critical sections
> normally implied by spinlocks once you return to the broken context,
> though IIRC this situation will get fixed up at the next syscall
> return or something along those lines. I haven't tried this though.)

Sleeping in atomic context is a known area of concern. We're working
on it. For now, just assume that it's one of the allowed bad things.
Eventually we would like to handle it properly with this tool:
https://www.memorysafety.org/blog/gary-guo-klint-rust-tools/

> I think the MM subsystem is a place where logic issues and memory
> safety issues are closer to each other than in most other parts of the
> kernel, since much of the logic in the MM subsystem is responsible for
> managing memory...

Agreed.

> > > I think the reasonable ways in which drivers might want to change
> > > flags on a VmAreaNew are:
> > >
> > > 1. clear VM_MAYWRITE after checking VM_WRITE is unset
> > > 2. (you could do the same with the VM_EXEC/VM_MAYEXEC bits)
> > > 3. set VM_PFNMAP (only valid if not VM_MIXEDMAP and no PTEs have been
> > > installed yet)
> > > 4. set VM_MIXEDMAP (only valid if not VM_PFNMAP)
> > > 5. set VM_IO/VM_DONTEXPAND
> > > 6. set VM_DONTCOPY/VM_DONTDUMP (only permanently effective if VM_IO is also set)
> >
> > Okay, thanks for the overview. Comparing with the drivers I'm familiar
> > with, they do the following:
> >
> > Ashmem: Clear VM_MAYREAD/VM_MAYWRITE/VM_MAYEXEC after checking
> > VM_READ/VM_WRITE/VM_EXEC. This is allowed by what you say. (You did not
> > mention VM_READ but I assume the same applies here.)
>
> Yeah, I think VM_READ should work the same way. (With the caveat that
> if you tried to create a mapping that can't be read but can be
> written, or can't be read but can be executed, the restriction on read
> access might be ineffective. In particular, arch-specific code for
> mapping VMA access configuration to page table bits might silently
> ignore such a restriction because the architecture has no way to
> express write-only/execute-only access in the page table bits. I
> imagine other kernel code might also have the assumption baked in that
> at least write access implies read access.)

Got it, makes sense.

> > Binder: Clears VM_MAYWRITE after checking VM_WRITE. Ok. Binder also sets
> > the flags VM_DONTCOPY and VM_MIXEDMAP. Is a VM_PFNMAP check missing? But
> > I guess that flag only gets set if you do so yourself, so the check is
> > probably not necessary?
>
> Yeah, VM_PFNMAP only gets set if you do that yourself (or if you call
> remap_pfn_range(), which will do that for you).

Got it.

> > >> +    /// Maps a single page at the given address within the virtual memory area.
> > >> +    ///
> > >> +    /// This operation does not take ownership of the page.
> > >> +    #[inline]
> > >> +    pub fn vm_insert_page(&self, address: usize, page: &Page) -> Result {
> > >> +        // SAFETY: By the type invariants, the caller holds the mmap write lock, so this access is
> > >> +        // not a data race. The page is guaranteed to be valid and of order 0. The range of
> > >> +        // `address` is already checked by `vm_insert_page`.
> > >> +        to_result(unsafe { bindings::vm_insert_page(self.as_ptr(), address as _, page.as_ptr()) })
> > >
> > > vm_insert_page() has a kinda weird contract because there are two
> > > contexts from which you can call it cleanly:
> > >
> > > 1. You can call it on a VmAreaRef (just like zap_page_range_single(),
> > > you only need to hold an mmap read lock or VMA read lock, no write
> > > lock is required); however, you must ensure that the VMA is already
> > > marked VM_MIXEDMAP. This is the API contract under which you'd call
> > > this from a fault handler.
> > >
> > > 2. You can call it on a VmAreaNew (and it will take care of setting
> > > VM_MIXEDMAP for you).
> > >
> > > I think nothing would immediately crash if you called vm_insert_page()
> > > on a VMA that does not yet have VM_MIXEDMAP while holding the mmap
> > > lock in write mode; but that would permit weird scenarios where you
> > > could, for example, have a userfaultfd context associated with a VMA
> > > which becomes VM_MIXEDMAP, while normally you can't attach userfaultfd
> > > contexts to VM_MIXEDMAP VMAs. I don't think if that actually leads to
> > > anything bad, but it would be a weird state that probably shouldn't be
> > > permitted.
> > >
> > > There are also safety requirements for the page being installed, but I
> > > guess the checks that vm_insert_page() already does via
> > > validate_page_before_insert() might be enough to make this safe...
> >
> > One way to handle this is to make an VmAreaRef::check_mixedmap that
> > returns a VmAreaMixedMapRef after checking the flag. That type can then
> > have a vm_insert_page method.
>
> Sounds reasonable.
>
> > As for VmAreaNew, I'm not sure we should have it there. If we're not
> > careful, it would be a way to set VM_MIXEDMAP on something that already
> > has the VM_PFNMAP flag. We can probably just tell you to set VM_MIXEDMAP
> > directly and then go through the method on VmAreaRef.
>
> Makes sense.
>
> I guess one tricky part is that it might be bad if you could

Seems like this sentence is incomplete?

Alice
Jann Horn Nov. 12, 2024, 12:58 p.m. UTC | #8
On Tue, Nov 12, 2024 at 1:12 PM Alice Ryhl <aliceryhl@google.com> wrote:
> On Mon, Nov 11, 2024 at 5:51 PM Jann Horn <jannh@google.com> wrote:
> >
> > On Mon, Nov 11, 2024 at 12:41 PM Alice Ryhl <aliceryhl@google.com> wrote:
> > > Thanks a lot for your review!
> > >
> > > Note that there is a v7 already:
> > > https://lore.kernel.org/all/20241014-vma-v7-0-01e32f861195@google.com/
> >
> > Yeah, oops, somehow I only realized I was looking at an older version
> > of the series after sending my mail...
> >
> > > On Fri, Nov 8, 2024 at 8:02 PM Jann Horn <jannh@google.com> wrote:
> > > > On Thu, Oct 10, 2024 at 2:56 PM Alice Ryhl <aliceryhl@google.com> wrote:
> > > >> These abstractions allow you to manipulate vmas. Rust Binder will uses
> > > >> these in a few different ways.
> > > >>
> > > >> In the mmap implementation, a VmAreaNew will be provided to the mmap
> > > >> call which allows it to modify the vma in ways that are only okay during
> > > >> initial setup. This is the case where the most methods are available.
> > > >>
> > > >> However, Rust Binder needs to insert and remove pages from the vma as
> > > >> time passes. When incoming messages arrive, pages may need to be
> > > >> inserted if space is missing, and in this case that is done by using a
> > > >> stashed ARef<Mm> and calling mmget_not_zero followed by mmap_write_lock
> > > >> followed by vma_lookup followed by vm_insert_page. In this case, since
> > > >> mmap_write_lock is used, the VmAreaMut type will be in use.
> > > >
> > > > FYI, the way the C binder implementation uses vma_lookup() and
> > > > vm_insert_page() is not very idiomatic. The standard way of
> > > > implementing binder_alloc_free_page() would be to use something like
> > > > unmap_mapping_range() instead of using
> > > > vma_lookup()+zap_page_range_single(); though to set that up, you'd
> > > > have to create one inode per binder file, maybe with something like
> > > > the DRM subsystem's drm_fs_inode_new(). And instead of having
> > > > binder_install_single_page(), the standard way would be to let
> > > > binder_vm_fault() install PTEs lazily on fault. That way you'd never
> > > > have to take mmap locks or grab MM references yourself.
> > >
> > > Would that actually work? All writes happen from kernel space, so it's
> > > not clear to me that we can trigger the fault handler from there. We
> > > can't use copy_to_user because the write happens from a different
> > > process than the one that owns the vma.
> > >
> > > That said, one alternative is to let Binder store an array of `struct
> > > page` and just write directly to those pages when writing from kernel
> > > space. Then binder_vm_fault() can look up the relevant page from the
> > > array when userspace attempts to read the data.
> >
> > Right, that's what I was thinking of - keep allocating pages at the
> > same point in time when binder currently does it, only defer mapping
> > them into userspace.
> >
> > > Though we could also
> > > just vm_insert_page() right before returning the address to userspace,
> > > since we know that userspace will read it right away after the syscall
> > > returns.
> >
> > I think that is basically what binder does now?
>
> Right now binder calls vm_insert_page() right after calling
> alloc_page(), which means that vm_insert_page() happens in the process
> sending the message. But we could delay the call so that it happens in
> the process that receives the message instead (which is also the
> process that owns the mapping).

Ah, I see. I don't think that would make much of a difference in terms
of the complexity of MM locking, except that you'd save an
mmget_not_zero()...

> > > > Let me know if you think it would be helpful to see a prototype of
> > > > that in C - I think I can cobble that together, but doing it nicely
> > > > will require some work to convert at least some of the binder_alloc
> > > > locking to mutexes, and some more work to switch the ->f_mapping of
> > > > the binder file or something like that. (I guess modeling that in Rust
> > > > might be a bit of a pain though...)
> > >
> > > It would be useful to hear about what the advantages of
> > > unmap_mapping_range() are. I don't need a full prototype, I should be
> > > able to understand given a rough description of what the required
> > > changes are.
> >
> > The nice part is that basically, if you have a pointer to the file or
> > the inode, you can just do something like the following to zap a PTE:
> >
> > unmap_mapping_range(file->f_mapping, index, 1, 1);
> >
> > You don't have to take any locks yourself to make that work, you don't
> > even have to hold a reference to the mm_struct or anything like that,
> > and you don't need any of that logic you currently have in the C
> > binder driver to look up the VMA by address and revalidate it is still
> > the VMA you expect. The MM subsystem will automatically iterate
> > through all VMAs that overlap the specified range of the file's
> > address_space, and it will zap PTEs in the specified range (and it
> > even works fine if the VMAs have been moved or split or exist in
> > multiple processes or stuff like that). It's a similar story on the
> > allocation path. The only locks you need to explicitly take are
> > whatever locks the driver internally requires.
> >
> > Going through the fault handler and/or the address_space for
> > installing/removing PTEs, instead of using vma_lookup(), is also safer
> > because it implicitly ensures you can only operate on your own
> > driver's VMAs. From a glance at the Rust binder RFC
> > (https://lore.kernel.org/all/20231101-rust-binder-v1-19-08ba9197f637@google.com/),
> > it looks like use_page_slow() just looks up the VMA at the expected
> > address and calls insert_page() on it. I don't immediately see
> > anything in the Rust Binder RFC that would prevent calling
> > insert_page() on a newly created VMA of a different type that has
> > appeared at the same address - which could cause the page to
> > inadvertently become writable by userspace (if the new VMA is
> > writable), could cause refcounted pages to be installed in VM_PFNMAP
> > regions that are supposed to only contain non-refcounted entries,
> > could probably cause type confusion when trying to install 4K pages in
> > hugetlb regions that can't contain 4K pages, and so on.
>
> Right ... I guess I'm missing an equivalent to binder_vma_close to
> ensure that once the vma is closed, we don't try to insert pages.

Yeah, I think that would work. (I think there currently is no way to
shrink a VMA without first splitting it, so you should see ->open()
and ->close() invocations when that happens.)

> I gave a suggestion to enforce that vm_insert_page is only called on
> MIXEDMAP vmas in my previous email. I guess that would prevent the
> type confusion you mention, but it still seems dangerous ... you risk
> that some other driver is storing special data in the private data of
> pages in the new vma, and if you insert a random page there, there
> could maybe be type confusion on the private data in the `struct
> page`?

Hmm, yeah, maybe...

> > Though I just realized, you're only doing this in the shrinker
> > callback where you're not supposed to sleep, but unmap_mapping_range()
> > requires sleeping locks. So I guess directly using
> > unmap_mapping_range() wouldn't work so well. I guess one way to
> > address that could be to add a trylock version of
> > unmap_mapping_range().
> >
> > Another more fancy solution to that would be to stop using shrinkers
> > entirely, and instead make binder pages show up as clean file pages on
> > the kernel's page allocation LRU lists, and let the kernel take care
> > of removing page mappings - basically similar to how reclaim works for
> > normal file pages. I'm a bit fuzzy on how this area of the kernel
> > works exactly; one option might be to do something similar to
> > aio_private_file(), creating a new anonymous inode - but unlike in
> > AIO, you'd then install that anonymous inode's address_space as the
> > i_mapping of the existing file in binder_open(), rather than creating
> > a new file. Then you could pin the inode's pages into a page pointer
> > array in the kernel (sort of like in aio_setup_ring(), except you'd
> > only do it on demand in binder_install_buffer_pages()), and then have
> > a "release_folio" operation in your address_space_operations that
> > drops your page reference if the page is currently unused. At that
> > point, the driver wouldn't really be participating in creating or
> > removing userspace PTEs at all, the kernel would mostly manage it for
> > you, except that you'd have to tell the kernel when it is okay to
> > reclaim pages, or something like that.
>
> Whether it's okay to reclaim a given page can flip-flop very quickly
> under some workflows. Changing that setting has to be a pretty fast
> operation.

I think one fast way to do that would be to track this internally in
binder (as is the case now), and then have address_space_operations
callbacks that the kernel invokes when it wants to know whether a page
can be discarded or not.

> > (I think in the Linux kernel, you might theoretically be able to cause
> > memory safety issues by deadlocking in particular ways - if you are
> > running on a CONFIG_PREEMPT kernel without panic_on_warn set, and
> > you're in a non-preemptible context because you're holding a spinlock
> > or such, and then you sleep because you try to wait on a mutex or
> > kmalloc() with GFP_KERNEL, the scheduler will print a "scheduling
> > while atomic" error message and then try to recover from that
> > situation by resetting the preempt_count and scheduling anyway. I
> > think that theoretically breaks the RCU read-side critical sections
> > normally implied by spinlocks once you return to the broken context,
> > though IIRC this situation will get fixed up at the next syscall
> > return or something along those lines. I haven't tried this though.)
>
> Sleeping in atomic context is a known area of concern. We're working
> on it. For now, just assume that it's one of the allowed bad things.
> Eventually we would like to handle it properly with this tool:
> https://www.memorysafety.org/blog/gary-guo-klint-rust-tools/

Ah, thanks for the pointer.

> > > >> +    /// Maps a single page at the given address within the virtual memory area.
> > > >> +    ///
> > > >> +    /// This operation does not take ownership of the page.
> > > >> +    #[inline]
> > > >> +    pub fn vm_insert_page(&self, address: usize, page: &Page) -> Result {
> > > >> +        // SAFETY: By the type invariants, the caller holds the mmap write lock, so this access is
> > > >> +        // not a data race. The page is guaranteed to be valid and of order 0. The range of
> > > >> +        // `address` is already checked by `vm_insert_page`.
> > > >> +        to_result(unsafe { bindings::vm_insert_page(self.as_ptr(), address as _, page.as_ptr()) })
> > > >
> > > > vm_insert_page() has a kinda weird contract because there are two
> > > > contexts from which you can call it cleanly:
> > > >
> > > > 1. You can call it on a VmAreaRef (just like zap_page_range_single(),
> > > > you only need to hold an mmap read lock or VMA read lock, no write
> > > > lock is required); however, you must ensure that the VMA is already
> > > > marked VM_MIXEDMAP. This is the API contract under which you'd call
> > > > this from a fault handler.
> > > >
> > > > 2. You can call it on a VmAreaNew (and it will take care of setting
> > > > VM_MIXEDMAP for you).
> > > >
> > > > I think nothing would immediately crash if you called vm_insert_page()
> > > > on a VMA that does not yet have VM_MIXEDMAP while holding the mmap
> > > > lock in write mode; but that would permit weird scenarios where you
> > > > could, for example, have a userfaultfd context associated with a VMA
> > > > which becomes VM_MIXEDMAP, while normally you can't attach userfaultfd
> > > > contexts to VM_MIXEDMAP VMAs. I don't think if that actually leads to
> > > > anything bad, but it would be a weird state that probably shouldn't be
> > > > permitted.
> > > >
> > > > There are also safety requirements for the page being installed, but I
> > > > guess the checks that vm_insert_page() already does via
> > > > validate_page_before_insert() might be enough to make this safe...
> > >
> > > One way to handle this is to make an VmAreaRef::check_mixedmap that
> > > returns a VmAreaMixedMapRef after checking the flag. That type can then
> > > have a vm_insert_page method.
> >
> > Sounds reasonable.
> >
> > > As for VmAreaNew, I'm not sure we should have it there. If we're not
> > > careful, it would be a way to set VM_MIXEDMAP on something that already
> > > has the VM_PFNMAP flag. We can probably just tell you to set VM_MIXEDMAP
> > > directly and then go through the method on VmAreaRef.
> >
> > Makes sense.
> >
> > I guess one tricky part is that it might be bad if you could
>
> Seems like this sentence is incomplete?

Whoops, guess I got distracted while writing this...

I guess it could be bad if you could install page mappings before
changing the VMA flags in a way that makes the already-installed page
mappings invalid. But as long as you don't change the
VM_READ/VM_WRITE/VM_EXEC flags, and you never clear
VM_MIXEDMAP/VM_PFNMAP, I think that probably can't happen, so that
should be fine...
Alice Ryhl Nov. 12, 2024, 1:48 p.m. UTC | #9
On Tue, Nov 12, 2024 at 1:58 PM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Nov 12, 2024 at 1:12 PM Alice Ryhl <aliceryhl@google.com> wrote:
> > On Mon, Nov 11, 2024 at 5:51 PM Jann Horn <jannh@google.com> wrote:
> > >
> > > On Mon, Nov 11, 2024 at 12:41 PM Alice Ryhl <aliceryhl@google.com> wrote:
> > > > Thanks a lot for your review!
> > > >
> > > > Note that there is a v7 already:
> > > > https://lore.kernel.org/all/20241014-vma-v7-0-01e32f861195@google.com/
> > >
> > > Yeah, oops, somehow I only realized I was looking at an older version
> > > of the series after sending my mail...
> > >
> > > > On Fri, Nov 8, 2024 at 8:02 PM Jann Horn <jannh@google.com> wrote:
> > > > > On Thu, Oct 10, 2024 at 2:56 PM Alice Ryhl <aliceryhl@google.com> wrote:
> > > > >> These abstractions allow you to manipulate vmas. Rust Binder will uses
> > > > >> these in a few different ways.
> > > > >>
> > > > >> In the mmap implementation, a VmAreaNew will be provided to the mmap
> > > > >> call which allows it to modify the vma in ways that are only okay during
> > > > >> initial setup. This is the case where the most methods are available.
> > > > >>
> > > > >> However, Rust Binder needs to insert and remove pages from the vma as
> > > > >> time passes. When incoming messages arrive, pages may need to be
> > > > >> inserted if space is missing, and in this case that is done by using a
> > > > >> stashed ARef<Mm> and calling mmget_not_zero followed by mmap_write_lock
> > > > >> followed by vma_lookup followed by vm_insert_page. In this case, since
> > > > >> mmap_write_lock is used, the VmAreaMut type will be in use.
> > > > >
> > > > > FYI, the way the C binder implementation uses vma_lookup() and
> > > > > vm_insert_page() is not very idiomatic. The standard way of
> > > > > implementing binder_alloc_free_page() would be to use something like
> > > > > unmap_mapping_range() instead of using
> > > > > vma_lookup()+zap_page_range_single(); though to set that up, you'd
> > > > > have to create one inode per binder file, maybe with something like
> > > > > the DRM subsystem's drm_fs_inode_new(). And instead of having
> > > > > binder_install_single_page(), the standard way would be to let
> > > > > binder_vm_fault() install PTEs lazily on fault. That way you'd never
> > > > > have to take mmap locks or grab MM references yourself.
> > > >
> > > > Would that actually work? All writes happen from kernel space, so it's
> > > > not clear to me that we can trigger the fault handler from there. We
> > > > can't use copy_to_user because the write happens from a different
> > > > process than the one that owns the vma.
> > > >
> > > > That said, one alternative is to let Binder store an array of `struct
> > > > page` and just write directly to those pages when writing from kernel
> > > > space. Then binder_vm_fault() can look up the relevant page from the
> > > > array when userspace attempts to read the data.
> > >
> > > Right, that's what I was thinking of - keep allocating pages at the
> > > same point in time when binder currently does it, only defer mapping
> > > them into userspace.
> > >
> > > > Though we could also
> > > > just vm_insert_page() right before returning the address to userspace,
> > > > since we know that userspace will read it right away after the syscall
> > > > returns.
> > >
> > > I think that is basically what binder does now?
> >
> > Right now binder calls vm_insert_page() right after calling
> > alloc_page(), which means that vm_insert_page() happens in the process
> > sending the message. But we could delay the call so that it happens in
> > the process that receives the message instead (which is also the
> > process that owns the mapping).
>
> Ah, I see. I don't think that would make much of a difference in terms
> of the complexity of MM locking, except that you'd save an
> mmget_not_zero()...

It might reduce contention on the mm locks since it drastically
reduces the number of threads we might take the locks from.

Another question ... right now we access the vma by doing vma_lookup
under the mmap lock, and then we call vm_insert_page. The call to
vm_insert_page only requires the vma read lock, so it would be nice to
perform vm_insert_page without having to take the mmap lock to do
vma_lookup. What is the best way to do that? I could stash a pointer
to the vma, but I need to make sure it doesn't get freed in the
meantime. I could keep track of whether it's still valid by listening
for close in vma_ops, but that's a pretty unpleasant solution. I don't
see any refcount in `struct vm_area_struct` that I can increment.

> > > > > Let me know if you think it would be helpful to see a prototype of
> > > > > that in C - I think I can cobble that together, but doing it nicely
> > > > > will require some work to convert at least some of the binder_alloc
> > > > > locking to mutexes, and some more work to switch the ->f_mapping of
> > > > > the binder file or something like that. (I guess modeling that in Rust
> > > > > might be a bit of a pain though...)
> > > >
> > > > It would be useful to hear about what the advantages of
> > > > unmap_mapping_range() are. I don't need a full prototype, I should be
> > > > able to understand given a rough description of what the required
> > > > changes are.
> > >
> > > The nice part is that basically, if you have a pointer to the file or
> > > the inode, you can just do something like the following to zap a PTE:
> > >
> > > unmap_mapping_range(file->f_mapping, index, 1, 1);
> > >
> > > You don't have to take any locks yourself to make that work, you don't
> > > even have to hold a reference to the mm_struct or anything like that,
> > > and you don't need any of that logic you currently have in the C
> > > binder driver to look up the VMA by address and revalidate it is still
> > > the VMA you expect. The MM subsystem will automatically iterate
> > > through all VMAs that overlap the specified range of the file's
> > > address_space, and it will zap PTEs in the specified range (and it
> > > even works fine if the VMAs have been moved or split or exist in
> > > multiple processes or stuff like that). It's a similar story on the
> > > allocation path. The only locks you need to explicitly take are
> > > whatever locks the driver internally requires.
> > >
> > > Going through the fault handler and/or the address_space for
> > > installing/removing PTEs, instead of using vma_lookup(), is also safer
> > > because it implicitly ensures you can only operate on your own
> > > driver's VMAs. From a glance at the Rust binder RFC
> > > (https://lore.kernel.org/all/20231101-rust-binder-v1-19-08ba9197f637@google.com/),
> > > it looks like use_page_slow() just looks up the VMA at the expected
> > > address and calls insert_page() on it. I don't immediately see
> > > anything in the Rust Binder RFC that would prevent calling
> > > insert_page() on a newly created VMA of a different type that has
> > > appeared at the same address - which could cause the page to
> > > inadvertently become writable by userspace (if the new VMA is
> > > writable), could cause refcounted pages to be installed in VM_PFNMAP
> > > regions that are supposed to only contain non-refcounted entries,
> > > could probably cause type confusion when trying to install 4K pages in
> > > hugetlb regions that can't contain 4K pages, and so on.
> >
> > Right ... I guess I'm missing an equivalent to binder_vma_close to
> > ensure that once the vma is closed, we don't try to insert pages.
>
> Yeah, I think that would work. (I think there currently is no way to
> shrink a VMA without first splitting it, so you should see ->open()
> and ->close() invocations when that happens.)
>
> > I gave a suggestion to enforce that vm_insert_page is only called on
> > MIXEDMAP vmas in my previous email. I guess that would prevent the
> > type confusion you mention, but it still seems dangerous ... you risk
> > that some other driver is storing special data in the private data of
> > pages in the new vma, and if you insert a random page there, there
> > could maybe be type confusion on the private data in the `struct
> > page`?
>
> Hmm, yeah, maybe...

Is there a standard / idiomatic way for a driver to verify that a VMA
is the one that it created? Checking vm_ops and private data?

> > > > >> +    /// Maps a single page at the given address within the virtual memory area.
> > > > >> +    ///
> > > > >> +    /// This operation does not take ownership of the page.
> > > > >> +    #[inline]
> > > > >> +    pub fn vm_insert_page(&self, address: usize, page: &Page) -> Result {
> > > > >> +        // SAFETY: By the type invariants, the caller holds the mmap write lock, so this access is
> > > > >> +        // not a data race. The page is guaranteed to be valid and of order 0. The range of
> > > > >> +        // `address` is already checked by `vm_insert_page`.
> > > > >> +        to_result(unsafe { bindings::vm_insert_page(self.as_ptr(), address as _, page.as_ptr()) })
> > > > >
> > > > > vm_insert_page() has a kinda weird contract because there are two
> > > > > contexts from which you can call it cleanly:
> > > > >
> > > > > 1. You can call it on a VmAreaRef (just like zap_page_range_single(),
> > > > > you only need to hold an mmap read lock or VMA read lock, no write
> > > > > lock is required); however, you must ensure that the VMA is already
> > > > > marked VM_MIXEDMAP. This is the API contract under which you'd call
> > > > > this from a fault handler.
> > > > >
> > > > > 2. You can call it on a VmAreaNew (and it will take care of setting
> > > > > VM_MIXEDMAP for you).
> > > > >
> > > > > I think nothing would immediately crash if you called vm_insert_page()
> > > > > on a VMA that does not yet have VM_MIXEDMAP while holding the mmap
> > > > > lock in write mode; but that would permit weird scenarios where you
> > > > > could, for example, have a userfaultfd context associated with a VMA
> > > > > which becomes VM_MIXEDMAP, while normally you can't attach userfaultfd
> > > > > contexts to VM_MIXEDMAP VMAs. I don't think if that actually leads to
> > > > > anything bad, but it would be a weird state that probably shouldn't be
> > > > > permitted.
> > > > >
> > > > > There are also safety requirements for the page being installed, but I
> > > > > guess the checks that vm_insert_page() already does via
> > > > > validate_page_before_insert() might be enough to make this safe...
> > > >
> > > > One way to handle this is to make an VmAreaRef::check_mixedmap that
> > > > returns a VmAreaMixedMapRef after checking the flag. That type can then
> > > > have a vm_insert_page method.
> > >
> > > Sounds reasonable.
> > >
> > > > As for VmAreaNew, I'm not sure we should have it there. If we're not
> > > > careful, it would be a way to set VM_MIXEDMAP on something that already
> > > > has the VM_PFNMAP flag. We can probably just tell you to set VM_MIXEDMAP
> > > > directly and then go through the method on VmAreaRef.
> > >
> > > Makes sense.
> > >
> > > I guess one tricky part is that it might be bad if you could
> >
> > Seems like this sentence is incomplete?
>
> Whoops, guess I got distracted while writing this...
>
> I guess it could be bad if you could install page mappings before
> changing the VMA flags in a way that makes the already-installed page
> mappings invalid. But as long as you don't change the
> VM_READ/VM_WRITE/VM_EXEC flags, and you never clear
> VM_MIXEDMAP/VM_PFNMAP, I think that probably can't happen, so that
> should be fine...

Could we come up with a list of what the mm subsystem assumes about a
vma? Similar to Lorenzo's VMA locks documentation work. So far I have:

* If a vma is VM_MIXEDMAP then it doesn't have a userfaultfd context
and it is not VM_PFNMAP.
* If a vma is VM_PFNMAP then there are a bunch of restrictions on what
kind of pages you can insert.
* If VM_WRITE is set, then VM_MAYWRITE is also set. (And similarly for
the other perms.)

I'm sure there are many more.

Alice
Jann Horn Nov. 12, 2024, 6:06 p.m. UTC | #10
On Tue, Nov 12, 2024 at 2:48 PM Alice Ryhl <aliceryhl@google.com> wrote:
> On Tue, Nov 12, 2024 at 1:58 PM Jann Horn <jannh@google.com> wrote:
> >
> > On Tue, Nov 12, 2024 at 1:12 PM Alice Ryhl <aliceryhl@google.com> wrote:
> > > On Mon, Nov 11, 2024 at 5:51 PM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > On Mon, Nov 11, 2024 at 12:41 PM Alice Ryhl <aliceryhl@google.com> wrote:
> > > > > Thanks a lot for your review!
> > > > >
> > > > > Note that there is a v7 already:
> > > > > https://lore.kernel.org/all/20241014-vma-v7-0-01e32f861195@google.com/
> > > >
> > > > Yeah, oops, somehow I only realized I was looking at an older version
> > > > of the series after sending my mail...
> > > >
> > > > > On Fri, Nov 8, 2024 at 8:02 PM Jann Horn <jannh@google.com> wrote:
> > > > > > On Thu, Oct 10, 2024 at 2:56 PM Alice Ryhl <aliceryhl@google.com> wrote:
> > > > > >> These abstractions allow you to manipulate vmas. Rust Binder will uses
> > > > > >> these in a few different ways.
> > > > > >>
> > > > > >> In the mmap implementation, a VmAreaNew will be provided to the mmap
> > > > > >> call which allows it to modify the vma in ways that are only okay during
> > > > > >> initial setup. This is the case where the most methods are available.
> > > > > >>
> > > > > >> However, Rust Binder needs to insert and remove pages from the vma as
> > > > > >> time passes. When incoming messages arrive, pages may need to be
> > > > > >> inserted if space is missing, and in this case that is done by using a
> > > > > >> stashed ARef<Mm> and calling mmget_not_zero followed by mmap_write_lock
> > > > > >> followed by vma_lookup followed by vm_insert_page. In this case, since
> > > > > >> mmap_write_lock is used, the VmAreaMut type will be in use.
> > > > > >
> > > > > > FYI, the way the C binder implementation uses vma_lookup() and
> > > > > > vm_insert_page() is not very idiomatic. The standard way of
> > > > > > implementing binder_alloc_free_page() would be to use something like
> > > > > > unmap_mapping_range() instead of using
> > > > > > vma_lookup()+zap_page_range_single(); though to set that up, you'd
> > > > > > have to create one inode per binder file, maybe with something like
> > > > > > the DRM subsystem's drm_fs_inode_new(). And instead of having
> > > > > > binder_install_single_page(), the standard way would be to let
> > > > > > binder_vm_fault() install PTEs lazily on fault. That way you'd never
> > > > > > have to take mmap locks or grab MM references yourself.
> > > > >
> > > > > Would that actually work? All writes happen from kernel space, so it's
> > > > > not clear to me that we can trigger the fault handler from there. We
> > > > > can't use copy_to_user because the write happens from a different
> > > > > process than the one that owns the vma.
> > > > >
> > > > > That said, one alternative is to let Binder store an array of `struct
> > > > > page` and just write directly to those pages when writing from kernel
> > > > > space. Then binder_vm_fault() can look up the relevant page from the
> > > > > array when userspace attempts to read the data.
> > > >
> > > > Right, that's what I was thinking of - keep allocating pages at the
> > > > same point in time when binder currently does it, only defer mapping
> > > > them into userspace.
> > > >
> > > > > Though we could also
> > > > > just vm_insert_page() right before returning the address to userspace,
> > > > > since we know that userspace will read it right away after the syscall
> > > > > returns.
> > > >
> > > > I think that is basically what binder does now?
> > >
> > > Right now binder calls vm_insert_page() right after calling
> > > alloc_page(), which means that vm_insert_page() happens in the process
> > > sending the message. But we could delay the call so that it happens in
> > > the process that receives the message instead (which is also the
> > > process that owns the mapping).
> >
> > Ah, I see. I don't think that would make much of a difference in terms
> > of the complexity of MM locking, except that you'd save an
> > mmget_not_zero()...
>
> It might reduce contention on the mm locks since it drastically
> reduces the number of threads we might take the locks from.
>
> Another question ... right now we access the vma by doing vma_lookup
> under the mmap lock, and then we call vm_insert_page. The call to
> vm_insert_page only requires the vma read lock, so it would be nice to
> perform vm_insert_page without having to take the mmap lock to do
> vma_lookup. What is the best way to do that? I could stash a pointer
> to the vma, but I need to make sure it doesn't get freed in the
> meantime. I could keep track of whether it's still valid by listening
> for close in vma_ops, but that's a pretty unpleasant solution. I don't
> see any refcount in `struct vm_area_struct` that I can increment.

You're not supposed to hold a long-term reference to a vm_area_struct.
Listening to ->close() would be error-prone because ->close() is
called later than when calling vm_insert_page() is no longer allowed.
The most lightweight locking you can use for looking up a VMA when you
go through the mm_struct is to call lock_vma_under_rcu(mm, address);
but that is a "trylock"-style operation, so if there is contention, it
will fail and in that case you have to instead take the mmap lock
normally. In userfaultfd code they have uffd_lock_vma() as a wrapper
that basically does a lock_vma_under_rcu() without trylock semantics
by temporarily taking the mmap lock (and it also ensures that the VMA
has an anon_vma if necessary), but there is currently no helper for
this that is available to the rest of the kernel.

(The exception to not being able to hold long-term references to a
vm_area_struct is the MM-internal rmap / address_space tracking: VMAs
that map a given file are tracked on an rbtree owned by that file's
address_space. That's how the MM code internally finds the VMAs when
you do something like unmap_mapping_range(). But you can't use that to
install new PTEs - well, it would probably be possible, but it would
require extra checks to protect against races.)

> > > > > > Let me know if you think it would be helpful to see a prototype of
> > > > > > that in C - I think I can cobble that together, but doing it nicely
> > > > > > will require some work to convert at least some of the binder_alloc
> > > > > > locking to mutexes, and some more work to switch the ->f_mapping of
> > > > > > the binder file or something like that. (I guess modeling that in Rust
> > > > > > might be a bit of a pain though...)
> > > > >
> > > > > It would be useful to hear about what the advantages of
> > > > > unmap_mapping_range() are. I don't need a full prototype, I should be
> > > > > able to understand given a rough description of what the required
> > > > > changes are.
> > > >
> > > > The nice part is that basically, if you have a pointer to the file or
> > > > the inode, you can just do something like the following to zap a PTE:
> > > >
> > > > unmap_mapping_range(file->f_mapping, index, 1, 1);
> > > >
> > > > You don't have to take any locks yourself to make that work, you don't
> > > > even have to hold a reference to the mm_struct or anything like that,
> > > > and you don't need any of that logic you currently have in the C
> > > > binder driver to look up the VMA by address and revalidate it is still
> > > > the VMA you expect. The MM subsystem will automatically iterate
> > > > through all VMAs that overlap the specified range of the file's
> > > > address_space, and it will zap PTEs in the specified range (and it
> > > > even works fine if the VMAs have been moved or split or exist in
> > > > multiple processes or stuff like that). It's a similar story on the
> > > > allocation path. The only locks you need to explicitly take are
> > > > whatever locks the driver internally requires.
> > > >
> > > > Going through the fault handler and/or the address_space for
> > > > installing/removing PTEs, instead of using vma_lookup(), is also safer
> > > > because it implicitly ensures you can only operate on your own
> > > > driver's VMAs. From a glance at the Rust binder RFC
> > > > (https://lore.kernel.org/all/20231101-rust-binder-v1-19-08ba9197f637@google.com/),
> > > > it looks like use_page_slow() just looks up the VMA at the expected
> > > > address and calls insert_page() on it. I don't immediately see
> > > > anything in the Rust Binder RFC that would prevent calling
> > > > insert_page() on a newly created VMA of a different type that has
> > > > appeared at the same address - which could cause the page to
> > > > inadvertently become writable by userspace (if the new VMA is
> > > > writable), could cause refcounted pages to be installed in VM_PFNMAP
> > > > regions that are supposed to only contain non-refcounted entries,
> > > > could probably cause type confusion when trying to install 4K pages in
> > > > hugetlb regions that can't contain 4K pages, and so on.
> > >
> > > Right ... I guess I'm missing an equivalent to binder_vma_close to
> > > ensure that once the vma is closed, we don't try to insert pages.
> >
> > Yeah, I think that would work. (I think there currently is no way to
> > shrink a VMA without first splitting it, so you should see ->open()
> > and ->close() invocations when that happens.)
> >
> > > I gave a suggestion to enforce that vm_insert_page is only called on
> > > MIXEDMAP vmas in my previous email. I guess that would prevent the
> > > type confusion you mention, but it still seems dangerous ... you risk
> > > that some other driver is storing special data in the private data of
> > > pages in the new vma, and if you insert a random page there, there
> > > could maybe be type confusion on the private data in the `struct
> > > page`?
> >
> > Hmm, yeah, maybe...
>
> Is there a standard / idiomatic way for a driver to verify that a VMA
> is the one that it created? Checking vm_ops and private data?

I think drivers normally don't look up their own VMAs by address, but
yeah, checking ->vm_ops would work for making sure the VMA comes from
your driver. It looks like the TCP zerocopy receive path does that in
find_tcp_vma().

> > > > > >> +    /// Maps a single page at the given address within the virtual memory area.
> > > > > >> +    ///
> > > > > >> +    /// This operation does not take ownership of the page.
> > > > > >> +    #[inline]
> > > > > >> +    pub fn vm_insert_page(&self, address: usize, page: &Page) -> Result {
> > > > > >> +        // SAFETY: By the type invariants, the caller holds the mmap write lock, so this access is
> > > > > >> +        // not a data race. The page is guaranteed to be valid and of order 0. The range of
> > > > > >> +        // `address` is already checked by `vm_insert_page`.
> > > > > >> +        to_result(unsafe { bindings::vm_insert_page(self.as_ptr(), address as _, page.as_ptr()) })
> > > > > >
> > > > > > vm_insert_page() has a kinda weird contract because there are two
> > > > > > contexts from which you can call it cleanly:
> > > > > >
> > > > > > 1. You can call it on a VmAreaRef (just like zap_page_range_single(),
> > > > > > you only need to hold an mmap read lock or VMA read lock, no write
> > > > > > lock is required); however, you must ensure that the VMA is already
> > > > > > marked VM_MIXEDMAP. This is the API contract under which you'd call
> > > > > > this from a fault handler.
> > > > > >
> > > > > > 2. You can call it on a VmAreaNew (and it will take care of setting
> > > > > > VM_MIXEDMAP for you).
> > > > > >
> > > > > > I think nothing would immediately crash if you called vm_insert_page()
> > > > > > on a VMA that does not yet have VM_MIXEDMAP while holding the mmap
> > > > > > lock in write mode; but that would permit weird scenarios where you
> > > > > > could, for example, have a userfaultfd context associated with a VMA
> > > > > > which becomes VM_MIXEDMAP, while normally you can't attach userfaultfd
> > > > > > contexts to VM_MIXEDMAP VMAs. I don't think if that actually leads to
> > > > > > anything bad, but it would be a weird state that probably shouldn't be
> > > > > > permitted.
> > > > > >
> > > > > > There are also safety requirements for the page being installed, but I
> > > > > > guess the checks that vm_insert_page() already does via
> > > > > > validate_page_before_insert() might be enough to make this safe...
> > > > >
> > > > > One way to handle this is to make an VmAreaRef::check_mixedmap that
> > > > > returns a VmAreaMixedMapRef after checking the flag. That type can then
> > > > > have a vm_insert_page method.
> > > >
> > > > Sounds reasonable.
> > > >
> > > > > As for VmAreaNew, I'm not sure we should have it there. If we're not
> > > > > careful, it would be a way to set VM_MIXEDMAP on something that already
> > > > > has the VM_PFNMAP flag. We can probably just tell you to set VM_MIXEDMAP
> > > > > directly and then go through the method on VmAreaRef.
> > > >
> > > > Makes sense.
> > > >
> > > > I guess one tricky part is that it might be bad if you could
> > >
> > > Seems like this sentence is incomplete?
> >
> > Whoops, guess I got distracted while writing this...
> >
> > I guess it could be bad if you could install page mappings before
> > changing the VMA flags in a way that makes the already-installed page
> > mappings invalid. But as long as you don't change the
> > VM_READ/VM_WRITE/VM_EXEC flags, and you never clear
> > VM_MIXEDMAP/VM_PFNMAP, I think that probably can't happen, so that
> > should be fine...
>
> Could we come up with a list of what the mm subsystem assumes about a
> vma? Similar to Lorenzo's VMA locks documentation work. So far I have:
>
> * If a vma is VM_MIXEDMAP then it doesn't have a userfaultfd context
> and it is not VM_PFNMAP.

The "then it doesn't have a userfaultfd context" part is probably more
easily described as something like "VM_MIXEDMAP is not changed on a
VMA that is already in use".

> * If a vma is VM_PFNMAP then there are a bunch of restrictions on what
> kind of pages you can insert.

It's kind of the other way around. Normally, VMAs just contain
anonymous pages and file pages, which are fairly closely managed by
the MM core - they can be found through the "reverse mapping" and can
therefore usually be migrated/swapped/reclaimed, and they are
refcounted by the MM core. I am a bit fuzzy on VM_MIXEDMAP, but I
think it grants the ability to install a wider variety of pages that a
driver allocated itself (which are still refcounted but not tracked
through the reverse mapping), and also the ability to create PFN
mappings of physical addresses that are in physical address ranges not
managed by the MM subsystem (and therefore can't be refcounted).
Finally, VM_PFNMAP means "this is a super special memory range in
which the MM subsystem will just see opaque physical address mappings,
and any lifetime management of that physical memory is up to the
driver".

In other words, with VM_PFNMAP you can map basically any memory you
want - you can even map things like slab memory, or freed pages, and
the MM subsystem will be blind to that and basically won't care.
That's how /dev/mem can AFAIU let you mmap almost-arbitrary physical
memory when you disable CONFIG_STRICT_DEVMEM. But that also means
VM_PFNMAP is a great way to introduce physical memory use-after-free
bugs, because there is nothing that stops a buggy driver from freeing
pages that are still mmap'ed to userspace.

One feature of VM_PFNMAP VMAs that is desirable in some contexts is
that they stop get_user_pages_fast() from working. When you install a
normal page table entry with something like vm_insert_page(), that
AFAIK doesn't mark the PTE with pte_mkspecial(), and so userspace can
cause the kernel to grab a refcounted reference to the page and then
use this reference to read/write into the page at a later point, even
after it is no longer mapped into the page tables, and even after the
VMA is gone. (For example, by using vmsplice() to insert a reference
to the page into a pipe.) My understanding from a conversation with
upstream GPU developers is that they use VM_PFNMAP for this reason -
they want to be able to reliably revoke userspace access to pages when
they're unmapped. (We could probably improve that by having some way
for a driver to say "I want VM_MIXEDMAP but please don't let userspace
hold extra references to my pages"...)

In the context of binder, AFAIK because it does not use VM_PFNMAP,
userspace can still read the contents of binder pages after the
zap_page_range_single() call in binder_alloc_free_page(); but that's
fine because binder doesn't reuse the page afterwards. The subsequent
__free_page() call does not actually free the page - when it is called
on an order-0 page, it just drops the refcount on the page by 1, and
the page will be freed later on when no references to it are left.
Using put_page() might be a bit clearer there than using
__free_pages(), I guess? But I'm not entirely sure what the
nicest-looking APIs are for allocating/freeing order-0 pages with
refcounted semantics.

> * If VM_WRITE is set, then VM_MAYWRITE is also set. (And similarly for
> the other perms.)
>
> I'm sure there are many more.

I guess we could try to come up with a document with subjects like
"how drivers are supposed to interact with VMAs" and "other things
that a driver can do with VMAs (weird but okay)"...
diff mbox series

Patch

diff --git a/rust/helpers/helpers.c b/rust/helpers/helpers.c
index 30f40149f3a9..907ee77b3bb9 100644
--- a/rust/helpers/helpers.c
+++ b/rust/helpers/helpers.c
@@ -13,6 +13,7 @@ 
 #include "build_bug.c"
 #include "err.c"
 #include "kunit.c"
+#include "mm.c"
 #include "mutex.c"
 #include "page.c"
 #include "rbtree.c"
diff --git a/rust/helpers/mm.c b/rust/helpers/mm.c
new file mode 100644
index 000000000000..0a2c2cc1903f
--- /dev/null
+++ b/rust/helpers/mm.c
@@ -0,0 +1,55 @@ 
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/mm.h>
+#include <linux/sched/mm.h>
+
+void rust_helper_mmgrab(struct mm_struct *mm)
+{
+	mmgrab(mm);
+}
+
+void rust_helper_mmdrop(struct mm_struct *mm)
+{
+	mmdrop(mm);
+}
+
+void rust_helper_mmget(struct mm_struct *mm)
+{
+	mmget(mm);
+}
+
+bool rust_helper_mmget_not_zero(struct mm_struct *mm)
+{
+	return mmget_not_zero(mm);
+}
+
+void rust_helper_mmap_read_lock(struct mm_struct *mm)
+{
+	mmap_read_lock(mm);
+}
+
+bool rust_helper_mmap_read_trylock(struct mm_struct *mm)
+{
+	return mmap_read_trylock(mm);
+}
+
+void rust_helper_mmap_read_unlock(struct mm_struct *mm)
+{
+	mmap_read_unlock(mm);
+}
+
+void rust_helper_mmap_write_lock(struct mm_struct *mm)
+{
+	mmap_write_lock(mm);
+}
+
+void rust_helper_mmap_write_unlock(struct mm_struct *mm)
+{
+	mmap_write_unlock(mm);
+}
+
+struct vm_area_struct *rust_helper_vma_lookup(struct mm_struct *mm,
+					      unsigned long addr)
+{
+	return vma_lookup(mm, addr);
+}
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index 8a228bcbbe85..6fa97e8866a9 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -40,6 +40,7 @@ 
 pub mod kunit;
 pub mod list;
 pub mod miscdevice;
+pub mod mm;
 #[cfg(CONFIG_NET)]
 pub mod net;
 pub mod page;
diff --git a/rust/kernel/mm.rs b/rust/kernel/mm.rs
new file mode 100644
index 000000000000..b5d29ce0c6ef
--- /dev/null
+++ b/rust/kernel/mm.rs
@@ -0,0 +1,344 @@ 
+// SPDX-License-Identifier: GPL-2.0
+
+// Copyright (C) 2024 Google LLC.
+
+//! Memory management.
+//!
+//! C header: [`include/linux/mm.h`](srctree/include/linux/mm.h)
+
+use crate::{
+    bindings,
+    types::{ARef, AlwaysRefCounted, Opaque},
+};
+
+use core::{
+    ops::Deref,
+    ptr::{self, NonNull},
+};
+
+pub mod virt;
+
+/// A wrapper for the kernel's `struct mm_struct`.
+///
+/// Since `mm_users` may be zero, the associated address space may not exist anymore. You can use
+/// [`mmget_not_zero`] to be able to access the address space.
+///
+/// The `ARef<Mm>` smart pointer holds an `mmgrab` refcount. Its destructor may sleep.
+///
+/// # Invariants
+///
+/// Values of this type are always refcounted using `mmgrab`.
+///
+/// [`mmget_not_zero`]: Mm::mmget_not_zero
+pub struct Mm {
+    mm: Opaque<bindings::mm_struct>,
+}
+
+// SAFETY: It is safe to call `mmdrop` on another thread than where `mmgrab` was called.
+unsafe impl Send for Mm {}
+// SAFETY: All methods on `Mm` can be called in parallel from several threads.
+unsafe impl Sync for Mm {}
+
+// SAFETY: By the type invariants, this type is always refcounted.
+unsafe impl AlwaysRefCounted for Mm {
+    fn inc_ref(&self) {
+        // SAFETY: The pointer is valid since self is a reference.
+        unsafe { bindings::mmgrab(self.as_raw()) };
+    }
+
+    unsafe fn dec_ref(obj: NonNull<Self>) {
+        // SAFETY: The caller is giving up their refcount.
+        unsafe { bindings::mmdrop(obj.cast().as_ptr()) };
+    }
+}
+
+/// A wrapper for the kernel's `struct mm_struct`.
+///
+/// This type is like [`Mm`], but with non-zero `mm_users`. It can only be used when `mm_users` can
+/// be proven to be non-zero at compile-time, usually because the relevant code holds an `mmget`
+/// refcount. It can be used to access the associated address space.
+///
+/// The `ARef<MmWithUser>` smart pointer holds an `mmget` refcount. Its destructor may sleep.
+///
+/// # Invariants
+///
+/// Values of this type are always refcounted using `mmget`. The value of `mm_users` is non-zero.
+/// #[repr(transparent)]
+pub struct MmWithUser {
+    mm: Mm,
+}
+
+// SAFETY: It is safe to call `mmput` on another thread than where `mmget` was called.
+unsafe impl Send for MmWithUser {}
+// SAFETY: All methods on `MmWithUser` can be called in parallel from several threads.
+unsafe impl Sync for MmWithUser {}
+
+// SAFETY: By the type invariants, this type is always refcounted.
+unsafe impl AlwaysRefCounted for MmWithUser {
+    fn inc_ref(&self) {
+        // SAFETY: The pointer is valid since self is a reference.
+        unsafe { bindings::mmget(self.as_raw()) };
+    }
+
+    unsafe fn dec_ref(obj: NonNull<Self>) {
+        // SAFETY: The caller is giving up their refcount.
+        unsafe { bindings::mmput(obj.cast().as_ptr()) };
+    }
+}
+
+// Make all `Mm` methods available on `MmWithUser`.
+impl Deref for MmWithUser {
+    type Target = Mm;
+
+    #[inline]
+    fn deref(&self) -> &Mm {
+        &self.mm
+    }
+}
+
+/// A wrapper for the kernel's `struct mm_struct`.
+///
+/// This type is identical to `MmWithUser` except that it uses `mmput_async` when dropping a
+/// refcount. This means that the destructor of `ARef<MmWithUserAsync>` is safe to call in atomic
+/// context.
+///
+/// # Invariants
+///
+/// Values of this type are always refcounted using `mmget`. The value of `mm_users` is non-zero.
+/// #[repr(transparent)]
+#[repr(transparent)]
+pub struct MmWithUserAsync {
+    mm: MmWithUser,
+}
+
+// SAFETY: It is safe to call `mmput_async` on another thread than where `mmget` was called.
+unsafe impl Send for MmWithUserAsync {}
+// SAFETY: All methods on `MmWithUserAsync` can be called in parallel from several threads.
+unsafe impl Sync for MmWithUserAsync {}
+
+// SAFETY: By the type invariants, this type is always refcounted.
+unsafe impl AlwaysRefCounted for MmWithUserAsync {
+    fn inc_ref(&self) {
+        // SAFETY: The pointer is valid since self is a reference.
+        unsafe { bindings::mmget(self.as_raw()) };
+    }
+
+    unsafe fn dec_ref(obj: NonNull<Self>) {
+        // SAFETY: The caller is giving up their refcount.
+        unsafe { bindings::mmput_async(obj.cast().as_ptr()) };
+    }
+}
+
+// Make all `MmWithUser` methods available on `MmWithUserAsync`.
+impl Deref for MmWithUserAsync {
+    type Target = MmWithUser;
+
+    #[inline]
+    fn deref(&self) -> &MmWithUser {
+        &self.mm
+    }
+}
+
+// These methods are safe to call even if `mm_users` is zero.
+impl Mm {
+    /// Call `mmgrab` on `current.mm`.
+    #[inline]
+    pub fn mmgrab_current() -> Option<ARef<Mm>> {
+        // SAFETY: It's safe to get the `mm` field from current.
+        let mm = unsafe {
+            let current = bindings::get_current();
+            (*current).mm
+        };
+
+        if mm.is_null() {
+            return None;
+        }
+
+        // SAFETY: The value of `current->mm` is guaranteed to be null or a valid `mm_struct`. We
+        // just checked that it's not null. Furthermore, the returned `&Mm` is valid only for the
+        // duration of this function, and `current->mm` will stay valid for that long.
+        let mm = unsafe { Mm::from_raw(mm) };
+
+        // This increments the refcount using `mmgrab`.
+        Some(ARef::from(mm))
+    }
+
+    /// Returns a raw pointer to the inner `mm_struct`.
+    #[inline]
+    pub fn as_raw(&self) -> *mut bindings::mm_struct {
+        self.mm.get()
+    }
+
+    /// Obtain a reference from a raw pointer.
+    ///
+    /// # Safety
+    ///
+    /// The caller must ensure that `ptr` points at an `mm_struct`, and that it is not deallocated
+    /// during the lifetime 'a.
+    #[inline]
+    pub unsafe fn from_raw<'a>(ptr: *const bindings::mm_struct) -> &'a Mm {
+        // SAFETY: Caller promises that the pointer is valid for 'a. Layouts are compatible due to
+        // repr(transparent).
+        unsafe { &*ptr.cast() }
+    }
+
+    /// Check whether this vma is associated with this mm.
+    #[inline]
+    pub fn is_same_mm(&self, area: &virt::VmAreaRef) -> bool {
+        // SAFETY: The `vm_mm` field of the area is immutable, so we can read it without
+        // synchronization.
+        let vm_mm = unsafe { (*area.as_ptr()).vm_mm };
+
+        ptr::eq(vm_mm, self.as_raw())
+    }
+
+    /// Calls `mmget_not_zero` and returns a handle if it succeeds.
+    #[inline]
+    pub fn mmget_not_zero(&self) -> Option<ARef<MmWithUser>> {
+        // SAFETY: The pointer is valid since self is a reference.
+        let success = unsafe { bindings::mmget_not_zero(self.as_raw()) };
+
+        if success {
+            // SAFETY: We just created an `mmget` refcount.
+            Some(unsafe { ARef::from_raw(NonNull::new_unchecked(self.as_raw().cast())) })
+        } else {
+            None
+        }
+    }
+}
+
+// These methods require `mm_users` to be non-zero.
+impl MmWithUser {
+    /// Obtain a reference from a raw pointer.
+    ///
+    /// # Safety
+    ///
+    /// The caller must ensure that `ptr` points at an `mm_struct`, and that `mm_users` remains
+    /// non-zero for the duration of the lifetime 'a.
+    #[inline]
+    pub unsafe fn from_raw<'a>(ptr: *const bindings::mm_struct) -> &'a MmWithUser {
+        // SAFETY: Caller promises that the pointer is valid for 'a. The layout is compatible due
+        // to repr(transparent).
+        unsafe { &*ptr.cast() }
+    }
+
+    /// Use `mmput_async` when dropping this refcount.
+    pub fn use_mmput_async(me: ARef<MmWithUser>) -> ARef<MmWithUserAsync> {
+        // SAFETY: The layouts and invariants are compatible.
+        unsafe { ARef::from_raw(ARef::into_raw(me).cast()) }
+    }
+
+    /// Lock the mmap write lock.
+    #[inline]
+    pub fn mmap_write_lock(&self) -> MmapWriteLock<'_> {
+        // SAFETY: The pointer is valid since self is a reference.
+        unsafe { bindings::mmap_write_lock(self.as_raw()) };
+
+        // INVARIANT: We just acquired the write lock.
+        MmapWriteLock { mm: self }
+    }
+
+    /// Lock the mmap read lock.
+    #[inline]
+    pub fn mmap_read_lock(&self) -> MmapReadLock<'_> {
+        // SAFETY: The pointer is valid since self is a reference.
+        unsafe { bindings::mmap_read_lock(self.as_raw()) };
+
+        // INVARIANT: We just acquired the read lock.
+        MmapReadLock { mm: self }
+    }
+
+    /// Try to lock the mmap read lock.
+    #[inline]
+    pub fn mmap_read_trylock(&self) -> Option<MmapReadLock<'_>> {
+        // SAFETY: The pointer is valid since self is a reference.
+        let success = unsafe { bindings::mmap_read_trylock(self.as_raw()) };
+
+        if success {
+            // INVARIANT: We just acquired the read lock.
+            Some(MmapReadLock { mm: self })
+        } else {
+            None
+        }
+    }
+}
+
+impl MmWithUserAsync {
+    /// Use `mmput` when dropping this refcount.
+    pub fn use_mmput(me: ARef<MmWithUserAsync>) -> ARef<MmWithUser> {
+        // SAFETY: The layouts and invariants are compatible.
+        unsafe { ARef::from_raw(ARef::into_raw(me).cast()) }
+    }
+}
+
+/// A guard for the mmap read lock.
+///
+/// # Invariants
+///
+/// This `MmapReadLock` guard owns the mmap read lock.
+pub struct MmapReadLock<'a> {
+    mm: &'a MmWithUser,
+}
+
+impl<'a> MmapReadLock<'a> {
+    /// Look up a vma at the given address.
+    #[inline]
+    pub fn vma_lookup(&self, vma_addr: usize) -> Option<&virt::VmAreaRef> {
+        // SAFETY: We hold a reference to the mm, so the pointer must be valid. Any value is okay
+        // for `vma_addr`.
+        let vma = unsafe { bindings::vma_lookup(self.mm.as_raw(), vma_addr as _) };
+
+        if vma.is_null() {
+            None
+        } else {
+            // SAFETY: We just checked that a vma was found, so the pointer is valid. Furthermore,
+            // the returned area will borrow from this read lock guard, so it can only be used
+            // while the read lock is still held.
+            unsafe { Some(virt::VmAreaRef::from_raw(vma)) }
+        }
+    }
+}
+
+impl Drop for MmapReadLock<'_> {
+    #[inline]
+    fn drop(&mut self) {
+        // SAFETY: We hold the read lock by the type invariants.
+        unsafe { bindings::mmap_read_unlock(self.mm.as_raw()) };
+    }
+}
+
+/// A guard for the mmap write lock.
+///
+/// # Invariants
+///
+/// This `MmapReadLock` guard owns the mmap write lock.
+pub struct MmapWriteLock<'a> {
+    mm: &'a MmWithUser,
+}
+
+impl<'a> MmapWriteLock<'a> {
+    /// Look up a vma at the given address.
+    #[inline]
+    pub fn vma_lookup(&mut self, vma_addr: usize) -> Option<&virt::VmAreaMut> {
+        // SAFETY: We hold a reference to the mm, so the pointer must be valid. Any value is okay
+        // for `vma_addr`.
+        let vma = unsafe { bindings::vma_lookup(self.mm.as_raw(), vma_addr as _) };
+
+        if vma.is_null() {
+            None
+        } else {
+            // SAFETY: We just checked that a vma was found, so the pointer is valid. Furthermore,
+            // the returned area will borrow from this write lock guard, so it can only be used
+            // while the write lock is still held.
+            unsafe { Some(virt::VmAreaMut::from_raw(vma)) }
+        }
+    }
+}
+
+impl Drop for MmapWriteLock<'_> {
+    #[inline]
+    fn drop(&mut self) {
+        // SAFETY: We hold the write lock by the type invariants.
+        unsafe { bindings::mmap_write_unlock(self.mm.as_raw()) };
+    }
+}
diff --git a/rust/kernel/mm/virt.rs b/rust/kernel/mm/virt.rs
new file mode 100644
index 000000000000..7c09813e22f9
--- /dev/null
+++ b/rust/kernel/mm/virt.rs
@@ -0,0 +1,264 @@ 
+// SPDX-License-Identifier: GPL-2.0
+
+// Copyright (C) 2024 Google LLC.
+
+//! Virtual memory.
+
+use crate::{
+    bindings,
+    error::{to_result, Result},
+    page::Page,
+    types::Opaque,
+};
+
+use core::ops::Deref;
+
+/// A wrapper for the kernel's `struct vm_area_struct` with read access.
+///
+/// It represents an area of virtual memory.
+///
+/// # Invariants
+///
+/// The caller must at least hold the mmap read lock.
+#[repr(transparent)]
+pub struct VmAreaRef {
+    vma: Opaque<bindings::vm_area_struct>,
+}
+
+/// A wrapper for the kernel's `struct vm_area_struct` with write access.
+///
+/// It represents an area of virtual memory.
+///
+/// # Invariants
+///
+/// The caller must at least hold the mmap write lock.
+#[repr(transparent)]
+pub struct VmAreaMut {
+    vma: Opaque<bindings::vm_area_struct>,
+}
+
+/// A wrapper for the kernel's `struct vm_area_struct` during initial VMA setup.
+///
+/// It represents an area of virtual memory.
+///
+/// # Invariants
+///
+/// The caller must have mutable access and the vma must not have been shared yet.
+#[repr(transparent)]
+pub struct VmAreaNew {
+    vma: Opaque<bindings::vm_area_struct>,
+}
+
+// Make all VmAreaRef methods available on VmAreaMut.
+impl Deref for VmAreaMut {
+    type Target = VmAreaRef;
+    #[inline]
+    fn deref(&self) -> &VmAreaRef {
+        // SAFETY: We hold at least the write lock, so we have read access.
+        unsafe { VmAreaRef::from_raw(self.vma.get()) }
+    }
+}
+
+// Make all VmAreaMut methods available on VmAreaNew.
+impl Deref for VmAreaNew {
+    type Target = VmAreaMut;
+    #[inline]
+    fn deref(&self) -> &VmAreaMut {
+        // SAFETY: The vma is not yet shared, so we have write access.
+        unsafe { VmAreaMut::from_raw(self.vma.get()) }
+    }
+}
+
+impl VmAreaRef {
+    /// Access a virtual memory area given a raw pointer.
+    ///
+    /// # Safety
+    ///
+    /// Callers must ensure that `vma` is valid for the duration of 'a, and that the mmap read lock
+    /// (or stronger) is held for at least the duration of 'a.
+    #[inline]
+    pub unsafe fn from_raw<'a>(vma: *const bindings::vm_area_struct) -> &'a Self {
+        // SAFETY: The caller ensures that the invariants are satisfied for the duration of 'a.
+        unsafe { &*vma.cast() }
+    }
+
+    /// Returns a raw pointer to this area.
+    #[inline]
+    pub fn as_ptr(&self) -> *mut bindings::vm_area_struct {
+        self.vma.get()
+    }
+
+    /// Returns the flags associated with the virtual memory area.
+    ///
+    /// The possible flags are a combination of the constants in [`flags`].
+    #[inline]
+    pub fn flags(&self) -> usize {
+        // SAFETY: By the type invariants, the caller holds at least the mmap read lock, so this
+        // access is not a data race.
+        unsafe { (*self.as_ptr()).__bindgen_anon_2.vm_flags as _ }
+    }
+
+    /// Returns the start address of the virtual memory area.
+    #[inline]
+    pub fn start(&self) -> usize {
+        // SAFETY: By the type invariants, the caller holds at least the mmap read lock, so this
+        // access is not a data race.
+        unsafe { (*self.as_ptr()).__bindgen_anon_1.__bindgen_anon_1.vm_start as _ }
+    }
+
+    /// Returns the end address of the virtual memory area.
+    #[inline]
+    pub fn end(&self) -> usize {
+        // SAFETY: By the type invariants, the caller holds at least the mmap read lock, so this
+        // access is not a data race.
+        unsafe { (*self.as_ptr()).__bindgen_anon_1.__bindgen_anon_1.vm_end as _ }
+    }
+
+    /// Unmap pages in the given page range.
+    #[inline]
+    pub fn zap_page_range_single(&self, address: usize, size: usize) {
+        // SAFETY: By the type invariants, the caller holds at least the mmap read lock, so this
+        // access is okay. Any value of `address` and `size` is allowed.
+        unsafe {
+            bindings::zap_page_range_single(
+                self.as_ptr(),
+                address as _,
+                size as _,
+                core::ptr::null_mut(),
+            )
+        };
+    }
+}
+
+impl VmAreaMut {
+    /// Access a virtual memory area given a raw pointer.
+    ///
+    /// # Safety
+    ///
+    /// Callers must ensure that `vma` is valid for the duration of 'a, and that the mmap write lock
+    /// (or stronger) is held for at least the duration of 'a.
+    #[inline]
+    pub unsafe fn from_raw<'a>(vma: *const bindings::vm_area_struct) -> &'a Self {
+        // SAFETY: The caller ensures that the invariants are satisfied for the duration of 'a.
+        unsafe { &*vma.cast() }
+    }
+
+    /// Sets the flags associated with the virtual memory area.
+    ///
+    /// The possible flags are a combination of the constants in [`flags`].
+    #[inline]
+    pub fn set_flags(&self, flags: usize) {
+        // SAFETY: By the type invariants, the caller holds the mmap write lock, so this access is
+        // not a data race.
+        unsafe { (*self.as_ptr()).__bindgen_anon_2.vm_flags = flags as _ };
+    }
+
+    /// Maps a single page at the given address within the virtual memory area.
+    ///
+    /// This operation does not take ownership of the page.
+    #[inline]
+    pub fn vm_insert_page(&self, address: usize, page: &Page) -> Result {
+        // SAFETY: By the type invariants, the caller holds the mmap write lock, so this access is
+        // not a data race. The page is guaranteed to be valid and of order 0. The range of
+        // `address` is already checked by `vm_insert_page`.
+        to_result(unsafe { bindings::vm_insert_page(self.as_ptr(), address as _, page.as_ptr()) })
+    }
+}
+
+impl VmAreaNew {
+    /// Access a virtual memory area given a raw pointer.
+    ///
+    /// # Safety
+    ///
+    /// Callers must ensure that `vma` is valid for the duration of 'a, and that the vma is
+    /// currently undergoing initial VMA setup.
+    #[inline]
+    pub unsafe fn from_raw<'a>(vma: *const bindings::vm_area_struct) -> &'a Self {
+        // SAFETY: The caller ensures that the invariants are satisfied for the duration of 'a.
+        unsafe { &*vma.cast() }
+    }
+
+    /// Make this vma anonymous.
+    #[inline]
+    pub fn set_anonymous(&self) {
+        // SAFETY: The vma is not yet shared.
+        unsafe { (*self.as_ptr()).vm_ops = core::ptr::null() };
+    }
+}
+
+/// Container for [`VmArea`] flags.
+pub mod flags {
+    use crate::bindings;
+
+    /// No flags are set.
+    pub const NONE: usize = bindings::VM_NONE as _;
+
+    /// Mapping allows reads.
+    pub const READ: usize = bindings::VM_READ as _;
+
+    /// Mapping allows writes.
+    pub const WRITE: usize = bindings::VM_WRITE as _;
+
+    /// Mapping allows execution.
+    pub const EXEC: usize = bindings::VM_EXEC as _;
+
+    /// Mapping is shared.
+    pub const SHARED: usize = bindings::VM_SHARED as _;
+
+    /// Mapping may be updated to allow reads.
+    pub const MAYREAD: usize = bindings::VM_MAYREAD as _;
+
+    /// Mapping may be updated to allow writes.
+    pub const MAYWRITE: usize = bindings::VM_MAYWRITE as _;
+
+    /// Mapping may be updated to allow execution.
+    pub const MAYEXEC: usize = bindings::VM_MAYEXEC as _;
+
+    /// Mapping may be updated to be shared.
+    pub const MAYSHARE: usize = bindings::VM_MAYSHARE as _;
+
+    /// Do not copy this vma on fork.
+    pub const DONTCOPY: usize = bindings::VM_DONTCOPY as _;
+
+    /// Cannot expand with mremap().
+    pub const DONTEXPAND: usize = bindings::VM_DONTEXPAND as _;
+
+    /// Lock the pages covered when they are faulted in.
+    pub const LOCKONFAULT: usize = bindings::VM_LOCKONFAULT as _;
+
+    /// Is a VM accounted object.
+    pub const ACCOUNT: usize = bindings::VM_ACCOUNT as _;
+
+    /// should the VM suppress accounting.
+    pub const NORESERVE: usize = bindings::VM_NORESERVE as _;
+
+    /// Huge TLB Page VM.
+    pub const HUGETLB: usize = bindings::VM_HUGETLB as _;
+
+    /// Synchronous page faults.
+    pub const SYNC: usize = bindings::VM_SYNC as _;
+
+    /// Architecture-specific flag.
+    pub const ARCH_1: usize = bindings::VM_ARCH_1 as _;
+
+    /// Wipe VMA contents in child..
+    pub const WIPEONFORK: usize = bindings::VM_WIPEONFORK as _;
+
+    /// Do not include in the core dump.
+    pub const DONTDUMP: usize = bindings::VM_DONTDUMP as _;
+
+    /// Not soft dirty clean area.
+    pub const SOFTDIRTY: usize = bindings::VM_SOFTDIRTY as _;
+
+    /// Can contain "struct page" and pure PFN pages.
+    pub const MIXEDMAP: usize = bindings::VM_MIXEDMAP as _;
+
+    /// MADV_HUGEPAGE marked this vma.
+    pub const HUGEPAGE: usize = bindings::VM_HUGEPAGE as _;
+
+    /// MADV_NOHUGEPAGE marked this vma.
+    pub const NOHUGEPAGE: usize = bindings::VM_NOHUGEPAGE as _;
+
+    /// KSM may merge identical pages.
+    pub const MERGEABLE: usize = bindings::VM_MERGEABLE as _;
+}