diff mbox series

[v2,6/7] rust: file: add `DeferredFdCloser`

Message ID 20231206-alice-file-v2-6-af617c0d9d94@google.com (mailing list archive)
State New, archived
Headers show
Series File abstractions needed by Rust Binder | expand

Commit Message

Alice Ryhl Dec. 6, 2023, 11:59 a.m. UTC
To close an fd from kernel space, we could call `ksys_close`. However,
if we do this to an fd that is held using `fdget`, then we may trigger a
use-after-free. Introduce a helper that can be used to close an fd even
if the fd is currently held with `fdget`. This is done by grabbing an
extra refcount to the file and dropping it in a task work once we return
to userspace.

This is necessary for Rust Binder because otherwise the user might try
to have Binder close its fd for /dev/binder, which would cause problems
as this happens inside an ioctl on /dev/binder, and ioctls hold the fd
using `fdget`.

Additional motivation can be found in commit 80cd795630d6 ("binder: fix
use-after-free due to ksys_close() during fdget()") and in the comments
on `binder_do_fd_close`.

If there is some way to detect whether an fd is currently held with
`fdget`, then this could be optimized to skip the allocation and task
work when this is not the case. Another possible optimization would be
to combine several fds into a single task work, since this is used with
fd arrays that might hold several fds.

That said, it might not be necessary to optimize it, because Rust Binder
has two ways to send fds: BINDER_TYPE_FD and BINDER_TYPE_FDA. With
BINDER_TYPE_FD, it is userspace's responsibility to close the fd, so
this mechanism is used only by BINDER_TYPE_FDA, but fd arrays are used
rarely these days.

Signed-off-by: Alice Ryhl <aliceryhl@google.com>
---
 rust/bindings/bindings_helper.h |   2 +
 rust/helpers.c                  |   8 ++
 rust/kernel/file.rs             | 157 +++++++++++++++++++++++++++++++++++++++-
 3 files changed, 166 insertions(+), 1 deletion(-)

Comments

Benno Lossin Dec. 8, 2023, 5:39 p.m. UTC | #1
On 12/6/23 12:59, Alice Ryhl wrote:
> +    /// Schedule a task work that closes the file descriptor when this task returns to userspace.
> +    ///
> +    /// Fails if this is called from a context where we cannot run work when returning to
> +    /// userspace. (E.g., from a kthread.)
> +    pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
> +        use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
> +
> +        // In this method, we schedule the task work before closing the file. This is because
> +        // scheduling a task work is fallible, and we need to know whether it will fail before we
> +        // attempt to close the file.
> +
> +        // SAFETY: Getting a pointer to current is always safe.
> +        let current = unsafe { bindings::get_current() };
> +
> +        // SAFETY: Accessing the `flags` field of `current` is always safe.
> +        let is_kthread = (unsafe { (*current).flags } & bindings::PF_KTHREAD) != 0;

Since Boqun brought to my attention that we already have a wrapper for
`get_current()`, how about you use it here as well?

> +        if is_kthread {
> +            return Err(DeferredFdCloseError::TaskWorkUnavailable);
> +        }
> +
> +        // This disables the destructor of the box, so the allocation is not cleaned up
> +        // automatically below.
> +        let inner = Box::into_raw(self.inner);

Importantly this also lifts the uniqueness requirement (maybe add this
to the comment?).

> +
> +        // The `callback_head` field is first in the struct, so this cast correctly gives us a
> +        // pointer to the field.
> +        let callback_head = inner.cast::<bindings::callback_head>();
> +        // SAFETY: This pointer offset operation does not go out-of-bounds.
> +        let file_field = unsafe { core::ptr::addr_of_mut!((*inner).file) };
> +
> +        // SAFETY: The `callback_head` pointer is compatible with the `do_close_fd` method.

Also, `callback_head` is valid, since it is derived from...

> +        unsafe { bindings::init_task_work(callback_head, Some(Self::do_close_fd)) };
> +        // SAFETY: The `callback_head` pointer points at a valid and fully initialized task work
> +        // that is ready to be scheduled.
> +        //
> +        // If the task work gets scheduled as-is, then it will be a no-op. However, we will update
> +        // the file pointer below to tell it which file to fput.
> +        let res = unsafe { bindings::task_work_add(current, callback_head, TWA_RESUME) };
> +
> +        if res != 0 {
> +            // SAFETY: Scheduling the task work failed, so we still have ownership of the box, so
> +            // we may destroy it.
> +            unsafe { drop(Box::from_raw(inner)) };
> +
> +            return Err(DeferredFdCloseError::TaskWorkUnavailable);

Just curious, what could make the `task_work_add` call fail? I imagine
an OOM situation, but is that all?

> +        }
> +
> +        // SAFETY: Just an FFI call. This is safe no matter what `fd` is.

I took a look at the C code and there I found this comment:

    /*
     * variant of close_fd that gets a ref on the file for later fput.
     * The caller must ensure that filp_close() called on the file.
     */

And while you do call `filp_close` later, this seems like a safety
requirement to me.
Also, you do not call it when `file` is null, which I imagine to be
fine, but I do not know that since the C comment does not cover that
case.

> +        let file = unsafe { bindings::close_fd_get_file(fd) };
> +        if file.is_null() {
> +            // We don't clean up the task work since that might be expensive if the task work queue
> +            // is long. Just let it execute and let it clean up for itself.
> +            return Err(DeferredFdCloseError::BadFd);
> +        }
> +
> +        // SAFETY: The `file` pointer points at a valid file.
> +        unsafe { bindings::get_file(file) };
> +
> +        // SAFETY: Due to the above `get_file`, even if the current task holds an `fdget` to
> +        // this file right now, the refcount will not drop to zero until after it is released
> +        // with `fdput`. This is because when using `fdget`, you must always use `fdput` before

Shouldn't this be "the refcount will not drop to zero until after it is
released with `fput`."?

Why is this the SAFETY comment for `filp_close`? I am not understanding
the requirement on that function that needs this. This seems more a
justification for accessing `file` inside `do_close_fd`. In which case I
think it would be better to make it a type invariant of
`DeferredFdCloserInner`.

> +        // returning to userspace, and our task work runs after any `fdget` users have returned
> +        // to userspace.
> +        //
> +        // Note: fl_owner_t is currently a void pointer.
> +        unsafe { bindings::filp_close(file, (*current).files as bindings::fl_owner_t) };
> +
> +        // We update the file pointer that the task work is supposed to fput.
> +        //
> +        // SAFETY: Task works are executed on the current thread once we return to userspace, so
> +        // this write is guaranteed to happen before `do_close_fd` is called, which means that a
> +        // race is not possible here.
> +        //
> +        // It's okay to pass this pointer to the task work, since we just acquired a refcount with
> +        // the previous call to `get_file`. Furthermore, the refcount will not drop to zero during
> +        // an `fdget` call, since we defer the `fput` until after returning to userspace.
> +        unsafe { *file_field = file };

A synchronization question: who guarantees that this write is actually
available to the cpu that executes `do_close_fd`? Is there some
synchronization run when returning to userspace?

> +
> +        Ok(())
> +    }
> +
> +    // SAFETY: This function is an implementation detail of `close_fd`, so its safety comments
> +    // should be read in extension of that method.

Why not use this?:
- `inner` is a valid pointer to the `callback_head` field of a valid
  `DeferredFdCloserInner`.
- `inner` has exclusive access to the pointee and owns the allocation.
- `inner` originates from a call to `Box::into_raw`.

> +    unsafe extern "C" fn do_close_fd(inner: *mut bindings::callback_head) {
> +        // SAFETY: In `close_fd` we use this method together with a pointer that originates from a
> +        // `Box<DeferredFdCloserInner>`, and we have just been given ownership of that allocation.
> +        let inner = unsafe { Box::from_raw(inner as *mut DeferredFdCloserInner) };

Use `.cast()`.

> +        if !inner.file.is_null() {
> +            // SAFETY: This drops a refcount we acquired in `close_fd`. Since this callback runs in
> +            // a task work after we return to userspace, it is guaranteed that the current thread
> +            // doesn't hold this file with `fdget`, as `fdget` must be released before returning to
> +            // userspace.
> +            unsafe { bindings::fput(inner.file) };
> +        }
> +        // Free the allocation.
> +        drop(inner);
> +    }
> +}
> +
> +/// Represents a failure to close an fd in a deferred manner.
> +#[derive(Copy, Clone, Eq, PartialEq)]
> +pub enum DeferredFdCloseError {
> +    /// Closing the fd failed because we were unable to schedule a task work.
> +    TaskWorkUnavailable,
> +    /// Closing the fd failed because the fd does not exist.
> +    BadFd,
> +}
> +
> +impl From<DeferredFdCloseError> for Error {
> +    fn from(err: DeferredFdCloseError) -> Error {
> +        match err {
> +            DeferredFdCloseError::TaskWorkUnavailable => ESRCH,

This error reads "No such process", I am not sure if that is the best
way to express the problem in that situation. I took a quick look at the
other error codes, but could not find a better fit. Do you have any
better ideas? Or is this the error that C binder uses?
Alice Ryhl Dec. 11, 2023, 3:34 p.m. UTC | #2
Benno Lossin <benno.lossin@proton.me> writes:
> On 12/6/23 12:59, Alice Ryhl wrote:
> > +    /// Schedule a task work that closes the file descriptor when this task returns to userspace.
> > +    ///
> > +    /// Fails if this is called from a context where we cannot run work when returning to
> > +    /// userspace. (E.g., from a kthread.)
> > +    pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
> > +        use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
> > +
> > +        // In this method, we schedule the task work before closing the file. This is because
> > +        // scheduling a task work is fallible, and we need to know whether it will fail before we
> > +        // attempt to close the file.
> > +
> > +        // SAFETY: Getting a pointer to current is always safe.
> > +        let current = unsafe { bindings::get_current() };
> > +
> > +        // SAFETY: Accessing the `flags` field of `current` is always safe.
> > +        let is_kthread = (unsafe { (*current).flags } & bindings::PF_KTHREAD) != 0;
> 
> Since Boqun brought to my attention that we already have a wrapper for
> `get_current()`, how about you use it here as well?

I can use the wrapper, but it seems simpler to not go through a
reference when we just need a raw pointer.

Perhaps we should have a safe `Task::current_raw` function that just
returns a raw pointer? It can still be safe.

> > +        if is_kthread {
> > +            return Err(DeferredFdCloseError::TaskWorkUnavailable);
> > +        }
> > +
> > +        // This disables the destructor of the box, so the allocation is not cleaned up
> > +        // automatically below.
> > +        let inner = Box::into_raw(self.inner);
> 
> Importantly this also lifts the uniqueness requirement (maybe add this
> to the comment?).

I can update the comment.

> > +
> > +        // The `callback_head` field is first in the struct, so this cast correctly gives us a
> > +        // pointer to the field.
> > +        let callback_head = inner.cast::<bindings::callback_head>();
> > +        // SAFETY: This pointer offset operation does not go out-of-bounds.
> > +        let file_field = unsafe { core::ptr::addr_of_mut!((*inner).file) };
> > +
> > +        // SAFETY: The `callback_head` pointer is compatible with the `do_close_fd` method.
> 
> Also, `callback_head` is valid, since it is derived from...

I can update the comment.

> > +        unsafe { bindings::init_task_work(callback_head, Some(Self::do_close_fd)) };
> > +        // SAFETY: The `callback_head` pointer points at a valid and fully initialized task work
> > +        // that is ready to be scheduled.
> > +        //
> > +        // If the task work gets scheduled as-is, then it will be a no-op. However, we will update
> > +        // the file pointer below to tell it which file to fput.
> > +        let res = unsafe { bindings::task_work_add(current, callback_head, TWA_RESUME) };
> > +
> > +        if res != 0 {
> > +            // SAFETY: Scheduling the task work failed, so we still have ownership of the box, so
> > +            // we may destroy it.
> > +            unsafe { drop(Box::from_raw(inner)) };
> > +
> > +            return Err(DeferredFdCloseError::TaskWorkUnavailable);
> 
> Just curious, what could make the `task_work_add` call fail? I imagine
> an OOM situation, but is that all?

Actually, no, this doesn't fail in OOM situations since we pass it an
allocation for its linked list. It fails only when the current task is
exiting and wont run task work again.

> > +        }
> > +
> > +        // SAFETY: Just an FFI call. This is safe no matter what `fd` is.
> 
> I took a look at the C code and there I found this comment:
> 
>     /*
>      * variant of close_fd that gets a ref on the file for later fput.
>      * The caller must ensure that filp_close() called on the file.
>      */
> 
> And while you do call `filp_close` later, this seems like a safety
> requirement to me.

I'll mention this.

> Also, you do not call it when `file` is null, which I imagine to be
> fine, but I do not know that since the C comment does not cover that
> case.

Null pointer means that the fd doesn't exist, and it's correct to do
nothing in that case.

> > +        let file = unsafe { bindings::close_fd_get_file(fd) };
> > +        if file.is_null() {
> > +            // We don't clean up the task work since that might be expensive if the task work queue
> > +            // is long. Just let it execute and let it clean up for itself.
> > +            return Err(DeferredFdCloseError::BadFd);
> > +        }
> > +
> > +        // SAFETY: The `file` pointer points at a valid file.
> > +        unsafe { bindings::get_file(file) };
> > +
> > +        // SAFETY: Due to the above `get_file`, even if the current task holds an `fdget` to
> > +        // this file right now, the refcount will not drop to zero until after it is released
> > +        // with `fdput`. This is because when using `fdget`, you must always use `fdput` before
> 
> Shouldn't this be "the refcount will not drop to zero until after it is
> released with `fput`."?
> 
> Why is this the SAFETY comment for `filp_close`? I am not understanding
> the requirement on that function that needs this. This seems more a
> justification for accessing `file` inside `do_close_fd`. In which case I
> think it would be better to make it a type invariant of
> `DeferredFdCloserInner`.

It's because `filp_close` decreases the refcount for the file, and doing
that is not always safe even if you have a refcount to the file. To drop
the refcount, at least one of the two following must be the case:

* If the refcount decreases to a non-zero value, then it is okay.
* If there are no users of `fdget` on the file, then it is okay.

In this case, we are in the first case, as we own two refcounts.

I'll clarify the safety comment in v3.

> > +        // returning to userspace, and our task work runs after any `fdget` users have returned
> > +        // to userspace.
> > +        //
> > +        // Note: fl_owner_t is currently a void pointer.
> > +        unsafe { bindings::filp_close(file, (*current).files as bindings::fl_owner_t) };
> > +
> > +        // We update the file pointer that the task work is supposed to fput.
> > +        //
> > +        // SAFETY: Task works are executed on the current thread once we return to userspace, so
> > +        // this write is guaranteed to happen before `do_close_fd` is called, which means that a
> > +        // race is not possible here.
> > +        //
> > +        // It's okay to pass this pointer to the task work, since we just acquired a refcount with
> > +        // the previous call to `get_file`. Furthermore, the refcount will not drop to zero during
> > +        // an `fdget` call, since we defer the `fput` until after returning to userspace.
> > +        unsafe { *file_field = file };
> 
> A synchronization question: who guarantees that this write is actually
> available to the cpu that executes `do_close_fd`? Is there some
> synchronization run when returning to userspace?

It's on the same thread, so it's just a sequenced-before relation.

It's not like an interrupt. It runs after the syscall invocation has
exited, but before it does the actual return-to-userspace stuff.

> > +
> > +        Ok(())
> > +    }
> > +
> > +    // SAFETY: This function is an implementation detail of `close_fd`, so its safety comments
> > +    // should be read in extension of that method.
> 
> Why not use this?:
> - `inner` is a valid pointer to the `callback_head` field of a valid
>   `DeferredFdCloserInner`.
> - `inner` has exclusive access to the pointee and owns the allocation.
> - `inner` originates from a call to `Box::into_raw`.

I'll update this together with the clarification of `filp_close`.

> > +    unsafe extern "C" fn do_close_fd(inner: *mut bindings::callback_head) {
> > +        // SAFETY: In `close_fd` we use this method together with a pointer that originates from a
> > +        // `Box<DeferredFdCloserInner>`, and we have just been given ownership of that allocation.
> > +        let inner = unsafe { Box::from_raw(inner as *mut DeferredFdCloserInner) };
> 
> Use `.cast()`.

Ok.

> > +        if !inner.file.is_null() {
> > +            // SAFETY: This drops a refcount we acquired in `close_fd`. Since this callback runs in
> > +            // a task work after we return to userspace, it is guaranteed that the current thread
> > +            // doesn't hold this file with `fdget`, as `fdget` must be released before returning to
> > +            // userspace.
> > +            unsafe { bindings::fput(inner.file) };
> > +        }
> > +        // Free the allocation.
> > +        drop(inner);
> > +    }
> > +}
> > +
> > +/// Represents a failure to close an fd in a deferred manner.
> > +#[derive(Copy, Clone, Eq, PartialEq)]
> > +pub enum DeferredFdCloseError {
> > +    /// Closing the fd failed because we were unable to schedule a task work.
> > +    TaskWorkUnavailable,
> > +    /// Closing the fd failed because the fd does not exist.
> > +    BadFd,
> > +}
> > +
> > +impl From<DeferredFdCloseError> for Error {
> > +    fn from(err: DeferredFdCloseError) -> Error {
> > +        match err {
> > +            DeferredFdCloseError::TaskWorkUnavailable => ESRCH,
> 
> This error reads "No such process", I am not sure if that is the best
> way to express the problem in that situation. I took a quick look at the
> other error codes, but could not find a better fit. Do you have any
> better ideas? Or is this the error that C binder uses?

This is the error code that task_work_add returns. (It can't happen in
Binder.)

And I do think that it is a reasonable choice, because the error only
happens if you're calling the method from a context that has no
userspace process associated with it.

Alice
Benno Lossin Dec. 11, 2023, 5:23 p.m. UTC | #3
On 12/11/23 16:34, Alice Ryhl wrote:
> Benno Lossin <benno.lossin@proton.me> writes:
>> On 12/6/23 12:59, Alice Ryhl wrote:
>>> +    /// Schedule a task work that closes the file descriptor when this task returns to userspace.
>>> +    ///
>>> +    /// Fails if this is called from a context where we cannot run work when returning to
>>> +    /// userspace. (E.g., from a kthread.)
>>> +    pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
>>> +        use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
>>> +
>>> +        // In this method, we schedule the task work before closing the file. This is because
>>> +        // scheduling a task work is fallible, and we need to know whether it will fail before we
>>> +        // attempt to close the file.
>>> +
>>> +        // SAFETY: Getting a pointer to current is always safe.
>>> +        let current = unsafe { bindings::get_current() };
>>> +
>>> +        // SAFETY: Accessing the `flags` field of `current` is always safe.
>>> +        let is_kthread = (unsafe { (*current).flags } & bindings::PF_KTHREAD) != 0;
>>
>> Since Boqun brought to my attention that we already have a wrapper for
>> `get_current()`, how about you use it here as well?
> 
> I can use the wrapper, but it seems simpler to not go through a
> reference when we just need a raw pointer.
> 
> Perhaps we should have a safe `Task::current_raw` function that just
> returns a raw pointer? It can still be safe.

Sure, that would work.

Just thought that it is annoying having to write the safety
requirements of that again and again.

[...]

>>> +        unsafe { bindings::init_task_work(callback_head, Some(Self::do_close_fd)) };
>>> +        // SAFETY: The `callback_head` pointer points at a valid and fully initialized task work
>>> +        // that is ready to be scheduled.
>>> +        //
>>> +        // If the task work gets scheduled as-is, then it will be a no-op. However, we will update
>>> +        // the file pointer below to tell it which file to fput.
>>> +        let res = unsafe { bindings::task_work_add(current, callback_head, TWA_RESUME) };
>>> +
>>> +        if res != 0 {
>>> +            // SAFETY: Scheduling the task work failed, so we still have ownership of the box, so
>>> +            // we may destroy it.
>>> +            unsafe { drop(Box::from_raw(inner)) };
>>> +
>>> +            return Err(DeferredFdCloseError::TaskWorkUnavailable);
>>
>> Just curious, what could make the `task_work_add` call fail? I imagine
>> an OOM situation, but is that all?
> 
> Actually, no, this doesn't fail in OOM situations since we pass it an
> allocation for its linked list. It fails only when the current task is
> exiting and wont run task work again.

Interesting, is there some way to check for this aside from calling
`task_work_add`?

>>> +        }
>>> +
>>> +        // SAFETY: Just an FFI call. This is safe no matter what `fd` is.
>>
>> I took a look at the C code and there I found this comment:
>>
>>     /*
>>      * variant of close_fd that gets a ref on the file for later fput.
>>      * The caller must ensure that filp_close() called on the file.
>>      */
>>
>> And while you do call `filp_close` later, this seems like a safety
>> requirement to me.
> 
> I'll mention this.
> 
>> Also, you do not call it when `file` is null, which I imagine to be
>> fine, but I do not know that since the C comment does not cover that
>> case.
> 
> Null pointer means that the fd doesn't exist, and it's correct to do
> nothing in that case.

I would also mention that in a comment (or the SAFETY comment).

>>> +        let file = unsafe { bindings::close_fd_get_file(fd) };
>>> +        if file.is_null() {
>>> +            // We don't clean up the task work since that might be expensive if the task work queue
>>> +            // is long. Just let it execute and let it clean up for itself.
>>> +            return Err(DeferredFdCloseError::BadFd);
>>> +        }
>>> +
>>> +        // SAFETY: The `file` pointer points at a valid file.
>>> +        unsafe { bindings::get_file(file) };
>>> +
>>> +        // SAFETY: Due to the above `get_file`, even if the current task holds an `fdget` to
>>> +        // this file right now, the refcount will not drop to zero until after it is released
>>> +        // with `fdput`. This is because when using `fdget`, you must always use `fdput` before
>>
>> Shouldn't this be "the refcount will not drop to zero until after it is
>> released with `fput`."?
>>
>> Why is this the SAFETY comment for `filp_close`? I am not understanding
>> the requirement on that function that needs this. This seems more a
>> justification for accessing `file` inside `do_close_fd`. In which case I
>> think it would be better to make it a type invariant of
>> `DeferredFdCloserInner`.
> 
> It's because `filp_close` decreases the refcount for the file, and doing
> that is not always safe even if you have a refcount to the file. To drop
> the refcount, at least one of the two following must be the case:
> 
> * If the refcount decreases to a non-zero value, then it is okay.
> * If there are no users of `fdget` on the file, then it is okay.

I see, that makes sense. Is this written down somewhere? Or how does one
know about this?

> In this case, we are in the first case, as we own two refcounts.
> 
> I'll clarify the safety comment in v3.
> 
>>> +        // returning to userspace, and our task work runs after any `fdget` users have returned
>>> +        // to userspace.
>>> +        //
>>> +        // Note: fl_owner_t is currently a void pointer.
>>> +        unsafe { bindings::filp_close(file, (*current).files as bindings::fl_owner_t) };
>>> +
>>> +        // We update the file pointer that the task work is supposed to fput.
>>> +        //
>>> +        // SAFETY: Task works are executed on the current thread once we return to userspace, so
>>> +        // this write is guaranteed to happen before `do_close_fd` is called, which means that a
>>> +        // race is not possible here.
>>> +        //
>>> +        // It's okay to pass this pointer to the task work, since we just acquired a refcount with
>>> +        // the previous call to `get_file`. Furthermore, the refcount will not drop to zero during
>>> +        // an `fdget` call, since we defer the `fput` until after returning to userspace.
>>> +        unsafe { *file_field = file };
>>
>> A synchronization question: who guarantees that this write is actually
>> available to the cpu that executes `do_close_fd`? Is there some
>> synchronization run when returning to userspace?
> 
> It's on the same thread, so it's just a sequenced-before relation.
> 
> It's not like an interrupt. It runs after the syscall invocation has
> exited, but before it does the actual return-to-userspace stuff.

Reasonable, can you also put this in a comment?

[...]

>>> +        if !inner.file.is_null() {
>>> +            // SAFETY: This drops a refcount we acquired in `close_fd`. Since this callback runs in
>>> +            // a task work after we return to userspace, it is guaranteed that the current thread
>>> +            // doesn't hold this file with `fdget`, as `fdget` must be released before returning to
>>> +            // userspace.
>>> +            unsafe { bindings::fput(inner.file) };
>>> +        }
>>> +        // Free the allocation.
>>> +        drop(inner);
>>> +    }
>>> +}
>>> +
>>> +/// Represents a failure to close an fd in a deferred manner.
>>> +#[derive(Copy, Clone, Eq, PartialEq)]
>>> +pub enum DeferredFdCloseError {
>>> +    /// Closing the fd failed because we were unable to schedule a task work.
>>> +    TaskWorkUnavailable,
>>> +    /// Closing the fd failed because the fd does not exist.
>>> +    BadFd,
>>> +}
>>> +
>>> +impl From<DeferredFdCloseError> for Error {
>>> +    fn from(err: DeferredFdCloseError) -> Error {
>>> +        match err {
>>> +            DeferredFdCloseError::TaskWorkUnavailable => ESRCH,
>>
>> This error reads "No such process", I am not sure if that is the best
>> way to express the problem in that situation. I took a quick look at the
>> other error codes, but could not find a better fit. Do you have any
>> better ideas? Or is this the error that C binder uses?
> 
> This is the error code that task_work_add returns. (It can't happen in
> Binder.)
> 
> And I do think that it is a reasonable choice, because the error only
> happens if you're calling the method from a context that has no
> userspace process associated with it.

I see.

What do you think of making the Rust error more descriptive? So instead
of implementing `Debug` like you currently do, you print
    
    $error ($variant)

where $error = Error::from(*self) and $variant is the name of the
variant?

This is more of a general suggestion, I don't think that this error type
in particular warrants this. But in general with Rust we do have the
option to have good error messages for every error while maintaining
efficient error values.
Boqun Feng Dec. 11, 2023, 5:41 p.m. UTC | #4
On Mon, Dec 11, 2023 at 03:34:40PM +0000, Alice Ryhl wrote:
> Benno Lossin <benno.lossin@proton.me> writes:
> > On 12/6/23 12:59, Alice Ryhl wrote:
> > > +    /// Schedule a task work that closes the file descriptor when this task returns to userspace.
> > > +    ///
> > > +    /// Fails if this is called from a context where we cannot run work when returning to
> > > +    /// userspace. (E.g., from a kthread.)
> > > +    pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
> > > +        use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
> > > +
> > > +        // In this method, we schedule the task work before closing the file. This is because
> > > +        // scheduling a task work is fallible, and we need to know whether it will fail before we
> > > +        // attempt to close the file.
> > > +
> > > +        // SAFETY: Getting a pointer to current is always safe.
> > > +        let current = unsafe { bindings::get_current() };
> > > +
> > > +        // SAFETY: Accessing the `flags` field of `current` is always safe.
> > > +        let is_kthread = (unsafe { (*current).flags } & bindings::PF_KTHREAD) != 0;
> > 
> > Since Boqun brought to my attention that we already have a wrapper for
> > `get_current()`, how about you use it here as well?
> 
> I can use the wrapper, but it seems simpler to not go through a
> reference when we just need a raw pointer.
> 
> Perhaps we should have a safe `Task::current_raw` function that just
> returns a raw pointer? It can still be safe.
> 

I think we can have a `as_ptr` function for `Task`?

	impl Task {
	    pub fn as_ptr(&self) -> *mut bindings::task_struct {
	        self.0.get()
	    }
	}

Regards,
Boqun

[...]
Boqun Feng Dec. 12, 2023, 1:25 a.m. UTC | #5
On Mon, Dec 11, 2023 at 09:41:28AM -0800, Boqun Feng wrote:
> On Mon, Dec 11, 2023 at 03:34:40PM +0000, Alice Ryhl wrote:
> > Benno Lossin <benno.lossin@proton.me> writes:
> > > On 12/6/23 12:59, Alice Ryhl wrote:
> > > > +    /// Schedule a task work that closes the file descriptor when this task returns to userspace.
> > > > +    ///
> > > > +    /// Fails if this is called from a context where we cannot run work when returning to
> > > > +    /// userspace. (E.g., from a kthread.)
> > > > +    pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
> > > > +        use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
> > > > +
> > > > +        // In this method, we schedule the task work before closing the file. This is because
> > > > +        // scheduling a task work is fallible, and we need to know whether it will fail before we
> > > > +        // attempt to close the file.
> > > > +
> > > > +        // SAFETY: Getting a pointer to current is always safe.
> > > > +        let current = unsafe { bindings::get_current() };
> > > > +
> > > > +        // SAFETY: Accessing the `flags` field of `current` is always safe.
> > > > +        let is_kthread = (unsafe { (*current).flags } & bindings::PF_KTHREAD) != 0;
> > > 
> > > Since Boqun brought to my attention that we already have a wrapper for
> > > `get_current()`, how about you use it here as well?
> > 
> > I can use the wrapper, but it seems simpler to not go through a
> > reference when we just need a raw pointer.
> > 
> > Perhaps we should have a safe `Task::current_raw` function that just
> > returns a raw pointer? It can still be safe.
> > 
> 
> I think we can have a `as_ptr` function for `Task`?
> 
> 	impl Task {
> 	    pub fn as_ptr(&self) -> *mut bindings::task_struct {
> 	        self.0.get()
> 	    }
> 	}

Forgot mention, yes a ptr->ref->ptr trip may not be ideal, but I think
eventually we will have a task work wrapper, in that case maybe
Task::as_ptr() is still needed somehow.

Regards,
Boqun
Alice Ryhl Dec. 12, 2023, 9:35 a.m. UTC | #6
On Mon, Dec 11, 2023 at 6:23 PM Benno Lossin <benno.lossin@proton.me> wrote:
>
> >>> +        unsafe { bindings::init_task_work(callback_head, Some(Self::do_close_fd)) };
> >>> +        // SAFETY: The `callback_head` pointer points at a valid and fully initialized task work
> >>> +        // that is ready to be scheduled.
> >>> +        //
> >>> +        // If the task work gets scheduled as-is, then it will be a no-op. However, we will update
> >>> +        // the file pointer below to tell it which file to fput.
> >>> +        let res = unsafe { bindings::task_work_add(current, callback_head, TWA_RESUME) };
> >>> +
> >>> +        if res != 0 {
> >>> +            // SAFETY: Scheduling the task work failed, so we still have ownership of the box, so
> >>> +            // we may destroy it.
> >>> +            unsafe { drop(Box::from_raw(inner)) };
> >>> +
> >>> +            return Err(DeferredFdCloseError::TaskWorkUnavailable);
> >>
> >> Just curious, what could make the `task_work_add` call fail? I imagine
> >> an OOM situation, but is that all?
> >
> > Actually, no, this doesn't fail in OOM situations since we pass it an
> > allocation for its linked list. It fails only when the current task is
> > exiting and wont run task work again.
>
> Interesting, is there some way to check for this aside from calling
> `task_work_add`?

I don't think so. We would need to access the `work_exited` constant
in `kernel/task_work.c` to do that, but it is not exported.

> >> Also, you do not call it when `file` is null, which I imagine to be
> >> fine, but I do not know that since the C comment does not cover that
> >> case.
> >
> > Null pointer means that the fd doesn't exist, and it's correct to do
> > nothing in that case.
>
> I would also mention that in a comment (or the SAFETY comment).

Okay.

> >>> +        let file = unsafe { bindings::close_fd_get_file(fd) };
> >>> +        if file.is_null() {
> >>> +            // We don't clean up the task work since that might be expensive if the task work queue
> >>> +            // is long. Just let it execute and let it clean up for itself.
> >>> +            return Err(DeferredFdCloseError::BadFd);
> >>> +        }
> >>> +
> >>> +        // SAFETY: The `file` pointer points at a valid file.
> >>> +        unsafe { bindings::get_file(file) };
> >>> +
> >>> +        // SAFETY: Due to the above `get_file`, even if the current task holds an `fdget` to
> >>> +        // this file right now, the refcount will not drop to zero until after it is released
> >>> +        // with `fdput`. This is because when using `fdget`, you must always use `fdput` before
> >>
> >> Shouldn't this be "the refcount will not drop to zero until after it is
> >> released with `fput`."?
> >>
> >> Why is this the SAFETY comment for `filp_close`? I am not understanding
> >> the requirement on that function that needs this. This seems more a
> >> justification for accessing `file` inside `do_close_fd`. In which case I
> >> think it would be better to make it a type invariant of
> >> `DeferredFdCloserInner`.
> >
> > It's because `filp_close` decreases the refcount for the file, and doing
> > that is not always safe even if you have a refcount to the file. To drop
> > the refcount, at least one of the two following must be the case:
> >
> > * If the refcount decreases to a non-zero value, then it is okay.
> > * If there are no users of `fdget` on the file, then it is okay.
>
> I see, that makes sense. Is this written down somewhere? Or how does one
> know about this?

I don't think there's a single place to read about this. The comments
on __fget_light allude to something similar, but it makes the blanket
statement that you can't call filp_close while an fdget reference
exists, even though the reality is a bit more nuanced.

> >>> +        // We update the file pointer that the task work is supposed to fput.
> >>> +        //
> >>> +        // SAFETY: Task works are executed on the current thread once we return to userspace, so
> >>> +        // this write is guaranteed to happen before `do_close_fd` is called, which means that a
> >>> +        // race is not possible here.
> >>> +        //
> >>> +        // It's okay to pass this pointer to the task work, since we just acquired a refcount with
> >>> +        // the previous call to `get_file`. Furthermore, the refcount will not drop to zero during
> >>> +        // an `fdget` call, since we defer the `fput` until after returning to userspace.
> >>> +        unsafe { *file_field = file };
> >>
> >> A synchronization question: who guarantees that this write is actually
> >> available to the cpu that executes `do_close_fd`? Is there some
> >> synchronization run when returning to userspace?
> >
> > It's on the same thread, so it's just a sequenced-before relation.
> >
> > It's not like an interrupt. It runs after the syscall invocation has
> > exited, but before it does the actual return-to-userspace stuff.
>
> Reasonable, can you also put this in a comment?

What do you want me to add? I already say that it will be executed on
the same thread.

> >>> +/// Represents a failure to close an fd in a deferred manner.
> >>> +#[derive(Copy, Clone, Eq, PartialEq)]
> >>> +pub enum DeferredFdCloseError {
> >>> +    /// Closing the fd failed because we were unable to schedule a task work.
> >>> +    TaskWorkUnavailable,
> >>> +    /// Closing the fd failed because the fd does not exist.
> >>> +    BadFd,
> >>> +}
> >>> +
> >>> +impl From<DeferredFdCloseError> for Error {
> >>> +    fn from(err: DeferredFdCloseError) -> Error {
> >>> +        match err {
> >>> +            DeferredFdCloseError::TaskWorkUnavailable => ESRCH,
> >>
> >> This error reads "No such process", I am not sure if that is the best
> >> way to express the problem in that situation. I took a quick look at the
> >> other error codes, but could not find a better fit. Do you have any
> >> better ideas? Or is this the error that C binder uses?
> >
> > This is the error code that task_work_add returns. (It can't happen in
> > Binder.)
> >
> > And I do think that it is a reasonable choice, because the error only
> > happens if you're calling the method from a context that has no
> > userspace process associated with it.
>
> I see.
>
> What do you think of making the Rust error more descriptive? So instead
> of implementing `Debug` like you currently do, you print
>
>     $error ($variant)
>
> where $error = Error::from(*self) and $variant is the name of the
> variant?
>
> This is more of a general suggestion, I don't think that this error type
> in particular warrants this. But in general with Rust we do have the
> option to have good error messages for every error while maintaining
> efficient error values.

I can #[derive(Debug)] instead, I guess?

Alice
Benno Lossin Dec. 12, 2023, 4:50 p.m. UTC | #7
On 12/12/23 10:35, Alice Ryhl wrote:
> On Mon, Dec 11, 2023 at 6:23 PM Benno Lossin <benno.lossin@proton.me> wrote:
>>
>>>>> +        // We update the file pointer that the task work is supposed to fput.
>>>>> +        //
>>>>> +        // SAFETY: Task works are executed on the current thread once we return to userspace, so
>>>>> +        // this write is guaranteed to happen before `do_close_fd` is called, which means that a
>>>>> +        // race is not possible here.
>>>>> +        //
>>>>> +        // It's okay to pass this pointer to the task work, since we just acquired a refcount with
>>>>> +        // the previous call to `get_file`. Furthermore, the refcount will not drop to zero during
>>>>> +        // an `fdget` call, since we defer the `fput` until after returning to userspace.
>>>>> +        unsafe { *file_field = file };
>>>>
>>>> A synchronization question: who guarantees that this write is actually
>>>> available to the cpu that executes `do_close_fd`? Is there some
>>>> synchronization run when returning to userspace?
>>>
>>> It's on the same thread, so it's just a sequenced-before relation.
>>>
>>> It's not like an interrupt. It runs after the syscall invocation has
>>> exited, but before it does the actual return-to-userspace stuff.
>>
>> Reasonable, can you also put this in a comment?
> 
> What do you want me to add? I already say that it will be executed on
> the same thread.

Seems I missed that, then no need to add anything.

>>>>> +/// Represents a failure to close an fd in a deferred manner.
>>>>> +#[derive(Copy, Clone, Eq, PartialEq)]
>>>>> +pub enum DeferredFdCloseError {
>>>>> +    /// Closing the fd failed because we were unable to schedule a task work.
>>>>> +    TaskWorkUnavailable,
>>>>> +    /// Closing the fd failed because the fd does not exist.
>>>>> +    BadFd,
>>>>> +}
>>>>> +
>>>>> +impl From<DeferredFdCloseError> for Error {
>>>>> +    fn from(err: DeferredFdCloseError) -> Error {
>>>>> +        match err {
>>>>> +            DeferredFdCloseError::TaskWorkUnavailable => ESRCH,
>>>>
>>>> This error reads "No such process", I am not sure if that is the best
>>>> way to express the problem in that situation. I took a quick look at the
>>>> other error codes, but could not find a better fit. Do you have any
>>>> better ideas? Or is this the error that C binder uses?
>>>
>>> This is the error code that task_work_add returns. (It can't happen in
>>> Binder.)
>>>
>>> And I do think that it is a reasonable choice, because the error only
>>> happens if you're calling the method from a context that has no
>>> userspace process associated with it.
>>
>> I see.
>>
>> What do you think of making the Rust error more descriptive? So instead
>> of implementing `Debug` like you currently do, you print
>>
>>     $error ($variant)
>>
>> where $error = Error::from(*self) and $variant is the name of the
>> variant?
>>
>> This is more of a general suggestion, I don't think that this error type
>> in particular warrants this. But in general with Rust we do have the
>> option to have good error messages for every error while maintaining
>> efficient error values.
> 
> I can #[derive(Debug)] instead, I guess?

Hmm I thought that might not be ideal, since then you would not have the
error code, only `TaskWorkUnavailable` or `BadFd`.
But if that is also acceptable, then I would go with the derived debug.
Boqun Feng Dec. 12, 2023, 8:57 p.m. UTC | #8
On Mon, Dec 11, 2023 at 05:25:18PM -0800, Boqun Feng wrote:
> On Mon, Dec 11, 2023 at 09:41:28AM -0800, Boqun Feng wrote:
> > On Mon, Dec 11, 2023 at 03:34:40PM +0000, Alice Ryhl wrote:
> > > Benno Lossin <benno.lossin@proton.me> writes:
> > > > On 12/6/23 12:59, Alice Ryhl wrote:
> > > > > +    /// Schedule a task work that closes the file descriptor when this task returns to userspace.
> > > > > +    ///
> > > > > +    /// Fails if this is called from a context where we cannot run work when returning to
> > > > > +    /// userspace. (E.g., from a kthread.)
> > > > > +    pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
> > > > > +        use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
> > > > > +
> > > > > +        // In this method, we schedule the task work before closing the file. This is because
> > > > > +        // scheduling a task work is fallible, and we need to know whether it will fail before we
> > > > > +        // attempt to close the file.
> > > > > +
> > > > > +        // SAFETY: Getting a pointer to current is always safe.
> > > > > +        let current = unsafe { bindings::get_current() };
> > > > > +
> > > > > +        // SAFETY: Accessing the `flags` field of `current` is always safe.
> > > > > +        let is_kthread = (unsafe { (*current).flags } & bindings::PF_KTHREAD) != 0;
> > > > 
> > > > Since Boqun brought to my attention that we already have a wrapper for
> > > > `get_current()`, how about you use it here as well?
> > > 
> > > I can use the wrapper, but it seems simpler to not go through a
> > > reference when we just need a raw pointer.
> > > 
> > > Perhaps we should have a safe `Task::current_raw` function that just
> > > returns a raw pointer? It can still be safe.
> > > 
> > 
> > I think we can have a `as_ptr` function for `Task`?
> > 
> > 	impl Task {
> > 	    pub fn as_ptr(&self) -> *mut bindings::task_struct {
> > 	        self.0.get()
> > 	    }
> > 	}
> 
> Forgot mention, yes a ptr->ref->ptr trip may not be ideal, but I think
> eventually we will have a task work wrapper, in that case maybe
> Task::as_ptr() is still needed somehow.
> 

After some more thoughts, I agree `Task::current_raw` may be better for
the current usage, since we can also use it to wrap a
`current_is_kthread` method like:

    impl Task {
        pub fn current_is_kthread() -> bool {
	    let current = Self::current_raw();

	    unsafe { (*current).flags & bindings::PF_KTHREAD != 0 }
	}
    }

Regards,
Boqun
Alice Ryhl Dec. 13, 2023, 11:04 a.m. UTC | #9
On Tue, Dec 12, 2023 at 9:57 PM Boqun Feng <boqun.feng@gmail.com> wrote:
>
> On Mon, Dec 11, 2023 at 05:25:18PM -0800, Boqun Feng wrote:
> > On Mon, Dec 11, 2023 at 09:41:28AM -0800, Boqun Feng wrote:
> > > On Mon, Dec 11, 2023 at 03:34:40PM +0000, Alice Ryhl wrote:
> > > > Benno Lossin <benno.lossin@proton.me> writes:
> > > > > On 12/6/23 12:59, Alice Ryhl wrote:
> > > > > > +    /// Schedule a task work that closes the file descriptor when this task returns to userspace.
> > > > > > +    ///
> > > > > > +    /// Fails if this is called from a context where we cannot run work when returning to
> > > > > > +    /// userspace. (E.g., from a kthread.)
> > > > > > +    pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
> > > > > > +        use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
> > > > > > +
> > > > > > +        // In this method, we schedule the task work before closing the file. This is because
> > > > > > +        // scheduling a task work is fallible, and we need to know whether it will fail before we
> > > > > > +        // attempt to close the file.
> > > > > > +
> > > > > > +        // SAFETY: Getting a pointer to current is always safe.
> > > > > > +        let current = unsafe { bindings::get_current() };
> > > > > > +
> > > > > > +        // SAFETY: Accessing the `flags` field of `current` is always safe.
> > > > > > +        let is_kthread = (unsafe { (*current).flags } & bindings::PF_KTHREAD) != 0;
> > > > >
> > > > > Since Boqun brought to my attention that we already have a wrapper for
> > > > > `get_current()`, how about you use it here as well?
> > > >
> > > > I can use the wrapper, but it seems simpler to not go through a
> > > > reference when we just need a raw pointer.
> > > >
> > > > Perhaps we should have a safe `Task::current_raw` function that just
> > > > returns a raw pointer? It can still be safe.
> > > >
> > >
> > > I think we can have a `as_ptr` function for `Task`?
> > >
> > >     impl Task {
> > >         pub fn as_ptr(&self) -> *mut bindings::task_struct {
> > >             self.0.get()
> > >         }
> > >     }
> >
> > Forgot mention, yes a ptr->ref->ptr trip may not be ideal, but I think
> > eventually we will have a task work wrapper, in that case maybe
> > Task::as_ptr() is still needed somehow.
> >
>
> After some more thoughts, I agree `Task::current_raw` may be better for
> the current usage, since we can also use it to wrap a
> `current_is_kthread` method like:
>
>     impl Task {
>         pub fn current_is_kthread() -> bool {
>             let current = Self::current_raw();
>
>             unsafe { (*current).flags & bindings::PF_KTHREAD != 0 }
>         }
>     }

I'll introduce a current_raw, then.

Alice
diff mbox series

Patch

diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index 700f01840188..c8daee341df6 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -9,6 +9,7 @@ 
 #include <kunit/test.h>
 #include <linux/cred.h>
 #include <linux/errname.h>
+#include <linux/fdtable.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/pid_namespace.h>
@@ -17,6 +18,7 @@ 
 #include <linux/refcount.h>
 #include <linux/wait.h>
 #include <linux/sched.h>
+#include <linux/task_work.h>
 #include <linux/workqueue.h>
 
 /* `bindgen` gets confused at certain things. */
diff --git a/rust/helpers.c b/rust/helpers.c
index 58e3a9dff349..d146bbf25aec 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -32,6 +32,7 @@ 
 #include <linux/sched/signal.h>
 #include <linux/security.h>
 #include <linux/spinlock.h>
+#include <linux/task_work.h>
 #include <linux/wait.h>
 #include <linux/workqueue.h>
 
@@ -243,6 +244,13 @@  void rust_helper_security_release_secctx(char *secdata, u32 seclen)
 EXPORT_SYMBOL_GPL(rust_helper_security_release_secctx);
 #endif
 
+void rust_helper_init_task_work(struct callback_head *twork,
+				task_work_func_t func)
+{
+	init_task_work(twork, func);
+}
+EXPORT_SYMBOL_GPL(rust_helper_init_task_work);
+
 /*
  * `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
  * use it in contexts where Rust expects a `usize` like slice (array) indices.
diff --git a/rust/kernel/file.rs b/rust/kernel/file.rs
index 2d036d4636a0..eba96af4bdf7 100644
--- a/rust/kernel/file.rs
+++ b/rust/kernel/file.rs
@@ -11,7 +11,8 @@ 
     error::{code::*, Error, Result},
     types::{ARef, AlwaysRefCounted, NotThreadSafe, Opaque},
 };
-use core::{marker::PhantomData, ptr};
+use alloc::boxed::Box;
+use core::{alloc::AllocError, marker::PhantomData, mem, ptr};
 
 /// Flags associated with a [`File`].
 pub mod flags {
@@ -257,6 +258,160 @@  fn drop(&mut self) {
     }
 }
 
+/// Helper used for closing file descriptors in a way that is safe even if the file is currently
+/// held using `fdget`.
+///
+/// Additional motivation can be found in commit 80cd795630d6 ("binder: fix use-after-free due to
+/// ksys_close() during fdget()") and in the comments on `binder_do_fd_close`.
+pub struct DeferredFdCloser {
+    inner: Box<DeferredFdCloserInner>,
+}
+
+/// SAFETY: This just holds an allocation with no real content, so there's no safety issue with
+/// moving it across threads.
+unsafe impl Send for DeferredFdCloser {}
+unsafe impl Sync for DeferredFdCloser {}
+
+#[repr(C)]
+struct DeferredFdCloserInner {
+    twork: mem::MaybeUninit<bindings::callback_head>,
+    file: *mut bindings::file,
+}
+
+impl DeferredFdCloser {
+    /// Create a new [`DeferredFdCloser`].
+    pub fn new() -> Result<Self, AllocError> {
+        Ok(Self {
+            inner: Box::try_new(DeferredFdCloserInner {
+                twork: mem::MaybeUninit::uninit(),
+                file: core::ptr::null_mut(),
+            })?,
+        })
+    }
+
+    /// Schedule a task work that closes the file descriptor when this task returns to userspace.
+    ///
+    /// Fails if this is called from a context where we cannot run work when returning to
+    /// userspace. (E.g., from a kthread.)
+    pub fn close_fd(self, fd: u32) -> Result<(), DeferredFdCloseError> {
+        use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
+
+        // In this method, we schedule the task work before closing the file. This is because
+        // scheduling a task work is fallible, and we need to know whether it will fail before we
+        // attempt to close the file.
+
+        // SAFETY: Getting a pointer to current is always safe.
+        let current = unsafe { bindings::get_current() };
+
+        // SAFETY: Accessing the `flags` field of `current` is always safe.
+        let is_kthread = (unsafe { (*current).flags } & bindings::PF_KTHREAD) != 0;
+        if is_kthread {
+            return Err(DeferredFdCloseError::TaskWorkUnavailable);
+        }
+
+        // This disables the destructor of the box, so the allocation is not cleaned up
+        // automatically below.
+        let inner = Box::into_raw(self.inner);
+
+        // The `callback_head` field is first in the struct, so this cast correctly gives us a
+        // pointer to the field.
+        let callback_head = inner.cast::<bindings::callback_head>();
+        // SAFETY: This pointer offset operation does not go out-of-bounds.
+        let file_field = unsafe { core::ptr::addr_of_mut!((*inner).file) };
+
+        // SAFETY: The `callback_head` pointer is compatible with the `do_close_fd` method.
+        unsafe { bindings::init_task_work(callback_head, Some(Self::do_close_fd)) };
+        // SAFETY: The `callback_head` pointer points at a valid and fully initialized task work
+        // that is ready to be scheduled.
+        //
+        // If the task work gets scheduled as-is, then it will be a no-op. However, we will update
+        // the file pointer below to tell it which file to fput.
+        let res = unsafe { bindings::task_work_add(current, callback_head, TWA_RESUME) };
+
+        if res != 0 {
+            // SAFETY: Scheduling the task work failed, so we still have ownership of the box, so
+            // we may destroy it.
+            unsafe { drop(Box::from_raw(inner)) };
+
+            return Err(DeferredFdCloseError::TaskWorkUnavailable);
+        }
+
+        // SAFETY: Just an FFI call. This is safe no matter what `fd` is.
+        let file = unsafe { bindings::close_fd_get_file(fd) };
+        if file.is_null() {
+            // We don't clean up the task work since that might be expensive if the task work queue
+            // is long. Just let it execute and let it clean up for itself.
+            return Err(DeferredFdCloseError::BadFd);
+        }
+
+        // SAFETY: The `file` pointer points at a valid file.
+        unsafe { bindings::get_file(file) };
+
+        // SAFETY: Due to the above `get_file`, even if the current task holds an `fdget` to
+        // this file right now, the refcount will not drop to zero until after it is released
+        // with `fdput`. This is because when using `fdget`, you must always use `fdput` before
+        // returning to userspace, and our task work runs after any `fdget` users have returned
+        // to userspace.
+        //
+        // Note: fl_owner_t is currently a void pointer.
+        unsafe { bindings::filp_close(file, (*current).files as bindings::fl_owner_t) };
+
+        // We update the file pointer that the task work is supposed to fput.
+        //
+        // SAFETY: Task works are executed on the current thread once we return to userspace, so
+        // this write is guaranteed to happen before `do_close_fd` is called, which means that a
+        // race is not possible here.
+        //
+        // It's okay to pass this pointer to the task work, since we just acquired a refcount with
+        // the previous call to `get_file`. Furthermore, the refcount will not drop to zero during
+        // an `fdget` call, since we defer the `fput` until after returning to userspace.
+        unsafe { *file_field = file };
+
+        Ok(())
+    }
+
+    // SAFETY: This function is an implementation detail of `close_fd`, so its safety comments
+    // should be read in extension of that method.
+    unsafe extern "C" fn do_close_fd(inner: *mut bindings::callback_head) {
+        // SAFETY: In `close_fd` we use this method together with a pointer that originates from a
+        // `Box<DeferredFdCloserInner>`, and we have just been given ownership of that allocation.
+        let inner = unsafe { Box::from_raw(inner as *mut DeferredFdCloserInner) };
+        if !inner.file.is_null() {
+            // SAFETY: This drops a refcount we acquired in `close_fd`. Since this callback runs in
+            // a task work after we return to userspace, it is guaranteed that the current thread
+            // doesn't hold this file with `fdget`, as `fdget` must be released before returning to
+            // userspace.
+            unsafe { bindings::fput(inner.file) };
+        }
+        // Free the allocation.
+        drop(inner);
+    }
+}
+
+/// Represents a failure to close an fd in a deferred manner.
+#[derive(Copy, Clone, Eq, PartialEq)]
+pub enum DeferredFdCloseError {
+    /// Closing the fd failed because we were unable to schedule a task work.
+    TaskWorkUnavailable,
+    /// Closing the fd failed because the fd does not exist.
+    BadFd,
+}
+
+impl From<DeferredFdCloseError> for Error {
+    fn from(err: DeferredFdCloseError) -> Error {
+        match err {
+            DeferredFdCloseError::TaskWorkUnavailable => ESRCH,
+            DeferredFdCloseError::BadFd => EBADF,
+        }
+    }
+}
+
+impl core::fmt::Debug for DeferredFdCloseError {
+    fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
+        Error::from(*self).fmt(f)
+    }
+}
+
 /// Represents the `EBADF` error code.
 ///
 /// Used for methods that can only fail with `EBADF`.