diff mbox series

[RFC,10/19] rust: fs: introduce `FileSystem::read_folio`

Message ID 20231018122518.128049-11-wedsonaf@gmail.com (mailing list archive)
State New, archived
Headers show
Series Rust abstractions for VFS | expand

Commit Message

Wedson Almeida Filho Oct. 18, 2023, 12:25 p.m. UTC
From: Wedson Almeida Filho <walmeida@microsoft.com>

Allow Rust file systems to create regular file inodes backed by the page
cache. The contents of such files are read into folios via `read_folio`.

Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
---
 rust/kernel/folio.rs      |  1 -
 rust/kernel/fs.rs         | 75 +++++++++++++++++++++++++++++++++++++--
 samples/rust/rust_rofs.rs | 69 ++++++++++++++++++++++++-----------
 3 files changed, 122 insertions(+), 23 deletions(-)

Comments

Matthew Wilcox Nov. 7, 2023, 10:18 p.m. UTC | #1
On Wed, Oct 18, 2023 at 09:25:09AM -0300, Wedson Almeida Filho wrote:
> @@ -36,6 +39,9 @@ pub trait FileSystem {
>  
>      /// Returns the inode corresponding to the directory entry with the given name.
>      fn lookup(parent: &INode<Self>, name: &[u8]) -> Result<ARef<INode<Self>>>;
> +
> +    /// Reads the contents of the inode into the given folio.
> +    fn read_folio(inode: &INode<Self>, folio: LockedFolio<'_>) -> Result;
>  }
>  

This really shouldn't be a per-filesystem operation.  We have operations
split up into mapping_ops, inode_ops and file_ops for a reason.  In this
case, read_folio() can have a very different implementation for, eg,
symlinks, directories and files.  So we want to have different aops
for each of symlinks, directories and files.  We should maintain that
separation for filesystems written in Rust too.  Unless there's a good
reason to change it, and then we should change it in C too.
Al Viro Nov. 7, 2023, 10:22 p.m. UTC | #2
On Tue, Nov 07, 2023 at 10:18:05PM +0000, Matthew Wilcox wrote:
> On Wed, Oct 18, 2023 at 09:25:09AM -0300, Wedson Almeida Filho wrote:
> > @@ -36,6 +39,9 @@ pub trait FileSystem {
> >  
> >      /// Returns the inode corresponding to the directory entry with the given name.
> >      fn lookup(parent: &INode<Self>, name: &[u8]) -> Result<ARef<INode<Self>>>;
> > +
> > +    /// Reads the contents of the inode into the given folio.
> > +    fn read_folio(inode: &INode<Self>, folio: LockedFolio<'_>) -> Result;
> >  }
> >  
> 
> This really shouldn't be a per-filesystem operation.  We have operations
> split up into mapping_ops, inode_ops and file_ops for a reason.  In this
> case, read_folio() can have a very different implementation for, eg,
> symlinks, directories and files.  So we want to have different aops
> for each of symlinks, directories and files.  We should maintain that
> separation for filesystems written in Rust too.  Unless there's a good
> reason to change it, and then we should change it in C too.

While we are at it, lookup is also very much not a per-filesystem operation.
Take a look at e.g. procfs, for an obvious example...

Wait a minute... what in name of everything unholy is that thing doing tied
to inodes in the first place?
Wedson Almeida Filho Nov. 8, 2023, 12:35 a.m. UTC | #3
On Tue, 7 Nov 2023 at 19:22, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Tue, Nov 07, 2023 at 10:18:05PM +0000, Matthew Wilcox wrote:
> > On Wed, Oct 18, 2023 at 09:25:09AM -0300, Wedson Almeida Filho wrote:
> > > @@ -36,6 +39,9 @@ pub trait FileSystem {
> > >
> > >      /// Returns the inode corresponding to the directory entry with the given name.
> > >      fn lookup(parent: &INode<Self>, name: &[u8]) -> Result<ARef<INode<Self>>>;
> > > +
> > > +    /// Reads the contents of the inode into the given folio.
> > > +    fn read_folio(inode: &INode<Self>, folio: LockedFolio<'_>) -> Result;
> > >  }
> > >
> >
> > This really shouldn't be a per-filesystem operation.  We have operations
> > split up into mapping_ops, inode_ops and file_ops for a reason.  In this
> > case, read_folio() can have a very different implementation for, eg,
> > symlinks, directories and files.  So we want to have different aops
> > for each of symlinks, directories and files.  We should maintain that
> > separation for filesystems written in Rust too.  Unless there's a good
> > reason to change it, and then we should change it in C too.

read_folio() is only called for regular files and symlinks. All other
modes (directories, pipes, sockets, char devices, block devices) have
their own read callbacks that don't involve read_folio().

For the filesystems that we have in Rust today, reading the contents
of a symlink is the same as reading a file (i.e., the name of the link
target is stored the same way as data in a file). For cases when this
is different, read_folio() can of course just check the mode of the
inode and take the appropriate path.

This is also what a bunch of C file systems do. But you folks are the
ones with most experience in file systems, if you think this isn't a
good idea, we could use read_folio() only for regular files and
introduce a function for reading symblinks, say read_symlink().

> While we are at it, lookup is also very much not a per-filesystem operation.
> Take a look at e.g. procfs, for an obvious example...

The C api offers the greatest freedom: one could write a file system
where each file has its own set of mapping_ops, inode_ops and
file_ops; and while we could choose to replicate this freedom in Rust
but we haven't.

Mostly because we don't need it, and we've been repeatedly told (by
Greg KH and others) not to introduce abstractions/bindings for
anything for which there isn't a user. Besides being a longstanding
rule in the kernel, they also say that they can't reasonably decide if
the interfaces are good if they can't see the users.

The existing Rust users (tarfs and puzzlefs) only need a single
lookup. And a quick grep (git grep \\\.lookup\\\> -- fs/) appears to
show that the vast majority of C filesystems only have a single lookup
as well. So we choose simplicity, knowing well that we may have to
revisit it in the future if the needs change.

> Wait a minute... what in name of everything unholy is that thing doing tied
> to inodes in the first place?

For the same reason as above, we don't need it in our current
filesystems. A bunch of C ones (e.g., xfs, ext2, romfs, erofs) only
use the dentry to get the name and later call d_splice_alias(), so we
hide the name extraction and call to d_splice_alias() in the
"trampoline" function.

BTW, thank you Matthew and Al, I very much appreciate that you take
the time to look into and raise concerns.

Cheers,
-Wedson
Al Viro Nov. 8, 2023, 12:56 a.m. UTC | #4
On Tue, Nov 07, 2023 at 09:35:44PM -0300, Wedson Almeida Filho wrote:

> > While we are at it, lookup is also very much not a per-filesystem operation.
> > Take a look at e.g. procfs, for an obvious example...
> 
> The C api offers the greatest freedom: one could write a file system
> where each file has its own set of mapping_ops, inode_ops and
> file_ops; and while we could choose to replicate this freedom in Rust
> but we haven't.

Too bad.
 
> Mostly because we don't need it, and we've been repeatedly told (by
> Greg KH and others) not to introduce abstractions/bindings for
> anything for which there isn't a user. Besides being a longstanding
> rule in the kernel, they also say that they can't reasonably decide if
> the interfaces are good if they can't see the users.

The interfaces are *already* there.  If it's going to be a separate
set of operations for Rust and for the rest of the filesystems, we
have a major problem right there.

> The existing Rust users (tarfs and puzzlefs) only need a single
> lookup. And a quick grep (git grep \\\.lookup\\\> -- fs/) appears to
> show that the vast majority of C filesystems only have a single lookup
> as well. So we choose simplicity, knowing well that we may have to
> revisit it in the future if the needs change.
> 
> > Wait a minute... what in name of everything unholy is that thing doing tied
> > to inodes in the first place?
> 
> For the same reason as above, we don't need it in our current
> filesystems. A bunch of C ones (e.g., xfs, ext2, romfs, erofs) only
> use the dentry to get the name and later call d_splice_alias(), so we
> hide the name extraction and call to d_splice_alias() in the
> "trampoline" function.

What controls the lifecycle of that stuff from the Rust point of view?
Wedson Almeida Filho Nov. 8, 2023, 2:39 a.m. UTC | #5
On Tue, 7 Nov 2023 at 21:56, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Tue, Nov 07, 2023 at 09:35:44PM -0300, Wedson Almeida Filho wrote:
>
> > > While we are at it, lookup is also very much not a per-filesystem operation.
> > > Take a look at e.g. procfs, for an obvious example...
> >
> > The C api offers the greatest freedom: one could write a file system
> > where each file has its own set of mapping_ops, inode_ops and
> > file_ops; and while we could choose to replicate this freedom in Rust
> > but we haven't.
>
> Too bad.
>
> > Mostly because we don't need it, and we've been repeatedly told (by
> > Greg KH and others) not to introduce abstractions/bindings for
> > anything for which there isn't a user. Besides being a longstanding
> > rule in the kernel, they also say that they can't reasonably decide if
> > the interfaces are good if they can't see the users.
>
> The interfaces are *already* there.  If it's going to be a separate
> set of operations for Rust and for the rest of the filesystems, we
> have a major problem right there.

The interfaces will be different but compatible -- they boil down to
calls to/from C and follow all rules and requirements imposed by C.

We just use Rust's type system to encode more of the rules into the
interfaces so that the compiler will catch more bugs for us at compile
time (and avoid memory safety issues if developers stay away from
unsafe blocks). For example, if you want to attach non-zero-sized data
to inodes of a given filesystem, in Rust we have a generic type:

struct INodeWithData<T> {
    data: MaybeUninit<T>,
    inode: bindings::inode,
}

And we automatically implement alloc_inode() and destroy_inode() in
super_operations. And all inodes in callbacks are typed so that
developers never need to call container_of directly themselves. The
compiler will catch, at compile time, any type mismatches without
runtime cost.

Another example: instead of implementing functions then declaring
structs containing pointers to these functions (and potentially other
fields), in Rust we expose "traits" that developers need to implement.
Then we can control which functions are required/optional and allow
developers to logically group them, as well as declare constants and
related types (e.g., the additional struct, if any, to be allocated
along with an inode in INodeWithData above). But in the end, these get
translated (at compile time for const ops) into
file_operations/address_space_operations/inode_operations.

> > The existing Rust users (tarfs and puzzlefs) only need a single
> > lookup. And a quick grep (git grep \\\.lookup\\\> -- fs/) appears to
> > show that the vast majority of C filesystems only have a single lookup
> > as well. So we choose simplicity, knowing well that we may have to
> > revisit it in the future if the needs change.
> >
> > > Wait a minute... what in name of everything unholy is that thing doing tied
> > > to inodes in the first place?
> >
> > For the same reason as above, we don't need it in our current
> > filesystems. A bunch of C ones (e.g., xfs, ext2, romfs, erofs) only
> > use the dentry to get the name and later call d_splice_alias(), so we
> > hide the name extraction and call to d_splice_alias() in the
> > "trampoline" function.
>
> What controls the lifecycle of that stuff from the Rust point of view?

The same rules as C. inodes, for example, are ref-counted so while a
callback that has an inode as argument is inflight, we know it (the
inode) is referenced and we can just use it. If/when Rust code wants
to hold on to a pointer to it beyond a callback, it needs to increment
the refcount and release it when it's done. Here the type system also
helps us: it guarantees that pointers to ref-counted objects are never
dangling, if we ever try to hold on to a pointer without incrementing
the refcount, we get a compile-time error (no additional runtime
cost).

We also have a common interface for _all_ C ref-counted objects, so
instead of having to memorise that I should call ihold/iget for
inodes, folio_get/folio_put for folios,
get_task_struct/put_task_struct for tasks, etc., in Rust they're
simply ARef<INode>, ARef<Folio>, ARef<Task> with automatic increment
via clone() and decrement on destruction.

There are a couple of issues that I alluded to in the cover letter but
never actually wrote down, so I will describe them here to get your
thoughts:

First issue:
VFS conflates filesystem unregistration with module unload. The
description of unregister_filesystem() states:

 * Once this function has returned the &struct file_system_type structure
 * may be freed or reused.

When a filesystem is mounted, VFS calls get_filesystem() to presumably
prevent the filesystem from unregistering, and it calls
put_filesystem() when deactivating a super-block.

But get/put_filesystem() are implemented as module_get/put(). So this
works well if unregister_filesystem is only ever called when modules
are unloaded. It doesn't seem to help if it's called anywhere else
(e.g., on failure paths of module load).

Here's an example: init_f2fs_fs() calls init_inodecache() to allocate
an inode cache, then eventually calls register_filesystem(). Let's
suppose at this point, another CPU actually mounts an instance of an
f2fs fs. After register_filesystem() in init_f2fs_fs(), there is a
bunch of extra failure paths; let's suppose
f2fs_init_compress_mempool() fails. The exit path will call
unregister_filesystem() (which prevents new superblocks from being
created, but the existing superblocks continue to exist), then
eventually destroy_inodecache(), which frees the cache from which all
inodes of the existing superblock have been allocated and we have a
bunch of potential user-after-frees.

Granted that the module will not be unloaded immediately (it will wait
for all references, including the ones by get_filesystem() to go
away), so we won't have an issue with the callbacks being called to
unloaded memory. But if we recycle f2fs_fs_type, which
unregister_filesystem claims to be safe, we'll also have
user-after-frees there.

(Note that the example above doesn't require unload at all.)

I think we could fix this by having a different implementation of
get/put_filesystem() that keeps track of a count for filesystem usage
(in addition to avoiding module unload), and only completing
unregister_filesystem when it goes to zero. Would you be interested in
a patch for this?

Second issue:

Leaked inodes: if a filesystem leaks inodes, then after unregistration
most implementations will just free the kmemcache from which they
came, so future attempts to use these leaked inodes (it's possible
they've been stored in some list somewhere) will lead to
user-after-frees. Is there anything we can do improve this? Should we
prevent unregister_filesystem() from completing in such cases?

Thanks,
-Wedson
diff mbox series

Patch

diff --git a/rust/kernel/folio.rs b/rust/kernel/folio.rs
index ef8a08b97962..b7f80291b0e1 100644
--- a/rust/kernel/folio.rs
+++ b/rust/kernel/folio.rs
@@ -123,7 +123,6 @@  impl LockedFolio<'_> {
     /// Callers must ensure that the folio is valid and locked. Additionally, that the
     /// responsibility of unlocking is transferred to the new instance of [`LockedFolio`]. Lastly,
     /// that the returned [`LockedFolio`] doesn't outlive the refcount that keeps it alive.
-    #[allow(dead_code)]
     pub(crate) unsafe fn from_raw(folio: *const bindings::folio) -> Self {
         let ptr = folio.cast();
         // SAFETY: The safety requirements ensure that `folio` (from which `ptr` is derived) is
diff --git a/rust/kernel/fs.rs b/rust/kernel/fs.rs
index 681fef8e3af1..ee3dce87032b 100644
--- a/rust/kernel/fs.rs
+++ b/rust/kernel/fs.rs
@@ -8,7 +8,10 @@ 
 
 use crate::error::{code::*, from_result, to_result, Error, Result};
 use crate::types::{ARef, AlwaysRefCounted, Either, Opaque};
-use crate::{bindings, init::PinInit, str::CStr, time::Timespec, try_pin_init, ThisModule};
+use crate::{
+    bindings, folio::LockedFolio, init::PinInit, str::CStr, time::Timespec, try_pin_init,
+    ThisModule,
+};
 use core::{marker::PhantomData, marker::PhantomPinned, mem::ManuallyDrop, pin::Pin, ptr};
 use macros::{pin_data, pinned_drop};
 
@@ -36,6 +39,9 @@  pub trait FileSystem {
 
     /// Returns the inode corresponding to the directory entry with the given name.
     fn lookup(parent: &INode<Self>, name: &[u8]) -> Result<ARef<INode<Self>>>;
+
+    /// Reads the contents of the inode into the given folio.
+    fn read_folio(inode: &INode<Self>, folio: LockedFolio<'_>) -> Result;
 }
 
 /// The types of directory entries reported by [`FileSystem::read_dir`].
@@ -74,6 +80,7 @@  impl From<INodeType> for DirEntryType {
     fn from(value: INodeType) -> Self {
         match value {
             INodeType::Dir => DirEntryType::Dir,
+            INodeType::Reg => DirEntryType::Reg,
         }
     }
 }
@@ -232,6 +239,15 @@  pub fn init(self, params: INodeParams) -> Result<ARef<INode<T>>> {
                 inode.i_op = &Tables::<T>::DIR_INODE_OPERATIONS;
                 bindings::S_IFDIR
             }
+            INodeType::Reg => {
+                // SAFETY: `generic_ro_fops` never changes, it's safe to reference it.
+                inode.__bindgen_anon_3.i_fop = unsafe { &bindings::generic_ro_fops };
+                inode.i_data.a_ops = &Tables::<T>::FILE_ADDRESS_SPACE_OPERATIONS;
+
+                // SAFETY: The `i_mapping` pointer doesn't change and is valid.
+                unsafe { bindings::mapping_set_large_folios(inode.i_mapping) };
+                bindings::S_IFREG
+            }
         };
 
         inode.i_mode = (params.mode & 0o777) | u16::try_from(mode)?;
@@ -268,6 +284,9 @@  fn drop(&mut self) {
 pub enum INodeType {
     /// Directory type.
     Dir,
+
+    /// Regular file type.
+    Reg,
 }
 
 /// Required inode parameters.
@@ -588,6 +607,55 @@  extern "C" fn lookup_callback(
             },
         }
     }
+
+    const FILE_ADDRESS_SPACE_OPERATIONS: bindings::address_space_operations =
+        bindings::address_space_operations {
+            writepage: None,
+            read_folio: Some(Self::read_folio_callback),
+            writepages: None,
+            dirty_folio: None,
+            readahead: None,
+            write_begin: None,
+            write_end: None,
+            bmap: None,
+            invalidate_folio: None,
+            release_folio: None,
+            free_folio: None,
+            direct_IO: None,
+            migrate_folio: None,
+            launder_folio: None,
+            is_partially_uptodate: None,
+            is_dirty_writeback: None,
+            error_remove_page: None,
+            swap_activate: None,
+            swap_deactivate: None,
+            swap_rw: None,
+        };
+
+    extern "C" fn read_folio_callback(
+        _file: *mut bindings::file,
+        folio: *mut bindings::folio,
+    ) -> i32 {
+        from_result(|| {
+            // SAFETY: All pointers are valid and stable.
+            let inode = unsafe {
+                &*(*(*folio)
+                    .__bindgen_anon_1
+                    .page
+                    .__bindgen_anon_1
+                    .__bindgen_anon_1
+                    .mapping)
+                    .host
+                    .cast::<INode<T>>()
+            };
+
+            // SAFETY: The C contract guarantees that the folio is valid and locked, with ownership
+            // of the lock transferred to the callee (this function). The folio is also guaranteed
+            // not to outlive this function.
+            T::read_folio(inode, unsafe { LockedFolio::from_raw(folio) })?;
+            Ok(0)
+        })
+    }
 }
 
 /// Directory entry emitter.
@@ -673,7 +741,7 @@  fn init(module: &'static ThisModule) -> impl PinInit<Self, Error> {
 /// # mod module_fs_sample {
 /// use kernel::fs::{DirEmitter, INode, NewSuperBlock, SuperBlock, SuperParams};
 /// use kernel::prelude::*;
-/// use kernel::{c_str, fs, types::ARef};
+/// use kernel::{c_str, folio::LockedFolio, fs, types::ARef};
 ///
 /// kernel::module_fs! {
 ///     type: MyFs,
@@ -698,6 +766,9 @@  fn init(module: &'static ThisModule) -> impl PinInit<Self, Error> {
 ///     fn lookup(_: &INode<Self>, _: &[u8]) -> Result<ARef<INode<Self>>> {
 ///         todo!()
 ///     }
+///     fn read_folio(_: &INode<Self>, _: LockedFolio<'_>) -> Result {
+///         todo!()
+///     }
 /// }
 /// # }
 /// ```
diff --git a/samples/rust/rust_rofs.rs b/samples/rust/rust_rofs.rs
index 4cc8525884a9..ef651ad38185 100644
--- a/samples/rust/rust_rofs.rs
+++ b/samples/rust/rust_rofs.rs
@@ -6,7 +6,7 @@ 
     DirEmitter, INode, INodeParams, INodeType, NewSuperBlock, SuperBlock, SuperParams,
 };
 use kernel::prelude::*;
-use kernel::{c_str, fs, time::UNIX_EPOCH, types::ARef, types::Either};
+use kernel::{c_str, folio::LockedFolio, fs, time::UNIX_EPOCH, types::ARef, types::Either};
 
 kernel::module_fs! {
     type: RoFs,
@@ -20,6 +20,7 @@  struct Entry {
     name: &'static [u8],
     ino: u64,
     etype: INodeType,
+    contents: &'static [u8],
 }
 
 const ENTRIES: [Entry; 3] = [
@@ -27,16 +28,19 @@  struct Entry {
         name: b".",
         ino: 1,
         etype: INodeType::Dir,
+        contents: b"",
     },
     Entry {
         name: b"..",
         ino: 1,
         etype: INodeType::Dir,
+        contents: b"",
     },
     Entry {
-        name: b"subdir",
+        name: b"test.txt",
         ino: 2,
-        etype: INodeType::Dir,
+        etype: INodeType::Reg,
+        contents: b"hello\n",
     },
 ];
 
@@ -95,23 +99,48 @@  fn lookup(parent: &INode<Self>, name: &[u8]) -> Result<ARef<INode<Self>>> {
             return Err(ENOENT);
         }
 
-        match name {
-            b"subdir" => match parent.super_block().get_or_create_inode(2)? {
-                Either::Left(existing) => Ok(existing),
-                Either::Right(new) => new.init(INodeParams {
-                    typ: INodeType::Dir,
-                    mode: 0o555,
-                    size: 0,
-                    blocks: 1,
-                    nlink: 2,
-                    uid: 0,
-                    gid: 0,
-                    atime: UNIX_EPOCH,
-                    ctime: UNIX_EPOCH,
-                    mtime: UNIX_EPOCH,
-                }),
-            },
-            _ => Err(ENOENT),
+        for e in &ENTRIES {
+            if name == e.name {
+                return match parent.super_block().get_or_create_inode(e.ino)? {
+                    Either::Left(existing) => Ok(existing),
+                    Either::Right(new) => new.init(INodeParams {
+                        typ: e.etype,
+                        mode: 0o444,
+                        size: e.contents.len().try_into()?,
+                        blocks: 1,
+                        nlink: 1,
+                        uid: 0,
+                        gid: 0,
+                        atime: UNIX_EPOCH,
+                        ctime: UNIX_EPOCH,
+                        mtime: UNIX_EPOCH,
+                    }),
+                };
+            }
         }
+
+        Err(ENOENT)
+    }
+
+    fn read_folio(inode: &INode<Self>, mut folio: LockedFolio<'_>) -> Result {
+        let data = match inode.ino() {
+            2 => ENTRIES[2].contents,
+            _ => return Err(EINVAL),
+        };
+
+        let pos = usize::try_from(folio.pos()).unwrap_or(usize::MAX);
+        let copied = if pos >= data.len() {
+            0
+        } else {
+            let to_copy = core::cmp::min(data.len() - pos, folio.size());
+            folio.write(0, &data[pos..][..to_copy])?;
+            to_copy
+        };
+
+        folio.zero_out(copied, folio.size() - copied)?;
+        folio.mark_uptodate();
+        folio.flush_dcache();
+
+        Ok(())
     }
 }