Message ID | 20231018122518.128049-11-wedsonaf@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Rust abstractions for VFS | expand |
On Wed, Oct 18, 2023 at 09:25:09AM -0300, Wedson Almeida Filho wrote: > @@ -36,6 +39,9 @@ pub trait FileSystem { > > /// Returns the inode corresponding to the directory entry with the given name. > fn lookup(parent: &INode<Self>, name: &[u8]) -> Result<ARef<INode<Self>>>; > + > + /// Reads the contents of the inode into the given folio. > + fn read_folio(inode: &INode<Self>, folio: LockedFolio<'_>) -> Result; > } > This really shouldn't be a per-filesystem operation. We have operations split up into mapping_ops, inode_ops and file_ops for a reason. In this case, read_folio() can have a very different implementation for, eg, symlinks, directories and files. So we want to have different aops for each of symlinks, directories and files. We should maintain that separation for filesystems written in Rust too. Unless there's a good reason to change it, and then we should change it in C too.
On Tue, Nov 07, 2023 at 10:18:05PM +0000, Matthew Wilcox wrote: > On Wed, Oct 18, 2023 at 09:25:09AM -0300, Wedson Almeida Filho wrote: > > @@ -36,6 +39,9 @@ pub trait FileSystem { > > > > /// Returns the inode corresponding to the directory entry with the given name. > > fn lookup(parent: &INode<Self>, name: &[u8]) -> Result<ARef<INode<Self>>>; > > + > > + /// Reads the contents of the inode into the given folio. > > + fn read_folio(inode: &INode<Self>, folio: LockedFolio<'_>) -> Result; > > } > > > > This really shouldn't be a per-filesystem operation. We have operations > split up into mapping_ops, inode_ops and file_ops for a reason. In this > case, read_folio() can have a very different implementation for, eg, > symlinks, directories and files. So we want to have different aops > for each of symlinks, directories and files. We should maintain that > separation for filesystems written in Rust too. Unless there's a good > reason to change it, and then we should change it in C too. While we are at it, lookup is also very much not a per-filesystem operation. Take a look at e.g. procfs, for an obvious example... Wait a minute... what in name of everything unholy is that thing doing tied to inodes in the first place?
On Tue, 7 Nov 2023 at 19:22, Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Tue, Nov 07, 2023 at 10:18:05PM +0000, Matthew Wilcox wrote: > > On Wed, Oct 18, 2023 at 09:25:09AM -0300, Wedson Almeida Filho wrote: > > > @@ -36,6 +39,9 @@ pub trait FileSystem { > > > > > > /// Returns the inode corresponding to the directory entry with the given name. > > > fn lookup(parent: &INode<Self>, name: &[u8]) -> Result<ARef<INode<Self>>>; > > > + > > > + /// Reads the contents of the inode into the given folio. > > > + fn read_folio(inode: &INode<Self>, folio: LockedFolio<'_>) -> Result; > > > } > > > > > > > This really shouldn't be a per-filesystem operation. We have operations > > split up into mapping_ops, inode_ops and file_ops for a reason. In this > > case, read_folio() can have a very different implementation for, eg, > > symlinks, directories and files. So we want to have different aops > > for each of symlinks, directories and files. We should maintain that > > separation for filesystems written in Rust too. Unless there's a good > > reason to change it, and then we should change it in C too. read_folio() is only called for regular files and symlinks. All other modes (directories, pipes, sockets, char devices, block devices) have their own read callbacks that don't involve read_folio(). For the filesystems that we have in Rust today, reading the contents of a symlink is the same as reading a file (i.e., the name of the link target is stored the same way as data in a file). For cases when this is different, read_folio() can of course just check the mode of the inode and take the appropriate path. This is also what a bunch of C file systems do. But you folks are the ones with most experience in file systems, if you think this isn't a good idea, we could use read_folio() only for regular files and introduce a function for reading symblinks, say read_symlink(). > While we are at it, lookup is also very much not a per-filesystem operation. > Take a look at e.g. procfs, for an obvious example... The C api offers the greatest freedom: one could write a file system where each file has its own set of mapping_ops, inode_ops and file_ops; and while we could choose to replicate this freedom in Rust but we haven't. Mostly because we don't need it, and we've been repeatedly told (by Greg KH and others) not to introduce abstractions/bindings for anything for which there isn't a user. Besides being a longstanding rule in the kernel, they also say that they can't reasonably decide if the interfaces are good if they can't see the users. The existing Rust users (tarfs and puzzlefs) only need a single lookup. And a quick grep (git grep \\\.lookup\\\> -- fs/) appears to show that the vast majority of C filesystems only have a single lookup as well. So we choose simplicity, knowing well that we may have to revisit it in the future if the needs change. > Wait a minute... what in name of everything unholy is that thing doing tied > to inodes in the first place? For the same reason as above, we don't need it in our current filesystems. A bunch of C ones (e.g., xfs, ext2, romfs, erofs) only use the dentry to get the name and later call d_splice_alias(), so we hide the name extraction and call to d_splice_alias() in the "trampoline" function. BTW, thank you Matthew and Al, I very much appreciate that you take the time to look into and raise concerns. Cheers, -Wedson
On Tue, Nov 07, 2023 at 09:35:44PM -0300, Wedson Almeida Filho wrote: > > While we are at it, lookup is also very much not a per-filesystem operation. > > Take a look at e.g. procfs, for an obvious example... > > The C api offers the greatest freedom: one could write a file system > where each file has its own set of mapping_ops, inode_ops and > file_ops; and while we could choose to replicate this freedom in Rust > but we haven't. Too bad. > Mostly because we don't need it, and we've been repeatedly told (by > Greg KH and others) not to introduce abstractions/bindings for > anything for which there isn't a user. Besides being a longstanding > rule in the kernel, they also say that they can't reasonably decide if > the interfaces are good if they can't see the users. The interfaces are *already* there. If it's going to be a separate set of operations for Rust and for the rest of the filesystems, we have a major problem right there. > The existing Rust users (tarfs and puzzlefs) only need a single > lookup. And a quick grep (git grep \\\.lookup\\\> -- fs/) appears to > show that the vast majority of C filesystems only have a single lookup > as well. So we choose simplicity, knowing well that we may have to > revisit it in the future if the needs change. > > > Wait a minute... what in name of everything unholy is that thing doing tied > > to inodes in the first place? > > For the same reason as above, we don't need it in our current > filesystems. A bunch of C ones (e.g., xfs, ext2, romfs, erofs) only > use the dentry to get the name and later call d_splice_alias(), so we > hide the name extraction and call to d_splice_alias() in the > "trampoline" function. What controls the lifecycle of that stuff from the Rust point of view?
On Tue, 7 Nov 2023 at 21:56, Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Tue, Nov 07, 2023 at 09:35:44PM -0300, Wedson Almeida Filho wrote: > > > > While we are at it, lookup is also very much not a per-filesystem operation. > > > Take a look at e.g. procfs, for an obvious example... > > > > The C api offers the greatest freedom: one could write a file system > > where each file has its own set of mapping_ops, inode_ops and > > file_ops; and while we could choose to replicate this freedom in Rust > > but we haven't. > > Too bad. > > > Mostly because we don't need it, and we've been repeatedly told (by > > Greg KH and others) not to introduce abstractions/bindings for > > anything for which there isn't a user. Besides being a longstanding > > rule in the kernel, they also say that they can't reasonably decide if > > the interfaces are good if they can't see the users. > > The interfaces are *already* there. If it's going to be a separate > set of operations for Rust and for the rest of the filesystems, we > have a major problem right there. The interfaces will be different but compatible -- they boil down to calls to/from C and follow all rules and requirements imposed by C. We just use Rust's type system to encode more of the rules into the interfaces so that the compiler will catch more bugs for us at compile time (and avoid memory safety issues if developers stay away from unsafe blocks). For example, if you want to attach non-zero-sized data to inodes of a given filesystem, in Rust we have a generic type: struct INodeWithData<T> { data: MaybeUninit<T>, inode: bindings::inode, } And we automatically implement alloc_inode() and destroy_inode() in super_operations. And all inodes in callbacks are typed so that developers never need to call container_of directly themselves. The compiler will catch, at compile time, any type mismatches without runtime cost. Another example: instead of implementing functions then declaring structs containing pointers to these functions (and potentially other fields), in Rust we expose "traits" that developers need to implement. Then we can control which functions are required/optional and allow developers to logically group them, as well as declare constants and related types (e.g., the additional struct, if any, to be allocated along with an inode in INodeWithData above). But in the end, these get translated (at compile time for const ops) into file_operations/address_space_operations/inode_operations. > > The existing Rust users (tarfs and puzzlefs) only need a single > > lookup. And a quick grep (git grep \\\.lookup\\\> -- fs/) appears to > > show that the vast majority of C filesystems only have a single lookup > > as well. So we choose simplicity, knowing well that we may have to > > revisit it in the future if the needs change. > > > > > Wait a minute... what in name of everything unholy is that thing doing tied > > > to inodes in the first place? > > > > For the same reason as above, we don't need it in our current > > filesystems. A bunch of C ones (e.g., xfs, ext2, romfs, erofs) only > > use the dentry to get the name and later call d_splice_alias(), so we > > hide the name extraction and call to d_splice_alias() in the > > "trampoline" function. > > What controls the lifecycle of that stuff from the Rust point of view? The same rules as C. inodes, for example, are ref-counted so while a callback that has an inode as argument is inflight, we know it (the inode) is referenced and we can just use it. If/when Rust code wants to hold on to a pointer to it beyond a callback, it needs to increment the refcount and release it when it's done. Here the type system also helps us: it guarantees that pointers to ref-counted objects are never dangling, if we ever try to hold on to a pointer without incrementing the refcount, we get a compile-time error (no additional runtime cost). We also have a common interface for _all_ C ref-counted objects, so instead of having to memorise that I should call ihold/iget for inodes, folio_get/folio_put for folios, get_task_struct/put_task_struct for tasks, etc., in Rust they're simply ARef<INode>, ARef<Folio>, ARef<Task> with automatic increment via clone() and decrement on destruction. There are a couple of issues that I alluded to in the cover letter but never actually wrote down, so I will describe them here to get your thoughts: First issue: VFS conflates filesystem unregistration with module unload. The description of unregister_filesystem() states: * Once this function has returned the &struct file_system_type structure * may be freed or reused. When a filesystem is mounted, VFS calls get_filesystem() to presumably prevent the filesystem from unregistering, and it calls put_filesystem() when deactivating a super-block. But get/put_filesystem() are implemented as module_get/put(). So this works well if unregister_filesystem is only ever called when modules are unloaded. It doesn't seem to help if it's called anywhere else (e.g., on failure paths of module load). Here's an example: init_f2fs_fs() calls init_inodecache() to allocate an inode cache, then eventually calls register_filesystem(). Let's suppose at this point, another CPU actually mounts an instance of an f2fs fs. After register_filesystem() in init_f2fs_fs(), there is a bunch of extra failure paths; let's suppose f2fs_init_compress_mempool() fails. The exit path will call unregister_filesystem() (which prevents new superblocks from being created, but the existing superblocks continue to exist), then eventually destroy_inodecache(), which frees the cache from which all inodes of the existing superblock have been allocated and we have a bunch of potential user-after-frees. Granted that the module will not be unloaded immediately (it will wait for all references, including the ones by get_filesystem() to go away), so we won't have an issue with the callbacks being called to unloaded memory. But if we recycle f2fs_fs_type, which unregister_filesystem claims to be safe, we'll also have user-after-frees there. (Note that the example above doesn't require unload at all.) I think we could fix this by having a different implementation of get/put_filesystem() that keeps track of a count for filesystem usage (in addition to avoiding module unload), and only completing unregister_filesystem when it goes to zero. Would you be interested in a patch for this? Second issue: Leaked inodes: if a filesystem leaks inodes, then after unregistration most implementations will just free the kmemcache from which they came, so future attempts to use these leaked inodes (it's possible they've been stored in some list somewhere) will lead to user-after-frees. Is there anything we can do improve this? Should we prevent unregister_filesystem() from completing in such cases? Thanks, -Wedson
diff --git a/rust/kernel/folio.rs b/rust/kernel/folio.rs index ef8a08b97962..b7f80291b0e1 100644 --- a/rust/kernel/folio.rs +++ b/rust/kernel/folio.rs @@ -123,7 +123,6 @@ impl LockedFolio<'_> { /// Callers must ensure that the folio is valid and locked. Additionally, that the /// responsibility of unlocking is transferred to the new instance of [`LockedFolio`]. Lastly, /// that the returned [`LockedFolio`] doesn't outlive the refcount that keeps it alive. - #[allow(dead_code)] pub(crate) unsafe fn from_raw(folio: *const bindings::folio) -> Self { let ptr = folio.cast(); // SAFETY: The safety requirements ensure that `folio` (from which `ptr` is derived) is diff --git a/rust/kernel/fs.rs b/rust/kernel/fs.rs index 681fef8e3af1..ee3dce87032b 100644 --- a/rust/kernel/fs.rs +++ b/rust/kernel/fs.rs @@ -8,7 +8,10 @@ use crate::error::{code::*, from_result, to_result, Error, Result}; use crate::types::{ARef, AlwaysRefCounted, Either, Opaque}; -use crate::{bindings, init::PinInit, str::CStr, time::Timespec, try_pin_init, ThisModule}; +use crate::{ + bindings, folio::LockedFolio, init::PinInit, str::CStr, time::Timespec, try_pin_init, + ThisModule, +}; use core::{marker::PhantomData, marker::PhantomPinned, mem::ManuallyDrop, pin::Pin, ptr}; use macros::{pin_data, pinned_drop}; @@ -36,6 +39,9 @@ pub trait FileSystem { /// Returns the inode corresponding to the directory entry with the given name. fn lookup(parent: &INode<Self>, name: &[u8]) -> Result<ARef<INode<Self>>>; + + /// Reads the contents of the inode into the given folio. + fn read_folio(inode: &INode<Self>, folio: LockedFolio<'_>) -> Result; } /// The types of directory entries reported by [`FileSystem::read_dir`]. @@ -74,6 +80,7 @@ impl From<INodeType> for DirEntryType { fn from(value: INodeType) -> Self { match value { INodeType::Dir => DirEntryType::Dir, + INodeType::Reg => DirEntryType::Reg, } } } @@ -232,6 +239,15 @@ pub fn init(self, params: INodeParams) -> Result<ARef<INode<T>>> { inode.i_op = &Tables::<T>::DIR_INODE_OPERATIONS; bindings::S_IFDIR } + INodeType::Reg => { + // SAFETY: `generic_ro_fops` never changes, it's safe to reference it. + inode.__bindgen_anon_3.i_fop = unsafe { &bindings::generic_ro_fops }; + inode.i_data.a_ops = &Tables::<T>::FILE_ADDRESS_SPACE_OPERATIONS; + + // SAFETY: The `i_mapping` pointer doesn't change and is valid. + unsafe { bindings::mapping_set_large_folios(inode.i_mapping) }; + bindings::S_IFREG + } }; inode.i_mode = (params.mode & 0o777) | u16::try_from(mode)?; @@ -268,6 +284,9 @@ fn drop(&mut self) { pub enum INodeType { /// Directory type. Dir, + + /// Regular file type. + Reg, } /// Required inode parameters. @@ -588,6 +607,55 @@ extern "C" fn lookup_callback( }, } } + + const FILE_ADDRESS_SPACE_OPERATIONS: bindings::address_space_operations = + bindings::address_space_operations { + writepage: None, + read_folio: Some(Self::read_folio_callback), + writepages: None, + dirty_folio: None, + readahead: None, + write_begin: None, + write_end: None, + bmap: None, + invalidate_folio: None, + release_folio: None, + free_folio: None, + direct_IO: None, + migrate_folio: None, + launder_folio: None, + is_partially_uptodate: None, + is_dirty_writeback: None, + error_remove_page: None, + swap_activate: None, + swap_deactivate: None, + swap_rw: None, + }; + + extern "C" fn read_folio_callback( + _file: *mut bindings::file, + folio: *mut bindings::folio, + ) -> i32 { + from_result(|| { + // SAFETY: All pointers are valid and stable. + let inode = unsafe { + &*(*(*folio) + .__bindgen_anon_1 + .page + .__bindgen_anon_1 + .__bindgen_anon_1 + .mapping) + .host + .cast::<INode<T>>() + }; + + // SAFETY: The C contract guarantees that the folio is valid and locked, with ownership + // of the lock transferred to the callee (this function). The folio is also guaranteed + // not to outlive this function. + T::read_folio(inode, unsafe { LockedFolio::from_raw(folio) })?; + Ok(0) + }) + } } /// Directory entry emitter. @@ -673,7 +741,7 @@ fn init(module: &'static ThisModule) -> impl PinInit<Self, Error> { /// # mod module_fs_sample { /// use kernel::fs::{DirEmitter, INode, NewSuperBlock, SuperBlock, SuperParams}; /// use kernel::prelude::*; -/// use kernel::{c_str, fs, types::ARef}; +/// use kernel::{c_str, folio::LockedFolio, fs, types::ARef}; /// /// kernel::module_fs! { /// type: MyFs, @@ -698,6 +766,9 @@ fn init(module: &'static ThisModule) -> impl PinInit<Self, Error> { /// fn lookup(_: &INode<Self>, _: &[u8]) -> Result<ARef<INode<Self>>> { /// todo!() /// } +/// fn read_folio(_: &INode<Self>, _: LockedFolio<'_>) -> Result { +/// todo!() +/// } /// } /// # } /// ``` diff --git a/samples/rust/rust_rofs.rs b/samples/rust/rust_rofs.rs index 4cc8525884a9..ef651ad38185 100644 --- a/samples/rust/rust_rofs.rs +++ b/samples/rust/rust_rofs.rs @@ -6,7 +6,7 @@ DirEmitter, INode, INodeParams, INodeType, NewSuperBlock, SuperBlock, SuperParams, }; use kernel::prelude::*; -use kernel::{c_str, fs, time::UNIX_EPOCH, types::ARef, types::Either}; +use kernel::{c_str, folio::LockedFolio, fs, time::UNIX_EPOCH, types::ARef, types::Either}; kernel::module_fs! { type: RoFs, @@ -20,6 +20,7 @@ struct Entry { name: &'static [u8], ino: u64, etype: INodeType, + contents: &'static [u8], } const ENTRIES: [Entry; 3] = [ @@ -27,16 +28,19 @@ struct Entry { name: b".", ino: 1, etype: INodeType::Dir, + contents: b"", }, Entry { name: b"..", ino: 1, etype: INodeType::Dir, + contents: b"", }, Entry { - name: b"subdir", + name: b"test.txt", ino: 2, - etype: INodeType::Dir, + etype: INodeType::Reg, + contents: b"hello\n", }, ]; @@ -95,23 +99,48 @@ fn lookup(parent: &INode<Self>, name: &[u8]) -> Result<ARef<INode<Self>>> { return Err(ENOENT); } - match name { - b"subdir" => match parent.super_block().get_or_create_inode(2)? { - Either::Left(existing) => Ok(existing), - Either::Right(new) => new.init(INodeParams { - typ: INodeType::Dir, - mode: 0o555, - size: 0, - blocks: 1, - nlink: 2, - uid: 0, - gid: 0, - atime: UNIX_EPOCH, - ctime: UNIX_EPOCH, - mtime: UNIX_EPOCH, - }), - }, - _ => Err(ENOENT), + for e in &ENTRIES { + if name == e.name { + return match parent.super_block().get_or_create_inode(e.ino)? { + Either::Left(existing) => Ok(existing), + Either::Right(new) => new.init(INodeParams { + typ: e.etype, + mode: 0o444, + size: e.contents.len().try_into()?, + blocks: 1, + nlink: 1, + uid: 0, + gid: 0, + atime: UNIX_EPOCH, + ctime: UNIX_EPOCH, + mtime: UNIX_EPOCH, + }), + }; + } } + + Err(ENOENT) + } + + fn read_folio(inode: &INode<Self>, mut folio: LockedFolio<'_>) -> Result { + let data = match inode.ino() { + 2 => ENTRIES[2].contents, + _ => return Err(EINVAL), + }; + + let pos = usize::try_from(folio.pos()).unwrap_or(usize::MAX); + let copied = if pos >= data.len() { + 0 + } else { + let to_copy = core::cmp::min(data.len() - pos, folio.size()); + folio.write(0, &data[pos..][..to_copy])?; + to_copy + }; + + folio.zero_out(copied, folio.size() - copied)?; + folio.mark_uptodate(); + folio.flush_dcache(); + + Ok(()) } }