diff mbox series

[RFC,09/19] rust: folio: introduce basic support for folios

Message ID 20231018122518.128049-10-wedsonaf@gmail.com (mailing list archive)
State New, archived
Headers show
Series Rust abstractions for VFS | expand

Commit Message

Wedson Almeida Filho Oct. 18, 2023, 12:25 p.m. UTC
From: Wedson Almeida Filho <walmeida@microsoft.com>

Allow Rust file systems to handle ref-counted folios.

Provide the minimum needed to implement `read_folio` (part of `struct
address_space_operations`) in read-only file systems and to read
uncached blocks.

Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
---
 rust/bindings/bindings_helper.h |   3 +
 rust/bindings/lib.rs            |   2 +
 rust/helpers.c                  |  81 ++++++++++++
 rust/kernel/folio.rs            | 215 ++++++++++++++++++++++++++++++++
 rust/kernel/lib.rs              |   1 +
 5 files changed, 302 insertions(+)
 create mode 100644 rust/kernel/folio.rs

Comments

Matthew Wilcox Oct. 18, 2023, 5:17 p.m. UTC | #1
On Wed, Oct 18, 2023 at 09:25:08AM -0300, Wedson Almeida Filho wrote:
> +void *rust_helper_kmap(struct page *page)
> +{
> +	return kmap(page);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kmap);
> +
> +void rust_helper_kunmap(struct page *page)
> +{
> +	kunmap(page);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kunmap);

I'm not thrilled by exposing kmap()/kunmap() to Rust code.  The vast
majority of code really only needs kmap_local_*() / kunmap_local().
Can you elaborate on why you need the old kmap() in new Rust code?

> +void rust_helper_folio_set_error(struct folio *folio)
> +{
> +	folio_set_error(folio);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_folio_set_error);

I'm trying to get rid of the error flag.  Can you share the situations
in which you've needed the error flag?  Or is it just copying existing
practices?

> +    /// Returns the byte position of this folio in its file.
> +    pub fn pos(&self) -> i64 {
> +        // SAFETY: The folio is valid because the shared reference implies a non-zero refcount.
> +        unsafe { bindings::folio_pos(self.0.get()) }
> +    }

I think it's a mistake to make file positions an i64.  I estimate 64
bits will not be enough by 2035-2040.  We should probably have a numeric
type which is i64 on 32-bit and isize on other CPUs (I also project
64-bit pointers will be insufficient by 2035-2040 and so we will have
128-bit pointers around the same time, so we're not going to need i128
file offsets with i64 pointers).

> +/// A [`Folio`] that has a single reference to it.
> +pub struct UniqueFolio(pub(crate) ARef<Folio>);

How do we know it only has a single reference?  Do you mean "has at
least one reference"?  Or am I confusing Rust's notion of a reference
with Linux's notion of a reference?

> +impl UniqueFolio {
> +    /// Maps the contents of a folio page into a slice.
> +    pub fn map_page(&self, page_index: usize) -> Result<MapGuard<'_>> {
> +        if page_index >= self.0.size() / bindings::PAGE_SIZE {
> +            return Err(EDOM);
> +        }
> +
> +        // SAFETY: We just checked that the index is within bounds of the folio.
> +        let page = unsafe { bindings::folio_page(self.0 .0.get(), page_index) };
> +
> +        // SAFETY: `page` is valid because it was returned by `folio_page` above.
> +        let ptr = unsafe { bindings::kmap(page) };

Surely this can be:

	   let ptr = unsafe { bindings::kmap_local_folio(folio, page_index * PAGE_SIZE) };

> +        // SAFETY: We just mapped `ptr`, so it's valid for read.
> +        let data = unsafe { core::slice::from_raw_parts(ptr.cast::<u8>(), bindings::PAGE_SIZE) };

Can we hide away the "if this isn't a HIGHMEM system, this maps to the
end of the folio, but if it is, it only maps to the end of the page"
problem here?
Wedson Almeida Filho Oct. 18, 2023, 6:32 p.m. UTC | #2
On Wed, 18 Oct 2023 at 14:17, Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Oct 18, 2023 at 09:25:08AM -0300, Wedson Almeida Filho wrote:
> > +void *rust_helper_kmap(struct page *page)
> > +{
> > +     return kmap(page);
> > +}
> > +EXPORT_SYMBOL_GPL(rust_helper_kmap);
> > +
> > +void rust_helper_kunmap(struct page *page)
> > +{
> > +     kunmap(page);
> > +}
> > +EXPORT_SYMBOL_GPL(rust_helper_kunmap);
>
> I'm not thrilled by exposing kmap()/kunmap() to Rust code.  The vast
> majority of code really only needs kmap_local_*() / kunmap_local().
> Can you elaborate on why you need the old kmap() in new Rust code?

The difficulty we have with kmap_local_*() has to do with the
requirement that maps and unmaps need to be nested neatly. For
example:

let a = folio1.map_local(...);
let b = folio2.map_local(...);
// Do something with `a` and `b`.
drop(a);
drop(b);

The code obviously violates the requirements.

One way to enforce the rule is Rust is to use closures, so the code
above would be:

folio1.map_local(..., |a| {
    folio2.map_local(..., |b| {
        // Do something with `a` and `b`.
    })
})

It isn't ergonomic the first option, but allows us to satisfy the
nesting requirement.

Any chance we can relax that requirement?

(If not, and we really want to get rid of the non-local function, we
can fall back to the closure-based implementation. In fact, you'll
find that in this patch I already do this for a private function that
used when writing into the folio, we could just make a version of it
public.)

> > +void rust_helper_folio_set_error(struct folio *folio)
> > +{
> > +     folio_set_error(folio);
> > +}
> > +EXPORT_SYMBOL_GPL(rust_helper_folio_set_error);
>
> I'm trying to get rid of the error flag.  Can you share the situations
> in which you've needed the error flag?  Or is it just copying existing
> practices?

I'm just mimicking C code. Happy to remove it.

> > +    /// Returns the byte position of this folio in its file.
> > +    pub fn pos(&self) -> i64 {
> > +        // SAFETY: The folio is valid because the shared reference implies a non-zero refcount.
> > +        unsafe { bindings::folio_pos(self.0.get()) }
> > +    }
>
> I think it's a mistake to make file positions an i64.  I estimate 64
> bits will not be enough by 2035-2040.  We should probably have a numeric
> type which is i64 on 32-bit and isize on other CPUs (I also project
> 64-bit pointers will be insufficient by 2035-2040 and so we will have
> 128-bit pointers around the same time, so we're not going to need i128
> file offsets with i64 pointers).

I'm also just mimicking C here -- we just don't have a type that has
the properties you describe. I'm happy to switch once we have it, in
fact, Miguel has plans that I believe align well with what you want.
I'm not sure if he has already contacted you about it yet though.

> > +/// A [`Folio`] that has a single reference to it.
> > +pub struct UniqueFolio(pub(crate) ARef<Folio>);
>
> How do we know it only has a single reference?  Do you mean "has at
> least one reference"?  Or am I confusing Rust's notion of a reference
> with Linux's notion of a reference?

Instances of `UniqueFolio` are only produced by calls to
`folio_alloc`. They encode the fact that it's safe for us to map the
folio and know that there aren't any concurrent threads/CPUs doing the
same to the same folio.

Naturally, if you to increment the refcount on this folio and share it
with other threads/CPUs, it's no longer unique. So we don't allow it.

This is only used when using a synchronous bio to read blocks from a
block device while setting up a new superblock, in particular, to read
the superblock itself.

>
> > +impl UniqueFolio {
> > +    /// Maps the contents of a folio page into a slice.
> > +    pub fn map_page(&self, page_index: usize) -> Result<MapGuard<'_>> {
> > +        if page_index >= self.0.size() / bindings::PAGE_SIZE {
> > +            return Err(EDOM);
> > +        }
> > +
> > +        // SAFETY: We just checked that the index is within bounds of the folio.
> > +        let page = unsafe { bindings::folio_page(self.0 .0.get(), page_index) };
> > +
> > +        // SAFETY: `page` is valid because it was returned by `folio_page` above.
> > +        let ptr = unsafe { bindings::kmap(page) };
>
> Surely this can be:
>
>            let ptr = unsafe { bindings::kmap_local_folio(folio, page_index * PAGE_SIZE) };

The problem is the unmap path that can happen at arbitrary order in
Rust, see my comment above.

>
> > +        // SAFETY: We just mapped `ptr`, so it's valid for read.
> > +        let data = unsafe { core::slice::from_raw_parts(ptr.cast::<u8>(), bindings::PAGE_SIZE) };
>
> Can we hide away the "if this isn't a HIGHMEM system, this maps to the
> end of the folio, but if it is, it only maps to the end of the page"
> problem here?

Do you have ideas on how this might look like? (Don't worry about
Rust, just express it in some pseudo-C and we'll see if you can
express it in Rust.)

My approach here was to be conservative, since the common denominator
was "maps to the end of the page", that's what I have.

One possible way to do it with Rust would be an "Iterator" -- in
HIGHMEM it would return one item per page, otherwise it would return a
single item. (We have something similar for reading arbitrary ranges
of a block device, it's broken up into several chunks, so we return an
iterator.)
Matthew Wilcox Oct. 18, 2023, 7:21 p.m. UTC | #3
On Wed, Oct 18, 2023 at 03:32:36PM -0300, Wedson Almeida Filho wrote:
> On Wed, 18 Oct 2023 at 14:17, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Wed, Oct 18, 2023 at 09:25:08AM -0300, Wedson Almeida Filho wrote:
> > > +void *rust_helper_kmap(struct page *page)
> > > +{
> > > +     return kmap(page);
> > > +}
> > > +EXPORT_SYMBOL_GPL(rust_helper_kmap);
> > > +
> > > +void rust_helper_kunmap(struct page *page)
> > > +{
> > > +     kunmap(page);
> > > +}
> > > +EXPORT_SYMBOL_GPL(rust_helper_kunmap);
> >
> > I'm not thrilled by exposing kmap()/kunmap() to Rust code.  The vast
> > majority of code really only needs kmap_local_*() / kunmap_local().
> > Can you elaborate on why you need the old kmap() in new Rust code?
> 
> The difficulty we have with kmap_local_*() has to do with the
> requirement that maps and unmaps need to be nested neatly. For
> example:
> 
> let a = folio1.map_local(...);
> let b = folio2.map_local(...);
> // Do something with `a` and `b`.
> drop(a);
> drop(b);
> 
> The code obviously violates the requirements.

Is that the only problem, or are there situations where we might try
to do something like:

a = folio1.map.local()
b = folio2.map.local()
drop(a)
a = folio3.map.local()
drop(b)
b = folio4.map.local()
drop (a)
a = folio5.map.local()
...

> One way to enforce the rule is Rust is to use closures, so the code
> above would be:
> 
> folio1.map_local(..., |a| {
>     folio2.map_local(..., |b| {
>         // Do something with `a` and `b`.
>     })
> })
> 
> It isn't ergonomic the first option, but allows us to satisfy the
> nesting requirement.
> 
> Any chance we can relax that requirement?

It's possible.  Here's an untested patch that _only_ supports
"map a, map b, unmap a, unmap b".  If we need more, well, I guess
we can scan the entire array, both at map & unmap in order to
unmap pages.

diff --git a/mm/highmem.c b/mm/highmem.c
index e19269093a93..778a22ca1796 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -586,7 +586,7 @@ void kunmap_local_indexed(const void *vaddr)
 {
 	unsigned long addr = (unsigned long) vaddr & PAGE_MASK;
 	pte_t *kmap_pte;
-	int idx;
+	int idx, local_idx;
 
 	if (addr < __fix_to_virt(FIX_KMAP_END) ||
 	    addr > __fix_to_virt(FIX_KMAP_BEGIN)) {
@@ -607,15 +607,25 @@ void kunmap_local_indexed(const void *vaddr)
 	}
 
 	preempt_disable();
-	idx = arch_kmap_local_unmap_idx(kmap_local_idx(), addr);
+	local_idx = kmap_local_idx();
+	idx = arch_kmap_local_unmap_idx(local_idx, addr);
+	if (addr != __fix_to_virt(FIX_KMAP_BEGIN + idx) && local_idx > 0) {
+		idx--;
+		local_idx--;
+	}
 	WARN_ON_ONCE(addr != __fix_to_virt(FIX_KMAP_BEGIN + idx));
 
 	kmap_pte = kmap_get_pte(addr, idx);
 	arch_kmap_local_pre_unmap(addr);
 	pte_clear(&init_mm, addr, kmap_pte);
 	arch_kmap_local_post_unmap(addr);
-	current->kmap_ctrl.pteval[kmap_local_idx()] = __pte(0);
-	kmap_local_idx_pop();
+	current->kmap_ctrl.pteval[local_idx] = __pte(0);
+	if (local_idx == kmap_local_idx()) {
+		kmap_local_idx_pop();
+		if (local_idx > 0 &&
+		    pte_none(current->kmap_ctrl.pteval[local_idx - 1]))
+			kmap_local_idx_pop();
+	}
 	preempt_enable();
 	migrate_enable();
 }
@@ -648,7 +658,7 @@ void __kmap_local_sched_out(void)
 			WARN_ON_ONCE(pte_val(pteval) != 0);
 			continue;
 		}
-		if (WARN_ON_ONCE(pte_none(pteval)))
+		if (pte_none(pteval))
 			continue;
 
 		/*
@@ -685,7 +695,7 @@ void __kmap_local_sched_in(void)
 			WARN_ON_ONCE(pte_val(pteval) != 0);
 			continue;
 		}
-		if (WARN_ON_ONCE(pte_none(pteval)))
+		if (pte_none(pteval))
 			continue;
 
 		/* See comment in __kmap_local_sched_out() */

> > > +void rust_helper_folio_set_error(struct folio *folio)
> > > +{
> > > +     folio_set_error(folio);
> > > +}
> > > +EXPORT_SYMBOL_GPL(rust_helper_folio_set_error);
> >
> > I'm trying to get rid of the error flag.  Can you share the situations
> > in which you've needed the error flag?  Or is it just copying existing
> > practices?
> 
> I'm just mimicking C code. Happy to remove it.

Great, thanks!

> > > +    /// Returns the byte position of this folio in its file.
> > > +    pub fn pos(&self) -> i64 {
> > > +        // SAFETY: The folio is valid because the shared reference implies a non-zero refcount.
> > > +        unsafe { bindings::folio_pos(self.0.get()) }
> > > +    }
> >
> > I think it's a mistake to make file positions an i64.  I estimate 64
> > bits will not be enough by 2035-2040.  We should probably have a numeric
> > type which is i64 on 32-bit and isize on other CPUs (I also project
> > 64-bit pointers will be insufficient by 2035-2040 and so we will have
> > 128-bit pointers around the same time, so we're not going to need i128
> > file offsets with i64 pointers).
> 
> I'm also just mimicking C here -- we just don't have a type that has
> the properties you describe. I'm happy to switch once we have it, in
> fact, Miguel has plans that I believe align well with what you want.
> I'm not sure if he has already contacted you about it yet though.

No, I haven't heard about plans for an off_t equivalent.  Perhaps you
could just do what the crates.io libc does?

https://docs.rs/libc/0.2.149/libc/type.off_t.html
pub type off_t = i64;

and then there's only one place to change to be i128 when the time comes.

> > > +/// A [`Folio`] that has a single reference to it.
> > > +pub struct UniqueFolio(pub(crate) ARef<Folio>);
> >
> > How do we know it only has a single reference?  Do you mean "has at
> > least one reference"?  Or am I confusing Rust's notion of a reference
> > with Linux's notion of a reference?
> 
> Instances of `UniqueFolio` are only produced by calls to
> `folio_alloc`. They encode the fact that it's safe for us to map the
> folio and know that there aren't any concurrent threads/CPUs doing the
> same to the same folio.

Mmm ... it's always safe to map a folio, even if other people have a
reference to it.  And Linux can make temporary spurious references to
folios appear, although those should be noticed by the other party and
released again before they access the contents of the folio.  So from
the point of view of being memory-safe, you can ignore them, but you
might see the refcount of the folio as >1, even if you just got the
folio back from the allocator.

> > > +impl UniqueFolio {
> > > +    /// Maps the contents of a folio page into a slice.
> > > +    pub fn map_page(&self, page_index: usize) -> Result<MapGuard<'_>> {
> > > +        if page_index >= self.0.size() / bindings::PAGE_SIZE {
> > > +            return Err(EDOM);
> > > +        }
> > > +
> > > +        // SAFETY: We just checked that the index is within bounds of the folio.
> > > +        let page = unsafe { bindings::folio_page(self.0 .0.get(), page_index) };
> > > +
> > > +        // SAFETY: `page` is valid because it was returned by `folio_page` above.
> > > +        let ptr = unsafe { bindings::kmap(page) };
> >
> > Surely this can be:
> >
> >            let ptr = unsafe { bindings::kmap_local_folio(folio, page_index * PAGE_SIZE) };
> 
> The problem is the unmap path that can happen at arbitrary order in
> Rust, see my comment above.
> 
> >
> > > +        // SAFETY: We just mapped `ptr`, so it's valid for read.
> > > +        let data = unsafe { core::slice::from_raw_parts(ptr.cast::<u8>(), bindings::PAGE_SIZE) };
> >
> > Can we hide away the "if this isn't a HIGHMEM system, this maps to the
> > end of the folio, but if it is, it only maps to the end of the page"
> > problem here?
> 
> Do you have ideas on how this might look like? (Don't worry about
> Rust, just express it in some pseudo-C and we'll see if you can
> express it in Rust.)

On systems without HIGHMEM, kmap() is a no-op.  So we could do something
like this:

	let data = unsafe { core::slice::from_raw_parts(ptr.cast::<u8>(),
		if (folio_test_highmem(folio))
			bindings::PAGE_SIZE
		else
			folio_size(folio) - page_idx * PAGE_SIZE) }

... modulo whatever the correct syntax is in Rust.

Something I forgot to mention was that I found it more useful to express
"map this chunk of a folio" in bytes rather than pages.  You might find
the same, in which case it's just folio.map(offset: usize) instead of
folio.map_page(page_index: usize)
Wedson Almeida Filho Oct. 19, 2023, 1:25 p.m. UTC | #4
On Wed, 18 Oct 2023 at 16:21, Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Oct 18, 2023 at 03:32:36PM -0300, Wedson Almeida Filho wrote:
> > On Wed, 18 Oct 2023 at 14:17, Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Wed, Oct 18, 2023 at 09:25:08AM -0300, Wedson Almeida Filho wrote:
> > > > +void *rust_helper_kmap(struct page *page)
> > > > +{
> > > > +     return kmap(page);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(rust_helper_kmap);
> > > > +
> > > > +void rust_helper_kunmap(struct page *page)
> > > > +{
> > > > +     kunmap(page);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(rust_helper_kunmap);
> > >
> > > I'm not thrilled by exposing kmap()/kunmap() to Rust code.  The vast
> > > majority of code really only needs kmap_local_*() / kunmap_local().
> > > Can you elaborate on why you need the old kmap() in new Rust code?
> >
> > The difficulty we have with kmap_local_*() has to do with the
> > requirement that maps and unmaps need to be nested neatly. For
> > example:
> >
> > let a = folio1.map_local(...);
> > let b = folio2.map_local(...);
> > // Do something with `a` and `b`.
> > drop(a);
> > drop(b);
> >
> > The code obviously violates the requirements.
>
> Is that the only problem, or are there situations where we might try
> to do something like:
>
> a = folio1.map.local()
> b = folio2.map.local()
> drop(a)
> a = folio3.map.local()
> drop(b)
> b = folio4.map.local()
> drop (a)
> a = folio5.map.local()
> ...

This is also a problem. We don't control the order in which users are
going to unmap.

> > One way to enforce the rule is Rust is to use closures, so the code
> > above would be:
> >
> > folio1.map_local(..., |a| {
> >     folio2.map_local(..., |b| {
> >         // Do something with `a` and `b`.
> >     })
> > })
> >
> > It isn't ergonomic the first option, but allows us to satisfy the
> > nesting requirement.
> >
> > Any chance we can relax that requirement?
>
> It's possible.  Here's an untested patch that _only_ supports
> "map a, map b, unmap a, unmap b".  If we need more, well, I guess
> we can scan the entire array, both at map & unmap in order to
> unmap pages.

We need more.

If you don't want to scan the whole array, we could have a solution
where we add an indirection between the available indices and the
stack of allocations; this way C could continue to work as is and Rust
would have a slightly different API that returns both the mapped
address and an index (which would be used to unmap).

It's simple to remember the index in Rust and it wouldn't have to be
exposed to end users, they'd still just do:

let a = folio1.map_local(...);

And when `a` is dropped, it would call unmap and pass the index back.
(It's also safe in the sense that users would not be able to
accidentally pass the wrong index.)

But if scanning the whole array is acceptable performance-wise, it's
definitely a simpler solution.

> diff --git a/mm/highmem.c b/mm/highmem.c
> index e19269093a93..778a22ca1796 100644
> --- a/mm/highmem.c
> +++ b/mm/highmem.c
> @@ -586,7 +586,7 @@ void kunmap_local_indexed(const void *vaddr)
>  {
>         unsigned long addr = (unsigned long) vaddr & PAGE_MASK;
>         pte_t *kmap_pte;
> -       int idx;
> +       int idx, local_idx;
>
>         if (addr < __fix_to_virt(FIX_KMAP_END) ||
>             addr > __fix_to_virt(FIX_KMAP_BEGIN)) {
> @@ -607,15 +607,25 @@ void kunmap_local_indexed(const void *vaddr)
>         }
>
>         preempt_disable();
> -       idx = arch_kmap_local_unmap_idx(kmap_local_idx(), addr);
> +       local_idx = kmap_local_idx();
> +       idx = arch_kmap_local_unmap_idx(local_idx, addr);
> +       if (addr != __fix_to_virt(FIX_KMAP_BEGIN + idx) && local_idx > 0) {
> +               idx--;
> +               local_idx--;
> +       }
>         WARN_ON_ONCE(addr != __fix_to_virt(FIX_KMAP_BEGIN + idx));
>
>         kmap_pte = kmap_get_pte(addr, idx);
>         arch_kmap_local_pre_unmap(addr);
>         pte_clear(&init_mm, addr, kmap_pte);
>         arch_kmap_local_post_unmap(addr);
> -       current->kmap_ctrl.pteval[kmap_local_idx()] = __pte(0);
> -       kmap_local_idx_pop();
> +       current->kmap_ctrl.pteval[local_idx] = __pte(0);
> +       if (local_idx == kmap_local_idx()) {
> +               kmap_local_idx_pop();
> +               if (local_idx > 0 &&
> +                   pte_none(current->kmap_ctrl.pteval[local_idx - 1]))
> +                       kmap_local_idx_pop();
> +       }
>         preempt_enable();
>         migrate_enable();
>  }
> @@ -648,7 +658,7 @@ void __kmap_local_sched_out(void)
>                         WARN_ON_ONCE(pte_val(pteval) != 0);
>                         continue;
>                 }
> -               if (WARN_ON_ONCE(pte_none(pteval)))
> +               if (pte_none(pteval))
>                         continue;
>
>                 /*
> @@ -685,7 +695,7 @@ void __kmap_local_sched_in(void)
>                         WARN_ON_ONCE(pte_val(pteval) != 0);
>                         continue;
>                 }
> -               if (WARN_ON_ONCE(pte_none(pteval)))
> +               if (pte_none(pteval))
>                         continue;
>
>                 /* See comment in __kmap_local_sched_out() */
>
> > > > +void rust_helper_folio_set_error(struct folio *folio)
> > > > +{
> > > > +     folio_set_error(folio);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(rust_helper_folio_set_error);
> > >
> > > I'm trying to get rid of the error flag.  Can you share the situations
> > > in which you've needed the error flag?  Or is it just copying existing
> > > practices?
> >
> > I'm just mimicking C code. Happy to remove it.
>
> Great, thanks!
>
> > > > +    /// Returns the byte position of this folio in its file.
> > > > +    pub fn pos(&self) -> i64 {
> > > > +        // SAFETY: The folio is valid because the shared reference implies a non-zero refcount.
> > > > +        unsafe { bindings::folio_pos(self.0.get()) }
> > > > +    }
> > >
> > > I think it's a mistake to make file positions an i64.  I estimate 64
> > > bits will not be enough by 2035-2040.  We should probably have a numeric
> > > type which is i64 on 32-bit and isize on other CPUs (I also project
> > > 64-bit pointers will be insufficient by 2035-2040 and so we will have
> > > 128-bit pointers around the same time, so we're not going to need i128
> > > file offsets with i64 pointers).
> >
> > I'm also just mimicking C here -- we just don't have a type that has
> > the properties you describe. I'm happy to switch once we have it, in
> > fact, Miguel has plans that I believe align well with what you want.
> > I'm not sure if he has already contacted you about it yet though.
>
> No, I haven't heard about plans for an off_t equivalent.

He tells me he'll send a patch for that soon.

> Perhaps you
> could just do what the crates.io libc does?
>
> https://docs.rs/libc/0.2.149/libc/type.off_t.html
> pub type off_t = i64;
>
> and then there's only one place to change to be i128 when the time comes.

Yes, I'll do that for v2.

> > > > +/// A [`Folio`] that has a single reference to it.
> > > > +pub struct UniqueFolio(pub(crate) ARef<Folio>);
> > >
> > > How do we know it only has a single reference?  Do you mean "has at
> > > least one reference"?  Or am I confusing Rust's notion of a reference
> > > with Linux's notion of a reference?
> >
> > Instances of `UniqueFolio` are only produced by calls to
> > `folio_alloc`. They encode the fact that it's safe for us to map the
> > folio and know that there aren't any concurrent threads/CPUs doing the
> > same to the same folio.
>
> Mmm ... it's always safe to map a folio, even if other people have a
> reference to it.  And Linux can make temporary spurious references to
> folios appear, although those should be noticed by the other party and
> released again before they access the contents of the folio.  So from
> the point of view of being memory-safe, you can ignore them, but you
> might see the refcount of the folio as >1, even if you just got the
> folio back from the allocator.

Sure, it's safe to map a folio in general, but Rust has stricter rules
about aliasing and mutability that are part of how memory safety is
achieved. In particular, it requires that we never have mutable and
immutable pointers to the same memory at once (modulo interior
mutability).

So we need to avoid something like:

let a = folio.map(); // `a` is a shared pointer to the contents of the folio.

// While we have a shared (and therefore immutable) pointer, we're
changing the contents of the folio.
sb.sread(sector_number, sector_count, folio);

This violates Rust rules. `UniqueFolio` helps us address this for our
use case; if we try the above with a UniqueFolio, the compiler will
error out saying that  `a` has a shared reference to the folio, so we
can't call `sread` on it (because sread requires a mutable, and
therefore not shareable, reference to the folio).

(It's ok for the reference count to go up and down; it's unfortunate
that we use "reference" with two slightly different meanings, we
invariably get confused.)

> > > > +impl UniqueFolio {
> > > > +    /// Maps the contents of a folio page into a slice.
> > > > +    pub fn map_page(&self, page_index: usize) -> Result<MapGuard<'_>> {
> > > > +        if page_index >= self.0.size() / bindings::PAGE_SIZE {
> > > > +            return Err(EDOM);
> > > > +        }
> > > > +
> > > > +        // SAFETY: We just checked that the index is within bounds of the folio.
> > > > +        let page = unsafe { bindings::folio_page(self.0 .0.get(), page_index) };
> > > > +
> > > > +        // SAFETY: `page` is valid because it was returned by `folio_page` above.
> > > > +        let ptr = unsafe { bindings::kmap(page) };
> > >
> > > Surely this can be:
> > >
> > >            let ptr = unsafe { bindings::kmap_local_folio(folio, page_index * PAGE_SIZE) };
> >
> > The problem is the unmap path that can happen at arbitrary order in
> > Rust, see my comment above.
> >
> > >
> > > > +        // SAFETY: We just mapped `ptr`, so it's valid for read.
> > > > +        let data = unsafe { core::slice::from_raw_parts(ptr.cast::<u8>(), bindings::PAGE_SIZE) };
> > >
> > > Can we hide away the "if this isn't a HIGHMEM system, this maps to the
> > > end of the folio, but if it is, it only maps to the end of the page"
> > > problem here?
> >
> > Do you have ideas on how this might look like? (Don't worry about
> > Rust, just express it in some pseudo-C and we'll see if you can
> > express it in Rust.)
>
> On systems without HIGHMEM, kmap() is a no-op.  So we could do something
> like this:
>
>         let data = unsafe { core::slice::from_raw_parts(ptr.cast::<u8>(),
>                 if (folio_test_highmem(folio))
>                         bindings::PAGE_SIZE
>                 else
>                         folio_size(folio) - page_idx * PAGE_SIZE) }
>
> ... modulo whatever the correct syntax is in Rust.

We can certainly do that. But since there's the possibility that the
array will be capped at PAGE_SIZE in the HIGHMEM case, callers would
still need a loop to traverse the whole folio, right?

let mut offset = 0;
while offset < folio.size() {
    let a = folio.map(offset);
    // Do something with a.
    offset += a.len();
}

I guess the advantage is that we'd have a single iteration in systems
without HIGHMEM.

> Something I forgot to mention was that I found it more useful to express
> "map this chunk of a folio" in bytes rather than pages.  You might find
> the same, in which case it's just folio.map(offset: usize) instead of
> folio.map_page(page_index: usize)

Oh, thanks for the feedback. I'll switch to bytes then for v2.
(Already did in the example above.)
Matthew Wilcox Oct. 20, 2023, 4:11 a.m. UTC | #5
On Thu, Oct 19, 2023 at 10:25:39AM -0300, Wedson Almeida Filho wrote:
> On Wed, 18 Oct 2023 at 16:21, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Wed, Oct 18, 2023 at 03:32:36PM -0300, Wedson Almeida Filho wrote:
> > > On Wed, 18 Oct 2023 at 14:17, Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Wed, Oct 18, 2023 at 09:25:08AM -0300, Wedson Almeida Filho wrote:
> > > > > +void *rust_helper_kmap(struct page *page)
> > > > > +{
> > > > > +     return kmap(page);
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(rust_helper_kmap);
> > > > > +
> > > > > +void rust_helper_kunmap(struct page *page)
> > > > > +{
> > > > > +     kunmap(page);
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(rust_helper_kunmap);
> > > >
> > > > I'm not thrilled by exposing kmap()/kunmap() to Rust code.  The vast
> > > > majority of code really only needs kmap_local_*() / kunmap_local().
> > > > Can you elaborate on why you need the old kmap() in new Rust code?
> > >
> > > The difficulty we have with kmap_local_*() has to do with the
> > > requirement that maps and unmaps need to be nested neatly. For
> > > example:
> > >
> > > let a = folio1.map_local(...);
> > > let b = folio2.map_local(...);
> > > // Do something with `a` and `b`.
> > > drop(a);
> > > drop(b);
> > >
> > > The code obviously violates the requirements.
> >
> > Is that the only problem, or are there situations where we might try
> > to do something like:
> >
> > a = folio1.map.local()
> > b = folio2.map.local()
> > drop(a)
> > a = folio3.map.local()
> > drop(b)
> > b = folio4.map.local()
> > drop (a)
> > a = folio5.map.local()
> > ...
> 
> This is also a problem. We don't control the order in which users are
> going to unmap.

OK.  I have something in the works, but it's not quite ready yet.

> If you don't want to scan the whole array, we could have a solution
> where we add an indirection between the available indices and the
> stack of allocations; this way C could continue to work as is and Rust
> would have a slightly different API that returns both the mapped
> address and an index (which would be used to unmap).
> 
> It's simple to remember the index in Rust and it wouldn't have to be
> exposed to end users, they'd still just do:
> 
> let a = folio1.map_local(...);
> 
> And when `a` is dropped, it would call unmap and pass the index back.
> (It's also safe in the sense that users would not be able to
> accidentally pass the wrong index.)
> 
> But if scanning the whole array is acceptable performance-wise, it's
> definitely a simpler solution.

Interesting idea.  There are some other possibilities too ... let's see.

> > > > > +/// A [`Folio`] that has a single reference to it.
> > > > > +pub struct UniqueFolio(pub(crate) ARef<Folio>);
> > > >
> > > > How do we know it only has a single reference?  Do you mean "has at
> > > > least one reference"?  Or am I confusing Rust's notion of a reference
> > > > with Linux's notion of a reference?
> > >
> > > Instances of `UniqueFolio` are only produced by calls to
> > > `folio_alloc`. They encode the fact that it's safe for us to map the
> > > folio and know that there aren't any concurrent threads/CPUs doing the
> > > same to the same folio.
> >
> > Mmm ... it's always safe to map a folio, even if other people have a
> > reference to it.  And Linux can make temporary spurious references to
> > folios appear, although those should be noticed by the other party and
> > released again before they access the contents of the folio.  So from
> > the point of view of being memory-safe, you can ignore them, but you
> > might see the refcount of the folio as >1, even if you just got the
> > folio back from the allocator.
> 
> Sure, it's safe to map a folio in general, but Rust has stricter rules
> about aliasing and mutability that are part of how memory safety is
> achieved. In particular, it requires that we never have mutable and
> immutable pointers to the same memory at once (modulo interior
> mutability).
> 
> So we need to avoid something like:
> 
> let a = folio.map(); // `a` is a shared pointer to the contents of the folio.
> 
> // While we have a shared (and therefore immutable) pointer, we're
> changing the contents of the folio.
> sb.sread(sector_number, sector_count, folio);
> 
> This violates Rust rules. `UniqueFolio` helps us address this for our
> use case; if we try the above with a UniqueFolio, the compiler will
> error out saying that  `a` has a shared reference to the folio, so we
> can't call `sread` on it (because sread requires a mutable, and
> therefore not shareable, reference to the folio).

This is going to be quite the impedance mismatch.  Still, I imagine
you're used to dealing with those by now and have a toolbox of ideas.

We don't have that rule for the pagecache as it is.  We do have rules that
prevent data corruption!  For example, if the folio is !uptodate then you
must have the lock to write to the folio in order to bring it uptodate
(so we have a single writer rule in that regard).  But once the folio is
uptodate, all bets are off in terms of who can be writing to it / reading
it at the same time.  And that's going to have to continue to be true;
multiple processes can have the same page mmaped writable and write to
it at the same time.  There's no possible synchronisation between them.

But I think your concern is really more limited.  You're concerned
with filesystem metadata obeying Rust's rules.  And for a read-write
filesystem, you're going to have to have ... something ... which gets a
folio from the page cache, and establishes that this is the only thread
which can modify that folio (maybe it's an interior node of a Btree,
maybe it's a free space bitmap, ...).  We could probably use the folio
lock bit for that purpose,  For the read-only filesystems, you only need
be concerned about freshly-allocated folios, but you need something more
when it comes to doing an ext2 clone.

There's general concern about the overuse of the folio lock bit, but
this is a reasonable use -- preventing two threads from modifying the
same folio at the same time.

(I have simplified all this; both iomap and buffer heads support folios
which are partially uptodate, but conceptually this is accurate)

> > On systems without HIGHMEM, kmap() is a no-op.  So we could do something
> > like this:
> >
> >         let data = unsafe { core::slice::from_raw_parts(ptr.cast::<u8>(),
> >                 if (folio_test_highmem(folio))
> >                         bindings::PAGE_SIZE
> >                 else
> >                         folio_size(folio) - page_idx * PAGE_SIZE) }
> >
> > ... modulo whatever the correct syntax is in Rust.
> 
> We can certainly do that. But since there's the possibility that the
> array will be capped at PAGE_SIZE in the HIGHMEM case, callers would
> still need a loop to traverse the whole folio, right?
> 
> let mut offset = 0;
> while offset < folio.size() {
>     let a = folio.map(offset);
>     // Do something with a.
>     offset += a.len();
> }
> 
> I guess the advantage is that we'd have a single iteration in systems
> without HIGHMEM.

Right.  You can see something similar to that in memcpy_from_folio() in
highmem.h.

> > Something I forgot to mention was that I found it more useful to express
> > "map this chunk of a folio" in bytes rather than pages.  You might find
> > the same, in which case it's just folio.map(offset: usize) instead of
> > folio.map_page(page_index: usize)
> 
> Oh, thanks for the feedback. I'll switch to bytes then for v2.
> (Already did in the example above.)

Great!  Something else I think would be a good idea is open-coding some
of the trivial accessors.  eg instead of doing:

+size_t rust_helper_folio_size(struct folio *folio)
+{
+	return folio_size(folio);
+}
+EXPORT_SYMBOL_GPL(rust_helper_folio_size);
[...]
+    pub fn size(&self) -> usize {
+        // SAFETY: The folio is valid because the shared reference implies a non-zero refcount.
+        unsafe { bindings::folio_size(self.0.get()) }
+    }

add:

impl Folio {
...
    pub fn order(&self) -> u8 {
	if (self.flags & (1 << PG_head))
	    self._flags_1 & 0xff
	else
	    0
    }

    pub fn size(&self) -> usize {
	bindings::PAGE_SIZE << self.order()
    }
}

... or have I misunderstood what is possible here?  My hope is that the
compiler gets to "see through" the abstraction, which surely can't be
done when there's a function call.
Matthew Wilcox Oct. 20, 2023, 3:17 p.m. UTC | #6
On Fri, Oct 20, 2023 at 05:11:38AM +0100, Matthew Wilcox wrote:
> > Sure, it's safe to map a folio in general, but Rust has stricter rules
> > about aliasing and mutability that are part of how memory safety is
> > achieved. In particular, it requires that we never have mutable and
> > immutable pointers to the same memory at once (modulo interior
> > mutability).
> > 
> > So we need to avoid something like:
> > 
> > let a = folio.map(); // `a` is a shared pointer to the contents of the folio.
> > 
> > // While we have a shared (and therefore immutable) pointer, we're
> > changing the contents of the folio.
> > sb.sread(sector_number, sector_count, folio);
> > 
> > This violates Rust rules. `UniqueFolio` helps us address this for our
> > use case; if we try the above with a UniqueFolio, the compiler will
> > error out saying that  `a` has a shared reference to the folio, so we
> > can't call `sread` on it (because sread requires a mutable, and
> > therefore not shareable, reference to the folio).
> 
> This is going to be quite the impedance mismatch.  Still, I imagine
> you're used to dealing with those by now and have a toolbox of ideas.
> 
> We don't have that rule for the pagecache as it is.  We do have rules that
> prevent data corruption!  For example, if the folio is !uptodate then you
> must have the lock to write to the folio in order to bring it uptodate
> (so we have a single writer rule in that regard).  But once the folio is
> uptodate, all bets are off in terms of who can be writing to it / reading
> it at the same time.  And that's going to have to continue to be true;
> multiple processes can have the same page mmaped writable and write to
> it at the same time.  There's no possible synchronisation between them.
> 
> But I think your concern is really more limited.  You're concerned
> with filesystem metadata obeying Rust's rules.  And for a read-write
> filesystem, you're going to have to have ... something ... which gets a
> folio from the page cache, and establishes that this is the only thread
> which can modify that folio (maybe it's an interior node of a Btree,
> maybe it's a free space bitmap, ...).  We could probably use the folio
> lock bit for that purpose,  For the read-only filesystems, you only need
> be concerned about freshly-allocated folios, but you need something more
> when it comes to doing an ext2 clone.
> 
> There's general concern about the overuse of the folio lock bit, but
> this is a reasonable use -- preventing two threads from modifying the
> same folio at the same time.

Sorry, I didn't quite finish this thought; that'll teach me to write
complicated emails last thing at night.

The page cache has no single-writer vs multiple-reader exclusion on folios
found in the page cache.  We expect filesystems to implement whatever
exclusion they need at a higher level.  For example, ext2 has no higher
level lock on its block allocator.  Once the buffer is uptodate (ie has
been read from storage), it uses atomic bit operations in order to track
which blocks are freed.  It does use a spinlock to control access to
"how many blocks are currently free".

I'm not suggesting ext2 is an optimal strategy.  I know XFS and btrfs
use rwsems, although I'm not familiar enough with either to describe
exactly how it works.
Benno Lossin Oct. 21, 2023, 9:21 a.m. UTC | #7
On 18.10.23 14:25, Wedson Almeida Filho wrote:
> From: Wedson Almeida Filho <walmeida@microsoft.com>
> 
> Allow Rust file systems to handle ref-counted folios.
> 
> Provide the minimum needed to implement `read_folio` (part of `struct
> address_space_operations`) in read-only file systems and to read
> uncached blocks.
> 
> Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
> ---
>   rust/bindings/bindings_helper.h |   3 +
>   rust/bindings/lib.rs            |   2 +
>   rust/helpers.c                  |  81 ++++++++++++
>   rust/kernel/folio.rs            | 215 ++++++++++++++++++++++++++++++++
>   rust/kernel/lib.rs              |   1 +
>   5 files changed, 302 insertions(+)
>   create mode 100644 rust/kernel/folio.rs
> 
> diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
> index ca1898ce9527..53a99ea512d1 100644
> --- a/rust/bindings/bindings_helper.h
> +++ b/rust/bindings/bindings_helper.h
> @@ -11,6 +11,7 @@
>   #include <linux/fs.h>
>   #include <linux/fs_context.h>
>   #include <linux/slab.h>
> +#include <linux/pagemap.h>
>   #include <linux/refcount.h>
>   #include <linux/wait.h>
>   #include <linux/sched.h>
> @@ -27,3 +28,5 @@ const slab_flags_t BINDINGS_SLAB_ACCOUNT = SLAB_ACCOUNT;
>   const unsigned long BINDINGS_SB_RDONLY = SB_RDONLY;
> 
>   const loff_t BINDINGS_MAX_LFS_FILESIZE = MAX_LFS_FILESIZE;
> +
> +const size_t BINDINGS_PAGE_SIZE = PAGE_SIZE;
> diff --git a/rust/bindings/lib.rs b/rust/bindings/lib.rs
> index 426915d3fb57..a96b7f08e57d 100644
> --- a/rust/bindings/lib.rs
> +++ b/rust/bindings/lib.rs
> @@ -59,3 +59,5 @@ mod bindings_helper {
>   pub const SB_RDONLY: core::ffi::c_ulong = BINDINGS_SB_RDONLY;
> 
>   pub const MAX_LFS_FILESIZE: loff_t = BINDINGS_MAX_LFS_FILESIZE;
> +
> +pub const PAGE_SIZE: usize = BINDINGS_PAGE_SIZE;
> diff --git a/rust/helpers.c b/rust/helpers.c
> index c5a2bec6467d..f2ce3e7b688c 100644
> --- a/rust/helpers.c
> +++ b/rust/helpers.c
> @@ -23,10 +23,14 @@
>   #include <kunit/test-bug.h>
>   #include <linux/bug.h>
>   #include <linux/build_bug.h>
> +#include <linux/cacheflush.h>
>   #include <linux/err.h>
>   #include <linux/errname.h>
>   #include <linux/fs.h>
> +#include <linux/highmem.h>
> +#include <linux/mm.h>
>   #include <linux/mutex.h>
> +#include <linux/pagemap.h>
>   #include <linux/refcount.h>
>   #include <linux/sched/signal.h>
>   #include <linux/spinlock.h>
> @@ -145,6 +149,77 @@ struct kunit *rust_helper_kunit_get_current_test(void)
>   }
>   EXPORT_SYMBOL_GPL(rust_helper_kunit_get_current_test);
> 
> +void *rust_helper_kmap(struct page *page)
> +{
> +	return kmap(page);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kmap);
> +
> +void rust_helper_kunmap(struct page *page)
> +{
> +	kunmap(page);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kunmap);
> +
> +void rust_helper_folio_get(struct folio *folio)
> +{
> +	folio_get(folio);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_folio_get);
> +
> +void rust_helper_folio_put(struct folio *folio)
> +{
> +	folio_put(folio);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_folio_put);
> +
> +struct page *rust_helper_folio_page(struct folio *folio, size_t n)
> +{
> +	return folio_page(folio, n);
> +}
> +
> +loff_t rust_helper_folio_pos(struct folio *folio)
> +{
> +	return folio_pos(folio);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_folio_pos);
> +
> +size_t rust_helper_folio_size(struct folio *folio)
> +{
> +	return folio_size(folio);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_folio_size);
> +
> +void rust_helper_folio_mark_uptodate(struct folio *folio)
> +{
> +	folio_mark_uptodate(folio);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_folio_mark_uptodate);
> +
> +void rust_helper_folio_set_error(struct folio *folio)
> +{
> +	folio_set_error(folio);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_folio_set_error);
> +
> +void rust_helper_flush_dcache_folio(struct folio *folio)
> +{
> +	flush_dcache_folio(folio);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_flush_dcache_folio);
> +
> +void *rust_helper_kmap_local_folio(struct folio *folio, size_t offset)
> +{
> +	return kmap_local_folio(folio, offset);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kmap_local_folio);
> +
> +void rust_helper_kunmap_local(const void *vaddr)
> +{
> +	kunmap_local(vaddr);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kunmap_local);
> +
>   void rust_helper_i_uid_write(struct inode *inode, uid_t uid)
>   {
>   	i_uid_write(inode, uid);
> @@ -163,6 +238,12 @@ off_t rust_helper_i_size_read(const struct inode *inode)
>   }
>   EXPORT_SYMBOL_GPL(rust_helper_i_size_read);
> 
> +void rust_helper_mapping_set_large_folios(struct address_space *mapping)
> +{
> +	mapping_set_large_folios(mapping);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_mapping_set_large_folios);
> +
>   /*
>    * `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
>    * use it in contexts where Rust expects a `usize` like slice (array) indices.
> diff --git a/rust/kernel/folio.rs b/rust/kernel/folio.rs
> new file mode 100644
> index 000000000000..ef8a08b97962
> --- /dev/null
> +++ b/rust/kernel/folio.rs
> @@ -0,0 +1,215 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! Groups of contiguous pages, folios.
> +//!
> +//! C headers: [`include/linux/mm.h`](../../include/linux/mm.h)
> +
> +use crate::error::{code::*, Result};
> +use crate::types::{ARef, AlwaysRefCounted, Opaque, ScopeGuard};
> +use core::{cmp::min, ptr};
> +
> +/// Wraps the kernel's `struct folio`.
> +///
> +/// # Invariants
> +///
> +/// Instances of this type are always ref-counted, that is, a call to `folio_get` ensures that the
> +/// allocation remains valid at least until the matching call to `folio_put`.
> +#[repr(transparent)]
> +pub struct Folio(pub(crate) Opaque<bindings::folio>);
> +
> +// SAFETY: The type invariants guarantee that `Folio` is always ref-counted.
> +unsafe impl AlwaysRefCounted for Folio {
> +    fn inc_ref(&self) {
> +        // SAFETY: The existence of a shared reference means that the refcount is nonzero.
> +        unsafe { bindings::folio_get(self.0.get()) };
> +    }
> +
> +    unsafe fn dec_ref(obj: ptr::NonNull<Self>) {
> +        // SAFETY: The safety requirements guarantee that the refcount is nonzero.
> +        unsafe { bindings::folio_put(obj.cast().as_ptr()) }
> +    }
> +}
> +
> +impl Folio {
> +    /// Tries to allocate a new folio.
> +    ///
> +    /// On success, returns a folio made up of 2^order pages.
> +    pub fn try_new(order: u32) -> Result<UniqueFolio> {
> +        if order > bindings::MAX_ORDER {
> +            return Err(EDOM);
> +        }
> +
> +        // SAFETY: We checked that `order` is within the max allowed value.
> +        let f = ptr::NonNull::new(unsafe { bindings::folio_alloc(bindings::GFP_KERNEL, order) })
> +            .ok_or(ENOMEM)?;
> +
> +        // SAFETY: The folio returned by `folio_alloc` is referenced. The ownership of the
> +        // reference is transferred to the `ARef` instance.
> +        Ok(UniqueFolio(unsafe { ARef::from_raw(f.cast()) }))
> +    }
> +
> +    /// Returns the byte position of this folio in its file.
> +    pub fn pos(&self) -> i64 {
> +        // SAFETY: The folio is valid because the shared reference implies a non-zero refcount.
> +        unsafe { bindings::folio_pos(self.0.get()) }
> +    }
> +
> +    /// Returns the byte size of this folio.
> +    pub fn size(&self) -> usize {
> +        // SAFETY: The folio is valid because the shared reference implies a non-zero refcount.
> +        unsafe { bindings::folio_size(self.0.get()) }
> +    }
> +
> +    /// Flushes the data cache for the pages that make up the folio.
> +    pub fn flush_dcache(&self) {
> +        // SAFETY: The folio is valid because the shared reference implies a non-zero refcount.
> +        unsafe { bindings::flush_dcache_folio(self.0.get()) }
> +    }
> +}
> +
> +/// A [`Folio`] that has a single reference to it.

This should be an invariant.

> +pub struct UniqueFolio(pub(crate) ARef<Folio>);
> +
> +impl UniqueFolio {
> +    /// Maps the contents of a folio page into a slice.
> +    pub fn map_page(&self, page_index: usize) -> Result<MapGuard<'_>> {
> +        if page_index >= self.0.size() / bindings::PAGE_SIZE {
> +            return Err(EDOM);
> +        }
> +
> +        // SAFETY: We just checked that the index is within bounds of the folio.
> +        let page = unsafe { bindings::folio_page(self.0 .0.get(), page_index) };
> +
> +        // SAFETY: `page` is valid because it was returned by `folio_page` above.
> +        let ptr = unsafe { bindings::kmap(page) };
> +
> +        // SAFETY: We just mapped `ptr`, so it's valid for read.
> +        let data = unsafe { core::slice::from_raw_parts(ptr.cast::<u8>(), bindings::PAGE_SIZE) };
> +
> +        Ok(MapGuard { data, page })
> +    }
> +}
> +
> +/// A mapped [`UniqueFolio`].
> +pub struct MapGuard<'a> {
> +    data: &'a [u8],
> +    page: *mut bindings::page,
> +}
> +
> +impl core::ops::Deref for MapGuard<'_> {
> +    type Target = [u8];
> +
> +    fn deref(&self) -> &Self::Target {
> +        self.data
> +    }
> +}
> +
> +impl Drop for MapGuard<'_> {
> +    fn drop(&mut self) {
> +        // SAFETY: A `MapGuard` instance is only created when `kmap` succeeds, so it's ok to unmap
> +        // it when the guard is dropped.
> +        unsafe { bindings::kunmap(self.page) };
> +    }
> +}
> +
> +/// A locked [`Folio`].

This should be an invariant.
Andreas Hindborg Oct. 23, 2023, 10:48 a.m. UTC | #8
Matthew Wilcox <willy@infradead.org> writes:

(snip)

>> > Something I forgot to mention was that I found it more useful to express
>> > "map this chunk of a folio" in bytes rather than pages.  You might find
>> > the same, in which case it's just folio.map(offset: usize) instead of
>> > folio.map_page(page_index: usize)
>> 
>> Oh, thanks for the feedback. I'll switch to bytes then for v2.
>> (Already did in the example above.)
>
> Great!  Something else I think would be a good idea is open-coding some
> of the trivial accessors.  eg instead of doing:
>
> +size_t rust_helper_folio_size(struct folio *folio)
> +{
> +	return folio_size(folio);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_folio_size);
> [...]
> +    pub fn size(&self) -> usize {
> +        // SAFETY: The folio is valid because the shared reference implies a non-zero refcount.
> +        unsafe { bindings::folio_size(self.0.get()) }
> +    }
>
> add:
>
> impl Folio {
> ...
>     pub fn order(&self) -> u8 {
> 	if (self.flags & (1 << PG_head))
> 	    self._flags_1 & 0xff
> 	else
> 	    0
>     }
>
>     pub fn size(&self) -> usize {
> 	bindings::PAGE_SIZE << self.order()
>     }
> }
>
> ... or have I misunderstood what is possible here?  My hope is that the
> compiler gets to "see through" the abstraction, which surely can't be
> done when there's a function call.

The build system and Rust compiler can inline and optimize across
function calls and languages when LTO is enabled. Some patches are
needed to make it work though.

BR Andreas
Wedson Almeida Filho Oct. 23, 2023, 12:29 p.m. UTC | #9
On Fri, 20 Oct 2023 at 01:11, Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Oct 19, 2023 at 10:25:39AM -0300, Wedson Almeida Filho wrote:
> > On Wed, 18 Oct 2023 at 16:21, Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Wed, Oct 18, 2023 at 03:32:36PM -0300, Wedson Almeida Filho wrote:
> > > > On Wed, 18 Oct 2023 at 14:17, Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Wed, Oct 18, 2023 at 09:25:08AM -0300, Wedson Almeida Filho wrote:
> > > > > > +void *rust_helper_kmap(struct page *page)
> > > > > > +{
> > > > > > +     return kmap(page);
> > > > > > +}
> > > > > > +EXPORT_SYMBOL_GPL(rust_helper_kmap);
> > > > > > +
> > > > > > +void rust_helper_kunmap(struct page *page)
> > > > > > +{
> > > > > > +     kunmap(page);
> > > > > > +}
> > > > > > +EXPORT_SYMBOL_GPL(rust_helper_kunmap);
> > > > >
> > > > > I'm not thrilled by exposing kmap()/kunmap() to Rust code.  The vast
> > > > > majority of code really only needs kmap_local_*() / kunmap_local().
> > > > > Can you elaborate on why you need the old kmap() in new Rust code?
> > > >
> > > > The difficulty we have with kmap_local_*() has to do with the
> > > > requirement that maps and unmaps need to be nested neatly. For
> > > > example:
> > > >
> > > > let a = folio1.map_local(...);
> > > > let b = folio2.map_local(...);
> > > > // Do something with `a` and `b`.
> > > > drop(a);
> > > > drop(b);
> > > >
> > > > The code obviously violates the requirements.
> > >
> > > Is that the only problem, or are there situations where we might try
> > > to do something like:
> > >
> > > a = folio1.map.local()
> > > b = folio2.map.local()
> > > drop(a)
> > > a = folio3.map.local()
> > > drop(b)
> > > b = folio4.map.local()
> > > drop (a)
> > > a = folio5.map.local()
> > > ...
> >
> > This is also a problem. We don't control the order in which users are
> > going to unmap.
>
> OK.  I have something in the works, but it's not quite ready yet.

Please share once you're happy with it!

Or before. :)

> > This violates Rust rules. `UniqueFolio` helps us address this for our
> > use case; if we try the above with a UniqueFolio, the compiler will
> > error out saying that  `a` has a shared reference to the folio, so we
> > can't call `sread` on it (because sread requires a mutable, and
> > therefore not shareable, reference to the folio).
>
> This is going to be quite the impedance mismatch.  Still, I imagine
> you're used to dealing with those by now and have a toolbox of ideas.
>
> We don't have that rule for the pagecache as it is.  We do have rules that
> prevent data corruption!  For example, if the folio is !uptodate then you
> must have the lock to write to the folio in order to bring it uptodate
> (so we have a single writer rule in that regard).  But once the folio is
> uptodate, all bets are off in terms of who can be writing to it / reading
> it at the same time.  And that's going to have to continue to be true;
> multiple processes can have the same page mmaped writable and write to
> it at the same time.  There's no possible synchronisation between them.
>
> But I think your concern is really more limited.  You're concerned
> with filesystem metadata obeying Rust's rules.  And for a read-write
> filesystem, you're going to have to have ... something ... which gets a
> folio from the page cache, and establishes that this is the only thread
> which can modify that folio (maybe it's an interior node of a Btree,
> maybe it's a free space bitmap, ...).  We could probably use the folio
> lock bit for that purpose,  For the read-only filesystems, you only need
> be concerned about freshly-allocated folios, but you need something more
> when it comes to doing an ext2 clone.
>
> There's general concern about the overuse of the folio lock bit, but
> this is a reasonable use -- preventing two threads from modifying the
> same folio at the same time.
>
> (I have simplified all this; both iomap and buffer heads support folios
> which are partially uptodate, but conceptually this is accurate)

Yes, that's precisely the case. Rust doesn't mind if data is mapped
multiple times for a given folio as in most cases it won't inspect the
contents anyway. But for metadata, it will need some synchronisation.

In this read-only scenario we're supporting now, we conveniently
already get locked folios in `read_folio` calls, so we're using the
fact that it's locked to have a single writer to it -- note that the
`write` and `zero_out` functions are only available  in `LockedFolio`,
not in `Folio`.

`UniqueFolio` is for when a module needs to read sectors without the
cache, in particular the super block. (Before it has decided on the
block size and has initialised the superblock with the block size.) So
it's essentially a sequence of allocate a folio, read from a block
device, map it, use the contents, unmap, free. The folio isn't really
shared with anyone else. Eventually we may want to merge this with
concept with that of a locked folio so that we only have one writable
folio.

> > > On systems without HIGHMEM, kmap() is a no-op.  So we could do something
> > > like this:
> > >
> > >         let data = unsafe { core::slice::from_raw_parts(ptr.cast::<u8>(),
> > >                 if (folio_test_highmem(folio))
> > >                         bindings::PAGE_SIZE
> > >                 else
> > >                         folio_size(folio) - page_idx * PAGE_SIZE) }
> > >
> > > ... modulo whatever the correct syntax is in Rust.
> >
> > We can certainly do that. But since there's the possibility that the
> > array will be capped at PAGE_SIZE in the HIGHMEM case, callers would
> > still need a loop to traverse the whole folio, right?
> >
> > let mut offset = 0;
> > while offset < folio.size() {
> >     let a = folio.map(offset);
> >     // Do something with a.
> >     offset += a.len();
> > }
> >
> > I guess the advantage is that we'd have a single iteration in systems
> > without HIGHMEM.
>
> Right.  You can see something similar to that in memcpy_from_folio() in
> highmem.h.

I will have this in v2.

> > > Something I forgot to mention was that I found it more useful to express
> > > "map this chunk of a folio" in bytes rather than pages.  You might find
> > > the same, in which case it's just folio.map(offset: usize) instead of
> > > folio.map_page(page_index: usize)
> >
> > Oh, thanks for the feedback. I'll switch to bytes then for v2.
> > (Already did in the example above.)
>
> Great!  Something else I think would be a good idea is open-coding some
> of the trivial accessors.  eg instead of doing:
>
> +size_t rust_helper_folio_size(struct folio *folio)
> +{
> +       return folio_size(folio);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_folio_size);
> [...]
> +    pub fn size(&self) -> usize {
> +        // SAFETY: The folio is valid because the shared reference implies a non-zero refcount.
> +        unsafe { bindings::folio_size(self.0.get()) }
> +    }
>
> add:
>
> impl Folio {
> ...
>     pub fn order(&self) -> u8 {
>         if (self.flags & (1 << PG_head))
>             self._flags_1 & 0xff
>         else
>             0
>     }
>
>     pub fn size(&self) -> usize {
>         bindings::PAGE_SIZE << self.order()
>     }
> }
>
> ... or have I misunderstood what is possible here?  My hope is that the
> compiler gets to "see through" the abstraction, which surely can't be
> done when there's a function call.

As Andreas pointed out, with LTO we can get the linker to see through
these functions, even across different languages, and optimise when it
makes sense.

Having said that, while it's possible to do what you suggested above,
we try to avoid it so that maintainers can continue to have a single
place they need to change if they ever decide to change things. A
simple example from above is order(), if you decide to implement it
differently (I don't know, if you change the flag, you decide to have
an explicit field, whatever), then you'd have to change the C _and_
the Rust versions. Worse yet, there's a chance that forgetting to
update the Rust version wouldn't break the build, which would make it
harder to catch mismatched versions.
Wedson Almeida Filho Oct. 23, 2023, 12:32 p.m. UTC | #10
On Fri, 20 Oct 2023 at 12:17, Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Oct 20, 2023 at 05:11:38AM +0100, Matthew Wilcox wrote:
> > > Sure, it's safe to map a folio in general, but Rust has stricter rules
> > > about aliasing and mutability that are part of how memory safety is
> > > achieved. In particular, it requires that we never have mutable and
> > > immutable pointers to the same memory at once (modulo interior
> > > mutability).
> > >
> > > So we need to avoid something like:
> > >
> > > let a = folio.map(); // `a` is a shared pointer to the contents of the folio.
> > >
> > > // While we have a shared (and therefore immutable) pointer, we're
> > > changing the contents of the folio.
> > > sb.sread(sector_number, sector_count, folio);
> > >
> > > This violates Rust rules. `UniqueFolio` helps us address this for our
> > > use case; if we try the above with a UniqueFolio, the compiler will
> > > error out saying that  `a` has a shared reference to the folio, so we
> > > can't call `sread` on it (because sread requires a mutable, and
> > > therefore not shareable, reference to the folio).
> >
> > This is going to be quite the impedance mismatch.  Still, I imagine
> > you're used to dealing with those by now and have a toolbox of ideas.
> >
> > We don't have that rule for the pagecache as it is.  We do have rules that
> > prevent data corruption!  For example, if the folio is !uptodate then you
> > must have the lock to write to the folio in order to bring it uptodate
> > (so we have a single writer rule in that regard).  But once the folio is
> > uptodate, all bets are off in terms of who can be writing to it / reading
> > it at the same time.  And that's going to have to continue to be true;
> > multiple processes can have the same page mmaped writable and write to
> > it at the same time.  There's no possible synchronisation between them.
> >
> > But I think your concern is really more limited.  You're concerned
> > with filesystem metadata obeying Rust's rules.  And for a read-write
> > filesystem, you're going to have to have ... something ... which gets a
> > folio from the page cache, and establishes that this is the only thread
> > which can modify that folio (maybe it's an interior node of a Btree,
> > maybe it's a free space bitmap, ...).  We could probably use the folio
> > lock bit for that purpose,  For the read-only filesystems, you only need
> > be concerned about freshly-allocated folios, but you need something more
> > when it comes to doing an ext2 clone.
> >
> > There's general concern about the overuse of the folio lock bit, but
> > this is a reasonable use -- preventing two threads from modifying the
> > same folio at the same time.
>
> Sorry, I didn't quite finish this thought; that'll teach me to write
> complicated emails last thing at night.
>
> The page cache has no single-writer vs multiple-reader exclusion on folios
> found in the page cache.  We expect filesystems to implement whatever
> exclusion they need at a higher level.  For example, ext2 has no higher
> level lock on its block allocator.  Once the buffer is uptodate (ie has
> been read from storage), it uses atomic bit operations in order to track
> which blocks are freed.  It does use a spinlock to control access to
> "how many blocks are currently free".
>
> I'm not suggesting ext2 is an optimal strategy.  I know XFS and btrfs
> use rwsems, although I'm not familiar enough with either to describe
> exactly how it works.

Thanks for this explanation, it's good to see this kind of high-level
decisions/directions spelled out. When we need writable file systems,
we'll encode this in the type system.
Matthew Wilcox Oct. 23, 2023, 2:28 p.m. UTC | #11
On Mon, Oct 23, 2023 at 12:48:33PM +0200, Andreas Hindborg (Samsung) wrote:
> The build system and Rust compiler can inline and optimize across
> function calls and languages when LTO is enabled. Some patches are
> needed to make it work though.

That's fine, but something like folio_put() is performance-critical.

Relying on the linker to figure out that it _should_ inline through

+void rust_helper_folio_put(struct folio *folio)
+{
+	folio_put(folio);
+}
+EXPORT_SYMBOL_GPL(rust_helper_folio_put);

seems like a very bad idea to me.  For reference, folio_put() is
defined as:

static inline void folio_put(struct folio *folio)
{
        if (folio_put_testzero(folio))
                __folio_put(folio);
}

which turns into (once you work your way through all the gunk that hasn't
been cleaned up yet)

	if (atomic_dec_and_test(&folio->_refcount))
		__folio_put(folio)

ie it's a single dec-and-test insn followed by a conditional function
call.  Yes, there's some expensive debug you can turn on in there, but
it's an incredibly frequent call, and we shouldn't be relying on linker
magic to optimise it all away.

Of course, I don't want to lose the ability to turn on the debug code,
so folio.put() can't be as simple as the call to atomic_dec_and_test(),
but I hope you see my point.

Wedson wrote in a later email,
> Having said that, while it's possible to do what you suggested above,
> we try to avoid it so that maintainers can continue to have a single
> place they need to change if they ever decide to change things. A
> simple example from above is order(), if you decide to implement it
> differently (I don't know, if you change the flag, you decide to have
> an explicit field, whatever), then you'd have to change the C _and_
> the Rust versions. Worse yet, there's a chance that forgetting to
> update the Rust version wouldn't break the build, which would make it
> harder to catch mismatched versions.

I understand that concern!  Indeed, I did change the implementation
of folio_order() recently.  I'm happy to commit to keeping the Rust
implementation updated as I modify the C implementation of folios,
but I appreciate that other maintainers may not be willing to make such
a commitment.

I'm all the way up to Chapter 5: References in the Blandy book now!
I expect to understand the patches you're sending any week now ;-)
Ariel Miculas Oct. 24, 2023, 3:04 p.m. UTC | #12
On 23/10/23 03:28PM, Matthew Wilcox wrote:
(snip)
> I'm all the way up to Chapter 5: References in the Blandy book now!
> I expect to understand the patches you're sending any week now ;-)

That book (Programming Rust) has some great explanations, I went through
the first 7 chapters after reading "The Rust Programming Language" [1].
If you also want to dive into unsafe Rust, I recommend "Learn Rust With
Entirely Too Many Linked Lists" [2], it covers quite a few complex
concepts and it's very well written.

[1] https://doc.rust-lang.org/book/
[2] https://rust-unofficial.github.io/too-many-lists/

Cheers,
Ariel
diff mbox series

Patch

diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index ca1898ce9527..53a99ea512d1 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -11,6 +11,7 @@ 
 #include <linux/fs.h>
 #include <linux/fs_context.h>
 #include <linux/slab.h>
+#include <linux/pagemap.h>
 #include <linux/refcount.h>
 #include <linux/wait.h>
 #include <linux/sched.h>
@@ -27,3 +28,5 @@  const slab_flags_t BINDINGS_SLAB_ACCOUNT = SLAB_ACCOUNT;
 const unsigned long BINDINGS_SB_RDONLY = SB_RDONLY;
 
 const loff_t BINDINGS_MAX_LFS_FILESIZE = MAX_LFS_FILESIZE;
+
+const size_t BINDINGS_PAGE_SIZE = PAGE_SIZE;
diff --git a/rust/bindings/lib.rs b/rust/bindings/lib.rs
index 426915d3fb57..a96b7f08e57d 100644
--- a/rust/bindings/lib.rs
+++ b/rust/bindings/lib.rs
@@ -59,3 +59,5 @@  mod bindings_helper {
 pub const SB_RDONLY: core::ffi::c_ulong = BINDINGS_SB_RDONLY;
 
 pub const MAX_LFS_FILESIZE: loff_t = BINDINGS_MAX_LFS_FILESIZE;
+
+pub const PAGE_SIZE: usize = BINDINGS_PAGE_SIZE;
diff --git a/rust/helpers.c b/rust/helpers.c
index c5a2bec6467d..f2ce3e7b688c 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -23,10 +23,14 @@ 
 #include <kunit/test-bug.h>
 #include <linux/bug.h>
 #include <linux/build_bug.h>
+#include <linux/cacheflush.h>
 #include <linux/err.h>
 #include <linux/errname.h>
 #include <linux/fs.h>
+#include <linux/highmem.h>
+#include <linux/mm.h>
 #include <linux/mutex.h>
+#include <linux/pagemap.h>
 #include <linux/refcount.h>
 #include <linux/sched/signal.h>
 #include <linux/spinlock.h>
@@ -145,6 +149,77 @@  struct kunit *rust_helper_kunit_get_current_test(void)
 }
 EXPORT_SYMBOL_GPL(rust_helper_kunit_get_current_test);
 
+void *rust_helper_kmap(struct page *page)
+{
+	return kmap(page);
+}
+EXPORT_SYMBOL_GPL(rust_helper_kmap);
+
+void rust_helper_kunmap(struct page *page)
+{
+	kunmap(page);
+}
+EXPORT_SYMBOL_GPL(rust_helper_kunmap);
+
+void rust_helper_folio_get(struct folio *folio)
+{
+	folio_get(folio);
+}
+EXPORT_SYMBOL_GPL(rust_helper_folio_get);
+
+void rust_helper_folio_put(struct folio *folio)
+{
+	folio_put(folio);
+}
+EXPORT_SYMBOL_GPL(rust_helper_folio_put);
+
+struct page *rust_helper_folio_page(struct folio *folio, size_t n)
+{
+	return folio_page(folio, n);
+}
+
+loff_t rust_helper_folio_pos(struct folio *folio)
+{
+	return folio_pos(folio);
+}
+EXPORT_SYMBOL_GPL(rust_helper_folio_pos);
+
+size_t rust_helper_folio_size(struct folio *folio)
+{
+	return folio_size(folio);
+}
+EXPORT_SYMBOL_GPL(rust_helper_folio_size);
+
+void rust_helper_folio_mark_uptodate(struct folio *folio)
+{
+	folio_mark_uptodate(folio);
+}
+EXPORT_SYMBOL_GPL(rust_helper_folio_mark_uptodate);
+
+void rust_helper_folio_set_error(struct folio *folio)
+{
+	folio_set_error(folio);
+}
+EXPORT_SYMBOL_GPL(rust_helper_folio_set_error);
+
+void rust_helper_flush_dcache_folio(struct folio *folio)
+{
+	flush_dcache_folio(folio);
+}
+EXPORT_SYMBOL_GPL(rust_helper_flush_dcache_folio);
+
+void *rust_helper_kmap_local_folio(struct folio *folio, size_t offset)
+{
+	return kmap_local_folio(folio, offset);
+}
+EXPORT_SYMBOL_GPL(rust_helper_kmap_local_folio);
+
+void rust_helper_kunmap_local(const void *vaddr)
+{
+	kunmap_local(vaddr);
+}
+EXPORT_SYMBOL_GPL(rust_helper_kunmap_local);
+
 void rust_helper_i_uid_write(struct inode *inode, uid_t uid)
 {
 	i_uid_write(inode, uid);
@@ -163,6 +238,12 @@  off_t rust_helper_i_size_read(const struct inode *inode)
 }
 EXPORT_SYMBOL_GPL(rust_helper_i_size_read);
 
+void rust_helper_mapping_set_large_folios(struct address_space *mapping)
+{
+	mapping_set_large_folios(mapping);
+}
+EXPORT_SYMBOL_GPL(rust_helper_mapping_set_large_folios);
+
 /*
  * `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
  * use it in contexts where Rust expects a `usize` like slice (array) indices.
diff --git a/rust/kernel/folio.rs b/rust/kernel/folio.rs
new file mode 100644
index 000000000000..ef8a08b97962
--- /dev/null
+++ b/rust/kernel/folio.rs
@@ -0,0 +1,215 @@ 
+// SPDX-License-Identifier: GPL-2.0
+
+//! Groups of contiguous pages, folios.
+//!
+//! C headers: [`include/linux/mm.h`](../../include/linux/mm.h)
+
+use crate::error::{code::*, Result};
+use crate::types::{ARef, AlwaysRefCounted, Opaque, ScopeGuard};
+use core::{cmp::min, ptr};
+
+/// Wraps the kernel's `struct folio`.
+///
+/// # Invariants
+///
+/// Instances of this type are always ref-counted, that is, a call to `folio_get` ensures that the
+/// allocation remains valid at least until the matching call to `folio_put`.
+#[repr(transparent)]
+pub struct Folio(pub(crate) Opaque<bindings::folio>);
+
+// SAFETY: The type invariants guarantee that `Folio` is always ref-counted.
+unsafe impl AlwaysRefCounted for Folio {
+    fn inc_ref(&self) {
+        // SAFETY: The existence of a shared reference means that the refcount is nonzero.
+        unsafe { bindings::folio_get(self.0.get()) };
+    }
+
+    unsafe fn dec_ref(obj: ptr::NonNull<Self>) {
+        // SAFETY: The safety requirements guarantee that the refcount is nonzero.
+        unsafe { bindings::folio_put(obj.cast().as_ptr()) }
+    }
+}
+
+impl Folio {
+    /// Tries to allocate a new folio.
+    ///
+    /// On success, returns a folio made up of 2^order pages.
+    pub fn try_new(order: u32) -> Result<UniqueFolio> {
+        if order > bindings::MAX_ORDER {
+            return Err(EDOM);
+        }
+
+        // SAFETY: We checked that `order` is within the max allowed value.
+        let f = ptr::NonNull::new(unsafe { bindings::folio_alloc(bindings::GFP_KERNEL, order) })
+            .ok_or(ENOMEM)?;
+
+        // SAFETY: The folio returned by `folio_alloc` is referenced. The ownership of the
+        // reference is transferred to the `ARef` instance.
+        Ok(UniqueFolio(unsafe { ARef::from_raw(f.cast()) }))
+    }
+
+    /// Returns the byte position of this folio in its file.
+    pub fn pos(&self) -> i64 {
+        // SAFETY: The folio is valid because the shared reference implies a non-zero refcount.
+        unsafe { bindings::folio_pos(self.0.get()) }
+    }
+
+    /// Returns the byte size of this folio.
+    pub fn size(&self) -> usize {
+        // SAFETY: The folio is valid because the shared reference implies a non-zero refcount.
+        unsafe { bindings::folio_size(self.0.get()) }
+    }
+
+    /// Flushes the data cache for the pages that make up the folio.
+    pub fn flush_dcache(&self) {
+        // SAFETY: The folio is valid because the shared reference implies a non-zero refcount.
+        unsafe { bindings::flush_dcache_folio(self.0.get()) }
+    }
+}
+
+/// A [`Folio`] that has a single reference to it.
+pub struct UniqueFolio(pub(crate) ARef<Folio>);
+
+impl UniqueFolio {
+    /// Maps the contents of a folio page into a slice.
+    pub fn map_page(&self, page_index: usize) -> Result<MapGuard<'_>> {
+        if page_index >= self.0.size() / bindings::PAGE_SIZE {
+            return Err(EDOM);
+        }
+
+        // SAFETY: We just checked that the index is within bounds of the folio.
+        let page = unsafe { bindings::folio_page(self.0 .0.get(), page_index) };
+
+        // SAFETY: `page` is valid because it was returned by `folio_page` above.
+        let ptr = unsafe { bindings::kmap(page) };
+
+        // SAFETY: We just mapped `ptr`, so it's valid for read.
+        let data = unsafe { core::slice::from_raw_parts(ptr.cast::<u8>(), bindings::PAGE_SIZE) };
+
+        Ok(MapGuard { data, page })
+    }
+}
+
+/// A mapped [`UniqueFolio`].
+pub struct MapGuard<'a> {
+    data: &'a [u8],
+    page: *mut bindings::page,
+}
+
+impl core::ops::Deref for MapGuard<'_> {
+    type Target = [u8];
+
+    fn deref(&self) -> &Self::Target {
+        self.data
+    }
+}
+
+impl Drop for MapGuard<'_> {
+    fn drop(&mut self) {
+        // SAFETY: A `MapGuard` instance is only created when `kmap` succeeds, so it's ok to unmap
+        // it when the guard is dropped.
+        unsafe { bindings::kunmap(self.page) };
+    }
+}
+
+/// A locked [`Folio`].
+pub struct LockedFolio<'a>(&'a Folio);
+
+impl LockedFolio<'_> {
+    /// Creates a new locked folio from a raw pointer.
+    ///
+    /// # Safety
+    ///
+    /// Callers must ensure that the folio is valid and locked. Additionally, that the
+    /// responsibility of unlocking is transferred to the new instance of [`LockedFolio`]. Lastly,
+    /// that the returned [`LockedFolio`] doesn't outlive the refcount that keeps it alive.
+    #[allow(dead_code)]
+    pub(crate) unsafe fn from_raw(folio: *const bindings::folio) -> Self {
+        let ptr = folio.cast();
+        // SAFETY: The safety requirements ensure that `folio` (from which `ptr` is derived) is
+        // valid and will remain valid while the `LockedFolio` instance lives.
+        Self(unsafe { &*ptr })
+    }
+
+    /// Marks the folio as being up to date.
+    pub fn mark_uptodate(&mut self) {
+        // SAFETY: The folio is valid because the shared reference implies a non-zero refcount.
+        unsafe { bindings::folio_mark_uptodate(self.0 .0.get()) }
+    }
+
+    /// Sets the error flag on the folio.
+    pub fn set_error(&mut self) {
+        // SAFETY: The folio is valid because the shared reference implies a non-zero refcount.
+        unsafe { bindings::folio_set_error(self.0 .0.get()) }
+    }
+
+    fn for_each_page(
+        &mut self,
+        offset: usize,
+        len: usize,
+        mut cb: impl FnMut(&mut [u8]) -> Result,
+    ) -> Result {
+        let mut remaining = len;
+        let mut next_offset = offset;
+
+        // Check that we don't overflow the folio.
+        let end = offset.checked_add(len).ok_or(EDOM)?;
+        if end > self.size() {
+            return Err(EINVAL);
+        }
+
+        while remaining > 0 {
+            let page_offset = next_offset & (bindings::PAGE_SIZE - 1);
+            let usable = min(remaining, bindings::PAGE_SIZE - page_offset);
+            // SAFETY: The folio is valid because the shared reference implies a non-zero refcount;
+            // `next_offset` is also guaranteed be lesss than the folio size.
+            let ptr = unsafe { bindings::kmap_local_folio(self.0 .0.get(), next_offset) };
+
+            // SAFETY: `ptr` was just returned by the `kmap_local_folio` above.
+            let _guard = ScopeGuard::new(|| unsafe { bindings::kunmap_local(ptr) });
+
+            // SAFETY: `kmap_local_folio` maps whole page so we know it's mapped for at least
+            // `usable` bytes.
+            let s = unsafe { core::slice::from_raw_parts_mut(ptr.cast::<u8>(), usable) };
+            cb(s)?;
+
+            next_offset += usable;
+            remaining -= usable;
+        }
+
+        Ok(())
+    }
+
+    /// Writes the given slice into the folio.
+    pub fn write(&mut self, offset: usize, data: &[u8]) -> Result {
+        let mut remaining = data;
+
+        self.for_each_page(offset, data.len(), |s| {
+            s.copy_from_slice(&remaining[..s.len()]);
+            remaining = &remaining[s.len()..];
+            Ok(())
+        })
+    }
+
+    /// Writes zeroes into the folio.
+    pub fn zero_out(&mut self, offset: usize, len: usize) -> Result {
+        self.for_each_page(offset, len, |s| {
+            s.fill(0);
+            Ok(())
+        })
+    }
+}
+
+impl core::ops::Deref for LockedFolio<'_> {
+    type Target = Folio;
+    fn deref(&self) -> &Self::Target {
+        self.0
+    }
+}
+
+impl Drop for LockedFolio<'_> {
+    fn drop(&mut self) {
+        // SAFETY: The folio is valid because the shared reference implies a non-zero refcount.
+        unsafe { bindings::folio_unlock(self.0 .0.get()) }
+    }
+}
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index 00059b80c240..0e85b380da64 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -34,6 +34,7 @@ 
 mod allocator;
 mod build_assert;
 pub mod error;
+pub mod folio;
 pub mod fs;
 pub mod init;
 pub mod ioctl;