[RFC,v1,00/10] mm: Introduce and use folio_owner_ops

Message ID	20241108162040.159038-1-tabba@google.com (mailing list archive)
Headers	show Received: from mail-wm1-f74.google.com (mail-wm1-f74.google.com [209.85.128.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8073F192D82 for <kvm@vger.kernel.org>; Fri, 8 Nov 2024 16:20:44 +0000 (UTC) Date: Fri, 8 Nov 2024 16:20:30 +0000 Precedence: bulk Mime-Version: 1.0 Message-ID: <20241108162040.159038-1-tabba@google.com> Subject: [RFC PATCH v1 00/10] mm: Introduce and use folio_owner_ops From: Fuad Tabba <tabba@google.com> To: linux-mm@kvack.org Cc: kvm@vger.kernel.org, nouveau@lists.freedesktop.org, dri-devel@lists.freedesktop.org, david@redhat.com, rppt@kernel.org, jglisse@redhat.com, akpm@linux-foundation.org, muchun.song@linux.dev, simona@ffwll.ch, airlied@gmail.com, pbonzini@redhat.com, seanjc@google.com, willy@infradead.org, jgg@nvidia.com, jhubbard@nvidia.com, ackerleytng@google.com, vannapurve@google.com, mail@maciej.szmigiero.name, kirill.shutemov@linux.intel.com, quic_eberman@quicinc.com, maz@kernel.org, will@kernel.org, qperret@google.com, keirf@google.com, roypat@amazon.co.uk, tabba@google.com Content-Type: text/plain; charset="UTF-8"
Series	mm: Introduce and use folio_owner_ops \| expand [RFC,v1,00/10] mm: Introduce and use folio_owner_ops [RFC,v1,01/10] mm/hugetlb: rename isolate_hugetlb() to folio_isolate_hugetlb() [RFC,v1,02/10] mm/migrate: don't call folio_putback_active_hugetlb() on dst hugetlb folio [RFC,v1,03/10] mm/hugetlb: rename "folio_putback_active_hugetlb()" to "folio_putback_hugetlb()" [RFC,v1,04/10] mm/hugetlb-cgroup: convert hugetlb_cgroup_css_offline() to work on folios [RFC,v1,05/10] mm/hugetlb: use folio->lru int demote_free_hugetlb_folios() [RFC,v1,06/10] mm/hugetlb: use separate folio->_hugetlb_list for hugetlb-internals [RFC,v1,07/10] mm: Introduce struct folio_owner_ops [RFC,v1,08/10] mm: Use getters and setters to access page pgmap [RFC,v1,09/10] mm: Use owner_ops on folio_put for zone device pages [RFC,v1,10/10] mm: hugetlb: Use owner_ops on folio_put for hugetlb

Fuad Tabba Nov. 8, 2024, 4:20 p.m. UTC

Some folios, such as hugetlb folios and zone device folios,
require special handling when the folio's reference count reaches
0, before being freed. Moreover, guest_memfd folios will likely
require special handling to notify it once a folio's reference
count reaches 0, to facilitate shared to private folio conversion
[*]. Currently, each usecase has a dedicated callback when the
folio refcount reaches 0 to that effect. Adding yet more
callbacks is not ideal.

This patch series introduces struct folio_owner_ops and uses it
as a generic way to handle callbacks on freeing a folio. It also
applies the callbacks to hugetlb and zone device folios.

A pointer to struct folio_owner_ops is overlaid on struct page
compound_page, struct page/folio lru. To indicate that the folio
uses the callback, this patch series sets bit 1 of the new field,
similar to how bit 0 indicates a compound page.

Patches 1 to 6 rework the hugetlb code to allow us to reuse
folio->lru for the owner ops as long as they are not isolated.

Patches 7 to 10 introduce struct folio_owner_ops, and apply the
callbacks to zone device and hugetlb folios.

Cheers,
/fuad

[*] https://lore.kernel.org/all/CAGtprH_JP2w-4rq02h_Ugvq5KuHX7TUvegOS7xUs_iy5hriE7g@mail.gmail.com/

David Hildenbrand (6):
  mm/hugetlb: rename isolate_hugetlb() to folio_isolate_hugetlb()
  mm/migrate: don't call folio_putback_active_hugetlb() on dst hugetlb
    folio
  mm/hugetlb: rename "folio_putback_active_hugetlb()" to
    "folio_putback_hugetlb()"
  mm/hugetlb-cgroup: convert hugetlb_cgroup_css_offline() to work on
    folios
  mm/hugetlb: use folio->lru int demote_free_hugetlb_folios()
  mm/hugetlb: use separate folio->_hugetlb_list for hugetlb-internals

Fuad Tabba (4):
  mm: Introduce struct folio_owner_ops
  mm: Use getters and setters to access page pgmap
  mm: Use owner_ops on folio_put for zone device pages
  mm: hugetlb: Use owner_ops on folio_put for hugetlb

 drivers/gpu/drm/nouveau/nouveau_dmem.c |   4 +-
 drivers/pci/p2pdma.c                   |   8 +-
 include/linux/hugetlb.h                |  10 +-
 include/linux/memremap.h               |  14 +-
 include/linux/mm_types.h               | 107 ++++++++++++++-
 lib/test_hmm.c                         |   2 +-
 mm/gup.c                               |   2 +-
 mm/hmm.c                               |   2 +-
 mm/hugetlb.c                           | 179 ++++++++++++++++++-------
 mm/hugetlb_cgroup.c                    |  19 ++-
 mm/hugetlb_vmemmap.c                   |   8 +-
 mm/internal.h                          |   1 -
 mm/memory.c                            |   2 +-
 mm/mempolicy.c                         |   2 +-
 mm/memremap.c                          |  49 +------
 mm/migrate.c                           |  20 +--
 mm/migrate_device.c                    |   4 +-
 mm/mm_init.c                           |  48 ++++++-
 mm/swap.c                              |  25 ++--
 19 files changed, 342 insertions(+), 164 deletions(-)


base-commit: beb2622b970047000fa3cae64c23585669b01fca

Jason Gunthorpe Nov. 8, 2024, 5:05 p.m. UTC | #1

On Fri, Nov 08, 2024 at 04:20:30PM +0000, Fuad Tabba wrote:
> Some folios, such as hugetlb folios and zone device folios,
> require special handling when the folio's reference count reaches
> 0, before being freed. Moreover, guest_memfd folios will likely
> require special handling to notify it once a folio's reference
> count reaches 0, to facilitate shared to private folio conversion
> [*]. Currently, each usecase has a dedicated callback when the
> folio refcount reaches 0 to that effect. Adding yet more
> callbacks is not ideal.

Honestly, I question this thesis. How complex would it be to have 'yet
more callbacks'? Is the challenge really that the mm can't detect when
guestmemfd is the owner of the page because the page will be
ZONE_NORMAL?

So the point of this is really to allow ZONE_NORMAL pages to have a
per-allocator callback?

But this is also why I suggested to shift them to ZONE_DEVICE for
guestmemfd, because then you get these things for free from the pgmap.

(this is not a disagreement this is a valid solution, but a request
you explain much more about what it is you actually need and compare
it with the other existing options)

Jason

David Hildenbrand Nov. 8, 2024, 7:33 p.m. UTC | #2

On 08.11.24 18:05, Jason Gunthorpe wrote:
> On Fri, Nov 08, 2024 at 04:20:30PM +0000, Fuad Tabba wrote:
>> Some folios, such as hugetlb folios and zone device folios,
>> require special handling when the folio's reference count reaches
>> 0, before being freed. Moreover, guest_memfd folios will likely
>> require special handling to notify it once a folio's reference
>> count reaches 0, to facilitate shared to private folio conversion
>> [*]. Currently, each usecase has a dedicated callback when the
>> folio refcount reaches 0 to that effect. Adding yet more
>> callbacks is not ideal.
> 

Thanks for having a look!

Replying to clarify some things. Fuad, feel free to add additional 
information.

> Honestly, I question this thesis. How complex would it be to have 'yet
> more callbacks'? Is the challenge really that the mm can't detect when
> guestmemfd is the owner of the page because the page will be
> ZONE_NORMAL?

Fuad might have been a bit imprecise here: We don't want an ever growing 
list of checks+callbacks on the page freeing fast path.

This series replaces the two cases we have by a single generic one, 
which is nice independent of guest_memfd I think.

> 
> So the point of this is really to allow ZONE_NORMAL pages to have a
> per-allocator callback?

To intercept the refcount going to zero independent of any zones or 
magic page types, without as little overhead in the common page freeing 
path.

It can be used to implement custom allocators, like factored out for 
hugetlb in this series. It's not necessarily limited to that, though. It 
can be used as a form of "asynchronous page ref freezing", where you get 
notified once all references are gone.

(I might have another use case with PageOffline, where we want to 
prevent virtio-mem ones of them from getting accidentally leaked into 
the buddy during memory offlining with speculative references -- 
virtio_mem_fake_offline_going_offline() contains the interesting bits. 
But I did not look into the dirty details yet, just some thought where 
we'd want to intercept the refcount going to 0.)

> 
> But this is also why I suggested to shift them to ZONE_DEVICE for
> guestmemfd, because then you get these things for free from the pgmap.

With this series even hugetlb gets it for "free", and hugetlb is not 
quite the nail for the ZONE_DEVICE hammer IMHO :)

For things we can statically set aside early during boot and never 
really want to return to the buddy/another allocator, I would agree that 
static ZONE_DEVICE would have possible.

Whenever the buddy or other allocators are involved, and we might have 
granularity as a handful of pages (e.g., taken from the buddy), getting 
ZONE_DEVICE involved is not a good (or even feasible) approach.

After all, all we want is intercept the refcount going to 0.

Fuad Tabba Nov. 11, 2024, 8:26 a.m. UTC | #3

Hi Jason and David,

On Fri, 8 Nov 2024 at 19:33, David Hildenbrand <david@redhat.com> wrote:
>
> On 08.11.24 18:05, Jason Gunthorpe wrote:
> > On Fri, Nov 08, 2024 at 04:20:30PM +0000, Fuad Tabba wrote:
> >> Some folios, such as hugetlb folios and zone device folios,
> >> require special handling when the folio's reference count reaches
> >> 0, before being freed. Moreover, guest_memfd folios will likely
> >> require special handling to notify it once a folio's reference
> >> count reaches 0, to facilitate shared to private folio conversion
> >> [*]. Currently, each usecase has a dedicated callback when the
> >> folio refcount reaches 0 to that effect. Adding yet more
> >> callbacks is not ideal.
> >
>
> Thanks for having a look!
>
> Replying to clarify some things. Fuad, feel free to add additional
> information.

Thanks for your comments Jason, and for clarifying my cover letter
David. I think David has covered everything, and I'll make sure to
clarify this in the cover letter when I respin.

Cheers,
/fuad

>
> > Honestly, I question this thesis. How complex would it be to have 'yet
> > more callbacks'? Is the challenge really that the mm can't detect when
> > guestmemfd is the owner of the page because the page will be
> > ZONE_NORMAL?
>
> Fuad might have been a bit imprecise here: We don't want an ever growing
> list of checks+callbacks on the page freeing fast path.
>
> This series replaces the two cases we have by a single generic one,
> which is nice independent of guest_memfd I think.
>
> >
> > So the point of this is really to allow ZONE_NORMAL pages to have a
> > per-allocator callback?
>
> To intercept the refcount going to zero independent of any zones or
> magic page types, without as little overhead in the common page freeing
> path.
>
> It can be used to implement custom allocators, like factored out for
> hugetlb in this series. It's not necessarily limited to that, though. It
> can be used as a form of "asynchronous page ref freezing", where you get
> notified once all references are gone.
>
> (I might have another use case with PageOffline, where we want to
> prevent virtio-mem ones of them from getting accidentally leaked into
> the buddy during memory offlining with speculative references --
> virtio_mem_fake_offline_going_offline() contains the interesting bits.
> But I did not look into the dirty details yet, just some thought where
> we'd want to intercept the refcount going to 0.)
>
> >
> > But this is also why I suggested to shift them to ZONE_DEVICE for
> > guestmemfd, because then you get these things for free from the pgmap.
>
> With this series even hugetlb gets it for "free", and hugetlb is not
> quite the nail for the ZONE_DEVICE hammer IMHO :)
>
> For things we can statically set aside early during boot and never
> really want to return to the buddy/another allocator, I would agree that
> static ZONE_DEVICE would have possible.
>
> Whenever the buddy or other allocators are involved, and we might have
> granularity as a handful of pages (e.g., taken from the buddy), getting
> ZONE_DEVICE involved is not a good (or even feasible) approach.
>
> After all, all we want is intercept the refcount going to 0.
>
> --
> Cheers,
>
> David / dhildenb
>

Matthew Wilcox Nov. 12, 2024, 5:26 a.m. UTC | #4

On Mon, Nov 11, 2024 at 08:26:54AM +0000, Fuad Tabba wrote:
> Thanks for your comments Jason, and for clarifying my cover letter
> David. I think David has covered everything, and I'll make sure to
> clarify this in the cover letter when I respin.

I don't want you to respin.  I think this is a bad idea.

David Hildenbrand Nov. 12, 2024, 9:10 a.m. UTC | #5

On 12.11.24 06:26, Matthew Wilcox wrote:
> On Mon, Nov 11, 2024 at 08:26:54AM +0000, Fuad Tabba wrote:
>> Thanks for your comments Jason, and for clarifying my cover letter
>> David. I think David has covered everything, and I'll make sure to
>> clarify this in the cover letter when I respin.
> 
> I don't want you to respin.  I think this is a bad idea.

I'm hoping you'll find some more time to explain what exactly you don't 
like, because this series only refactors what we already have.

I enjoy seeing the special casing (especially hugetlb) gone from mm/swap.c.

I don't particularly enjoy overlaying folio->lru, primarily because we 
have to temporarily "evacuate" it when someone wants to make use of 
folio->lru (e.g., hugetlb isolation). So it's not completely "sticky", 
at least for hugetlb.

Overlaying folio->mapping, similar to how "struct movable_operations" 
overlay page->mapping is not an option, because folio->mapping will be 
used for other purposes.

We'd need some sticky and reliable way to tell folio freeing code that 
someone wants to intercept when the refcount of that folio goes to 0, 
and identify who to notify.

Maybe folio->private/page->private could be overlayed? hugetlb only uses 
folio->private for flags, which we could move to some other tail page 
(e.g., simply putting them into flags1).

Jason Gunthorpe Nov. 12, 2024, 1:53 p.m. UTC | #6

On Tue, Nov 12, 2024 at 10:10:06AM +0100, David Hildenbrand wrote:
> On 12.11.24 06:26, Matthew Wilcox wrote:
> > On Mon, Nov 11, 2024 at 08:26:54AM +0000, Fuad Tabba wrote:
> > > Thanks for your comments Jason, and for clarifying my cover letter
> > > David. I think David has covered everything, and I'll make sure to
> > > clarify this in the cover letter when I respin.
> > 
> > I don't want you to respin.  I think this is a bad idea.
> 
> I'm hoping you'll find some more time to explain what exactly you don't
> like, because this series only refactors what we already have.
> 
> I enjoy seeing the special casing (especially hugetlb) gone from mm/swap.c.
> 
> I don't particularly enjoy overlaying folio->lru, primarily because we have
> to temporarily "evacuate" it when someone wants to make use of folio->lru
> (e.g., hugetlb isolation). So it's not completely "sticky", at least for
> hugetlb.

This is really the worst part of it though

And, IMHO, seems like overkill. We have only a handful of cases -
maybe we shouldn't be trying to get to full generality but just handle
a couple of cases directly? I don't really think it is such a bad
thing to have an if ladder on the free path if we have only a couple
things. Certainly it looks good instead of doing overlaying tricks.

Also how does this translate to Matthew's memdesc world? 

Jason

David Hildenbrand Nov. 12, 2024, 2:22 p.m. UTC | #7

On 12.11.24 14:53, Jason Gunthorpe wrote:
> On Tue, Nov 12, 2024 at 10:10:06AM +0100, David Hildenbrand wrote:
>> On 12.11.24 06:26, Matthew Wilcox wrote:
>>> On Mon, Nov 11, 2024 at 08:26:54AM +0000, Fuad Tabba wrote:
>>>> Thanks for your comments Jason, and for clarifying my cover letter
>>>> David. I think David has covered everything, and I'll make sure to
>>>> clarify this in the cover letter when I respin.
>>>
>>> I don't want you to respin.  I think this is a bad idea.
>>
>> I'm hoping you'll find some more time to explain what exactly you don't
>> like, because this series only refactors what we already have.
>>
>> I enjoy seeing the special casing (especially hugetlb) gone from mm/swap.c.
>>
>> I don't particularly enjoy overlaying folio->lru, primarily because we have
>> to temporarily "evacuate" it when someone wants to make use of folio->lru
>> (e.g., hugetlb isolation). So it's not completely "sticky", at least for
>> hugetlb.
> 
> This is really the worst part of it though

Yes.

> 
> And, IMHO, seems like overkill. We have only a handful of cases -
> maybe we shouldn't be trying to get to full generality but just handle
> a couple of cases directly? I don't really think it is such a bad
> thing to have an if ladder on the free path if we have only a couple
> things. Certainly it looks good instead of doing overlaying tricks.

I'd really like to abstract hugetlb handling if possible. The way it 
stands it's just very odd.

We'll need some reliable way to identify these folios that need care. 
guest_memfd will be using folio->mapcount for now, so for now we 
couldn't set a page type like hugetlb does.

> Also how does this translate to Matthew's memdesc world?

guest_memfd and hugetlb would be operating on folios (at least for now), 
which contain the refcount,lru,private, ... so nothing special there.

Once we actually decoupled "struct folio" from "struct page", we *might* 
have to play less tricks, because we could just have a callback pointer 
there. But well, maybe we also want to save some space in there.

Do we want dedicated memdescs for hugetlb/guest_memfd that extend folios 
in the future? I don't know, maybe.

I'm currently wondering if we can use folio->private for the time being. 
Either

(a) If folio->private is still set once the refcount drops to 0, it 
indicates that there is a freeing callback/owner_ops. We'll have to make 
hugetlb not use folio->private and convert others to clear 
folio->private before freeing.

(b) Use bitX of folio->private to indicate that this has "owner_ops" 
meaning. We'll have to make hugetlb not use folio->private and make 
others not use bitX. Might be harder and overkill, because right now we 
only really need the callback when refcount==0.

(c) Use some other indication that folio->private contains folio_ops.

Matthew Wilcox Nov. 13, 2024, 4:57 a.m. UTC | #8

On Tue, Nov 12, 2024 at 03:22:46PM +0100, David Hildenbrand wrote:
> On 12.11.24 14:53, Jason Gunthorpe wrote:
> > On Tue, Nov 12, 2024 at 10:10:06AM +0100, David Hildenbrand wrote:
> > > On 12.11.24 06:26, Matthew Wilcox wrote:
> > > > I don't want you to respin.  I think this is a bad idea.
> > > 
> > > I'm hoping you'll find some more time to explain what exactly you don't
> > > like, because this series only refactors what we already have.
> > > 
> > > I enjoy seeing the special casing (especially hugetlb) gone from mm/swap.c.

I don't.  The list of 'if's is better than the indirect function call.
That's terribly expensive, and the way we reuse the lru.next field
is fragile.  Not to mention that it introduces a new thing for the
hardening people to fret over.

> > And, IMHO, seems like overkill. We have only a handful of cases -
> > maybe we shouldn't be trying to get to full generality but just handle
> > a couple of cases directly? I don't really think it is such a bad
> > thing to have an if ladder on the free path if we have only a couple
> > things. Certainly it looks good instead of doing overlaying tricks.
> 
> I'd really like to abstract hugetlb handling if possible. The way it stands
> it's just very odd.

There might be ways to make that better.  I haven't really been looking
too hard at making that special handling go away.

> We'll need some reliable way to identify these folios that need care.
> guest_memfd will be using folio->mapcount for now, so for now we couldn't
> set a page type like hugetlb does.

If hugetlb can set lru.next at a certain point, then guestmemfd could
set a page type at a similar point, no?

> > Also how does this translate to Matthew's memdesc world?

In a memdesc world, pages no longer have a refcount.  We might still
have put_page() which will now be a very complicated (and out-of-line)
function that looks up what kind of memdesc it is and operates on the
memdesc's refcount ... if it has one.  I don't know if it'll be exported
to modules; I can see uses in the mm code, but I'm not sure if modules
will have a need.

Each memdesc type will have its own function to call to free the memdesc.
So we'll still have folio_put().  But slab does not have, need nor want
a refcount, so it'll just slab_free().  I expect us to keep around a
list of recently-freed memdescs of a particular type with their pages
still attached so that we can allocate them again quickly (or reclaim
them under memory pressure).  Once that freelist overflows, we'll free
a batch of them to the buddy allocator (for the pages) and the slab
allocator (for the memdesc itself).

> guest_memfd and hugetlb would be operating on folios (at least for now),
> which contain the refcount,lru,private, ... so nothing special there.
> 
> Once we actually decoupled "struct folio" from "struct page", we *might*
> have to play less tricks, because we could just have a callback pointer
> there. But well, maybe we also want to save some space in there.
> 
> Do we want dedicated memdescs for hugetlb/guest_memfd that extend folios in
> the future? I don't know, maybe.

I've certainly considered going so far as a per-fs folio.  So we'd
have an ext4_folio, an btrfs_folio, an iomap_folio, etc.  That'd let us
get rid of folio->private, but I'm not sure that C's type system can
really handle this nicely.  Maybe in a Rust world ;-)

What I'm thinking about is that I'd really like to be able to declare
that all the functions in ext4_aops only accept pointers to ext4_folio,
so ext4_dirty_folio() can't be called with pointers to _any_ folio,
but specifically folios which were previously allocated for ext4.

I don't know if Rust lets you do something like that.

> I'm currently wondering if we can use folio->private for the time being.
> Either
> 
> (a) If folio->private is still set once the refcount drops to 0, it
> indicates that there is a freeing callback/owner_ops. We'll have to make
> hugetlb not use folio->private and convert others to clear folio->private
> before freeing.
> 
> (b) Use bitX of folio->private to indicate that this has "owner_ops"
> meaning. We'll have to make hugetlb not use folio->private and make others
> not use bitX. Might be harder and overkill, because right now we only really
> need the callback when refcount==0.
> 
> (c) Use some other indication that folio->private contains folio_ops.

I really don't want to use folio_ops / folio_owner_ops.  I read
https://lore.kernel.org/all/CAGtprH_JP2w-4rq02h_Ugvq5KuHX7TUvegOS7xUs_iy5hriE7g@mail.gmail.com/
and I still don't understand what you're trying to do.

Would it work to use aops->free_folio() to notify you when the folio is
being removed from the address space?

David Hildenbrand Nov. 13, 2024, 11:27 a.m. UTC | #9

On 13.11.24 05:57, Matthew Wilcox wrote:
> On Tue, Nov 12, 2024 at 03:22:46PM +0100, David Hildenbrand wrote:
>> On 12.11.24 14:53, Jason Gunthorpe wrote:
>>> On Tue, Nov 12, 2024 at 10:10:06AM +0100, David Hildenbrand wrote:
>>>> On 12.11.24 06:26, Matthew Wilcox wrote:
>>>>> I don't want you to respin.  I think this is a bad idea.
>>>>
>>>> I'm hoping you'll find some more time to explain what exactly you don't
>>>> like, because this series only refactors what we already have.
>>>>
>>>> I enjoy seeing the special casing (especially hugetlb) gone from mm/swap.c.
> 
> I don't.  The list of 'if's is better than the indirect function call.
> That's terribly expensive, and the way we reuse the lru.next field
> is fragile.  Not to mention that it introduces a new thing for the
> hardening people to fret over.

Right, indirect calls are nasty and probably more fragile/insecure, but this is
really what ZONE_DEVICE is already using internally ... but I agree that is
less desirable to abstract that.

[...]

>> We'll need some reliable way to identify these folios that need care.
>> guest_memfd will be using folio->mapcount for now, so for now we couldn't
>> set a page type like hugetlb does.
> 
> If hugetlb can set lru.next at a certain point, then guestmemfd could
> set a page type at a similar point, no?

The main problem is that we will have to set it on small folios that can be
mapped to user space. As long as mapcount overlays page_type, that's
... not going to work.

We won't be truncating+freeing the folio as long as it is mapped to user space,
though. So one workaround would be to set the page type only as long as it
isn't mapped to user space.

Won't win a beauty price, but could be one workaround until we decoupled
the type from the mapcount.

[...]

> 
>> guest_memfd and hugetlb would be operating on folios (at least for now),
>> which contain the refcount,lru,private, ... so nothing special there.
>>
>> Once we actually decoupled "struct folio" from "struct page", we *might*
>> have to play less tricks, because we could just have a callback pointer
>> there. But well, maybe we also want to save some space in there.
>>
>> Do we want dedicated memdescs for hugetlb/guest_memfd that extend folios in
>> the future? I don't know, maybe.
> 
> I've certainly considered going so far as a per-fs folio.  So we'd
> have an ext4_folio, an btrfs_folio, an iomap_folio, etc.  That'd let us
> get rid of folio->private, but I'm not sure that C's type system can
> really handle this nicely.  Maybe in a Rust world ;-)

:)

> 
> What I'm thinking about is that I'd really like to be able to declare
> that all the functions in ext4_aops only accept pointers to ext4_folio,
> so ext4_dirty_folio() can't be called with pointers to _any_ folio,
> but specifically folios which were previously allocated for ext4.

Yes, that sounds reasonable. hugetlb definetly might be another such candidate.

> 
> I don't know if Rust lets you do something like that.
> 
>> I'm currently wondering if we can use folio->private for the time being.
>> Either
>>
>> (a) If folio->private is still set once the refcount drops to 0, it
>> indicates that there is a freeing callback/owner_ops. We'll have to make
>> hugetlb not use folio->private and convert others to clear folio->private
>> before freeing.
>>
>> (b) Use bitX of folio->private to indicate that this has "owner_ops"
>> meaning. We'll have to make hugetlb not use folio->private and make others
>> not use bitX. Might be harder and overkill, because right now we only really
>> need the callback when refcount==0.
>>
>> (c) Use some other indication that folio->private contains folio_ops.
> 
> I really don't want to use folio_ops / folio_owner_ops.

Yes, and I understand your reasoning. It was one approach to work around
the page_type vs. mapcount issue and avoiding more checks on the freeing path.

If we manage to make the page type fly, the following could work and leave
the ordinary folio freeing path as fast as before (and allow me to add the
PGTY_offline handling :) ):

 From 5a55e4bcf4d6cfa64d3383a7cf6649042cedcecb Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Wed, 13 Nov 2024 12:20:58 +0100
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  include/linux/page-flags.h | 11 +++++++++++
  mm/swap.c                  | 27 ++++++++++++++++++++++-----
  2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e80665bc51fac..ebf89075eeb5f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -950,6 +950,7 @@ enum pagetype {
  	PGTY_slab	= 0xf5,
  	PGTY_zsmalloc	= 0xf6,
  	PGTY_unaccepted	= 0xf7,
+	PGTY_guestmem	= 0xf8,
  
  	PGTY_mapcount_underflow = 0xff
  };
@@ -970,6 +971,16 @@ static inline bool page_has_type(const struct page *page)
  	return page_mapcount_is_type(data_race(page->page_type));
  }
  
+static inline bool folio_has_type(const struct folio *folio)
+{
+	return page_has_type(&folio->page);
+}
+
+static inline int folio_get_type(const struct folio *folio)
+{
+	return folio->page.page_type >> 24;
+}
+
  #define FOLIO_TYPE_OPS(lname, fname)					\
  static __always_inline bool folio_test_##fname(const struct folio *folio) \
  {									\
diff --git a/mm/swap.c b/mm/swap.c
index 10decd9dffa17..bf4efc7bba18a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -94,6 +94,22 @@ static void page_cache_release(struct folio *folio)
  		unlock_page_lruvec_irqrestore(lruvec, flags);
  }
  
+static void free_typed_folio(struct folio *folio)
+{
+	switch (folio_get_type(folio)) {
+	case PGTY_hugetlb:
+		free_huge_folio(folio);
+		return;
+	case PGTY_offline:
+		/* Nothing to do, it's offline. */
+		return;
+	case PGTY_guestmem:
+		// free_guestmem_folio(folio);
+	default:
+		WARN_ON_ONCE(1);
+	}
+}
+
  void __folio_put(struct folio *folio)
  {
  	if (unlikely(folio_is_zone_device(folio))) {
@@ -101,8 +117,8 @@ void __folio_put(struct folio *folio)
  		return;
  	}
  
-	if (folio_test_hugetlb(folio)) {
-		free_huge_folio(folio);
+	if (unlikely(folio_has_type(folio))) {
+		free_typed_folio(folio);
  		return;
  	}
  
@@ -934,15 +950,16 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
  		if (!folio_ref_sub_and_test(folio, nr_refs))
  			continue;
  
-		/* hugetlb has its own memcg */
-		if (folio_test_hugetlb(folio)) {
+		if (unlikely(folio_has_type(folio))) {
+			/* typed folios have their own memcg, if any */
  			if (lruvec) {
  				unlock_page_lruvec_irqrestore(lruvec, flags);
  				lruvec = NULL;
  			}
-			free_huge_folio(folio);
+			free_typed_folio(folio);
  			continue;
  		}
+
  		folio_unqueue_deferred_split(folio);
  		__page_cache_release(folio, &lruvec, &flags);

John Hubbard Nov. 14, 2024, 4:02 a.m. UTC | #10

On 11/12/24 8:57 PM, Matthew Wilcox wrote:
> On Tue, Nov 12, 2024 at 03:22:46PM +0100, David Hildenbrand wrote:
>> On 12.11.24 14:53, Jason Gunthorpe wrote:
>>> On Tue, Nov 12, 2024 at 10:10:06AM +0100, David Hildenbrand wrote:
>>>> On 12.11.24 06:26, Matthew Wilcox wrote:
...
> I've certainly considered going so far as a per-fs folio.  So we'd
> have an ext4_folio, an btrfs_folio, an iomap_folio, etc.  That'd let us
> get rid of folio->private, but I'm not sure that C's type system can
> really handle this nicely.  Maybe in a Rust world ;-)
> 
> What I'm thinking about is that I'd really like to be able to declare
> that all the functions in ext4_aops only accept pointers to ext4_folio,
> so ext4_dirty_folio() can't be called with pointers to _any_ folio,
> but specifically folios which were previously allocated for ext4.
> 
> I don't know if Rust lets you do something like that.
> 

As Rust-for-Linux student, I can answer that one: "yes".

Some combination of "newtypes" and Traits will provide exactly what you
need here. newtypes provide a zero-overhead type safe way of specifying
a type, and Traits can be used to require that only types that support
specific operations are accepted in foo().

(Rust at the language level looks a lot more like a replacement for C++,
than a replacement for C, imho. By which I mean, it has lots of goodies
for expressing things, built right into the language.)

thanks,

[RFC,v1,00/10] mm: Introduce and use folio_owner_ops

Message

Comments