Message ID | 20241108162040.159038-1-tabba@google.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: Introduce and use folio_owner_ops | expand |
On Fri, Nov 08, 2024 at 04:20:30PM +0000, Fuad Tabba wrote: > Some folios, such as hugetlb folios and zone device folios, > require special handling when the folio's reference count reaches > 0, before being freed. Moreover, guest_memfd folios will likely > require special handling to notify it once a folio's reference > count reaches 0, to facilitate shared to private folio conversion > [*]. Currently, each usecase has a dedicated callback when the > folio refcount reaches 0 to that effect. Adding yet more > callbacks is not ideal. Honestly, I question this thesis. How complex would it be to have 'yet more callbacks'? Is the challenge really that the mm can't detect when guestmemfd is the owner of the page because the page will be ZONE_NORMAL? So the point of this is really to allow ZONE_NORMAL pages to have a per-allocator callback? But this is also why I suggested to shift them to ZONE_DEVICE for guestmemfd, because then you get these things for free from the pgmap. (this is not a disagreement this is a valid solution, but a request you explain much more about what it is you actually need and compare it with the other existing options) Jason
On 08.11.24 18:05, Jason Gunthorpe wrote: > On Fri, Nov 08, 2024 at 04:20:30PM +0000, Fuad Tabba wrote: >> Some folios, such as hugetlb folios and zone device folios, >> require special handling when the folio's reference count reaches >> 0, before being freed. Moreover, guest_memfd folios will likely >> require special handling to notify it once a folio's reference >> count reaches 0, to facilitate shared to private folio conversion >> [*]. Currently, each usecase has a dedicated callback when the >> folio refcount reaches 0 to that effect. Adding yet more >> callbacks is not ideal. > Thanks for having a look! Replying to clarify some things. Fuad, feel free to add additional information. > Honestly, I question this thesis. How complex would it be to have 'yet > more callbacks'? Is the challenge really that the mm can't detect when > guestmemfd is the owner of the page because the page will be > ZONE_NORMAL? Fuad might have been a bit imprecise here: We don't want an ever growing list of checks+callbacks on the page freeing fast path. This series replaces the two cases we have by a single generic one, which is nice independent of guest_memfd I think. > > So the point of this is really to allow ZONE_NORMAL pages to have a > per-allocator callback? To intercept the refcount going to zero independent of any zones or magic page types, without as little overhead in the common page freeing path. It can be used to implement custom allocators, like factored out for hugetlb in this series. It's not necessarily limited to that, though. It can be used as a form of "asynchronous page ref freezing", where you get notified once all references are gone. (I might have another use case with PageOffline, where we want to prevent virtio-mem ones of them from getting accidentally leaked into the buddy during memory offlining with speculative references -- virtio_mem_fake_offline_going_offline() contains the interesting bits. But I did not look into the dirty details yet, just some thought where we'd want to intercept the refcount going to 0.) > > But this is also why I suggested to shift them to ZONE_DEVICE for > guestmemfd, because then you get these things for free from the pgmap. With this series even hugetlb gets it for "free", and hugetlb is not quite the nail for the ZONE_DEVICE hammer IMHO :) For things we can statically set aside early during boot and never really want to return to the buddy/another allocator, I would agree that static ZONE_DEVICE would have possible. Whenever the buddy or other allocators are involved, and we might have granularity as a handful of pages (e.g., taken from the buddy), getting ZONE_DEVICE involved is not a good (or even feasible) approach. After all, all we want is intercept the refcount going to 0.
Hi Jason and David, On Fri, 8 Nov 2024 at 19:33, David Hildenbrand <david@redhat.com> wrote: > > On 08.11.24 18:05, Jason Gunthorpe wrote: > > On Fri, Nov 08, 2024 at 04:20:30PM +0000, Fuad Tabba wrote: > >> Some folios, such as hugetlb folios and zone device folios, > >> require special handling when the folio's reference count reaches > >> 0, before being freed. Moreover, guest_memfd folios will likely > >> require special handling to notify it once a folio's reference > >> count reaches 0, to facilitate shared to private folio conversion > >> [*]. Currently, each usecase has a dedicated callback when the > >> folio refcount reaches 0 to that effect. Adding yet more > >> callbacks is not ideal. > > > > Thanks for having a look! > > Replying to clarify some things. Fuad, feel free to add additional > information. Thanks for your comments Jason, and for clarifying my cover letter David. I think David has covered everything, and I'll make sure to clarify this in the cover letter when I respin. Cheers, /fuad > > > Honestly, I question this thesis. How complex would it be to have 'yet > > more callbacks'? Is the challenge really that the mm can't detect when > > guestmemfd is the owner of the page because the page will be > > ZONE_NORMAL? > > Fuad might have been a bit imprecise here: We don't want an ever growing > list of checks+callbacks on the page freeing fast path. > > This series replaces the two cases we have by a single generic one, > which is nice independent of guest_memfd I think. > > > > > So the point of this is really to allow ZONE_NORMAL pages to have a > > per-allocator callback? > > To intercept the refcount going to zero independent of any zones or > magic page types, without as little overhead in the common page freeing > path. > > It can be used to implement custom allocators, like factored out for > hugetlb in this series. It's not necessarily limited to that, though. It > can be used as a form of "asynchronous page ref freezing", where you get > notified once all references are gone. > > (I might have another use case with PageOffline, where we want to > prevent virtio-mem ones of them from getting accidentally leaked into > the buddy during memory offlining with speculative references -- > virtio_mem_fake_offline_going_offline() contains the interesting bits. > But I did not look into the dirty details yet, just some thought where > we'd want to intercept the refcount going to 0.) > > > > > But this is also why I suggested to shift them to ZONE_DEVICE for > > guestmemfd, because then you get these things for free from the pgmap. > > With this series even hugetlb gets it for "free", and hugetlb is not > quite the nail for the ZONE_DEVICE hammer IMHO :) > > For things we can statically set aside early during boot and never > really want to return to the buddy/another allocator, I would agree that > static ZONE_DEVICE would have possible. > > Whenever the buddy or other allocators are involved, and we might have > granularity as a handful of pages (e.g., taken from the buddy), getting > ZONE_DEVICE involved is not a good (or even feasible) approach. > > After all, all we want is intercept the refcount going to 0. > > -- > Cheers, > > David / dhildenb >
On Mon, Nov 11, 2024 at 08:26:54AM +0000, Fuad Tabba wrote: > Thanks for your comments Jason, and for clarifying my cover letter > David. I think David has covered everything, and I'll make sure to > clarify this in the cover letter when I respin. I don't want you to respin. I think this is a bad idea.
On 12.11.24 06:26, Matthew Wilcox wrote: > On Mon, Nov 11, 2024 at 08:26:54AM +0000, Fuad Tabba wrote: >> Thanks for your comments Jason, and for clarifying my cover letter >> David. I think David has covered everything, and I'll make sure to >> clarify this in the cover letter when I respin. > > I don't want you to respin. I think this is a bad idea. I'm hoping you'll find some more time to explain what exactly you don't like, because this series only refactors what we already have. I enjoy seeing the special casing (especially hugetlb) gone from mm/swap.c. I don't particularly enjoy overlaying folio->lru, primarily because we have to temporarily "evacuate" it when someone wants to make use of folio->lru (e.g., hugetlb isolation). So it's not completely "sticky", at least for hugetlb. Overlaying folio->mapping, similar to how "struct movable_operations" overlay page->mapping is not an option, because folio->mapping will be used for other purposes. We'd need some sticky and reliable way to tell folio freeing code that someone wants to intercept when the refcount of that folio goes to 0, and identify who to notify. Maybe folio->private/page->private could be overlayed? hugetlb only uses folio->private for flags, which we could move to some other tail page (e.g., simply putting them into flags1).
On Tue, Nov 12, 2024 at 10:10:06AM +0100, David Hildenbrand wrote: > On 12.11.24 06:26, Matthew Wilcox wrote: > > On Mon, Nov 11, 2024 at 08:26:54AM +0000, Fuad Tabba wrote: > > > Thanks for your comments Jason, and for clarifying my cover letter > > > David. I think David has covered everything, and I'll make sure to > > > clarify this in the cover letter when I respin. > > > > I don't want you to respin. I think this is a bad idea. > > I'm hoping you'll find some more time to explain what exactly you don't > like, because this series only refactors what we already have. > > I enjoy seeing the special casing (especially hugetlb) gone from mm/swap.c. > > I don't particularly enjoy overlaying folio->lru, primarily because we have > to temporarily "evacuate" it when someone wants to make use of folio->lru > (e.g., hugetlb isolation). So it's not completely "sticky", at least for > hugetlb. This is really the worst part of it though And, IMHO, seems like overkill. We have only a handful of cases - maybe we shouldn't be trying to get to full generality but just handle a couple of cases directly? I don't really think it is such a bad thing to have an if ladder on the free path if we have only a couple things. Certainly it looks good instead of doing overlaying tricks. Also how does this translate to Matthew's memdesc world? Jason
On 12.11.24 14:53, Jason Gunthorpe wrote: > On Tue, Nov 12, 2024 at 10:10:06AM +0100, David Hildenbrand wrote: >> On 12.11.24 06:26, Matthew Wilcox wrote: >>> On Mon, Nov 11, 2024 at 08:26:54AM +0000, Fuad Tabba wrote: >>>> Thanks for your comments Jason, and for clarifying my cover letter >>>> David. I think David has covered everything, and I'll make sure to >>>> clarify this in the cover letter when I respin. >>> >>> I don't want you to respin. I think this is a bad idea. >> >> I'm hoping you'll find some more time to explain what exactly you don't >> like, because this series only refactors what we already have. >> >> I enjoy seeing the special casing (especially hugetlb) gone from mm/swap.c. >> >> I don't particularly enjoy overlaying folio->lru, primarily because we have >> to temporarily "evacuate" it when someone wants to make use of folio->lru >> (e.g., hugetlb isolation). So it's not completely "sticky", at least for >> hugetlb. > > This is really the worst part of it though Yes. > > And, IMHO, seems like overkill. We have only a handful of cases - > maybe we shouldn't be trying to get to full generality but just handle > a couple of cases directly? I don't really think it is such a bad > thing to have an if ladder on the free path if we have only a couple > things. Certainly it looks good instead of doing overlaying tricks. I'd really like to abstract hugetlb handling if possible. The way it stands it's just very odd. We'll need some reliable way to identify these folios that need care. guest_memfd will be using folio->mapcount for now, so for now we couldn't set a page type like hugetlb does. > Also how does this translate to Matthew's memdesc world? guest_memfd and hugetlb would be operating on folios (at least for now), which contain the refcount,lru,private, ... so nothing special there. Once we actually decoupled "struct folio" from "struct page", we *might* have to play less tricks, because we could just have a callback pointer there. But well, maybe we also want to save some space in there. Do we want dedicated memdescs for hugetlb/guest_memfd that extend folios in the future? I don't know, maybe. I'm currently wondering if we can use folio->private for the time being. Either (a) If folio->private is still set once the refcount drops to 0, it indicates that there is a freeing callback/owner_ops. We'll have to make hugetlb not use folio->private and convert others to clear folio->private before freeing. (b) Use bitX of folio->private to indicate that this has "owner_ops" meaning. We'll have to make hugetlb not use folio->private and make others not use bitX. Might be harder and overkill, because right now we only really need the callback when refcount==0. (c) Use some other indication that folio->private contains folio_ops.