Message ID | 20210320054104.1300774-1-willy@infradead.org (mailing list archive) |
---|---|
Headers | show |
Series | Memory Folios | expand |
On Sat, Mar 20, 2021 at 05:40:37AM +0000, Matthew Wilcox (Oracle) wrote: > Current tree at: > https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/folio > > (contains another ~100 patches on top of this batch, not all of which are > in good shape for submission) I've fixed the two buildbot bugs. I also resplit the docs work, and did a bunch of other things to the patches that I haven't posted yet. I'll send the first three patches as a separate series tomorrow, and then the next four as their own series, then I'll repost the rest (up to and including "Convert page wait queues to be folios") later in the week.
On Sat, Mar 20, 2021 at 05:40:37AM +0000, Matthew Wilcox (Oracle) wrote: > Managing memory in 4KiB pages is a serious overhead. Many benchmarks > exist which show the benefits of a larger "page size". As an example, > an earlier iteration of this idea which used compound pages got a 7% > performance boost when compiling the kernel using kernbench without any > particular tuning. > > Using compound pages or THPs exposes a serious weakness in our type > system. Functions are often unprepared for compound pages to be passed > to them, and may only act on PAGE_SIZE chunks. Even functions which are > aware of compound pages may expect a head page, and do the wrong thing > if passed a tail page. > > There have been efforts to label function parameters as 'head' instead > of 'page' to indicate that the function expects a head page, but this > leaves us with runtime assertions instead of using the compiler to prove > that nobody has mistakenly passed a tail page. Calling a struct page > 'head' is also inaccurate as they will work perfectly well on base pages. > The term 'nottail' has not proven popular. > > We also waste a lot of instructions ensuring that we're not looking at > a tail page. Almost every call to PageFoo() contains one or more hidden > calls to compound_head(). This also happens for get_page(), put_page() > and many more functions. There does not appear to be a way to tell gcc > that it can cache the result of compound_head(), nor is there a way to > tell it that compound_head() is idempotent. > > This series introduces the 'struct folio' as a replacement for > head-or-base pages. This initial set reduces the kernel size by > approximately 6kB, although its real purpose is adding infrastructure > to enable further use of the folio. > > The intent is to convert all filesystems and some device drivers to work > in terms of folios. This series contains a lot of explicit conversions, > but it's important to realise it's removing a lot of implicit conversions > in some relatively hot paths. There will be very few conversions from > folios when this work is completed; filesystems, the page cache, the > LRU and so on will generally only deal with folios. If that is the case, shouldn't there in the long term only be very few, easy to review instances of things like compound_head(), PAGE_SIZE etc. deep in the heart of MM? And everybody else should 1) never see tail pages and 2) never assume a compile-time page size? What are the higher-level places that in the long-term should be dealing with tail pages at all? Are there legit ones besides the page allocator, THP splitting internals & pte-mapped compound pages? I do agree that the current confusion around which layer sees which types of pages is a problem. But I also think a lot of it is the result of us being in a transitional period where we've added THP in more places but not all code and data structures are or were fully native yet, and so we had things leak out or into where maybe they shouldn't be to make things work in the short term. But this part is already getting better, and has gotten better, with the page cache (largely?) going native for example. Some compound_head() that are currently in the codebase are already unnecessary. Like the one in activate_page(). And looking at grep, I wouldn't be surprised if only the page table walkers need the page_compound() that mark_page_accessed() does. We would be better off if they did the translation once and explicitly in the outer scope, where it's clear they're dealing with a pte-mapped compound page, instead of having a series of rather low level helpers (page flags testing, refcount operations, LRU operations, stat accounting) all trying to be clever but really just obscuring things and imposing unnecessary costs on the vast majority of cases. So I fully agree with the motivation behind this patch. But I do wonder why it's special-casing the commmon case instead of the rare case. It comes at a huge cost. Short term, the churn of replacing 'page' with 'folio' in pretty much all instances is enormous. And longer term, I'm not convinced folio is the abstraction we want throughout the kernel. If nobody should be dealing with tail pages in the first place, why are we making everybody think in 'folios'? Why does a filesystem care that huge pages are composed of multiple base pages internally? This feels like an implementation detail leaking out of the MM code. The vast majority of places should be thinking 'page' with a size of 'page_size()'. Including most parts of the MM itself. The compile-time check is nice, but I'm not sure it would be that much more effective at catching things than a few centrally placed warns inside PageFoo(), get_page() etc. and other things that should not encounter tail pages in the first place (with __helpers for the few instances that do). And given the invasiveness of this change, they ought to be very drastically better at it, and obviously so, IMO. > Documentation/core-api/mm-api.rst | 7 + > fs/afs/write.c | 3 +- > fs/cachefiles/rdwr.c | 19 ++- > fs/io_uring.c | 2 +- > include/linux/memcontrol.h | 21 +++ > include/linux/mm.h | 156 +++++++++++++++---- > include/linux/mm_types.h | 52 +++++++ > include/linux/mmdebug.h | 20 +++ > include/linux/netfs.h | 2 +- > include/linux/page-flags.h | 120 +++++++++++--- > include/linux/pagemap.h | 249 ++++++++++++++++++++++-------- > include/linux/swap.h | 6 + > include/linux/vmstat.h | 107 +++++++++++++ > mm/Makefile | 2 +- > mm/filemap.c | 237 ++++++++++++++-------------- > mm/folio-compat.c | 37 +++++ > mm/memory.c | 8 +- > mm/page-writeback.c | 62 ++++++-- > mm/swapfile.c | 8 +- > mm/util.c | 30 ++-- > 20 files changed, 857 insertions(+), 291 deletions(-) > create mode 100644 mm/folio-compat.c
On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote: > On Sat, Mar 20, 2021 at 05:40:37AM +0000, Matthew Wilcox (Oracle) wrote: > > This series introduces the 'struct folio' as a replacement for > > head-or-base pages. This initial set reduces the kernel size by > > approximately 6kB, although its real purpose is adding infrastructure > > to enable further use of the folio. > > > > The intent is to convert all filesystems and some device drivers to work > > in terms of folios. This series contains a lot of explicit conversions, > > but it's important to realise it's removing a lot of implicit conversions > > in some relatively hot paths. There will be very few conversions from > > folios when this work is completed; filesystems, the page cache, the > > LRU and so on will generally only deal with folios. > > If that is the case, shouldn't there in the long term only be very > few, easy to review instances of things like compound_head(), > PAGE_SIZE etc. deep in the heart of MM? And everybody else should 1) > never see tail pages and 2) never assume a compile-time page size? I don't know exactly where we get to eventually. There are definitely some aspects of the filesystem<->mm interface which are page-based (eg ->fault needs to look up the exact page, regardless of its head/tail/base nature), while ->readpage needs to talk in terms of folios. > What are the higher-level places that in the long-term should be > dealing with tail pages at all? Are there legit ones besides the page > allocator, THP splitting internals & pte-mapped compound pages? I can't tell. I think this patch maybe illustrates some of the problems, but maybe it's just an intermediate problem: https://git.infradead.org/users/willy/pagecache.git/commitdiff/047e9185dc146b18f56c6df0b49fe798f1805c7b It deals mostly in terms of folios, but when it needs to kmap() and memcmp(), then it needs to work in terms of pages. I don't think it's avoidable (maybe we bury the "dealing with pages" inside a kmap() wrapper somewhere, but I'm not sure that's better). > I do agree that the current confusion around which layer sees which > types of pages is a problem. But I also think a lot of it is the > result of us being in a transitional period where we've added THP in > more places but not all code and data structures are or were fully > native yet, and so we had things leak out or into where maybe they > shouldn't be to make things work in the short term. > > But this part is already getting better, and has gotten better, with > the page cache (largely?) going native for example. Thanks ;-) There's still more work to do on that (ie storing one entry to cover 512 indices instead of 512 identical entries), but it is getting better. What can't be made better is the CPU page tables; they really do need to point to tail pages. One of my longer-term goals is to support largeish pages on ARM (and other CPUs). Instead of these silly config options to have 16KiB or 64KiB pages, support "add PTEs for these 16 consecutive, aligned pages". And I'm not sure how we do that without folios. The notion that a page is PAGE_SIZE is really, really ingrained. I tried the page_size() macro to make things easier, but there's 17000 instances of PAGE_SIZE in the tree, and they just aren't going to go away. > Some compound_head() that are currently in the codebase are already > unnecessary. Like the one in activate_page(). Right! And it's hard to find & remove them without very careful analysis, or particularly deep knowledge. With folios, we can remove them without terribly deep thought. > And looking at grep, I wouldn't be surprised if only the page table > walkers need the page_compound() that mark_page_accessed() does. We > would be better off if they did the translation once and explicitly in > the outer scope, where it's clear they're dealing with a pte-mapped > compound page, instead of having a series of rather low level helpers > (page flags testing, refcount operations, LRU operations, stat > accounting) all trying to be clever but really just obscuring things > and imposing unnecessary costs on the vast majority of cases. > > So I fully agree with the motivation behind this patch. But I do > wonder why it's special-casing the commmon case instead of the rare > case. It comes at a huge cost. Short term, the churn of replacing > 'page' with 'folio' in pretty much all instances is enormous. Because people (think they) know what a page is. It's PAGE_SIZE bytes long, it occupies one PTE, etc, etc. A folio is new and instead of changing how something familiar (a page) behaves, we're asking them to think about something new instead that behaves a lot like a page, but has differences. > And longer term, I'm not convinced folio is the abstraction we want > throughout the kernel. If nobody should be dealing with tail pages in > the first place, why are we making everybody think in 'folios'? Why > does a filesystem care that huge pages are composed of multiple base > pages internally? This feels like an implementation detail leaking out > of the MM code. The vast majority of places should be thinking 'page' > with a size of 'page_size()'. Including most parts of the MM itself. I think pages already leaked out of the MM and into filesystems (and most of the filesystem writers seem pretty unknowledgable about how pages and the page cache work, TBH). That's OK! Or it should be OK. Filesystem authors should be experts on how their filesystem works. Everywhere that they have to learn about the page cache is a distraction and annoyance for them. I mean, I already tried what you're suggesting. It's really freaking hard. It's hard to do, it's hard to explain, it's hard to know if you got it right. With folios, I've got the compiler working for me, telling me that I got some of the low-level bits right (or wrong), leaving me free to notice "Oh, wait, we got the accounting wrong because writeback assumes that a page is only PAGE_SIZE bytes". I would _never_ have noticed that with the THP tree. I only noticed it because transitioning things to folios made me read the writeback code and wonder about the 'inc_wb_stat' call, see that it's measuring something in 'number of pages' and realise that the wb_stat accounting needs to be fixed. > The compile-time check is nice, but I'm not sure it would be that much > more effective at catching things than a few centrally placed warns > inside PageFoo(), get_page() etc. and other things that should not > encounter tail pages in the first place (with __helpers for the few > instances that do). And given the invasiveness of this change, they > ought to be very drastically better at it, and obviously so, IMO. We should have come up with a new type 15 years ago instead of doing THP. But the second best time to invent a new type for "memory objects which are at least as big as a page" is right now. Because it only gets more painful over time.
On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote: > If that is the case, shouldn't there in the long term only be very > few, easy to review instances of things like compound_head(), > PAGE_SIZE etc. deep in the heart of MM? And everybody else should 1) > never see tail pages and 2) never assume a compile-time page size? Probably. > But this part is already getting better, and has gotten better, with > the page cache (largely?) going native for example. As long as there is no strong typing it is going to remain a mess. > So I fully agree with the motivation behind this patch. But I do > wonder why it's special-casing the commmon case instead of the rare > case. It comes at a huge cost. Short term, the churn of replacing > 'page' with 'folio' in pretty much all instances is enormous. The special case is in the eye of the beholder. I suspect we'll end up using the folio in most FS/VM interaction eventually, which makes it the common. But I don't see how it is the special case? Yes, changing from page to folio just about everywhere causes more change, but it also allow to: a) do this gradually b) thus actually audit everything that we actually do the right thing And I think willys whole series (the git branch, not just the few patches sent out) very clearly shows the benefit of that. > And longer term, I'm not convinced folio is the abstraction we want > throughout the kernel. If nobody should be dealing with tail pages in > the first place, why are we making everybody think in 'folios'? Why > does a filesystem care that huge pages are composed of multiple base > pages internally? This feels like an implementation detail leaking out > of the MM code. The vast majority of places should be thinking 'page' > with a size of 'page_size()'. Including most parts of the MM itself. Why does the name matter? While there are arguments both ways, the clean break certainly helps every to remind everyone that this is not your grandfathers fixed sized page. > > The compile-time check is nice, but I'm not sure it would be that much > more effective at catching things than a few centrally placed warns > inside PageFoo(), get_page() etc. and other things that should not > encounter tail pages in the first place (with __helpers for the few > instances that do). Eeek, no. No amount of runtime checks is going to replace compile time type safety.
Johannes Weiner <hannes@cmpxchg.org> wrote: > So I fully agree with the motivation behind this patch. But I do > wonder why it's special-casing the commmon case instead of the rare > case. It comes at a huge cost. Short term, the churn of replacing > 'page' with 'folio' in pretty much all instances is enormous. > > And longer term, I'm not convinced folio is the abstraction we want > throughout the kernel. If nobody should be dealing with tail pages in > the first place, why are we making everybody think in 'folios'? Why > does a filesystem care that huge pages are composed of multiple base > pages internally? This feels like an implementation detail leaking out > of the MM code. The vast majority of places should be thinking 'page' > with a size of 'page_size()'. Including most parts of the MM itself. I like the idea of logically separating individual hardware pages from abstract bundles of pages by using a separate type for them - at least in filesystem code. I'm trying to abstract some of the handling out of the network filesystems and into a common library plus ITER_XARRAY to insulate those filesystems from the VM. David
On Mon, Mar 22, 2021 at 06:47:44PM +0000, Matthew Wilcox wrote: > On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote: > > On Sat, Mar 20, 2021 at 05:40:37AM +0000, Matthew Wilcox (Oracle) wrote: > > > This series introduces the 'struct folio' as a replacement for > > > head-or-base pages. This initial set reduces the kernel size by > > > approximately 6kB, although its real purpose is adding infrastructure > > > to enable further use of the folio. > > > > > > The intent is to convert all filesystems and some device drivers to work > > > in terms of folios. This series contains a lot of explicit conversions, > > > but it's important to realise it's removing a lot of implicit conversions > > > in some relatively hot paths. There will be very few conversions from > > > folios when this work is completed; filesystems, the page cache, the > > > LRU and so on will generally only deal with folios. > > > > If that is the case, shouldn't there in the long term only be very > > few, easy to review instances of things like compound_head(), > > PAGE_SIZE etc. deep in the heart of MM? And everybody else should 1) > > never see tail pages and 2) never assume a compile-time page size? > > I don't know exactly where we get to eventually. There are definitely > some aspects of the filesystem<->mm interface which are page-based > (eg ->fault needs to look up the exact page, regardless of its > head/tail/base nature), while ->readpage needs to talk in terms of > folios. I can imagine we'd eventually want fault handlers that can also fill in larger chunks of data if the file is of the right size and the MM is able to (and policy/heuristics determine to) go with a huge page. > > What are the higher-level places that in the long-term should be > > dealing with tail pages at all? Are there legit ones besides the page > > allocator, THP splitting internals & pte-mapped compound pages? > > I can't tell. I think this patch maybe illustrates some of the > problems, but maybe it's just an intermediate problem: > > https://git.infradead.org/users/willy/pagecache.git/commitdiff/047e9185dc146b18f56c6df0b49fe798f1805c7b > > It deals mostly in terms of folios, but when it needs to kmap() and > memcmp(), then it needs to work in terms of pages. I don't think it's > avoidable (maybe we bury the "dealing with pages" inside a kmap() > wrapper somewhere, but I'm not sure that's better). Yeah it'd be nice to get low-level, PAGE_SIZE pages out of there. We may be able to just kmap whole folios too, which are more likely to be small pages on highmem systems anyway. > > Some compound_head() that are currently in the codebase are already > > unnecessary. Like the one in activate_page(). > > Right! And it's hard to find & remove them without very careful analysis, > or particularly deep knowledge. With folios, we can remove them without > terribly deep thought. True. It definitely also helps mark the places that have been converted from the top down and which ones haven't. Without that you need to think harder about the context ("How would a tail page even get here?" vs. "No page can get here, only folios" ;-)) Again, I think that's something that would automatically be better in the long term when compound_page() and PAGE_SIZE themselves would stand out like sore thumbs. But you raise a good point: there is such an overwhelming amount of them right now that it's difficult to do this without a clearer marker and help from the type system. > > And looking at grep, I wouldn't be surprised if only the page table > > walkers need the page_compound() that mark_page_accessed() does. We > > would be better off if they did the translation once and explicitly in > > the outer scope, where it's clear they're dealing with a pte-mapped > > compound page, instead of having a series of rather low level helpers > > (page flags testing, refcount operations, LRU operations, stat > > accounting) all trying to be clever but really just obscuring things > > and imposing unnecessary costs on the vast majority of cases. > > > > So I fully agree with the motivation behind this patch. But I do > > wonder why it's special-casing the commmon case instead of the rare > > case. It comes at a huge cost. Short term, the churn of replacing > > 'page' with 'folio' in pretty much all instances is enormous. > > Because people (think they) know what a page is. It's PAGE_SIZE bytes > long, it occupies one PTE, etc, etc. A folio is new and instead of > changing how something familiar (a page) behaves, we're asking them > to think about something new instead that behaves a lot like a page, > but has differences. Yeah, that makes sense. > > And longer term, I'm not convinced folio is the abstraction we want > > throughout the kernel. If nobody should be dealing with tail pages in > > the first place, why are we making everybody think in 'folios'? Why > > does a filesystem care that huge pages are composed of multiple base > > pages internally? This feels like an implementation detail leaking out > > of the MM code. The vast majority of places should be thinking 'page' > > with a size of 'page_size()'. Including most parts of the MM itself. > > I think pages already leaked out of the MM and into filesystems (and > most of the filesystem writers seem pretty unknowledgable about how > pages and the page cache work, TBH). That's OK! Or it should be OK. > Filesystem authors should be experts on how their filesystem works. > Everywhere that they have to learn about the page cache is a distraction > and annoyance for them. > > I mean, I already tried what you're suggesting. It's really freaking > hard. It's hard to do, it's hard to explain, it's hard to know if you > got it right. With folios, I've got the compiler working for me, telling > me that I got some of the low-level bits right (or wrong), leaving me > free to notice "Oh, wait, we got the accounting wrong because writeback > assumes that a page is only PAGE_SIZE bytes". I would _never_ have > noticed that with the THP tree. I only noticed it because transitioning > things to folios made me read the writeback code and wonder about the > 'inc_wb_stat' call, see that it's measuring something in 'number of pages' > and realise that the wb_stat accounting needs to be fixed. I agree with all of this whole-heartedly. The reason I asked about who would deal with tail pages in the long term is because I think optimally most places would just think of these things as descriptors for variable lengths of memory. And only the allocator looks behind the curtain and deals with the (current!) reality that they're stitched together from fixed-size objects. To me, folios seem to further highlight this implementation detail, more so than saying a page is now page_size() - although I readily accept that the latter didn't turn out to be a viable mid-term strategy in practice at all, and that a clean break is necessary sooner rather than later (instead of cleaning up the page api now and replacing the backing pages with struct hwpage or something later). The name of the abstraction indicates how we think we're supposed to use it, what behavior stands out as undesirable. For example, you brought up kmap/memcpy/usercopy, which is a pretty common operation. Should they continue to deal with individual tail pages, and thereby perpetuate the exposure of these low-level MM building blocks to drivers and filesystems? It means portfolio -> page lookups will remain common - and certainly the concept of the folio suggests thinking of it as a couple of pages strung together. And the more this is the case, the less it stands out when somebody is dealing with low-level pages when really they shouldn't be - the thing this is trying to fix. Granted it's narrowing the channel quite a bit. But it's also so pervasively used that I do wonder if it's possible to keep up with creative new abuses. But I also worry about the longevity of the concept in general. This is one of the most central and fundamental concepts in the kernel. Is this going to make sense in the future? In 5 years even? > > The compile-time check is nice, but I'm not sure it would be that much > > more effective at catching things than a few centrally placed warns > > inside PageFoo(), get_page() etc. and other things that should not > > encounter tail pages in the first place (with __helpers for the few > > instances that do). And given the invasiveness of this change, they > > ought to be very drastically better at it, and obviously so, IMO. > > We should have come up with a new type 15 years ago instead of doing THP. > But the second best time to invent a new type for "memory objects which > are at least as big as a page" is right now. Because it only gets more > painful over time. Yes and no. Yes because I fully agree that too much detail of the pages have leaked into all kinds of places where they shouldn't be, and a new abstraction for what most places interact with is a good idea IMO. But we're also headed in a direction with the VM that give me pause about the folios-are-multiple-pages abstraction. How long are we going to have multiple pages behind a huge page? Common storage drives are getting fast enough that simple buffered IO workloads are becoming limited by CPU, just because it's too many individual pages to push through the cache. We have pending patches to rewrite the reclaim algorithm because rmap is falling apart with the rate of paging we're doing. We'll need larger pages in the VM not just for optimizing TLB access, but to cut transaction overhead for paging in general (I know you're already onboard with this, especially on the page cache side, just stating it for completeness). But for that to work, we'll need the allocator to produce huge pages at the necessary rate, too. The current implementation likely won't scale. Compaction is expensive enough that we have to weigh when to allocate huge pages for long-lived anon regions, let alone allocate them for streaming IO cache entries. But if the overwhelming number of requests going to the page allocator are larger than 4k pages - anon regions? check. page cache? likely a sizable share. slub? check. network? check - does it even make sense to have that as the default block size for the page allocator anymore? Or even allocate struct page at this granularity? So I think transitioning away from ye olde page is a great idea. I wonder this: have we mapped out the near future of the VM enough to say that the folio is the right abstraction? What does 'folio' mean when it corresponds to either a single page or some slab-type object with no dedicated page? If we go through with all the churn now anyway, IMO it makes at least sense to ditch all association and conceptual proximity to the hardware page or collections thereof. Simply say it's some length of memory, and keep thing-to-page translations out of the public API from the start. I mean, is there a good reason to keep this baggage? mem_t or something. mem = find_get_mem(mapping, offset); p = kmap(mem, offset - mem_file_offset(mem), len); copy_from_user(p, buf, len); kunmap(mem); SetMemDirty(mem); put_mem(mem); There are 10k instances of 'page' in mm/ outside the page allocator, a majority of which will be the new thing. 14k in fs. I don't think I have the strength to type shrink_folio_list(), or explain to new people what it means, years after it has stopped making sense.
On Tue, Mar 23, 2021 at 08:29:16PM -0400, Johannes Weiner wrote: > On Mon, Mar 22, 2021 at 06:47:44PM +0000, Matthew Wilcox wrote: > > On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote: > > > On Sat, Mar 20, 2021 at 05:40:37AM +0000, Matthew Wilcox (Oracle) wrote: > > > > This series introduces the 'struct folio' as a replacement for > > > > head-or-base pages. This initial set reduces the kernel size by > > > > approximately 6kB, although its real purpose is adding infrastructure > > > > to enable further use of the folio. > > > > > > > > The intent is to convert all filesystems and some device drivers to work > > > > in terms of folios. This series contains a lot of explicit conversions, > > > > but it's important to realise it's removing a lot of implicit conversions > > > > in some relatively hot paths. There will be very few conversions from > > > > folios when this work is completed; filesystems, the page cache, the > > > > LRU and so on will generally only deal with folios. > > > > > > If that is the case, shouldn't there in the long term only be very > > > few, easy to review instances of things like compound_head(), > > > PAGE_SIZE etc. deep in the heart of MM? And everybody else should 1) > > > never see tail pages and 2) never assume a compile-time page size? > > > > I don't know exactly where we get to eventually. There are definitely > > some aspects of the filesystem<->mm interface which are page-based > > (eg ->fault needs to look up the exact page, regardless of its > > head/tail/base nature), while ->readpage needs to talk in terms of > > folios. > > I can imagine we'd eventually want fault handlers that can also fill > in larger chunks of data if the file is of the right size and the MM > is able to (and policy/heuristics determine to) go with a huge page. Oh yes, me too! The way I think this works is that the VM asks for the specific page, just as it does today and the ->fault handler returns the page. Then the VM looks up the folio for that page, and asks the arch to map the entire folio. How the arch does that is up to the arch -- if it's PMD sized and aligned, it can do that; if the arch knows that it should use 8 consecutive PTE entries to map 32KiB all at once, it can do that. But I think we need the ->fault handler to return the specific page, because that's how we can figure out whether this folio is mapped at the appropriate alignment to make this work. If the fault handler returns the folio, I don't think we can figure out if the alignment is correct. Maybe we can for the page cache, but a device driver might have a compound page allocated for its own purposes, and it might not be amenable to the same rules as the page cache. > > https://git.infradead.org/users/willy/pagecache.git/commitdiff/047e9185dc146b18f56c6df0b49fe798f1805c7b > > > > It deals mostly in terms of folios, but when it needs to kmap() and > > memcmp(), then it needs to work in terms of pages. I don't think it's > > avoidable (maybe we bury the "dealing with pages" inside a kmap() > > wrapper somewhere, but I'm not sure that's better). > > Yeah it'd be nice to get low-level, PAGE_SIZE pages out of there. We > may be able to just kmap whole folios too, which are more likely to be > small pages on highmem systems anyway. I got told "no" when asking for kmap_local() of a compound page. Maybe that's changeable, but I'm assuming that kmap() space will continue to be tight for the foreseeable future (until we can kill highmem forever). > > > Some compound_head() that are currently in the codebase are already > > > unnecessary. Like the one in activate_page(). > > > > Right! And it's hard to find & remove them without very careful analysis, > > or particularly deep knowledge. With folios, we can remove them without > > terribly deep thought. > > True. It definitely also helps mark the places that have been > converted from the top down and which ones haven't. Without that you > need to think harder about the context ("How would a tail page even > get here?" vs. "No page can get here, only folios" ;-)) Exactly! Take a look at page_mkclean(). Its implementation strongly suggests that it expects a head page, but I think it'll unmap a single page if passed a tail page ... and it's not clear to me that isn't the behaviour that pagecache_isize_extended() would prefer. Tricky. > > I mean, I already tried what you're suggesting. It's really freaking > > hard. It's hard to do, it's hard to explain, it's hard to know if you > > got it right. With folios, I've got the compiler working for me, telling > > me that I got some of the low-level bits right (or wrong), leaving me > > free to notice "Oh, wait, we got the accounting wrong because writeback > > assumes that a page is only PAGE_SIZE bytes". I would _never_ have > > noticed that with the THP tree. I only noticed it because transitioning > > things to folios made me read the writeback code and wonder about the > > 'inc_wb_stat' call, see that it's measuring something in 'number of pages' > > and realise that the wb_stat accounting needs to be fixed. > > I agree with all of this whole-heartedly. > > The reason I asked about who would deal with tail pages in the long > term is because I think optimally most places would just think of > these things as descriptors for variable lengths of memory. And only > the allocator looks behind the curtain and deals with the (current!) > reality that they're stitched together from fixed-size objects. > > To me, folios seem to further highlight this implementation detail, > more so than saying a page is now page_size() - although I readily > accept that the latter didn't turn out to be a viable mid-term > strategy in practice at all, and that a clean break is necessary > sooner rather than later (instead of cleaning up the page api now and > replacing the backing pages with struct hwpage or something later). > > The name of the abstraction indicates how we think we're supposed to > use it, what behavior stands out as undesirable. > > For example, you brought up kmap/memcpy/usercopy, which is a pretty > common operation. Should they continue to deal with individual tail > pages, and thereby perpetuate the exposure of these low-level MM > building blocks to drivers and filesystems? > > It means portfolio -> page lookups will remain common - and certainly > the concept of the folio suggests thinking of it as a couple of pages > strung together. And the more this is the case, the less it stands out > when somebody is dealing with low-level pages when really they > shouldn't be - the thing this is trying to fix. Granted it's narrowing > the channel quite a bit. But it's also so pervasively used that I do > wonder if it's possible to keep up with creative new abuses. > > But I also worry about the longevity of the concept in general. This > is one of the most central and fundamental concepts in the kernel. Is > this going to make sense in the future? In 5 years even? One of the patches I haven't posted yet starts to try to deal with kmap()/mem*()/kunmap(): mm: Add kmap_local_folio This allows us to map a portion of a folio. Callers can only expect to access up to the next page boundary. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h index 7902c7d8b55f..55a29c9d562f 100644 --- a/include/linux/highmem-internal.h +++ b/include/linux/highmem-internal.h @@ -73,6 +73,12 @@ static inline void *kmap_local_page(struct page *page) return __kmap_local_page_prot(page, kmap_prot); } +static inline void *kmap_local_folio(struct folio *folio, size_t offset) +{ + struct page *page = &folio->page + offset / PAGE_SIZE; + return __kmap_local_page_prot(page, kmap_prot) + offset % PAGE_SIZE; +} Partly I haven't shared that one because I'm not 100% sure that 'byte offset relative to start of folio' is the correct interface. I'm looking at some users and thinking that maybe 'byte offset relative to start of file' might be better. Or perhaps that's just filesystem-centric thinking. > > > The compile-time check is nice, but I'm not sure it would be that much > > > more effective at catching things than a few centrally placed warns > > > inside PageFoo(), get_page() etc. and other things that should not > > > encounter tail pages in the first place (with __helpers for the few > > > instances that do). And given the invasiveness of this change, they > > > ought to be very drastically better at it, and obviously so, IMO. > > > > We should have come up with a new type 15 years ago instead of doing THP. > > But the second best time to invent a new type for "memory objects which > > are at least as big as a page" is right now. Because it only gets more > > painful over time. > > Yes and no. > > Yes because I fully agree that too much detail of the pages have > leaked into all kinds of places where they shouldn't be, and a new > abstraction for what most places interact with is a good idea IMO. > > But we're also headed in a direction with the VM that give me pause > about the folios-are-multiple-pages abstraction. > > How long are we going to have multiple pages behind a huge page? Yes, that's a really good question. I think Muchun Song's patches are an interesting and practical way of freeing up memory _now_, but long-term we'll need something different. Maybe we end up with dynamically allocated pages (perhaps when we break a 2MB page into 1MB pages in the buddy allocator). > Common storage drives are getting fast enough that simple buffered IO > workloads are becoming limited by CPU, just because it's too many > individual pages to push through the cache. We have pending patches to > rewrite the reclaim algorithm because rmap is falling apart with the > rate of paging we're doing. We'll need larger pages in the VM not just > for optimizing TLB access, but to cut transaction overhead for paging > in general (I know you're already onboard with this, especially on the > page cache side, just stating it for completeness). yes, yes, yes and yes. Dave Chinner produced a fantastic perf report for me illustrating how kswapd and the page cache completely fall apart under what must be a common streaming load. Just create a file 2x the size of memory, then cat it to /dev/null. cat tries to allocate memory in readahead and ends up contending on the i_pages lock with kswapd who's trying to free pages from the LRU list one at a time. Larger pages will help with that because more work gets done with each lock acquisition, but I can't help but feel that the real solution is for the page cache to notice that this is a streaming workload and have cat eagerly recycle pages from this file. That's a biggish project; we know how many pages there are in this mapping, but how to know when to switch from "allocate memory from the page allocator" to "just delete a page from early in the file and reuse it at the current position inn the file"? > But for that to work, we'll need the allocator to produce huge pages > at the necessary rate, too. The current implementation likely won't > scale. Compaction is expensive enough that we have to weigh when to > allocate huge pages for long-lived anon regions, let alone allocate > them for streaming IO cache entries. Heh, I have that as a work item for later this year -- give the page allocator per-cpu lists of compound pages, not just order-0 pages. That'll save us turning compound pages back into buddy pages, only to turn them into compound pages again. I also have a feeling that the page allocator either needs to become a sub-allocator of an allocator that deals in, say, 1GB chunks of memory, or it needs to become reluctant to break up larger orders. eg if the dcache asks for just one more dentry, it should have to go through at least one round of reclaim before we choose to break up a high-order page to satisfy that request. > But if the overwhelming number of requests going to the page allocator > are larger than 4k pages - anon regions? check. page cache? likely a > sizable share. slub? check. network? check - does it even make sense > to have that as the default block size for the page allocator anymore? > Or even allocate struct page at this granularity? Yep, others have talked about that as well. I think I may even have said a few times at LSFMM, "What if we just make PAGE_SIZE 2MB?". After all, my first 386 Linux system was 4-8MB of RAM (it got upgraded). The 16GB laptop that I now have is 2048 times more RAM, so 4x the number of pages that system had. But people seem attached to being able to use smaller page sizes. There's that pesky "compatibility" argument. > So I think transitioning away from ye olde page is a great idea. I > wonder this: have we mapped out the near future of the VM enough to > say that the folio is the right abstraction? > > What does 'folio' mean when it corresponds to either a single page or > some slab-type object with no dedicated page? > > If we go through with all the churn now anyway, IMO it makes at least > sense to ditch all association and conceptual proximity to the > hardware page or collections thereof. Simply say it's some length of > memory, and keep thing-to-page translations out of the public API from > the start. I mean, is there a good reason to keep this baggage? > > mem_t or something. > > mem = find_get_mem(mapping, offset); > p = kmap(mem, offset - mem_file_offset(mem), len); > copy_from_user(p, buf, len); > kunmap(mem); > SetMemDirty(mem); > put_mem(mem); I think there's still value to the "new thing" being a power of two in size. I'm not sure you were suggesting otherwise, but it's worth putting on the table as something we explicitly agree on (or not!) I mean what you've written there looks a _lot_ like where I get to in the iomap code. status = iomap_write_begin(inode, pos, bytes, 0, &folio, iomap, srcmap); if (unlikely(status)) break; if (mapping_writably_mapped(inode->i_mapping)) flush_dcache_folio(folio); /* We may be part-way through a folio */ offset = offset_in_folio(folio, pos); copied = iov_iter_copy_from_user_atomic(folio, i, offset, bytes); copied = iomap_write_end(inode, pos, bytes, copied, folio, iomap, srcmap); (which eventually calls TestSetFolioDirty) It doesn't copy more than PAGE_SIZE bytes per iteration because iov_iter_copy_from_user_atomic() isn't safe to do that yet. But in *principle*, it should be able to. > There are 10k instances of 'page' in mm/ outside the page allocator, a > majority of which will be the new thing. 14k in fs. I don't think I > have the strength to type shrink_folio_list(), or explain to new > people what it means, years after it has stopped making sense. One of the things I don't like about the current iteration of folio is that getting to things is folio->page.mapping. I think it does want to be folio->mapping, and I'm playing around with this: struct folio { - struct page page; + union { + struct page page; + struct { + unsigned long flags; + struct list_head lru; + struct address_space *mapping; + pgoff_t index; + unsigned long private; + atomic_t _mapcount; + atomic_t _refcount; + }; + }; }; +static inline void folio_build_bug(void) +{ +#define FOLIO_MATCH(pg, fl) \ +BUILD_BUG_ON(offsetof(struct page, pg) != offsetof(struct folio, fl)); + + FOLIO_MATCH(flags, flags); + FOLIO_MATCH(lru, lru); + FOLIO_MATCH(mapping, mapping); + FOLIO_MATCH(index, index); + FOLIO_MATCH(private, private); + FOLIO_MATCH(_mapcount, _mapcount); + FOLIO_MATCH(_refcount, _refcount); +#undef FOLIO_MATCH + BUILD_BUG_ON(sizeof(struct page) != sizeof(struct folio)); +} with the intent of eventually renaming page->mapping to page->__mapping so people can't look at page->mapping on a tail page. If we even have tail pages eventually. I could see a future where we have pte_to_pfn(), pfn_to_folio() and are completely page-free (... the vm_fault would presumably return a pfn instead of a page at that point ...). But that's too ambitious a project to succeed any time soon. There's a lot of transitional stuff in these patches where I do &folio->page. I cringe a little every time I write that. So yes, let's ask the question of "Is this the right short term, medium term or long term approach?" I think it is, at least in broad strokes. Let's keep refining it. Thanks for your contribution here; it's really useful.
On Wed, Mar 24, 2021 at 06:24:21AM +0000, Matthew Wilcox wrote: > On Tue, Mar 23, 2021 at 08:29:16PM -0400, Johannes Weiner wrote: > > On Mon, Mar 22, 2021 at 06:47:44PM +0000, Matthew Wilcox wrote: > > > On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote: > One of the patches I haven't posted yet starts to try to deal with kmap()/mem*()/kunmap(): > > mm: Add kmap_local_folio > > This allows us to map a portion of a folio. Callers can only expect > to access up to the next page boundary. > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h > index 7902c7d8b55f..55a29c9d562f 100644 > --- a/include/linux/highmem-internal.h > +++ b/include/linux/highmem-internal.h > @@ -73,6 +73,12 @@ static inline void *kmap_local_page(struct page *page) > return __kmap_local_page_prot(page, kmap_prot); > } > > +static inline void *kmap_local_folio(struct folio *folio, size_t offset) > +{ > + struct page *page = &folio->page + offset / PAGE_SIZE; > + return __kmap_local_page_prot(page, kmap_prot) + offset % PAGE_SIZE; > +} > > Partly I haven't shared that one because I'm not 100% sure that 'byte > offset relative to start of folio' is the correct interface. I'm looking > at some users and thinking that maybe 'byte offset relative to start > of file' might be better. Or perhaps that's just filesystem-centric > thinking. Right, this doesn't seem specific to files just because they would be the primary users of it. > > But for that to work, we'll need the allocator to produce huge pages > > at the necessary rate, too. The current implementation likely won't > > scale. Compaction is expensive enough that we have to weigh when to > > allocate huge pages for long-lived anon regions, let alone allocate > > them for streaming IO cache entries. > > Heh, I have that as a work item for later this year -- give the page > allocator per-cpu lists of compound pages, not just order-0 pages. > That'll save us turning compound pages back into buddy pages, only to > turn them into compound pages again. > > I also have a feeling that the page allocator either needs to become a > sub-allocator of an allocator that deals in, say, 1GB chunks of memory, > or it needs to become reluctant to break up larger orders. eg if the > dcache asks for just one more dentry, it should have to go through at > least one round of reclaim before we choose to break up a high-order > page to satisfy that request. Slub already allocates higher-order pages for dentries: slabinfo - version: 2.1 # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> dentry 133350 133350 192 42 2 : tunables 0 0 0 : slabdata 3175 3175 0 ^ here and it could avoid even more internal fragmentation with bigger orders. It only doesn't because of the overhead of allocating them. If the default block size in the allocator were 2M, we'd also get slab packing at that granularity, and we wouldn't have to worry about small objects breaking huge pages any more than we worry about slab objects fragmenting 4k pages today. > > But if the overwhelming number of requests going to the page allocator > > are larger than 4k pages - anon regions? check. page cache? likely a > > sizable share. slub? check. network? check - does it even make sense > > to have that as the default block size for the page allocator anymore? > > Or even allocate struct page at this granularity? > > Yep, others have talked about that as well. I think I may even have said > a few times at LSFMM, "What if we just make PAGE_SIZE 2MB?". After all, > my first 386 Linux system was 4-8MB of RAM (it got upgraded). The 16GB > laptop that I now have is 2048 times more RAM, so 4x the number of pages > that system had. > > But people seem attached to being able to use smaller page sizes. > There's that pesky "compatibility" argument. Right, that's why I'm NOT saying we should eliminate the support for 4k chunks in the page cache and page tables. That's still useful if you have lots of small files. I'm just saying it doesn't have to be the default that everything is primarily optimized for. We can make the default allocation size of the allocator correspond to a hugepage and have a secondary allocator level for 4k chunks. Like slab, but fixed-size and highmem-aware. It makes sense to make struct page 2M as well. It would save a ton of memory on average and reduce the pressure we have on struct page's size today. And we really don't need struct page at 4k just to support this unit of paging when necesary: page tables don't care, they use pfns and can point to any 4k offset, struct page or no struct page. For the page cache, we can move mapping, index, lru. etc from today's struct page into an entry descriptor that could either sit in a native 2M struct page (just like today), or be be allocated on demand and point into a chunked struct page. Same for <2M anonymous mappings. Hey, didn't you just move EXACTLY those fields into the folio? ;) All this to reiterate, I really do agree with the concept of a new type of object for paging, page cache entries, etc. But I think there are good reasons to assume that this unit of paging needs to support sizes smaller than the standard page size used by the kernel at large, and so 'bundle of pages' is not a good way of defining it. It can easily cause problems down the line again if people continue to assume that there is at least one PAGE_SIZE struct page in a folio. And it's not obvious to me why it really NEEDS to be 'bundle of pages' instead of just 'chunk of memory'. > > So I think transitioning away from ye olde page is a great idea. I > > wonder this: have we mapped out the near future of the VM enough to > > say that the folio is the right abstraction? > > > > What does 'folio' mean when it corresponds to either a single page or > > some slab-type object with no dedicated page? > > > > If we go through with all the churn now anyway, IMO it makes at least > > sense to ditch all association and conceptual proximity to the > > hardware page or collections thereof. Simply say it's some length of > > memory, and keep thing-to-page translations out of the public API from > > the start. I mean, is there a good reason to keep this baggage? > > > > mem_t or something. > > > > mem = find_get_mem(mapping, offset); > > p = kmap(mem, offset - mem_file_offset(mem), len); > > copy_from_user(p, buf, len); > > kunmap(mem); > > SetMemDirty(mem); > > put_mem(mem); > > I think there's still value to the "new thing" being a power of two > in size. I'm not sure you were suggesting otherwise, but it's worth > putting on the table as something we explicitly agree on (or not!) Ha, I wasn't thinking about minimum alignment. I used the byte offsets because I figured that's what's natural to the fs and saw no reason to have it think in terms of page size in this example. From an implementation pov, since anything in the page cache can end up in a page table, it probably doesn't make a whole lot of sense to allow quantities smaller than the smallest unit of paging supported by the processor. But I wonder if that's mostly something the MM would care about when it allocates these objects, not necessarily something that needs to be reflected in the interface or the filesystem. The other point I was trying to make was just the alternate name. As I said above, I think 'bundle of pages' as a concept is a strategic error that will probably come back to haunt us. I also have to admit, I really hate the name. We may want to stop people thinking of PAGE_SIZE, but this term doesn't give people any clue WHAT to think of. Ten years down the line, when the possible confusion between folio and page and PAGE_SIZE has been eradicated, people still will have to google what a folio is, and then have a hard time retaining a mental image. I *know* what it is and I still have a hard time reading code that uses it. That's why I drafted around with the above code, to see if it would go down easier. I think it does. It's simple, self-explanatory, but abstract enough as to not make assumptions around its implementation. Filesystem look up cache memory, write data in it, mark memory dirty. Maybe folio makes more sense to native speakers, but I have never heard this term. Of course when you look it up, it's "something to do with pages" :D As a strategy to unseat the obsolete mental model around pages, IMO redirection would be preferable to confusion. > > There are 10k instances of 'page' in mm/ outside the page allocator, a > > majority of which will be the new thing. 14k in fs. I don't think I > > have the strength to type shrink_folio_list(), or explain to new > > people what it means, years after it has stopped making sense. > > One of the things I don't like about the current iteration of folio > is that getting to things is folio->page.mapping. I think it does want > to be folio->mapping, and I'm playing around with this: > > struct folio { > - struct page page; > + union { > + struct page page; > + struct { > + unsigned long flags; > + struct list_head lru; > + struct address_space *mapping; > + pgoff_t index; > + unsigned long private; > + atomic_t _mapcount; > + atomic_t _refcount; > + }; > + }; > }; > > +static inline void folio_build_bug(void) > +{ > +#define FOLIO_MATCH(pg, fl) \ > +BUILD_BUG_ON(offsetof(struct page, pg) != offsetof(struct folio, fl)); > + > + FOLIO_MATCH(flags, flags); > + FOLIO_MATCH(lru, lru); > + FOLIO_MATCH(mapping, mapping); > + FOLIO_MATCH(index, index); > + FOLIO_MATCH(private, private); > + FOLIO_MATCH(_mapcount, _mapcount); > + FOLIO_MATCH(_refcount, _refcount); > +#undef FOLIO_MATCH > + BUILD_BUG_ON(sizeof(struct page) != sizeof(struct folio)); > +} > > with the intent of eventually renaming page->mapping to page->__mapping > so people can't look at page->mapping on a tail page. If we even have > tail pages eventually. I could see a future where we have pte_to_pfn(), > pfn_to_folio() and are completely page-free (... the vm_fault would > presumably return a pfn instead of a page at that point ...). But that's > too ambitious a project to succeed any time soon. > > There's a lot of transitional stuff in these patches where I do > &folio->page. I cringe a little every time I write that. Instead of the union in there, could you do this? struct thing { struct address_space *mapping; pgoff_t index; ... }; struct page { union { struct thing thing; ... } } and use container_of() to get to the page in those places? > So yes, let's ask the question of "Is this the right short term, medium > term or long term approach?" I think it is, at least in broad strokes. > Let's keep refining it. Yes, yes, and yes. :)
I'm going to respond to some points in detail below, but there are a couple of overarching themes that I want to bring out up here. Grand Vision ~~~~~~~~~~~~ I haven't outlined my long-term plan. Partly because it is a _very_ long way off, and partly because I think what I'm doing stands on its own. But some of the points below bear on this, so I'll do it now. Eventually, I want to make struct page optional for allocations. It's too small for some things (allocating page tables, for example), and overly large for others (allocating a 2MB page, networking page_pool). I don't want to change its size in the meantime; having a struct page refer to PAGE_SIZE bytes is something that's quite deeply baked in. In broad strokes, I think that having a Power Of Two Allocator with Descriptor (POTAD) is a useful foundational allocator to have. The specific allocator that we call the buddy allocator is very clever for the 1990s, but touches too many cachelines to be good with today's CPUs. The generalisation of the buddy allocator to the POTAD lets us allocate smaller quantities (eg a 512 byte block) and allocate descriptors which differ in size from a struct page. For an extreme example, see xfs_buf which is 360 bytes and is the descriptor for an allocation between 512 and 65536 bytes. There are times when we need to get from the physical address to the descriptor, eg memory-failure.c or get_user_pages(). This is the equivalent of phys_to_page(), and it's going to have to be a lookup tree. I think this is a role for the Maple Tree, but it's not ready yet. I don't know if it'll be fast enough for this case. There's also the need (particularly for memory-failure) to determine exactly what kind of descriptor we're dealing with, and also its size. Even its owner, so we can notify them of memory failure. There's still a role for the slab allocator, eg allocating objects which aren't a power of two, or allocating things for which the user doesn't need a descriptor of its own. We can even keep the 'alloc_page' interface around; it's just a specialisation of the POTAD. Anyway, there's a lot of work here, and I'm sure there are many holes to be poked in it, but eventually I want the concept of tail pages to go away, and for pages to become not-the-unit of memory management in Linux any more. Naming ~~~~~~ The fun thing about the word folio is that it actually has several meanings. Quoting wikipedia, : it is firstly a term for a common method of arranging sheets of paper : into book form, folding the sheet only once, and a term for a book : made in this way; secondly it is a general term for a sheet, leaf or : page in (especially) manuscripts and old books; and thirdly it is an : approximate term for the size of a book, and for a book of this size. So while it is a collection of pages in the first sense, in the second sense it's also its own term for a "sheet, leaf or page". I (still) don't insist on the word folio, but I do insist that it be _a_ word. The word "slab" was a great coin by Bonwick -- it didn't really mean anything in the context of memory before he used it, and now we all know exactly what it means. I just don't want us to end up with struct uma { /* unit of memory allocation */ We could choose another (short, not-used-in-kernel) word almost at random. How about 'kerb'? What I haven't touched on anywhere in this, is whether a folio is the descriptor for all POTA or whether it's specifically the page cache descriptor. I like the idea of having separate descriptors for objects in the page cache from anonymous or other allocations. But I'm not very familiar with the rmap code, and that wants to do things like manipulate the refcount on a descriptor without knowing whether it's a file or anon page. Or neither (eg device driver memory mapped to userspace. Or vmalloc memory mapped to userspace. Or ...) We could get terribly carried away with this ... struct mappable { /* any mappable object must be LRU */ struct list_head lru; int refcount; int mapcount; }; struct folio { /* for page cache */ unsigned long flags; struct mappable map; struct address_space *mapping; pgoff_t index; void *private; }; struct quarto { /* for anon pages */ unsigned long flags; struct mappable map; swp_entry_t swp; struct anon_vma *vma; }; but I'm not sure we want to go there. On Fri, Mar 26, 2021 at 01:48:15PM -0400, Johannes Weiner wrote: > On Wed, Mar 24, 2021 at 06:24:21AM +0000, Matthew Wilcox wrote: > > On Tue, Mar 23, 2021 at 08:29:16PM -0400, Johannes Weiner wrote: > > > On Mon, Mar 22, 2021 at 06:47:44PM +0000, Matthew Wilcox wrote: > > > > On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote: > > One of the patches I haven't posted yet starts to try to deal with kmap()/mem*()/kunmap(): > > > > mm: Add kmap_local_folio > > > > This allows us to map a portion of a folio. Callers can only expect > > to access up to the next page boundary. > > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > > > diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h > > index 7902c7d8b55f..55a29c9d562f 100644 > > --- a/include/linux/highmem-internal.h > > +++ b/include/linux/highmem-internal.h > > @@ -73,6 +73,12 @@ static inline void *kmap_local_page(struct page *page) > > return __kmap_local_page_prot(page, kmap_prot); > > } > > > > +static inline void *kmap_local_folio(struct folio *folio, size_t offset) > > +{ > > + struct page *page = &folio->page + offset / PAGE_SIZE; > > + return __kmap_local_page_prot(page, kmap_prot) + offset % PAGE_SIZE; > > +} > > > > Partly I haven't shared that one because I'm not 100% sure that 'byte > > offset relative to start of folio' is the correct interface. I'm looking > > at some users and thinking that maybe 'byte offset relative to start > > of file' might be better. Or perhaps that's just filesystem-centric > > thinking. > > Right, this doesn't seem specific to files just because they would be > the primary users of it. Yeah. I think I forgot to cc you on this: https://lore.kernel.org/linux-fsdevel/20210325032202.GS1719932@casper.infradead.org/ and "byte offset relative to the start of the folio" works just fine: + offset = offset_in_folio(folio, diter->pos); + +map: + diter->entry = kmap_local_folio(folio, offset); > > > But for that to work, we'll need the allocator to produce huge pages > > > at the necessary rate, too. The current implementation likely won't > > > scale. Compaction is expensive enough that we have to weigh when to > > > allocate huge pages for long-lived anon regions, let alone allocate > > > them for streaming IO cache entries. > > > > Heh, I have that as a work item for later this year -- give the page > > allocator per-cpu lists of compound pages, not just order-0 pages. > > That'll save us turning compound pages back into buddy pages, only to > > turn them into compound pages again. > > > > I also have a feeling that the page allocator either needs to become a > > sub-allocator of an allocator that deals in, say, 1GB chunks of memory, > > or it needs to become reluctant to break up larger orders. eg if the > > dcache asks for just one more dentry, it should have to go through at > > least one round of reclaim before we choose to break up a high-order > > page to satisfy that request. > > Slub already allocates higher-order pages for dentries: > > slabinfo - version: 2.1 > # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> > dentry 133350 133350 192 42 2 : tunables 0 0 0 : slabdata 3175 3175 0 > > ^ here > > and it could avoid even more internal fragmentation with bigger > orders. It only doesn't because of the overhead of allocating them. Oh, yes. Sorry, I didn't explain myself properly. If we have a lightly-loaded system with terabytes of memory (perhaps all the jobs it is running are CPU intensive and don't need much memory), the system has a tendency to clog up with negative dentries. Hundreds of millions of them. We rely on memory pressure to get rid of them, and when there finally is memory pressure, it takes literally hours. If there were a slight amount of pressure to trim the dcache at the point when we'd otherwise break up an order-4 page to get an order-2 page, the system would work much better. Obviously, we do want the dcache to be able to expand to the point where it's useful, but at the point that it's no longer useful, we need to trim it. It'd probably be better to have the dcache realise that its old entries aren't useful any more and age them out instead of relying on memory pressure to remove old entries, so this is probably an unnecessary digression. > If the default block size in the allocator were 2M, we'd also get slab > packing at that granularity, and we wouldn't have to worry about small > objects breaking huge pages any more than we worry about slab objects > fragmenting 4k pages today. Yup. I definitely see the attraction of letting the slab allocator allocate in larger units. On the other hand, you have to start worrying about underutilisation of the memory at _some_ size, and I'd argue the sweet spot is somewhere between 4kB and 2MB today. For example: fat_inode_cache 110 110 744 22 4 : tunables 0 0 0 : slabdata 5 5 0 That's currently using 20 pages. If slab were only allocating 2MB slabs from the page allocator, I'd have 1.9MB of ram unused in that cache. > > But people seem attached to being able to use smaller page sizes. > > There's that pesky "compatibility" argument. > > Right, that's why I'm NOT saying we should eliminate the support for > 4k chunks in the page cache and page tables. That's still useful if > you have lots of small files. > > I'm just saying it doesn't have to be the default that everything is > primarily optimized for. We can make the default allocation size of > the allocator correspond to a hugepage and have a secondary allocator > level for 4k chunks. Like slab, but fixed-size and highmem-aware. > > It makes sense to make struct page 2M as well. It would save a ton of > memory on average and reduce the pressure we have on struct page's > size today. > > And we really don't need struct page at 4k just to support this unit > of paging when necesary: page tables don't care, they use pfns and can > point to any 4k offset, struct page or no struct page. For the page > cache, we can move mapping, index, lru. etc from today's struct page > into an entry descriptor that could either sit in a native 2M struct > page (just like today), or be be allocated on demand and point into a > chunked struct page. Same for <2M anonymous mappings. > > Hey, didn't you just move EXACTLY those fields into the folio? ;) You say page tables don't actually need a struct page, but we do use it. struct { /* Page table pages */ unsigned long _pt_pad_1; /* compound_head */ pgtable_t pmd_huge_pte; /* protected by page->ptl */ unsigned long _pt_pad_2; /* mapping */ union { struct mm_struct *pt_mm; /* x86 pgds only */ atomic_t pt_frag_refcount; /* powerpc */ }; #if ALLOC_SPLIT_PTLOCKS spinlock_t *ptl; #else spinlock_t ptl; #endif }; It's a problem because some architectures would really rather allocate 2KiB page tables (s390) or would like to support 4KiB page tables on a 64KiB base page size kernel (ppc). [actually i misread your comment initially; you meant that page tables point to PFNs and don't care what struct backs them ... i'm leaving this in here because it illustrates a problem with change struct-page-size-to-2MB]
On Mon, Mar 29, 2021 at 05:58:32PM +0100, Matthew Wilcox wrote: > In broad strokes, I think that having a Power Of Two Allocator > with Descriptor (POTAD) is a useful foundational allocator to have. > The specific allocator that we call the buddy allocator is very clever for > the 1990s, but touches too many cachelines to be good with today's CPUs. > The generalisation of the buddy allocator to the POTAD lets us allocate > smaller quantities (eg a 512 byte block) and allocate descriptors which > differ in size from a struct page. For an extreme example, see xfs_buf > which is 360 bytes and is the descriptor for an allocation between 512 > and 65536 bytes. > > There are times when we need to get from the physical address to > the descriptor, eg memory-failure.c or get_user_pages(). This is the > equivalent of phys_to_page(), and it's going to have to be a lookup tree. > I think this is a role for the Maple Tree, but it's not ready yet. > I don't know if it'll be fast enough for this case. There's also the > need (particularly for memory-failure) to determine exactly what kind > of descriptor we're dealing with, and also its size. Even its owner, > so we can notify them of memory failure. A couple of things I forgot to mention ... I'd like the POTAD to be not necessarily tied to allocating memory. For example, I think it could be used to allocate swap space. eg the swap code could register the space in a swap file as allocatable through the POTAD, and then later ask the POTAD to allocate a POT from the swap space. The POTAD wouldn't need to be limited to MAX_ORDER. It should be perfectly capable of allocating 1TB if your machine has 1.5TB of RAM in it (... and things haven't got too fragmented) I think the POTAD can be used to replace the CMA. The CMA supports weirdo things like "Allocate 8MB of memory at a 1MB alignment", and I think that's doable within the data structures that I'm thinking about for the POTAD. It'd first try to allocate an 8MB chunk at 8MB alignment, and then if that's not possible, try to allocate two adjacent 4MB chunks; continuing down until it finds that there aren't 8x1MB chunks, at which point it can give up.
Hi Willy, On Mon, Mar 29, 2021 at 05:58:32PM +0100, Matthew Wilcox wrote: > I'm going to respond to some points in detail below, but there are a > couple of overarching themes that I want to bring out up here. > > Grand Vision > ~~~~~~~~~~~~ > > I haven't outlined my long-term plan. Partly because it is a _very_ > long way off, and partly because I think what I'm doing stands on its > own. But some of the points below bear on this, so I'll do it now. > > Eventually, I want to make struct page optional for allocations. It's too > small for some things (allocating page tables, for example), and overly > large for others (allocating a 2MB page, networking page_pool). I don't > want to change its size in the meantime; having a struct page refer to > PAGE_SIZE bytes is something that's quite deeply baked in. Right, I think it's overloaded and it needs to go away from many contexts it's used in today. I think it describes a real physical thing, though, and won't go away as a concept. More on that below. > In broad strokes, I think that having a Power Of Two Allocator > with Descriptor (POTAD) is a useful foundational allocator to have. > The specific allocator that we call the buddy allocator is very clever for > the 1990s, but touches too many cachelines to be good with today's CPUs. > The generalisation of the buddy allocator to the POTAD lets us allocate > smaller quantities (eg a 512 byte block) and allocate descriptors which > differ in size from a struct page. For an extreme example, see xfs_buf > which is 360 bytes and is the descriptor for an allocation between 512 > and 65536 bytes. I actually disagree with this rather strongly. If anything, the buddy allocator has turned out to be a pretty poor fit for the foundational allocator. On paper, it is elegant and versatile in serving essentially arbitrary memory blocks. In practice, we mostly just need 4k and 2M chunks from it. And it sucks at the 2M ones because of the fragmentation caused by the ungrouped 4k blocks. The great thing about the slab allocator isn't just that it manages internal fragmentation of the larger underlying blocks. It also groups related objects by lifetime/age and reclaimability, which dramatically mitigates the external fragmentation of the memory space. The buddy allocator on the other hand has no idea what you want that 4k block for, and whether it pairs up well with the 4k block it just handed to somebody else. But the decision it makes in that moment is crucial for its ability to serve larger blocks later on. We do some mobility grouping based on how reclaimable or migratable the memory is, but it's not the full answer. A variable size allocator without object type grouping will always have difficulties producing anything but the smallest block size after some uptime. It's inherently flawed that way. What HAS proven itself is having the base block size correspond to a reasonable transaction unit for paging and page reclaim, then fill in smaller ranges with lifetime-aware slabbing, larger ranges with vmalloc and SG schemes, and absurdly large requests with CMA. We might be stuck with serving order-1, order-2 etc. for a little while longer for the few users who can't go to kvmalloc(), but IMO it's the wrong direction to expand into. Optimally the foundational allocator would just do one block size. > There are times when we need to get from the physical address to > the descriptor, eg memory-failure.c or get_user_pages(). This is the > equivalent of phys_to_page(), and it's going to have to be a lookup tree. > I think this is a role for the Maple Tree, but it's not ready yet. > I don't know if it'll be fast enough for this case. There's also the > need (particularly for memory-failure) to determine exactly what kind > of descriptor we're dealing with, and also its size. Even its owner, > so we can notify them of memory failure. A tree could be more memory efficient in the long term, but for starters a 2M page could have a struct smallpage *smallpages[512]; member that points to any allocated/mapped 4k descriptors. The page table level would tell you what you're looking at: a pmd is simple, a pte would map to a 4k pfn, whose upper bits identify a struct page then a page flag would tell you whether we have a pte-mapped 2M page or whether the lower pfn bits identify an offset in smallpages[]. It's one pointer for every 4k of RAM, which is a bit dumb, but not as dumb as having an entire struct page for each of those ;) > What I haven't touched on anywhere in this, is whether a folio is the > descriptor for all POTA or whether it's specifically the page cache > descriptor. I like the idea of having separate descriptors for objects > in the page cache from anonymous or other allocations. But I'm not very > familiar with the rmap code, and that wants to do things like manipulate > the refcount on a descriptor without knowing whether it's a file or > anon page. Or neither (eg device driver memory mapped to userspace. > Or vmalloc memory mapped to userspace. Or ...) The rmap code is all about the page type specifics, but once you get into mmap, page reclaim, page migration, we're dealing with fully fungible blocks of memory. I do like the idea of using actual language typing for the different things struct page can be today (fs page), but with a common type to manage the fungible block of memory backing it (allocation state, LRU & aging state, mmap state etc.) New types for the former are an easier sell. We all agree that there are too many details of the page - including the compound page implementation detail - inside the cache library, fs code and drivers. It's a slightly tougher sell to say that the core VM code itself (outside the cache library) needs a tighter abstraction for the struct page building block and the compound page structure. At least at this time while we're still sorting out how it all may work down the line. Certainly, we need something to describe fungible memory blocks: either a struct page that can be 4k and 2M compound, or a new thing that can be backed by a 2M struct page or a 4k struct smallpage. We don't know yet, so I would table the new abstraction type for this. I generally don't think we want a new type that does everything that the overloaded struct page already does PLUS the compound abstraction. Whatever name we pick for it, it'll always be difficult to wrap your head around such a beast. IMO starting with an explicit page cache descriptor that resolves to struct page inside core VM code (and maybe ->fault) for now makes the most sense: it greatly mitigates the PAGE_SIZE and tail page issue right away, and it's not in conflict with, but rather helps work toward, replacing the fungible memory unit behind it. There isn't too much overlap or generic code between cache and anon pages such that sharing a common descriptor would be a huge win (most overlap is at the fungible memory block level, and the physical struct page layout of course), so I don't think we should aim for a generic abstraction for both. As drivers go, I think there are slightly different requirements to filesystems, too. For filesystems, when the VM can finally do it (and the file range permits it), I assume we want to rather transparently increase the unit of data transfer from 4k to 2M. Most drivers that currently hardcode alloc_page() or PAGE_SIZE OTOH probably don't want us to bump their allocation sizes. There ARE instances where drivers allocate pages based on buffer_size / PAGE_SIZE and then interact with virtual memory. Those are true VM objects that could grow transparently if PAGE_SIZE grows, and IMO they should share the "fungible memory block" abstraction the VM uses. But there are also many instances where PAGE_SIZE just means 4006 is a good size for me, and struct page is useful for refcounting. Those just shouldn't use whatever the VM or the cache layer are using and stop putting additional burden on an already tricky abstraction. > On Fri, Mar 26, 2021 at 01:48:15PM -0400, Johannes Weiner wrote: > > On Wed, Mar 24, 2021 at 06:24:21AM +0000, Matthew Wilcox wrote: > > > On Tue, Mar 23, 2021 at 08:29:16PM -0400, Johannes Weiner wrote: > > > > On Mon, Mar 22, 2021 at 06:47:44PM +0000, Matthew Wilcox wrote: > > > > > On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote: > > > One of the patches I haven't posted yet starts to try to deal with kmap()/mem*()/kunmap(): > > > > > > mm: Add kmap_local_folio > > > > > > This allows us to map a portion of a folio. Callers can only expect > > > to access up to the next page boundary. > > > > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > > > > > diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h > > > index 7902c7d8b55f..55a29c9d562f 100644 > > > --- a/include/linux/highmem-internal.h > > > +++ b/include/linux/highmem-internal.h > > > @@ -73,6 +73,12 @@ static inline void *kmap_local_page(struct page *page) > > > return __kmap_local_page_prot(page, kmap_prot); > > > } > > > > > > +static inline void *kmap_local_folio(struct folio *folio, size_t offset) > > > +{ > > > + struct page *page = &folio->page + offset / PAGE_SIZE; > > > + return __kmap_local_page_prot(page, kmap_prot) + offset % PAGE_SIZE; > > > +} > > > > > > Partly I haven't shared that one because I'm not 100% sure that 'byte > > > offset relative to start of folio' is the correct interface. I'm looking > > > at some users and thinking that maybe 'byte offset relative to start > > > of file' might be better. Or perhaps that's just filesystem-centric > > > thinking. > > > > Right, this doesn't seem specific to files just because they would be > > the primary users of it. > > Yeah. I think I forgot to cc you on this: > > https://lore.kernel.org/linux-fsdevel/20210325032202.GS1719932@casper.infradead.org/ > > and "byte offset relative to the start of the folio" works just fine: > > + offset = offset_in_folio(folio, diter->pos); > + > +map: > + diter->entry = kmap_local_folio(folio, offset); Yeah, that looks great to me! > > > > But for that to work, we'll need the allocator to produce huge pages > > > > at the necessary rate, too. The current implementation likely won't > > > > scale. Compaction is expensive enough that we have to weigh when to > > > > allocate huge pages for long-lived anon regions, let alone allocate > > > > them for streaming IO cache entries. > > > > > > Heh, I have that as a work item for later this year -- give the page > > > allocator per-cpu lists of compound pages, not just order-0 pages. > > > That'll save us turning compound pages back into buddy pages, only to > > > turn them into compound pages again. > > > > > > I also have a feeling that the page allocator either needs to become a > > > sub-allocator of an allocator that deals in, say, 1GB chunks of memory, > > > or it needs to become reluctant to break up larger orders. eg if the > > > dcache asks for just one more dentry, it should have to go through at > > > least one round of reclaim before we choose to break up a high-order > > > page to satisfy that request. > > > > Slub already allocates higher-order pages for dentries: > > > > slabinfo - version: 2.1 > > # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> > > dentry 133350 133350 192 42 2 : tunables 0 0 0 : slabdata 3175 3175 0 > > > > ^ here > > > > and it could avoid even more internal fragmentation with bigger > > orders. It only doesn't because of the overhead of allocating them. > > Oh, yes. Sorry, I didn't explain myself properly. If we have a > lightly-loaded system with terabytes of memory (perhaps all the jobs > it is running are CPU intensive and don't need much memory), the system > has a tendency to clog up with negative dentries. Hundreds of millions > of them. We rely on memory pressure to get rid of them, and when there > finally is memory pressure, it takes literally hours. > > If there were a slight amount of pressure to trim the dcache at the point > when we'd otherwise break up an order-4 page to get an order-2 page, > the system would work much better. Obviously, we do want the dcache to > be able to expand to the point where it's useful, but at the point that > it's no longer useful, we need to trim it. > > It'd probably be better to have the dcache realise that its old entries > aren't useful any more and age them out instead of relying on memory > pressure to remove old entries, so this is probably an unnecessary > digression. It's difficult to identify a universally acceptable line for usefulness of caches other than physical memory pressure. The good thing about the memory pressure threshold is that you KNOW somebody else has immediate use for the memory, and you're justified in recycling and reallocating caches from the cold end. Without that, you'd either have to set an arbitrary size cutoff or an arbitrary aging cutoff (not used in the last minute e.g.). But optimal settings for either of those depend on the workload, and aren't very intuitive to configure. Such a large gap between the smallest object and the overall size of memory is just inherently difficult to manage. More below. > > If the default block size in the allocator were 2M, we'd also get slab > > packing at that granularity, and we wouldn't have to worry about small > > objects breaking huge pages any more than we worry about slab objects > > fragmenting 4k pages today. > > Yup. I definitely see the attraction of letting the slab allocator > allocate in larger units. On the other hand, you have to start worrying > about underutilisation of the memory at _some_ size, and I'd argue the > sweet spot is somewhere between 4kB and 2MB today. For example: > > fat_inode_cache 110 110 744 22 4 : tunables 0 0 0 : slabdata 5 5 0 > > That's currently using 20 pages. If slab were only allocating 2MB slabs > from the page allocator, I'd have 1.9MB of ram unused in that cache. Right, we'd raise internal fragmentation to a worst case of 2M (minus minimum object size) per slab cache. As a ratio of overall memory, this isn't unprecedented, though: my desktop machine has 32G and my phone has 8G. Divide those by 512 for a 4k base page comparison and you get memory sizes common in the mid to late 90s. Our levels of internal fragmentation are historically low, which of course is nice by itself. But that's also what's causing problems in the form of external fragmentation, and why we struggle to produce 2M blocks. It's multitudes easier to free one 2M slab page of consecutively allocated inodes than it is to free 512 batches of different objects with conflicting lifetimes, ages, or potentially even reclaimability. I don't think we'll have much of a choice when it comes to trading some internal fragmentation to deal with our mounting external fragmentation problem. [ Because of the way fragmentation works I also don't think that 1G would be a good foundational block size. It either wastes a crazy amount of memory on internal fragmentation, or you allow external fragmentation and the big blocks deteriorate with uptime anyway. There really is such a thing as a page: a goldilocks quantity of memory, given the overall amount of RAM in a system, that is optimal as a paging unit and intersection point for the fragmentation axes. This never went away. It just isn't 4k anymore on modern systems. And we're creating a bit of a mess by adapting various places (page allocator, slab, page cache, swap code) to today's goldilocks size while struct page lags behind and doesn't track reality anymore. I think there is a lot of value in disconnecting places from struct page that don't need it, but IMO all in the context of the broader goal of being able to catch up struct page to what the real page is. We may be able to get rid of the 4k backward-compatible paging units eventually when we all have 1TB of RAM. But the concept of a page in a virtual memory system isn't really going anywhere. ] > > > But people seem attached to being able to use smaller page sizes. > > > There's that pesky "compatibility" argument. > > > > Right, that's why I'm NOT saying we should eliminate the support for > > 4k chunks in the page cache and page tables. That's still useful if > > you have lots of small files. > > > > I'm just saying it doesn't have to be the default that everything is > > primarily optimized for. We can make the default allocation size of > > the allocator correspond to a hugepage and have a secondary allocator > > level for 4k chunks. Like slab, but fixed-size and highmem-aware. > > > > It makes sense to make struct page 2M as well. It would save a ton of > > memory on average and reduce the pressure we have on struct page's > > size today. > > > > And we really don't need struct page at 4k just to support this unit > > of paging when necesary: page tables don't care, they use pfns and can > > point to any 4k offset, struct page or no struct page. For the page > > cache, we can move mapping, index, lru. etc from today's struct page > > into an entry descriptor that could either sit in a native 2M struct > > page (just like today), or be be allocated on demand and point into a > > chunked struct page. Same for <2M anonymous mappings. > > > > Hey, didn't you just move EXACTLY those fields into the folio? ;) > > You say page tables don't actually need a struct page, but we do use it. > > struct { /* Page table pages */ > unsigned long _pt_pad_1; /* compound_head */ > pgtable_t pmd_huge_pte; /* protected by page->ptl */ > unsigned long _pt_pad_2; /* mapping */ > union { > struct mm_struct *pt_mm; /* x86 pgds only */ > atomic_t pt_frag_refcount; /* powerpc */ > }; > #if ALLOC_SPLIT_PTLOCKS > spinlock_t *ptl; > #else > spinlock_t ptl; > #endif > }; > > It's a problem because some architectures would really rather > allocate 2KiB page tables (s390) or would like to support 4KiB page > tables on a 64KiB base page size kernel (ppc). > > [actually i misread your comment initially; you meant that page > tables point to PFNs and don't care what struct backs them ... i'm > leaving this in here because it illustrates a problem with change > struct-page-size-to-2MB] Yes, I meant what page table entries point to. The page table (directories) themselves are still 4k as per the architecture, and they'd also have to use smallpage descriptors. I don't immediately see why they couldn't, though. It's not that many, especially if pmd mappings are common (a 4k pmd can map 1G worth of address space).
On Tue, Mar 30, 2021 at 03:30:54PM -0400, Johannes Weiner wrote: > Hi Willy, > > On Mon, Mar 29, 2021 at 05:58:32PM +0100, Matthew Wilcox wrote: > > I'm going to respond to some points in detail below, but there are a > > couple of overarching themes that I want to bring out up here. > > > > Grand Vision > > ~~~~~~~~~~~~ > > > > I haven't outlined my long-term plan. Partly because it is a _very_ > > long way off, and partly because I think what I'm doing stands on its > > own. But some of the points below bear on this, so I'll do it now. > > > > Eventually, I want to make struct page optional for allocations. It's too > > small for some things (allocating page tables, for example), and overly > > large for others (allocating a 2MB page, networking page_pool). I don't > > want to change its size in the meantime; having a struct page refer to > > PAGE_SIZE bytes is something that's quite deeply baked in. > > Right, I think it's overloaded and it needs to go away from many > contexts it's used in today. > > I think it describes a real physical thing, though, and won't go away > as a concept. More on that below. I'm at least 90% with you on this, and we're just quibbling over details at this point, I think. > > In broad strokes, I think that having a Power Of Two Allocator > > with Descriptor (POTAD) is a useful foundational allocator to have. > > The specific allocator that we call the buddy allocator is very clever for > > the 1990s, but touches too many cachelines to be good with today's CPUs. > > The generalisation of the buddy allocator to the POTAD lets us allocate > > smaller quantities (eg a 512 byte block) and allocate descriptors which > > differ in size from a struct page. For an extreme example, see xfs_buf > > which is 360 bytes and is the descriptor for an allocation between 512 > > and 65536 bytes. > > I actually disagree with this rather strongly. If anything, the buddy > allocator has turned out to be a pretty poor fit for the foundational > allocator. > > On paper, it is elegant and versatile in serving essentially arbitrary > memory blocks. In practice, we mostly just need 4k and 2M chunks from > it. And it sucks at the 2M ones because of the fragmentation caused by > the ungrouped 4k blocks. That's a very Intel-centric way of looking at it. Other architectures support a multitude of page sizes, from the insane ia64 (4k, 8k, 16k, then every power of four up to 4GB) to more reasonable options like (4k, 32k, 256k, 2M, 16M, 128M). But we (in software) shouldn't constrain ourselves to thinking in terms of what the hardware currently supports. Google have data showing that for their workloads, 32kB is the goldilocks size. I'm sure for some workloads, it's much higher and for others it's lower. But for almost no workload is 4kB the right choice any more, and probably hasn't been since the late 90s. > The great thing about the slab allocator isn't just that it manages > internal fragmentation of the larger underlying blocks. It also groups > related objects by lifetime/age and reclaimability, which dramatically > mitigates the external fragmentation of the memory space. > > The buddy allocator on the other hand has no idea what you want that > 4k block for, and whether it pairs up well with the 4k block it just > handed to somebody else. But the decision it makes in that moment is > crucial for its ability to serve larger blocks later on. > > We do some mobility grouping based on how reclaimable or migratable > the memory is, but it's not the full answer. I don't think that's entirely true. The vast majority of memory in any machine is either anonymous or page cache. The problem is that right now, all anonymous and page cache allocations are order-0 (... or order-9). So the buddy allocator can't know anything useful about the pages and will often allocate one order-0 page to the page cache, then allocate its buddy to the slab cache in order to allocate the radix_tree_node to store the pointer to the page in (ok, radix tree nodes come from an order-2 cache, but it still prevents this order-9 page from being assembled). If the movable allocations suddenly start being order-3 and order-4, the unmovable, unreclaimable allocations are naturally going to group down in the lower orders, and we won't have the problem that a single dentry blocks the allocation of an entire 2MB page. The problem, for me, with the ZONE_MOVABLE stuff is that it requires sysadmin intervention to set up. I don't have a ZONE_MOVABLE on my laptop. The allocator should be automatically handling movability hints without my intervention. > A variable size allocator without object type grouping will always > have difficulties producing anything but the smallest block size after > some uptime. It's inherently flawed that way. I think our buddy allocator is flawed, to be sure, but only because it doesn't handle movable hints more aggressively. For example, at the point that a largeish block gets a single non-movable allocation, all the movable allocations within that block should be migrated out. If the offending allocation is freed quickly, it all collapses into a large, useful chunk, or if not, then it provides a sponge to soak up other non-movable allocations. > > What I haven't touched on anywhere in this, is whether a folio is the > > descriptor for all POTA or whether it's specifically the page cache > > descriptor. I like the idea of having separate descriptors for objects > > in the page cache from anonymous or other allocations. But I'm not very > > familiar with the rmap code, and that wants to do things like manipulate > > the refcount on a descriptor without knowing whether it's a file or > > anon page. Or neither (eg device driver memory mapped to userspace. > > Or vmalloc memory mapped to userspace. Or ...) > > The rmap code is all about the page type specifics, but once you get > into mmap, page reclaim, page migration, we're dealing with fully > fungible blocks of memory. > > I do like the idea of using actual language typing for the different > things struct page can be today (fs page), but with a common type to > manage the fungible block of memory backing it (allocation state, LRU > & aging state, mmap state etc.) > > New types for the former are an easier sell. We all agree that there > are too many details of the page - including the compound page > implementation detail - inside the cache library, fs code and drivers. > > It's a slightly tougher sell to say that the core VM code itself > (outside the cache library) needs a tighter abstraction for the struct > page building block and the compound page structure. At least at this > time while we're still sorting out how it all may work down the line. > Certainly, we need something to describe fungible memory blocks: > either a struct page that can be 4k and 2M compound, or a new thing > that can be backed by a 2M struct page or a 4k struct smallpage. We > don't know yet, so I would table the new abstraction type for this. > > I generally don't think we want a new type that does everything that > the overloaded struct page already does PLUS the compound > abstraction. Whatever name we pick for it, it'll always be difficult > to wrap your head around such a beast. > > IMO starting with an explicit page cache descriptor that resolves to > struct page inside core VM code (and maybe ->fault) for now makes the > most sense: it greatly mitigates the PAGE_SIZE and tail page issue > right away, and it's not in conflict with, but rather helps work > toward, replacing the fungible memory unit behind it. Right, and that's what struct folio is today. It eliminates tail pages from consideration in a lot of paths. I think it also makes sense for struct folio to be used for anonymous memory. But I think that's where it stops; it isn't for Slab, it isn't for page table pages, and it's not for ZONE_DEVICE pages. > There isn't too much overlap or generic code between cache and anon > pages such that sharing a common descriptor would be a huge win (most > overlap is at the fungible memory block level, and the physical struct > page layout of course), so I don't think we should aim for a generic > abstraction for both. They're both on the LRU list, they use a lot of the same PageFlags, they both have a mapcount and refcount, and they both have memcg_data. The only things they really use differently are mapping, index and private. And then we have to consider shmem which uses both in a pretty eldritch way. > As drivers go, I think there are slightly different requirements to > filesystems, too. For filesystems, when the VM can finally do it (and > the file range permits it), I assume we want to rather transparently > increase the unit of data transfer from 4k to 2M. Most drivers that > currently hardcode alloc_page() or PAGE_SIZE OTOH probably don't want > us to bump their allocation sizes. If you take a look at my earlier work, you'll see me using a range of sizes in the page cache, starting at 16kB and gradually increasing to (theoretically) 2MB, although the algorithm tended to top out around 256kB. Doing particularly large reads could see 512kB/1MB reads, but it was very hard to hit 2MB in practice. I wasn't too concerned at the time, but my point is that we do want to automatically tune the size of the allocation unit to the workload. An application which reads in 64kB chunks is giving us a pretty clear signal that they want to manage memory in 64kB chunks. > > It'd probably be better to have the dcache realise that its old entries > > aren't useful any more and age them out instead of relying on memory > > pressure to remove old entries, so this is probably an unnecessary > > digression. > > It's difficult to identify a universally acceptable line for > usefulness of caches other than physical memory pressure. > > The good thing about the memory pressure threshold is that you KNOW > somebody else has immediate use for the memory, and you're justified > in recycling and reallocating caches from the cold end. > > Without that, you'd either have to set an arbitrary size cutoff or an > arbitrary aging cutoff (not used in the last minute e.g.). But optimal > settings for either of those depend on the workload, and aren't very > intuitive to configure. For the dentry cache, I think there is a more useful metric, and that's length of the hash chain. If it gets too long, we're spending more time walking it than we're saving by having entries cached. Starting reclaim based on "this bucket of the dcache has twenty entries in it" would probably work quite well. > Our levels of internal fragmentation are historically low, which of > course is nice by itself. But that's also what's causing problems in > the form of external fragmentation, and why we struggle to produce 2M > blocks. It's multitudes easier to free one 2M slab page of > consecutively allocated inodes than it is to free 512 batches of > different objects with conflicting lifetimes, ages, or potentially > even reclaimability. Unf. I don't think freeing 2MB worth of _anything_ is ever going to be easy enough to rely on. My actual root filesystem: xfs_inode 143134 144460 1024 32 8 : tunables 0 0 0 : slabdata 4517 4517 0 So we'd have to be able to free 2048 of those 143k inodes, and they all have to be consecutive (and aligned). I suppose we could model that and try to work out how many we'd have to be able to free in order to get all 2048 in any page free, but I bet it's a variant of the Birthday Paradox, and we'd find it's something crazy like half of them. Without slab gaining the ability to ask users to relocate allocations, I think any memory sent to slab is never coming back. So ... even if I accept every part of your vision as the way things are going to be, I think the folio patchset I have now is a step in the right direction. I'm going to send a v6 now and hope it's not too late for this merge window.
On Tue, Mar 30, 2021 at 03:30:54PM -0400, Johannes Weiner wrote: > > Eventually, I want to make struct page optional for allocations. It's too > > small for some things (allocating page tables, for example), and overly > > large for others (allocating a 2MB page, networking page_pool). I don't > > want to change its size in the meantime; having a struct page refer to > > PAGE_SIZE bytes is something that's quite deeply baked in. > > Right, I think it's overloaded and it needs to go away from many > contexts it's used in today. FYI, one unrelated usage is that in many contet we use a struct page and an offset to describe locations for I/O (block layer, networking, DMA API). With huge pages and merged I/O buffers this representation actually becomes increasingly painful. And a little bit back to the topic: I think the folio as in the current patchset is incredibly useful and someting we need like yesterday to help file systems and the block layer to cope with huge and compound pages of all sorts. Once willy sends out a new version with the accumulated fixes I'm ready to ACK the whole thing.
On Tue, Mar 30, 2021 at 10:09:29PM +0100, Matthew Wilcox wrote: > On Tue, Mar 30, 2021 at 03:30:54PM -0400, Johannes Weiner wrote: > > Hi Willy, > > > > On Mon, Mar 29, 2021 at 05:58:32PM +0100, Matthew Wilcox wrote: > > > I'm going to respond to some points in detail below, but there are a > > > couple of overarching themes that I want to bring out up here. > > > > > > Grand Vision > > > ~~~~~~~~~~~~ > > > > > > I haven't outlined my long-term plan. Partly because it is a _very_ > > > long way off, and partly because I think what I'm doing stands on its > > > own. But some of the points below bear on this, so I'll do it now. > > > > > > Eventually, I want to make struct page optional for allocations. It's too > > > small for some things (allocating page tables, for example), and overly > > > large for others (allocating a 2MB page, networking page_pool). I don't > > > want to change its size in the meantime; having a struct page refer to > > > PAGE_SIZE bytes is something that's quite deeply baked in. > > > > Right, I think it's overloaded and it needs to go away from many > > contexts it's used in today. > > > > I think it describes a real physical thing, though, and won't go away > > as a concept. More on that below. > > I'm at least 90% with you on this, and we're just quibbling over details > at this point, I think. > > > > In broad strokes, I think that having a Power Of Two Allocator > > > with Descriptor (POTAD) is a useful foundational allocator to have. > > > The specific allocator that we call the buddy allocator is very clever for > > > the 1990s, but touches too many cachelines to be good with today's CPUs. > > > The generalisation of the buddy allocator to the POTAD lets us allocate > > > smaller quantities (eg a 512 byte block) and allocate descriptors which > > > differ in size from a struct page. For an extreme example, see xfs_buf > > > which is 360 bytes and is the descriptor for an allocation between 512 > > > and 65536 bytes. > > > > I actually disagree with this rather strongly. If anything, the buddy > > allocator has turned out to be a pretty poor fit for the foundational > > allocator. > > > > On paper, it is elegant and versatile in serving essentially arbitrary > > memory blocks. In practice, we mostly just need 4k and 2M chunks from > > it. And it sucks at the 2M ones because of the fragmentation caused by > > the ungrouped 4k blocks. > > That's a very Intel-centric way of looking at it. Other architectures > support a multitude of page sizes, from the insane ia64 (4k, 8k, 16k, then > every power of four up to 4GB) to more reasonable options like (4k, 32k, > 256k, 2M, 16M, 128M). But we (in software) shouldn't constrain ourselves > to thinking in terms of what the hardware currently supports. Google > have data showing that for their workloads, 32kB is the goldilocks size. > I'm sure for some workloads, it's much higher and for others it's lower. > But for almost no workload is 4kB the right choice any more, and probably > hasn't been since the late 90s. You missed my point entirely. It's not about the exact page sizes, it's about the fragmentation issue when you mix variable-sized blocks without lifetime grouping. Anyway, we digressed quite far here. My argument was simply that it's conceivable we'll switch to a default allocation block and page size that is larger than the smallest paging size supported by the CPU and the kernel. (Various architectures might support multiple page sizes, but once you pick one, that's the smallest quantity the kernel pages.) That makes "bundle of pages" a short-sighted abstraction, and folio a poor name for pageable units. I might be wrong about what happens to PAGE_SIZE eventually (even though your broader arguments around allocator behavior and fragmentation don't seem to line up with my observations from production systems, or the evolution of how we manage allocations of different sizes) - but you also haven't made a good argument why the API *should* continue to imply we're dealing with one or more pages. Yes, it's a bit bikesheddy. But you're proposing an abstraction for one of the most fundamental data structures in the operating system, with tens of thousands of instances in almost all core subsystems. "Bundle of pages (for now) with filesystem data (and maybe anon data since it's sort of convenient in terms of data structure, for now)" just doesn't make me go "Yeah, that's it." I would understand cache_entry for the cache; mem for cache and file (that discussion trailed off); pageable if we want to imply a sizing and alignment constraints based on the underlying MMU. I would even prefer kerb, because at least it wouldn't be misleading if we do have non-struct page backing in the future. > > The great thing about the slab allocator isn't just that it manages > > internal fragmentation of the larger underlying blocks. It also groups > > related objects by lifetime/age and reclaimability, which dramatically > > mitigates the external fragmentation of the memory space. > > > > The buddy allocator on the other hand has no idea what you want that > > 4k block for, and whether it pairs up well with the 4k block it just > > handed to somebody else. But the decision it makes in that moment is > > crucial for its ability to serve larger blocks later on. > > > > We do some mobility grouping based on how reclaimable or migratable > > the memory is, but it's not the full answer. > > I don't think that's entirely true. The vast majority of memory in any > machine is either anonymous or page cache. The problem is that right now, > all anonymous and page cache allocations are order-0 (... or order-9). > So the buddy allocator can't know anything useful about the pages and will > often allocate one order-0 page to the page cache, then allocate its buddy > to the slab cache in order to allocate the radix_tree_node to store the > pointer to the page in (ok, radix tree nodes come from an order-2 cache, > but it still prevents this order-9 page from being assembled). > > If the movable allocations suddenly start being order-3 and order-4, > the unmovable, unreclaimable allocations are naturally going to group > down in the lower orders, and we won't have the problem that a single > dentry blocks the allocation of an entire 2MB page. I don't follow what you're saying here. > > A variable size allocator without object type grouping will always > > have difficulties producing anything but the smallest block size after > > some uptime. It's inherently flawed that way. > > I think our buddy allocator is flawed, to be sure, but only because > it doesn't handle movable hints more aggressively. For example, at > the point that a largeish block gets a single non-movable allocation, > all the movable allocations within that block should be migrated out. > If the offending allocation is freed quickly, it all collapses into a > large, useful chunk, or if not, then it provides a sponge to soak up > other non-movable allocations. The object type implies aging rules and typical access patterns that are not going to be captured purely by migratability. As such, the migratetype alone will always perform worse than full type grouping. E.g. a burst of inodes and dentries allocations can claim a large number of blocks from movable to reclaimable, which will then also be used to serve concurrent allocations of a different type that may have much longer lifetimes. After the inodes and dentries disappear again, you're stuck with very sparsely populated reclaimable blocks. They can still be reclaimed, but they won't free up as easily as a contiguous run of bulk-aged inodes and dentries. You also cannot easily move reclaimable objects out of the block when an unmovable allocation claims it the same way, so this is sort of a moot proposal anyway. The slab allocator isn't a guarantee, but I don't see why you're arguing we should leave additional lifetime/usage hints on the table. > > As drivers go, I think there are slightly different requirements to > > filesystems, too. For filesystems, when the VM can finally do it (and > > the file range permits it), I assume we want to rather transparently > > increase the unit of data transfer from 4k to 2M. Most drivers that > > currently hardcode alloc_page() or PAGE_SIZE OTOH probably don't want > > us to bump their allocation sizes. > > If you take a look at my earlier work, you'll see me using a range of > sizes in the page cache, starting at 16kB and gradually increasing to > (theoretically) 2MB, although the algorithm tended to top out around > 256kB. Doing particularly large reads could see 512kB/1MB reads, but > it was very hard to hit 2MB in practice. I wasn't too concerned at the > time, but my point is that we do want to automatically tune the size > of the allocation unit to the workload. An application which reads in > 64kB chunks is giving us a pretty clear signal that they want to manage > memory in 64kB chunks. You missed my point here, but it sounds like we agree that drivers who just want a fixed buffer should not use the same type that filesystems use for dynamic paging units. > > > It'd probably be better to have the dcache realise that its old entries > > > aren't useful any more and age them out instead of relying on memory > > > pressure to remove old entries, so this is probably an unnecessary > > > digression. > > > > It's difficult to identify a universally acceptable line for > > usefulness of caches other than physical memory pressure. > > > > The good thing about the memory pressure threshold is that you KNOW > > somebody else has immediate use for the memory, and you're justified > > in recycling and reallocating caches from the cold end. > > > > Without that, you'd either have to set an arbitrary size cutoff or an > > arbitrary aging cutoff (not used in the last minute e.g.). But optimal > > settings for either of those depend on the workload, and aren't very > > intuitive to configure. > > For the dentry cache, I think there is a more useful metric, and that's > length of the hash chain. If it gets too long, we're spending more time > walking it than we're saving by having entries cached. Starting reclaim > based on "this bucket of the dcache has twenty entries in it" would > probably work quite well. That might work for this cache, but it's not a generic solution to fragmentation caused by cache positions building in the absence of memory pressure. > > Our levels of internal fragmentation are historically low, which of > > course is nice by itself. But that's also what's causing problems in > > the form of external fragmentation, and why we struggle to produce 2M > > blocks. It's multitudes easier to free one 2M slab page of > > consecutively allocated inodes than it is to free 512 batches of > > different objects with conflicting lifetimes, ages, or potentially > > even reclaimability. > > Unf. I don't think freeing 2MB worth of _anything_ is ever going to be > easy enough to rely on. My actual root filesystem: > > xfs_inode 143134 144460 1024 32 8 : tunables 0 0 0 : slabdata 4517 4517 0 > > So we'd have to be able to free 2048 of those 143k inodes, and they all > have to be consecutive (and aligned). I suppose we could model that and > try to work out how many we'd have to be able to free in order to get all > 2048 in any page free, but I bet it's a variant of the Birthday Paradox, > and we'd find it's something crazy like half of them. How is it different than freeing a 4k page in 1995? The descriptor size itself may not have scaled at the same rate as overall memory size. But that also means the cache position itself is much less a concern in terms of memory consumed and fragmented. Case in point, this is 141M. Yes, probably with a mixture of some hot and a long tail of cold entries. It's not really an interesting reclaim target. When slab cache positions become a reclaim concern, it's usually when they spike due to a change in the workload. And then you tend to get contiguous runs of objects with a similar age. > Without slab gaining the ability to ask users to relocate allocations, > I think any memory sent to slab is never coming back. Not sure what data you're basing this on. > So ... even if I accept every part of your vision as the way things > are going to be, I think the folio patchset I have now is a step in the > right direction. I'm going to send a v6 now and hope it's not too late > for this merge window. I don't think folio as an abstraction is cooked enough to replace such a major part of the kernel with it. so I'm against merging it now. I would really like to see a better definition of what it actually represents, instead of a fluid combination of implementation details and conveniences.
On Wed, Mar 31, 2021 at 02:14:00PM -0400, Johannes Weiner wrote: > Anyway, we digressed quite far here. My argument was simply that it's > conceivable we'll switch to a default allocation block and page size > that is larger than the smallest paging size supported by the CPU and > the kernel. (Various architectures might support multiple page sizes, > but once you pick one, that's the smallest quantity the kernel pages.) We've had several attempts in the past to make 'struct page' refer to a different number of bytes than the-size-of-a-single-pte, and they've all failed in one way or another. I don't think changing PAGE_SIZE to any other size is reasonable. Maybe we have a larger allocation unit in the future, maybe we do something else, but that should have its own name, not 'struct page'. I think the shortest path to getting what you want is having a superpage allocator that the current page allocator can allocate from. When a superpage is allocated from the superpage allocator, we allocate an array of struct pages for it. > I don't think folio as an abstraction is cooked enough to replace such > a major part of the kernel with it. so I'm against merging it now. > > I would really like to see a better definition of what it actually > represents, instead of a fluid combination of implementation details > and conveniences. Here's the current kernel-doc for it: /** * struct folio - Represents a contiguous set of bytes. * @flags: Identical to the page flags. * @lru: Least Recently Used list; tracks how recently this folio was used. * @mapping: The file this page belongs to, or refers to the anon_vma for * anonymous pages. * @index: Offset within the file, in units of pages. For anonymous pages, * this is the index from the beginning of the mmap. * @private: Filesystem per-folio data (see attach_folio_private()). * Used for swp_entry_t if FolioSwapCache(). * @_mapcount: How many times this folio is mapped to userspace. Use * folio_mapcount() to access it. * @_refcount: Number of references to this folio. Use folio_ref_count() * to read it. * @memcg_data: Memory Control Group data. * * A folio is a physically, virtually and logically contiguous set * of bytes. It is a power-of-two in size, and it is aligned to that * same power-of-two. It is at least as large as %PAGE_SIZE. If it is * in the page cache, it is at a file offset which is a multiple of that * power-of-two. */ struct folio { /* private: don't document the anon union */ union { struct { /* public: */ unsigned long flags; struct list_head lru; struct address_space *mapping; pgoff_t index; unsigned long private; atomic_t _mapcount; atomic_t _refcount; #ifdef CONFIG_MEMCG unsigned long memcg_data; #endif /* private: the union with struct page is transitional */ }; struct page page; }; };
On Tue, Mar 30, 2021 at 10:09:29PM +0100, Matthew Wilcox wrote: > That's a very Intel-centric way of looking at it. Other architectures > support a multitude of page sizes, from the insane ia64 (4k, 8k, 16k, then > every power of four up to 4GB) to more reasonable options like (4k, 32k, > 256k, 2M, 16M, 128M). But we (in software) shouldn't constrain ourselves > to thinking in terms of what the hardware currently supports. Google > have data showing that for their workloads, 32kB is the goldilocks size. > I'm sure for some workloads, it's much higher and for others it's lower. > But for almost no workload is 4kB the right choice any more, and probably > hasn't been since the late 90s. Out of curiosity I looked at the distribution of file sizes in the kernel tree: 71455 files total 0--4Kb 36702 4--8Kb 11820 8--16Kb 10066 16--32Kb 6984 32--64Kb 3804 64--128Kb 1498 128--256Kb 393 256--512Kb 108 512Kb--1Mb 35 1--2Mb 25 2--4Mb 5 4--6Mb 7 6--8Mb 4 12Mb 2 14Mb 1 16Mb 1 ... incidentally, everything bigger than 1.2Mb lives^Wshambles under drivers/gpu/drm/amd/include/asic_reg/ Page size Footprint 4Kb 1128Mb 8Kb 1324Mb 16Kb 1764Mb 32Kb 2739Mb 64Kb 4832Mb 128Kb 9191Mb 256Kb 18062Mb 512Kb 35883Mb 1Mb 71570Mb 2Mb 142958Mb So for kernel builds (as well as grep over the tree, etc.) uniform 2Mb pages would be... interesting.
On Thu, Apr 01, 2021 at 05:05:37AM +0000, Al Viro wrote: > On Tue, Mar 30, 2021 at 10:09:29PM +0100, Matthew Wilcox wrote: > > > That's a very Intel-centric way of looking at it. Other architectures > > support a multitude of page sizes, from the insane ia64 (4k, 8k, 16k, then > > every power of four up to 4GB) to more reasonable options like (4k, 32k, > > 256k, 2M, 16M, 128M). But we (in software) shouldn't constrain ourselves > > to thinking in terms of what the hardware currently supports. Google > > have data showing that for their workloads, 32kB is the goldilocks size. > > I'm sure for some workloads, it's much higher and for others it's lower. > > But for almost no workload is 4kB the right choice any more, and probably > > hasn't been since the late 90s. > > Out of curiosity I looked at the distribution of file sizes in the > kernel tree: > 71455 files total > 0--4Kb 36702 > 4--8Kb 11820 > 8--16Kb 10066 > 16--32Kb 6984 > 32--64Kb 3804 > 64--128Kb 1498 > 128--256Kb 393 > 256--512Kb 108 > 512Kb--1Mb 35 > 1--2Mb 25 > 2--4Mb 5 > 4--6Mb 7 > 6--8Mb 4 > 12Mb 2 > 14Mb 1 > 16Mb 1 > > ... incidentally, everything bigger than 1.2Mb lives^Wshambles under > drivers/gpu/drm/amd/include/asic_reg/ I'm just going to edit this table to add a column indicating ratio to previous size: > Page size Footprint > 4Kb 1128Mb > 8Kb 1324Mb 1.17 > 16Kb 1764Mb 1.33 > 32Kb 2739Mb 1.55 > 64Kb 4832Mb 1.76 > 128Kb 9191Mb 1.90 > 256Kb 18062Mb 1.96 > 512Kb 35883Mb 1.98 > 1Mb 71570Mb 1.994 > 2Mb 142958Mb 1.997 > > So for kernel builds (as well as grep over the tree, etc.) uniform 2Mb pages > would be... interesting. Yep, that's why I opted for a "start out slowly and let readahead tell me when to increase the page size" approach. I think Johannes' real problem is that slab and page cache / anon pages are getting intermingled. We could solve this by having slab allocate 2MB pages from the page allocator and then split them up internally (so not all of that 2MB necessarily goes to a single slab cache, but all of that 2MB goes to some slab cache).
On Thu, Apr 01, 2021 at 05:05:37AM +0000, Al Viro wrote: > On Tue, Mar 30, 2021 at 10:09:29PM +0100, Matthew Wilcox wrote: > > > That's a very Intel-centric way of looking at it. Other architectures > > support a multitude of page sizes, from the insane ia64 (4k, 8k, 16k, then > > every power of four up to 4GB) to more reasonable options like (4k, 32k, > > 256k, 2M, 16M, 128M). But we (in software) shouldn't constrain ourselves > > to thinking in terms of what the hardware currently supports. Google > > have data showing that for their workloads, 32kB is the goldilocks size. > > I'm sure for some workloads, it's much higher and for others it's lower. > > But for almost no workload is 4kB the right choice any more, and probably > > hasn't been since the late 90s. > > Out of curiosity I looked at the distribution of file sizes in the > kernel tree: > 71455 files total > 0--4Kb 36702 > 4--8Kb 11820 > 8--16Kb 10066 > 16--32Kb 6984 > 32--64Kb 3804 > 64--128Kb 1498 > 128--256Kb 393 > 256--512Kb 108 > 512Kb--1Mb 35 > 1--2Mb 25 > 2--4Mb 5 > 4--6Mb 7 > 6--8Mb 4 > 12Mb 2 > 14Mb 1 > 16Mb 1 > > ... incidentally, everything bigger than 1.2Mb lives^Wshambles under > drivers/gpu/drm/amd/include/asic_reg/ > > Page size Footprint > 4Kb 1128Mb > 8Kb 1324Mb > 16Kb 1764Mb > 32Kb 2739Mb > 64Kb 4832Mb > 128Kb 9191Mb > 256Kb 18062Mb > 512Kb 35883Mb > 1Mb 71570Mb > 2Mb 142958Mb > > So for kernel builds (as well as grep over the tree, etc.) uniform 2Mb pages > would be... interesting. Right, I don't see us getting rid of 4k cache entries anytime soon. Even 32k pages would double the footprint here. The issue is just that at the other end of the spectrum we have IO devices that do 10GB/s, which corresponds to 2.6 million pages per second. At such data rates we are currently CPU-limited because of the pure transaction overhead in page reclaim. Workloads like this tend to use much larger files, and would benefit from a larger paging unit. Likewise, most production workloads in cloud servers have enormous anonymous regions and large executables that greatly benefit from fewer page table levels and bigger TLB entries. Today, fragmentation prevents the page allocator from producing 2MB blocks at a satisfactory rate and allocation latency. It's not feasible to allocate 2M inside page faults for example; getting huge page coverage for the page cache will be even more difficult. I'm not saying we should get rid of 4k cache entries. Rather, I'm wondering out loud whether longer-term we'd want to change the default page size to 2M, and implement the 4k cache entries, which we clearly continue to need, with a slab style allocator on top. The idea being that it'll do a better job at grouping cache entries with other cache entries of a similar lifetime than the untyped page allocator does naturally, and so make fragmentation a whole lot more manageable. (I'm using x86 page sizes as examples because they matter to me. But there is an architecture independent discrepancy between the smallest cache entries we must continue to support, and larger blocks / huge pages that we increasingly rely on as first class pages.)