Message ID | 20220729014041.21292-1-peterx@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: Remember young bit for migration entries | expand |
On Jul 28, 2022, at 6:40 PM, Peter Xu <peterx@redhat.com> wrote: > [Marking as RFC; only x86 is supported for now, plan to add a few more > archs when there's a formal version] > > Problem > ======= > > When migrate a page, right now we always mark the migrated page as old. > The reason could be that we don't really know whether the page is hot or > cold, so we could have taken it a default negative assuming that's safer. Looks good to me. I just wonder whether the order of the patches should be different. I always understood that separating the “enabling” patch from the others is not a good practice, since it complicates bisection. I guess it is more of a minor issue for such a small patch-set…
On Fri, Jul 29, 2022 at 10:07:02AM -0700, Nadav Amit wrote: > On Jul 28, 2022, at 6:40 PM, Peter Xu <peterx@redhat.com> wrote: > > > [Marking as RFC; only x86 is supported for now, plan to add a few more > > archs when there's a formal version] > > > > Problem > > ======= > > > > When migrate a page, right now we always mark the migrated page as old. > > The reason could be that we don't really know whether the page is hot or > > cold, so we could have taken it a default negative assuming that's safer. > > Looks good to me. Thanks for the quick review comment, Nadav. > > I just wonder whether the order of the patches should be different. I always > understood that separating the “enabling” patch from the others is not a > good practice, since it complicates bisection. I guess it is more of a minor > issue for such a small patch-set… Yeah I'd guess you mean when there are a bunch of patches to form one feature, then we may want to be able to know which part of the feature break something. But as you mentioned this feature is mostly implemented in patch 2 only. I can squash the enablement patch into the same patch, but when comes to more archs it also means I'll squash all the archs into the same patch. I'm just afraid it'll complicate that patch too much - I'd expect each calculation of swp offset for any arch may not be that straightforward enough, so it'll be good if they can be reviewed separately and carefully.
Peter Xu <peterx@redhat.com> writes: > [Marking as RFC; only x86 is supported for now, plan to add a few more > archs when there's a formal version] > > Problem > ======= > > When migrate a page, right now we always mark the migrated page as old. > The reason could be that we don't really know whether the page is hot or > cold, so we could have taken it a default negative assuming that's safer. > > However that could lead to at least two problems: > > (1) We lost the real hot/cold information while we could have persisted. > That information shouldn't change even if the backing page is changed > after the migration, > > (2) There can be always extra overhead on the immediate next access to > any migrated page, because hardware MMU needs cycles to set the young > bit again (as long as the MMU supports). > > Many of the recent upstream works showed that (2) is not something trivial > and actually very measurable. In my test case, reading 1G chunk of memory > - jumping in page size intervals - could take 99ms just because of the > extra setting on the young bit on a generic x86_64 system, comparing to 4ms > if young set. LKP has observed that before too, as in the following reports and discussion. https://lore.kernel.org/all/87bn35zcko.fsf@yhuang-dev.intel.com/t/ Best Regards, Huang, Ying > This issue is originally reported by Andrea Arcangeli. > > Solution > ======== > > To solve this problem, this patchset tries to remember the young bit in the > migration entries and carry it over when recovering the ptes. > > We have the chance to do so because in many systems the swap offset is not > really fully used. Migration entries use swp offset to store PFN only, > while the PFN is normally not as large as swp offset and normally smaller. > It means we do have some free bits in swp offset that we can use to store > things like young, and that's how this series tried to approach this > problem. > > One tricky thing here is even though we're embedding the information into > swap entry which seems to be a very generic data structure, the number of > bits that are free is still arch dependent. Not only because the size of > swp_entry_t differs, but also due to the different layouts of swap ptes on > different archs. > > Here, this series requires specific arch to define an extra macro called > __ARCH_SWP_OFFSET_BITS represents the size of swp offset. With this > information, the swap logic can know whether there's extra bits to use, > then it'll remember the young bits when possible. By default, it'll keep > the old behavior of keeping all migrated pages cold. > > Tests > ===== > > After the patchset applied, the immediate read access test [1] of above 1G > chunk after migration can shrink from 99ms to 4ms. The test is done by > moving 1G pages from node 0->1->0 then read it in page size jumps. > > Currently __ARCH_SWP_OFFSET_BITS is only defined on x86 for this series and > only tested on x86_64 with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. > > Patch Layout > ============ > > Patch 1: Add swp_offset_pfn() and apply to all pfn swap entries, we should > also stop treating swp_offset() as PFN anymore because it can > contain more information starting from next patch. > Patch 2: The core patch to remember young bit in swap offsets. > Patch 3: A cleanup for x86 32 bits pgtable.h. > Patch 4: Define __ARCH_SWP_OFFSET_BITS on x86, enable young bit for migration > > Please review, thanks. > > [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c > > Peter Xu (4): > mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry > mm: Remember young bit for page migrations > mm/x86: Use SWP_TYPE_BITS in 3-level swap macros > mm/x86: Define __ARCH_SWP_OFFSET_BITS > > arch/arm64/mm/hugetlbpage.c | 2 +- > arch/x86/include/asm/pgtable-2level.h | 6 ++ > arch/x86/include/asm/pgtable-3level.h | 15 +++-- > arch/x86/include/asm/pgtable_64.h | 5 ++ > include/linux/swapops.h | 85 +++++++++++++++++++++++++-- > mm/hmm.c | 2 +- > mm/huge_memory.c | 10 +++- > mm/memory-failure.c | 2 +- > mm/migrate.c | 4 +- > mm/migrate_device.c | 2 + > mm/page_vma_mapped.c | 6 +- > mm/rmap.c | 3 +- > 12 files changed, 122 insertions(+), 20 deletions(-)
Peter Xu <peterx@redhat.com> writes: > [Marking as RFC; only x86 is supported for now, plan to add a few more > archs when there's a formal version] > > Problem > ======= > > When migrate a page, right now we always mark the migrated page as old. > The reason could be that we don't really know whether the page is hot or > cold, so we could have taken it a default negative assuming that's safer. > > However that could lead to at least two problems: > > (1) We lost the real hot/cold information while we could have persisted. > That information shouldn't change even if the backing page is changed > after the migration, > > (2) There can be always extra overhead on the immediate next access to > any migrated page, because hardware MMU needs cycles to set the young > bit again (as long as the MMU supports). > > Many of the recent upstream works showed that (2) is not something trivial > and actually very measurable. In my test case, reading 1G chunk of memory > - jumping in page size intervals - could take 99ms just because of the > extra setting on the young bit on a generic x86_64 system, comparing to 4ms > if young set. > > This issue is originally reported by Andrea Arcangeli. > > Solution > ======== > > To solve this problem, this patchset tries to remember the young bit in the > migration entries and carry it over when recovering the ptes. > > We have the chance to do so because in many systems the swap offset is not > really fully used. Migration entries use swp offset to store PFN only, > while the PFN is normally not as large as swp offset and normally smaller. > It means we do have some free bits in swp offset that we can use to store > things like young, and that's how this series tried to approach this > problem. > > One tricky thing here is even though we're embedding the information into > swap entry which seems to be a very generic data structure, the number of > bits that are free is still arch dependent. Not only because the size of > swp_entry_t differs, but also due to the different layouts of swap ptes on > different archs. If my understanding were correct, max_swapfile_size() provides a mechanism to identify the available bits with swp_entry_t and swap PTE considered. We may take advantage of that? And according to commit 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2"), the highest bit of swap offset needs to be 0 if L1TF mitigation is enabled. Cced Andi for confirmation. Best Regards, Huang, Ying > Here, this series requires specific arch to define an extra macro called > __ARCH_SWP_OFFSET_BITS represents the size of swp offset. With this > information, the swap logic can know whether there's extra bits to use, > then it'll remember the young bits when possible. By default, it'll keep > the old behavior of keeping all migrated pages cold. > > Tests > ===== > > After the patchset applied, the immediate read access test [1] of above 1G > chunk after migration can shrink from 99ms to 4ms. The test is done by > moving 1G pages from node 0->1->0 then read it in page size jumps. > > Currently __ARCH_SWP_OFFSET_BITS is only defined on x86 for this series and > only tested on x86_64 with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. > > Patch Layout > ============ > > Patch 1: Add swp_offset_pfn() and apply to all pfn swap entries, we should > also stop treating swp_offset() as PFN anymore because it can > contain more information starting from next patch. > Patch 2: The core patch to remember young bit in swap offsets. > Patch 3: A cleanup for x86 32 bits pgtable.h. > Patch 4: Define __ARCH_SWP_OFFSET_BITS on x86, enable young bit for migration > > Please review, thanks. > > [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c > > Peter Xu (4): > mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry > mm: Remember young bit for page migrations > mm/x86: Use SWP_TYPE_BITS in 3-level swap macros > mm/x86: Define __ARCH_SWP_OFFSET_BITS > > arch/arm64/mm/hugetlbpage.c | 2 +- > arch/x86/include/asm/pgtable-2level.h | 6 ++ > arch/x86/include/asm/pgtable-3level.h | 15 +++-- > arch/x86/include/asm/pgtable_64.h | 5 ++ > include/linux/swapops.h | 85 +++++++++++++++++++++++++-- > mm/hmm.c | 2 +- > mm/huge_memory.c | 10 +++- > mm/memory-failure.c | 2 +- > mm/migrate.c | 4 +- > mm/migrate_device.c | 2 + > mm/page_vma_mapped.c | 6 +- > mm/rmap.c | 3 +- > 12 files changed, 122 insertions(+), 20 deletions(-)
On 29.07.22 03:40, Peter Xu wrote: > [Marking as RFC; only x86 is supported for now, plan to add a few more > archs when there's a formal version] > > Problem > ======= > > When migrate a page, right now we always mark the migrated page as old. > The reason could be that we don't really know whether the page is hot or > cold, so we could have taken it a default negative assuming that's safer. > > However that could lead to at least two problems: > > (1) We lost the real hot/cold information while we could have persisted. > That information shouldn't change even if the backing page is changed > after the migration, > > (2) There can be always extra overhead on the immediate next access to > any migrated page, because hardware MMU needs cycles to set the young > bit again (as long as the MMU supports). > > Many of the recent upstream works showed that (2) is not something trivial > and actually very measurable. In my test case, reading 1G chunk of memory > - jumping in page size intervals - could take 99ms just because of the > extra setting on the young bit on a generic x86_64 system, comparing to 4ms > if young set. > > This issue is originally reported by Andrea Arcangeli. > > Solution > ======== > > To solve this problem, this patchset tries to remember the young bit in the > migration entries and carry it over when recovering the ptes. > > We have the chance to do so because in many systems the swap offset is not > really fully used. Migration entries use swp offset to store PFN only, > while the PFN is normally not as large as swp offset and normally smaller. > It means we do have some free bits in swp offset that we can use to store > things like young, and that's how this series tried to approach this > problem. > > One tricky thing here is even though we're embedding the information into > swap entry which seems to be a very generic data structure, the number of > bits that are free is still arch dependent. Not only because the size of > swp_entry_t differs, but also due to the different layouts of swap ptes on > different archs. > > Here, this series requires specific arch to define an extra macro called > __ARCH_SWP_OFFSET_BITS represents the size of swp offset. With this > information, the swap logic can know whether there's extra bits to use, > then it'll remember the young bits when possible. By default, it'll keep > the old behavior of keeping all migrated pages cold. > I played with a similar idea when working on pte_swp_exclusive() but gave up, because it ended up looking too hacky. Looking at patch #2, I get the same feeling again. Kind of hacky. If we mostly only care about x86_64, and it's a performance improvement after all, why not simply do it like pte_swp_mkexclusive/pte_swp_exclusive/ ... and reuse a spare PTE bit?
On Mon, Aug 01, 2022 at 01:33:28PM +0800, Huang, Ying wrote: > If my understanding were correct, max_swapfile_size() provides a > mechanism to identify the available bits with swp_entry_t and swap PTE > considered. We may take advantage of that? That's an interesting trick, I'll have a closer look, thanks for the pointer! > > And according to commit 377eeaa8e11f ("x86/speculation/l1tf: Limit swap > file size to MAX_PA/2"), the highest bit of swap offset needs to be 0 if > L1TF mitigation is enabled. > > Cced Andi for confirmation. Yeah it'll be great to have a confirmation, hopefully max_swapfile_size() should have covered that case.
On Mon, Aug 01, 2022 at 10:21:32AM +0200, David Hildenbrand wrote: > On 29.07.22 03:40, Peter Xu wrote: > > [Marking as RFC; only x86 is supported for now, plan to add a few more > > archs when there's a formal version] > > > > Problem > > ======= > > > > When migrate a page, right now we always mark the migrated page as old. > > The reason could be that we don't really know whether the page is hot or > > cold, so we could have taken it a default negative assuming that's safer. > > > > However that could lead to at least two problems: > > > > (1) We lost the real hot/cold information while we could have persisted. > > That information shouldn't change even if the backing page is changed > > after the migration, > > > > (2) There can be always extra overhead on the immediate next access to > > any migrated page, because hardware MMU needs cycles to set the young > > bit again (as long as the MMU supports). > > > > Many of the recent upstream works showed that (2) is not something trivial > > and actually very measurable. In my test case, reading 1G chunk of memory > > - jumping in page size intervals - could take 99ms just because of the > > extra setting on the young bit on a generic x86_64 system, comparing to 4ms > > if young set. > > > > This issue is originally reported by Andrea Arcangeli. > > > > Solution > > ======== > > > > To solve this problem, this patchset tries to remember the young bit in the > > migration entries and carry it over when recovering the ptes. > > > > We have the chance to do so because in many systems the swap offset is not > > really fully used. Migration entries use swp offset to store PFN only, > > while the PFN is normally not as large as swp offset and normally smaller. > > It means we do have some free bits in swp offset that we can use to store > > things like young, and that's how this series tried to approach this > > problem. > > > > One tricky thing here is even though we're embedding the information into > > swap entry which seems to be a very generic data structure, the number of > > bits that are free is still arch dependent. Not only because the size of > > swp_entry_t differs, but also due to the different layouts of swap ptes on > > different archs. > > > > Here, this series requires specific arch to define an extra macro called > > __ARCH_SWP_OFFSET_BITS represents the size of swp offset. With this > > information, the swap logic can know whether there's extra bits to use, > > then it'll remember the young bits when possible. By default, it'll keep > > the old behavior of keeping all migrated pages cold. > > > > > I played with a similar idea when working on pte_swp_exclusive() but > gave up, because it ended up looking too hacky. Looking at patch #2, I > get the same feeling again. Kind of hacky. Could you explain what's the "hacky" part you mentioned? I used swap entry to avoid per-arch operations. I failed to figure out a common way to know swp offset length myself so unluckily in this RFC I still needed one macro per-arch. Ying's suggestion seems to be a good fit here to me to remove the last arch-specific dependency. > > > If we mostly only care about x86_64, and it's a performance improvement > after all, why not simply do it like > pte_swp_mkexclusive/pte_swp_exclusive/ ... and reuse a spare PTE bit? Page migration works for most archs, I want to have it work for all archs that can easily benefit from it. Besides I actually have a question on the anon exclusive bit in the swap pte: since we have that anyway, why we need a specific migration type for anon exclusive pages? Can it be simply read migration entries with anon exclusive bit set? Thanks,
On 02.08.22 00:35, Peter Xu wrote: > On Mon, Aug 01, 2022 at 10:21:32AM +0200, David Hildenbrand wrote: >> On 29.07.22 03:40, Peter Xu wrote: >>> [Marking as RFC; only x86 is supported for now, plan to add a few more >>> archs when there's a formal version] >>> >>> Problem >>> ======= >>> >>> When migrate a page, right now we always mark the migrated page as old. >>> The reason could be that we don't really know whether the page is hot or >>> cold, so we could have taken it a default negative assuming that's safer. >>> >>> However that could lead to at least two problems: >>> >>> (1) We lost the real hot/cold information while we could have persisted. >>> That information shouldn't change even if the backing page is changed >>> after the migration, >>> >>> (2) There can be always extra overhead on the immediate next access to >>> any migrated page, because hardware MMU needs cycles to set the young >>> bit again (as long as the MMU supports). >>> >>> Many of the recent upstream works showed that (2) is not something trivial >>> and actually very measurable. In my test case, reading 1G chunk of memory >>> - jumping in page size intervals - could take 99ms just because of the >>> extra setting on the young bit on a generic x86_64 system, comparing to 4ms >>> if young set. >>> >>> This issue is originally reported by Andrea Arcangeli. >>> >>> Solution >>> ======== >>> >>> To solve this problem, this patchset tries to remember the young bit in the >>> migration entries and carry it over when recovering the ptes. >>> >>> We have the chance to do so because in many systems the swap offset is not >>> really fully used. Migration entries use swp offset to store PFN only, >>> while the PFN is normally not as large as swp offset and normally smaller. >>> It means we do have some free bits in swp offset that we can use to store >>> things like young, and that's how this series tried to approach this >>> problem. >>> >>> One tricky thing here is even though we're embedding the information into >>> swap entry which seems to be a very generic data structure, the number of >>> bits that are free is still arch dependent. Not only because the size of >>> swp_entry_t differs, but also due to the different layouts of swap ptes on >>> different archs. >>> >>> Here, this series requires specific arch to define an extra macro called >>> __ARCH_SWP_OFFSET_BITS represents the size of swp offset. With this >>> information, the swap logic can know whether there's extra bits to use, >>> then it'll remember the young bits when possible. By default, it'll keep >>> the old behavior of keeping all migrated pages cold. >>> >> >> >> I played with a similar idea when working on pte_swp_exclusive() but >> gave up, because it ended up looking too hacky. Looking at patch #2, I >> get the same feeling again. Kind of hacky. > > Could you explain what's the "hacky" part you mentioned? SWP_PFN_OFFSET_FREE_BITS :) It's a PFN offset and we're mangling in random other bits. That's hacky IMHO. I played with the idea of converting all code to store bits in addition to the type + offset. But that requires digging through a lot of arch code to teach that code about additional flags, so I discarded that idea when working on the COW fixes. > > I used swap entry to avoid per-arch operations. I failed to figure out a > common way to know swp offset length myself so unluckily in this RFC I > still needed one macro per-arch. Ying's suggestion seems to be a good fit > here to me to remove the last arch-specific dependency. Instead of mangling this into the PFN offset and let the arch tell you which bits of the PFN offset are unused ... rather remove the bits from the offset and define them manually to have a certain meaning. That's exactly how pte_swp_mkexclusive/pte_swp_exclusive/ is supposed to be handled on architectures that want to support it. I hope I could make it clearer what the hacky part is IMHO :) > >> >> >> If we mostly only care about x86_64, and it's a performance improvement >> after all, why not simply do it like >> pte_swp_mkexclusive/pte_swp_exclusive/ ... and reuse a spare PTE bit? > > Page migration works for most archs, I want to have it work for all archs > that can easily benefit from it. Yet we only care about x86-64 IIUC regarding performance, just the way the dirty bit is handled? > > Besides I actually have a question on the anon exclusive bit in the swap > pte: since we have that anyway, why we need a specific migration type for > anon exclusive pages? Can it be simply read migration entries with anon > exclusive bit set? Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/. As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap PTEs, you could even reuse that bit for migration entries and get at alteast the most relevant 64bit architectures supported easily.
On Tue, Aug 02, 2022 at 02:06:37PM +0200, David Hildenbrand wrote: > On 02.08.22 00:35, Peter Xu wrote: > > On Mon, Aug 01, 2022 at 10:21:32AM +0200, David Hildenbrand wrote: > >> On 29.07.22 03:40, Peter Xu wrote: > >>> [Marking as RFC; only x86 is supported for now, plan to add a few more > >>> archs when there's a formal version] > >>> > >>> Problem > >>> ======= > >>> > >>> When migrate a page, right now we always mark the migrated page as old. > >>> The reason could be that we don't really know whether the page is hot or > >>> cold, so we could have taken it a default negative assuming that's safer. > >>> > >>> However that could lead to at least two problems: > >>> > >>> (1) We lost the real hot/cold information while we could have persisted. > >>> That information shouldn't change even if the backing page is changed > >>> after the migration, > >>> > >>> (2) There can be always extra overhead on the immediate next access to > >>> any migrated page, because hardware MMU needs cycles to set the young > >>> bit again (as long as the MMU supports). > >>> > >>> Many of the recent upstream works showed that (2) is not something trivial > >>> and actually very measurable. In my test case, reading 1G chunk of memory > >>> - jumping in page size intervals - could take 99ms just because of the > >>> extra setting on the young bit on a generic x86_64 system, comparing to 4ms > >>> if young set. > >>> > >>> This issue is originally reported by Andrea Arcangeli. > >>> > >>> Solution > >>> ======== > >>> > >>> To solve this problem, this patchset tries to remember the young bit in the > >>> migration entries and carry it over when recovering the ptes. > >>> > >>> We have the chance to do so because in many systems the swap offset is not > >>> really fully used. Migration entries use swp offset to store PFN only, > >>> while the PFN is normally not as large as swp offset and normally smaller. > >>> It means we do have some free bits in swp offset that we can use to store > >>> things like young, and that's how this series tried to approach this > >>> problem. > >>> > >>> One tricky thing here is even though we're embedding the information into > >>> swap entry which seems to be a very generic data structure, the number of > >>> bits that are free is still arch dependent. Not only because the size of > >>> swp_entry_t differs, but also due to the different layouts of swap ptes on > >>> different archs. > >>> > >>> Here, this series requires specific arch to define an extra macro called > >>> __ARCH_SWP_OFFSET_BITS represents the size of swp offset. With this > >>> information, the swap logic can know whether there's extra bits to use, > >>> then it'll remember the young bits when possible. By default, it'll keep > >>> the old behavior of keeping all migrated pages cold. > >>> > >> > >> > >> I played with a similar idea when working on pte_swp_exclusive() but > >> gave up, because it ended up looking too hacky. Looking at patch #2, I > >> get the same feeling again. Kind of hacky. > > > > Could you explain what's the "hacky" part you mentioned? > > SWP_PFN_OFFSET_FREE_BITS :) > > It's a PFN offset and we're mangling in random other bits. That's hacky > IMHO. > > I played with the idea of converting all code to store bits in addition > to the type + offset. But that requires digging through a lot of arch > code to teach that code about additional flags, so I discarded that idea > when working on the COW fixes. Having SWP_PFN_OFFSET_FREE_BITS was the cleanest approach I could think of before I know the max_swapfile_size() trick. It only needs the arch to define this one macro and swapops.h will take action accordingly. OTOH, I don't want to use swap pte bits not only because there aren't a lot left for some archs (x86_64 only has bit 4 left, afaict), but also since this is a page migration issue it'll be nicer to me if can be solved in the swp entry level, not pte. > > > > > I used swap entry to avoid per-arch operations. I failed to figure out a > > common way to know swp offset length myself so unluckily in this RFC I > > still needed one macro per-arch. Ying's suggestion seems to be a good fit > > here to me to remove the last arch-specific dependency. > > Instead of mangling this into the PFN offset and let the arch tell you > which bits of the PFN offset are unused ... rather remove the bits from > the offset and define them manually to have a certain meaning. That's > exactly how pte_swp_mkexclusive/pte_swp_exclusive/ is supposed to be > handled on architectures that want to support it. > > I hope I could make it clearer what the hacky part is IMHO :) > > > > >> > >> > >> If we mostly only care about x86_64, and it's a performance improvement > >> after all, why not simply do it like > >> pte_swp_mkexclusive/pte_swp_exclusive/ ... and reuse a spare PTE bit? > > > > Page migration works for most archs, I want to have it work for all archs > > that can easily benefit from it. > > Yet we only care about x86-64 IIUC regarding performance, just the way > the dirty bit is handled? I don't think we only care about x86_64? Should other archs have the same issue as long as there's the hardware young bit? Even without it, it'll affect page reclaim logic too, and that's also not x86 only. > > > > > Besides I actually have a question on the anon exclusive bit in the swap > > pte: since we have that anyway, why we need a specific migration type for > > anon exclusive pages? Can it be simply read migration entries with anon > > exclusive bit set? > > Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/. > > As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap > PTEs, you could even reuse that bit for migration entries and get at > alteast the most relevant 64bit architectures supported easily. Yes, but I think having two mechanisms for the single problem can confuse people. IIUC the swap bit is already defined in major archs anyway, and since anon exclusive bit is best-effort (or am I wrong?..), I won't worry too much on archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is only servicing that very minority.. which seems to be a pity to waste the swp type on all archs even if the archs defined swp pte bits just for anon exclusive. Thanks,
> I don't think we only care about x86_64? Should other archs have the same > issue as long as there's the hardware young bit? > > Even without it, it'll affect page reclaim logic too, and that's also not > x86 only. Okay, reading the cover letter and looking at the code my understanding was that x86-64 is the real focus. >> >>> >>> Besides I actually have a question on the anon exclusive bit in the swap >>> pte: since we have that anyway, why we need a specific migration type for >>> anon exclusive pages? Can it be simply read migration entries with anon >>> exclusive bit set? >> >> Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/. >> >> As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap >> PTEs, you could even reuse that bit for migration entries and get at >> alteast the most relevant 64bit architectures supported easily. > > Yes, but I think having two mechanisms for the single problem can confuse > people. > It would be one bit with two different meanings depending on the swp type. > IIUC the swap bit is already defined in major archs anyway, and since anon > exclusive bit is best-effort (or am I wrong?..), I won't worry too much on It kind-of is best effort, but the goal is to have all archs support it. ... just like the young bit here? > archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during > migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is > only servicing that very minority.. which seems to be a pity to waste the I have a big item on my todo list to support all, but I have different priorities right now. If there is no free bit, simply steal one from the offset ... which is the same thing your approach would do, just in a different way, no? > swp type on all archs even if the archs defined swp pte bits just for anon > exclusive. Why do we care? We walk about one type not one bit.
On Tue, Aug 02, 2022 at 10:23:49PM +0200, David Hildenbrand wrote: > > I don't think we only care about x86_64? Should other archs have the same > > issue as long as there's the hardware young bit? > > > > Even without it, it'll affect page reclaim logic too, and that's also not > > x86 only. > > Okay, reading the cover letter and looking at the code my understanding > was that x86-64 is the real focus. > > >> > >>> > >>> Besides I actually have a question on the anon exclusive bit in the swap > >>> pte: since we have that anyway, why we need a specific migration type for > >>> anon exclusive pages? Can it be simply read migration entries with anon > >>> exclusive bit set? > >> > >> Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/. > >> > >> As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap > >> PTEs, you could even reuse that bit for migration entries and get at > >> alteast the most relevant 64bit architectures supported easily. > > > > Yes, but I think having two mechanisms for the single problem can confuse > > people. > > > > It would be one bit with two different meanings depending on the swp type. > > > IIUC the swap bit is already defined in major archs anyway, and since anon > > exclusive bit is best-effort (or am I wrong?..), I won't worry too much on > > It kind-of is best effort, but the goal is to have all archs support it. > > ... just like the young bit here? Exactly, so I'm also wondering whether we can move the swp pte anon exclusive bit into swp entry. It just sounds weird to have them defined in two ways. > > > archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during > > migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is > > only servicing that very minority.. which seems to be a pity to waste the > > I have a big item on my todo list to support all, but I have different > priorities right now. > > If there is no free bit, simply steal one from the offset ... which is > the same thing your approach would do, just in a different way, no? > > > swp type on all archs even if the archs defined swp pte bits just for anon > > exclusive. > > Why do we care? We walk about one type not one bit. The swap type address space is still limited, I'd say we should save when possible. I believe people caring about swapping care about the limit of swap devices too. If the offset can keep it, I think it's better than the swap type. De-dup either the type or the swap pte bit would be nicer, imho.
On 02.08.22 22:35, Peter Xu wrote: > On Tue, Aug 02, 2022 at 10:23:49PM +0200, David Hildenbrand wrote: >>> I don't think we only care about x86_64? Should other archs have the same >>> issue as long as there's the hardware young bit? >>> >>> Even without it, it'll affect page reclaim logic too, and that's also not >>> x86 only. >> >> Okay, reading the cover letter and looking at the code my understanding >> was that x86-64 is the real focus. >> >>>> >>>>> >>>>> Besides I actually have a question on the anon exclusive bit in the swap >>>>> pte: since we have that anyway, why we need a specific migration type for >>>>> anon exclusive pages? Can it be simply read migration entries with anon >>>>> exclusive bit set? >>>> >>>> Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/. >>>> >>>> As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap >>>> PTEs, you could even reuse that bit for migration entries and get at >>>> alteast the most relevant 64bit architectures supported easily. >>> >>> Yes, but I think having two mechanisms for the single problem can confuse >>> people. >>> >> >> It would be one bit with two different meanings depending on the swp type. >> >>> IIUC the swap bit is already defined in major archs anyway, and since anon >>> exclusive bit is best-effort (or am I wrong?..), I won't worry too much on >> >> It kind-of is best effort, but the goal is to have all archs support it. >> >> ... just like the young bit here? > > Exactly, so I'm also wondering whether we can move the swp pte anon > exclusive bit into swp entry. It just sounds weird to have them defined in > two ways. I'd argue it's just the swp vs. nonswp difference that are in fact two different concepts (device+offset vs. type+pte). And some dirty details how swp entries are actually used. With swp entries you have to be very careful, for example, take a look at radix_to_swp_entry() and swp_to_radix_entry(). That made me refrain from touching anything inside actual swp entries and instead store it in the pte. > >> >>> archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during >>> migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is >>> only servicing that very minority.. which seems to be a pity to waste the >> >> I have a big item on my todo list to support all, but I have different >> priorities right now. >> >> If there is no free bit, simply steal one from the offset ... which is >> the same thing your approach would do, just in a different way, no? >> >>> swp type on all archs even if the archs defined swp pte bits just for anon >>> exclusive. >> >> Why do we care? We walk about one type not one bit. > > The swap type address space is still limited, I'd say we should save when > possible. I believe people caring about swapping care about the limit of > swap devices too. If the offset can keep it, I think it's better than the Ehm, last time I did the math I came to the conclusion that nobody cares. Let me redo the math: MAX_SWAPFILES = 1<<5 - 1 - 1 - 4 - 3 - 1 = 22 Which is the worst case right now with all kinds of oddity compiled in (sorry CONFIG_DEVICE_PRIVATE). So far nobody complaint. > swap type. De-dup either the type or the swap pte bit would be nicer, imho. > If you manage bits in the pte manually, you might be able to get a better packing density, if bits are scattered around. Just take a look at the x86_64 location of _PAGE_SWP_EXCLUSIVE. What I'm rooting for is something like #define pte_nonswp_mkyoung pte_swp_mkexclusive Eventually with some VM_BUG_ONs to make sure people call it on the right swp ptes. If we ever want to get rid of SWP_MIGRATION_READ_EXCLUSIVE (so people can have 23 swap devices), and eventually have separate bits for both. For now it's not necessary.
On Tue, Aug 02, 2022 at 10:59:42PM +0200, David Hildenbrand wrote: > On 02.08.22 22:35, Peter Xu wrote: > > On Tue, Aug 02, 2022 at 10:23:49PM +0200, David Hildenbrand wrote: > >>> I don't think we only care about x86_64? Should other archs have the same > >>> issue as long as there's the hardware young bit? > >>> > >>> Even without it, it'll affect page reclaim logic too, and that's also not > >>> x86 only. > >> > >> Okay, reading the cover letter and looking at the code my understanding > >> was that x86-64 is the real focus. > >> > >>>> > >>>>> > >>>>> Besides I actually have a question on the anon exclusive bit in the swap > >>>>> pte: since we have that anyway, why we need a specific migration type for > >>>>> anon exclusive pages? Can it be simply read migration entries with anon > >>>>> exclusive bit set? > >>>> > >>>> Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/. > >>>> > >>>> As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap > >>>> PTEs, you could even reuse that bit for migration entries and get at > >>>> alteast the most relevant 64bit architectures supported easily. > >>> > >>> Yes, but I think having two mechanisms for the single problem can confuse > >>> people. > >>> > >> > >> It would be one bit with two different meanings depending on the swp type. > >> > >>> IIUC the swap bit is already defined in major archs anyway, and since anon > >>> exclusive bit is best-effort (or am I wrong?..), I won't worry too much on > >> > >> It kind-of is best effort, but the goal is to have all archs support it. > >> > >> ... just like the young bit here? > > > > Exactly, so I'm also wondering whether we can move the swp pte anon > > exclusive bit into swp entry. It just sounds weird to have them defined in > > two ways. > > I'd argue it's just the swp vs. nonswp difference that are in fact two > different concepts (device+offset vs. type+pte). And some dirty details > how swp entries are actually used. > > With swp entries you have to be very careful, for example, take a look > at radix_to_swp_entry() and swp_to_radix_entry(). That made me refrain > from touching anything inside actual swp entries and instead store it in > the pte. I don't really see any risk - it neither touches the swp entry nor do complicated things around it (shift 1 and set bit 0 to 1). Please feel free to share your concern in case I overlooked something, though. > > > > >> > >>> archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during > >>> migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is > >>> only servicing that very minority.. which seems to be a pity to waste the > >> > >> I have a big item on my todo list to support all, but I have different > >> priorities right now. > >> > >> If there is no free bit, simply steal one from the offset ... which is > >> the same thing your approach would do, just in a different way, no? > >> > >>> swp type on all archs even if the archs defined swp pte bits just for anon > >>> exclusive. > >> > >> Why do we care? We walk about one type not one bit. > > > > The swap type address space is still limited, I'd say we should save when > > possible. I believe people caring about swapping care about the limit of > > swap devices too. If the offset can keep it, I think it's better than the > > Ehm, last time I did the math I came to the conclusion that nobody > cares. Let me redo the math: > > MAX_SWAPFILES = 1<<5 - 1 - 1 - 4 - 3 - 1 = 22 > > Which is the worst case right now with all kinds of oddity compiled in > (sorry CONFIG_DEVICE_PRIVATE). > > So far nobody complaint. Yeah. To me using one bit of it is fine especially if that's the best to do. Here what confuses me is we have two ways to represent "whether this page is anon exclusive" in a swap pte, either we need to go into the type or using the bit. The trick here is whether the swap pte bit makes sense depends on the swp type first too, while the swap type can be "anon exclusive read migration" itself. > > > swap type. De-dup either the type or the swap pte bit would be nicer, imho. > > > > If you manage bits in the pte manually, you might be able to get a > better packing density, if bits are scattered around. Just take a look > at the x86_64 location of _PAGE_SWP_EXCLUSIVE. > > What I'm rooting for is something like > > #define pte_nonswp_mkyoung pte_swp_mkexclusive > > Eventually with some VM_BUG_ONs to make sure people call it on the right > swp ptes. > > If we ever want to get rid of SWP_MIGRATION_READ_EXCLUSIVE (so people > can have 23 swap devices), and eventually have separate bits for both. > For now it's not necessary. Okay, but that's probably the last thing I'd like to try - I don't want to further complicate the anon exclusive bit in swap pte.. I'd think cleaning that up somehow would be nice, but as you mentioned if no one complains I have no strong opinion either. Thanks,