Message ID | 20210807032521.7591-1-peterx@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: Enable PM_SWAP for shmem with PTE_MARKER | expand |
Hi Peter, a couple of comments, sorry for the late reply. > Summary > ======= > > [Based on v5.14-rc4] > > This patchset enables PM_SWAP of pagemap on shmem. IOW userspace will be able > to detect whether a shmem page is swapped out, just like anonymous pages. > > This feature can be enabled with CONFIG_PTE_MARKER_PAGEOUT. When enabled, it > brings 0.8% overhead on swap-in performance on a shmem page, so I didn't make > it the default yet. However IMHO 0.8% is still in an acceptable range that we > can even make it the default at last. Comments are welcomed here. Special config option and added complexity for handling a corner case feature partially more correct. Hm. > > There's one previous series that wanted to address the same issue but in > another way by Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>, here: > > https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/ > > In that series it's done by looking up page cache for all none ptes. However I > raised concern on 4x performance degradation for all shmem pagemap users. Who cares? I am asking because for me, it's hard to imagine a workload that actually cares about a 4x performance degradation when querying the pagemap in very special cases, especially if it involves gigantic shmem ranges. VM live migration -- sync will be a bit slower? CRIU sync will be a bit slower? I mean, actual page dumping is a lot more expensive. Really a problem? I read that CRIU cares about shmem via pagemap [1], at least for anonymous shared memory; not sure how memfd is treated, I assume in a similar way. But I do wonder how it even works reliably, because it relies on present/swapped out and sofrtdirty tracking, which are both essentially broken e.g., when swapping out AFAIKT. Looks like this really needs a proper fix. [1] https://criu.org/Shared_memory > > Unlike the other approach, this series has zero overhead on pagemap read > because the PM_SWAP info is consolidated into the zapped PTEs directly. > > Goals > ===== > > One major goal of this series is to add the PM_SWAP support, the reason is as > stated by Tiberiu and Ivan in the other patchset: > > https://lore.kernel.org/lkml/CY4PR0201MB3460E372956C0E1B8D33F904E9E39@CY4PR0201MB3460.namprd02.prod.outlook.com/ > > As a summary: for some reason the userspace needs to scan the pages in the > background, however that scanning could misguide page reclaim on which page is > hot and which is cold. With correct PM_SWAP information, the userspace can > correct the behavior of page reclaim by firstly fetching that info from > pagemap, and explicit madvise(MADV_PAGEOUT). In this case, the pages are for > the guest, but it can be any shmem page. > > Another major goal of this series is to do a proof-of-concept of the PTE marker > idea, and that's also the major reason why it's RFC. So far PTE marker can > potentially be the solution for below three problems that I'm aware of: > > (a) PM_SWAP on shmem > > (b) Userfaultfd-wp on shmem/hugetlbfs > > (c) PM_SOFT_DIRTY lost for shmem over swapping > > This series tries to resolve problem (a) which should be the simplest, ideally > it should solve immediate problem for the live migration issue raised by > Tiberiu and Ivan on proactive paging out unused guest pages. > > Both (a) and (c) will be for performance-wise or statistic-wise. > > Scenario (b) will require pte markers as part of the function to trap writes to > uffd-wp protected regions when the pages were e.g. swapped out or zapped for > any reason. > > Currently, uffd-wp shmem work (still during review on the list, latest v5, [1]) > used another solution called "special swap pte". It works similarly like PTE > markers as both of the approachs are to persist information into zapped pte, > but people showed concern about that idea and it's suggested to use a safer > (swp-entry level operation, not pte level), and arch-independent approach. > > Hopefully PTE markers satifsfy these demands. > > Before I rework the uffd-wp series, I wanted to know whether this approach can > be accepted upstream. So besides the swap part, comments on PTE markers will > be extremely welcomed. For uffd-wp in its current form, it would certainly be the way to go I think. AFAIKT, the idea of special swap entries isn't new, just that it's limited to anonymous memory for now, which makes things like fork and new mappings a lot cheaper. > > What is PTE Markers? > ==================== > > PTE markers are defined as some special PTEs that works like a "marker" just > like in normal life. Firstly it uses a swap type, IOW it's not a valid/present > pte, so processor will trigger a page fault when it's accessed. Meanwhile, the > format of the PTE is well-defined, so as to contain some information that we > would like to know before/during the page access happening. > > In this specific case, when the shmem page is paged out, we set a marker > showing that this page was paged out, then when pagemap is read about this pte, > we know this is a swapped-out/very-cold page. > > This use case is not an obvious one but the most simplest. The uffd-wp use > case is more obvious (wr-protect is per-pte, so we can't save into page cache; > meanwhile we need that info to persist across zappings e.g. thp split or page > out of shmem pages). > > So in the future, it can contain more information, e.g., whether this pte is > wr-protected by userfaultfd; whether this pte was written in this mm context > for soft-dirtying. On 64 bit systems, we have a total of 58 bits (swp_offset). > > I'm also curious whether it can be further expanded to other mm areas. E.g., > logically it can work too for non-RAM based memories outside shmem/hugetlbfs, > e.g. a common file system like ext4 or btrfs? As long as there will be a need > to store some per-pte information across zapping of the ptes, then maybe it can > be considered. As already expressed, we should try storing as little information in page tables as possible if we're dealing with shared memory. The features we design around this all seem to over-complicate the actual users, over-complicate fork, over-complicate handling on new mappings. For uffd-wp in its current form, there seems to be no way around it, and PTE markers seem to be what you want -- but as I already raised, the feature itself on shmem is somewhat suboptimal, just like SOFTDIRTY tracking on shmem. But I guess I'm biased at this point because the main users of these features actually want to query/set such properties for all sharers, not individual processes; so the opinion of others would be appreciated. > > Known Issues/Concerns > ===================== > > About THP > --------- > > Currently we don't need to worry about THP because paged out shmem pages will > be split when shrinking, IOW we only need to consider PTE, and the markers will > only be applied to a shmem pte not pmd or bigger. > > About PM_SWAP Accuracy > ---------------------- > > This is not an "accurate" solution to provide PM_SWAP bit. Two exmaples: > > - When process A & B both map shmem page P somewhere, it can happen that only > one of these ptes got marked with the pte marker. Imagine below sequence: > > 0. Process A & B both map shmem page P somewhere > 1. Process A zap pte of page P for some reason (e.g. thp split) > 2. System decides to recycle page P > 3. System replace process B's pte (pointed to P) by PTE marker > 4. System _didn't_ replace process A's pte because it was none pte, and > it'll continue to be none pte > 5. Only process B's relevant pte has the PTE marker after P swapped out > > - When fork, we don't copy shmem vma ptes, including the pte markers. So > even if page P was swapped out, only the parent process has the pte marker > installed, in child it'll be none pte if fork() happened after pageout. > > Conclusion: just like it used to be, the PM_SWAP is best-effort. But it should > work in 99.99% cases and it should already start to solve problems. At least I don't like these semantics at all. PM_SWAP is a cached value which might be under-represented and consequently wrong. Take CRIU as an example, it has to be correct even if a process would remap a memory region, fork() and unmap in the parent as far as I understand, ... If we really care about performance for users with the old semantics, introduce some runtime toggle that enables the new behavior (even for a single process?) and consequently is a bit slower in corner cases. But I really do wonder if we care at all about the performance degradation in corner cases. If we really care about performance for users with new semantics, then let's do it properly and see how we can actually speed it up without per-process page table hacks. Anyhow, just my 2 cents.
On Tue, Aug 17, 2021 at 11:04:18AM +0200, David Hildenbrand wrote: > Hi Peter, a couple of comments, sorry for the late reply. > > > Summary > > ======= > > > > [Based on v5.14-rc4] > > > > This patchset enables PM_SWAP of pagemap on shmem. IOW userspace will be able > > to detect whether a shmem page is swapped out, just like anonymous pages. > > > > This feature can be enabled with CONFIG_PTE_MARKER_PAGEOUT. When enabled, it > > brings 0.8% overhead on swap-in performance on a shmem page, so I didn't make > > it the default yet. However IMHO 0.8% is still in an acceptable range that we > > can even make it the default at last. Comments are welcomed here. > > Special config option and added complexity for handling a corner case > feature partially more correct. Hm. > > > > > There's one previous series that wanted to address the same issue but in > > another way by Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>, here: > > > > https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/ > > > > In that series it's done by looking up page cache for all none ptes. However I > > raised concern on 4x performance degradation for all shmem pagemap users. > > Who cares? I am asking because for me, it's hard to imagine a workload that > actually cares about a 4x performance degradation when querying the pagemap > in very special cases, I won't say it's a "special case". it affects any pagemap reading if there's shmem, even if they're not looking for PM_SWAP it'll suffer from trying to look up page cache if the pte is none. Yes it needs to be a relatively large memory to show an effect, but IMHO it's not a reason good enough that we can make it 4x slower. > especially if it involves gigantic shmem ranges. VM > live migration -- sync will be a bit slower? CRIU sync will be a bit slower? > I mean, actual page dumping is a lot more expensive. Really a problem? > > I read that CRIU cares about shmem via pagemap [1], at least for anonymous > shared memory; not sure how memfd is treated, I assume in a similar way. But > I do wonder how it even works reliably, because it relies on present/swapped > out and sofrtdirty tracking, which are both essentially broken e.g., when > swapping out AFAIKT. Looks like this really needs a proper fix. > > [1] https://criu.org/Shared_memory I have not enough knowledge to judge CRIU here, but I can kind of understand Tiberiu's problem and hopefully I digested it right. What I'm targetting with this series is just to see whether it can fix the problem there, assuming that could be a good summary of what it can also be used elsewhere. We have the other solution of the page cache lookup, but I think that's slower than this solution, so when there's the 0.8% overhead possibility I think this one is much better. The bad side is it's going to be added into swap-in path, it's not ideal for fast swap devices, but I have no knowledge on that part too, as to my understanding most swap devices should be still slow like hard disks which can take worst case milli-seconds to read in a sector. I think that overhead should be mostly not measureable and lost in the white noise. > > > > > Unlike the other approach, this series has zero overhead on pagemap read > > because the PM_SWAP info is consolidated into the zapped PTEs directly. > > > > Goals > > ===== > > > > One major goal of this series is to add the PM_SWAP support, the reason is as > > stated by Tiberiu and Ivan in the other patchset: > > > > https://lore.kernel.org/lkml/CY4PR0201MB3460E372956C0E1B8D33F904E9E39@CY4PR0201MB3460.namprd02.prod.outlook.com/ > > > > As a summary: for some reason the userspace needs to scan the pages in the > > background, however that scanning could misguide page reclaim on which page is > > hot and which is cold. With correct PM_SWAP information, the userspace can > > correct the behavior of page reclaim by firstly fetching that info from > > pagemap, and explicit madvise(MADV_PAGEOUT). In this case, the pages are for > > the guest, but it can be any shmem page. > > > > Another major goal of this series is to do a proof-of-concept of the PTE marker > > idea, and that's also the major reason why it's RFC. So far PTE marker can > > potentially be the solution for below three problems that I'm aware of: > > > > (a) PM_SWAP on shmem > > > > (b) Userfaultfd-wp on shmem/hugetlbfs > > > > (c) PM_SOFT_DIRTY lost for shmem over swapping > > > > This series tries to resolve problem (a) which should be the simplest, ideally > > it should solve immediate problem for the live migration issue raised by > > Tiberiu and Ivan on proactive paging out unused guest pages. > > > > Both (a) and (c) will be for performance-wise or statistic-wise. > > > > Scenario (b) will require pte markers as part of the function to trap writes to > > uffd-wp protected regions when the pages were e.g. swapped out or zapped for > > any reason. > > > > Currently, uffd-wp shmem work (still during review on the list, latest v5, [1]) > > used another solution called "special swap pte". It works similarly like PTE > > markers as both of the approachs are to persist information into zapped pte, > > but people showed concern about that idea and it's suggested to use a safer > > (swp-entry level operation, not pte level), and arch-independent approach. > > > > Hopefully PTE markers satifsfy these demands. > > > > Before I rework the uffd-wp series, I wanted to know whether this approach can > > be accepted upstream. So besides the swap part, comments on PTE markers will > > be extremely welcomed. > > For uffd-wp in its current form, it would certainly be the way to go I > think. AFAIKT, the idea of special swap entries isn't new, just that it's > limited to anonymous memory for now, which makes things like fork and new > mappings a lot cheaper. Thanks for reviewing this series separately; yes I definitely wanted to get comments on both sides: one on the pte marker idea, the other is whether it's applicable to this swap+shmem use case. I don't plan to handle both fork and new mappings for swapping. For fork(), as we've discussed, and as I've also mentioned below in Known Issues, that we don't copy these markers in fork(). So children may lose these hints from pte, but I think that's fine if that's hardly used. Here I really want to make the pte marker be flexible - it can be strict when necessary (it will be 100% strict with uffd-wp), then it can also be a hint just like what we have with available ptes on soft-dirty, idle, accessed bits. Here the swap bit I wanted to make it that kind, so we add zero overhead to fork() and we still solve problems. Same thing to "newly mapped shmem". Do we have a use case for that? If that's a hint bit, can we ignore it? > > > > > What is PTE Markers? > > ==================== > > > > PTE markers are defined as some special PTEs that works like a "marker" just > > like in normal life. Firstly it uses a swap type, IOW it's not a valid/present > > pte, so processor will trigger a page fault when it's accessed. Meanwhile, the > > format of the PTE is well-defined, so as to contain some information that we > > would like to know before/during the page access happening. > > > > In this specific case, when the shmem page is paged out, we set a marker > > showing that this page was paged out, then when pagemap is read about this pte, > > we know this is a swapped-out/very-cold page. > > > > This use case is not an obvious one but the most simplest. The uffd-wp use > > case is more obvious (wr-protect is per-pte, so we can't save into page cache; > > meanwhile we need that info to persist across zappings e.g. thp split or page > > out of shmem pages). > > > > So in the future, it can contain more information, e.g., whether this pte is > > wr-protected by userfaultfd; whether this pte was written in this mm context > > for soft-dirtying. On 64 bit systems, we have a total of 58 bits (swp_offset). > > > > I'm also curious whether it can be further expanded to other mm areas. E.g., > > logically it can work too for non-RAM based memories outside shmem/hugetlbfs, > > e.g. a common file system like ext4 or btrfs? As long as there will be a need > > to store some per-pte information across zapping of the ptes, then maybe it can > > be considered. > > As already expressed, we should try storing as little information in page > tables as possible if we're dealing with shared memory. The features we > design around this all seem to over-complicate the actual users, > over-complicate fork, over-complicate handling on new mappings. I'll skip the last two "over-complicated" items, because as we've discussed I don't think we need to take care of them so far. We can revisit when they become some kind of requirement. To me having PM_SWAP 99% right on shmem is still a progress comparing to completely missing it, even if it's not 100% right. It's used for performance reasons on PAGEOUT and doing finer-grained memory control from userspace, it's not a strict requirement. So I still cannot strictly follow why storing information in pte is so bad for file-backed, which I can see you really don't like it. Could you share some detailed example? > > For uffd-wp in its current form, there seems to be no way around it, and PTE > markers seem to be what you want -- but as I already raised, the feature > itself on shmem is somewhat suboptimal, just like SOFTDIRTY tracking on > shmem. Emm I don't know whether we should call it sub-optimal. To me it's still a clean and self-contained way to do things. I don't know how to do that from page cache layer, but I bet there'll be issues coming when we go that way, then we may figure out "oh this way is 90% beautiful, but that way is 85%". To me it's sometimes not easy to figure out the ideal solution for all the problems. > > But I guess I'm biased at this point because the main users of these > features actually want to query/set such properties for all sharers, not > individual processes; so the opinion of others would be appreciated. > > > > > Known Issues/Concerns > > ===================== > > > > About THP > > --------- > > > > Currently we don't need to worry about THP because paged out shmem pages will > > be split when shrinking, IOW we only need to consider PTE, and the markers will > > only be applied to a shmem pte not pmd or bigger. > > > > About PM_SWAP Accuracy > > ---------------------- > > > > This is not an "accurate" solution to provide PM_SWAP bit. Two exmaples: > > > > - When process A & B both map shmem page P somewhere, it can happen that only > > one of these ptes got marked with the pte marker. Imagine below sequence: > > > > 0. Process A & B both map shmem page P somewhere > > 1. Process A zap pte of page P for some reason (e.g. thp split) > > 2. System decides to recycle page P > > 3. System replace process B's pte (pointed to P) by PTE marker > > 4. System _didn't_ replace process A's pte because it was none pte, and > > it'll continue to be none pte > > 5. Only process B's relevant pte has the PTE marker after P swapped out > > > > - When fork, we don't copy shmem vma ptes, including the pte markers. So > > even if page P was swapped out, only the parent process has the pte marker > > installed, in child it'll be none pte if fork() happened after pageout. > > > > Conclusion: just like it used to be, the PM_SWAP is best-effort. But it should > > work in 99.99% cases and it should already start to solve problems. > > At least I don't like these semantics at all. PM_SWAP is a cached value > which might be under-represented and consequently wrong. Please have a look at current pagemap impl in pte_to_pagemap_entry(). It's not accurate from the 1st day, imho. E.g., when a page is being migrated from numa node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case. We can make it more accurate, but I think it's fine, because it's a hint. > Take CRIU as an example, it has to be correct even if a process would remap a > memory region, fork() and unmap in the parent as far as I understand, ... Are you talking about dirty bit or swap bit? I'm a bit confused on why swap bit needs to be accurate. Maybe you mean the dirty bit? > > If we really care about performance for users with the old semantics, > introduce some runtime toggle that enables the new behavior (even for a > single process?) and consequently is a bit slower in corner cases. But I > really do wonder if we care at all about the performance degradation in > corner cases. Yes that's okay. If we want to go with Tiberiu's approach, we can use an option to avoid regress users. However I still prefer the current approach, and I even want to make it a default option if it can prove itself with some more real swap workloads somewhere (maybe kernel bot has some? I didn't check yet). > > If we really care about performance for users with new semantics, then let's > do it properly and see how we can actually speed it up without per-process > page table hacks. > > Anyhow, just my 2 cents. Thanks for the discussion!
>> >> For uffd-wp in its current form, it would certainly be the way to go I >> think. AFAIKT, the idea of special swap entries isn't new, just that it's >> limited to anonymous memory for now, which makes things like fork and new >> mappings a lot cheaper. > > Thanks for reviewing this series separately; yes I definitely wanted to get > comments on both sides: one on the pte marker idea, the other is whether it's > applicable to this swap+shmem use case. Exactly. > > Here I really want to make the pte marker be flexible - it can be strict when > necessary (it will be 100% strict with uffd-wp), then it can also be a hint > just like what we have with available ptes on soft-dirty, idle, accessed bits. > Here the swap bit I wanted to make it that kind, so we add zero overhead to > fork() and we still solve problems. > > Same thing to "newly mapped shmem". Do we have a use case for that? If that's > a hint bit, can we ignore it? I am really not a fan of taking an already broken feature (broken presence information for shmem in pagemap) and instead of fixing it properly, turning it less broken, crossing fingers that nobody will notice the remaining (undocumented) cases. [...] >> As already expressed, we should try storing as little information in page >> tables as possible if we're dealing with shared memory. The features we >> design around this all seem to over-complicate the actual users, >> over-complicate fork, over-complicate handling on new mappings. > > I'll skip the last two "over-complicated" items, because as we've discussed I > don't think we need to take care of them so far. We can revisit when they > become some kind of requirement. > > To me having PM_SWAP 99% right on shmem is still a progress comparing to > completely missing it, even if it's not 100% right. It's used for performance > reasons on PAGEOUT and doing finer-grained memory control from userspace, it's > not a strict requirement. > > So I still cannot strictly follow why storing information in pte is so bad for > file-backed, which I can see you really don't like it. Could you share some > detailed example? I am not a fan of your approach of "hints", because while it might work for some use cases, it might not work for others (see below for CRIU); I would rather like to avoid such "inconsistent caches" where it's not really required. But again, this is just my opinion and there might be plenty other people that will most probably disagree. Storing hints in page tables isn't actually that bad, because we can drop hints whenever we like (well, there are side effects, and once we drop hints too often people might complain again) -- for example, when reclaiming "empty" but actually "filled with hints" page tables. When we rely on consistent values, fork() and mmap() are a problem. Well, and page tables will have to stick around. At least for uffd-wp, mmap() is not an issue, and we don't expect too many uffd-wp users such that page table consumption would matter -- my guess. So I repeat, for uffd-wp in its current form, it sounds like the right thing to do. >> But I guess I'm biased at this point because the main users of these >> features actually want to query/set such properties for all sharers, not >> individual processes; so the opinion of others would be appreciated. >> >>> >>> Known Issues/Concerns >>> ===================== >>> >>> About THP >>> --------- >>> >>> Currently we don't need to worry about THP because paged out shmem pages will >>> be split when shrinking, IOW we only need to consider PTE, and the markers will >>> only be applied to a shmem pte not pmd or bigger. >>> >>> About PM_SWAP Accuracy >>> ---------------------- >>> >>> This is not an "accurate" solution to provide PM_SWAP bit. Two exmaples: >>> >>> - When process A & B both map shmem page P somewhere, it can happen that only >>> one of these ptes got marked with the pte marker. Imagine below sequence: >>> >>> 0. Process A & B both map shmem page P somewhere >>> 1. Process A zap pte of page P for some reason (e.g. thp split) >>> 2. System decides to recycle page P >>> 3. System replace process B's pte (pointed to P) by PTE marker >>> 4. System _didn't_ replace process A's pte because it was none pte, and >>> it'll continue to be none pte >>> 5. Only process B's relevant pte has the PTE marker after P swapped out >>> >>> - When fork, we don't copy shmem vma ptes, including the pte markers. So >>> even if page P was swapped out, only the parent process has the pte marker >>> installed, in child it'll be none pte if fork() happened after pageout. >>> >>> Conclusion: just like it used to be, the PM_SWAP is best-effort. But it should >>> work in 99.99% cases and it should already start to solve problems. >> >> At least I don't like these semantics at all. PM_SWAP is a cached value >> which might be under-represented and consequently wrong. > > Please have a look at current pagemap impl in pte_to_pagemap_entry(). It's not > accurate from the 1st day, imho. E.g., when a page is being migrated from numa > node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case. We can > make it more accurate, but I think it's fine, because it's a hint. That inconsistency doesn't really matter as you can determine if something is present and worth dumping if it's either swapped or present. As long as it's one of both but not simply nothing. I will shamelessly reference tools/testing/selftests/vm/madv_populate.c:pagemap_is_populated() that checks exactly for that (the test case uses only private anonymous memory). > >> Take CRIU as an example, it has to be correct even if a process would remap a >> memory region, fork() and unmap in the parent as far as I understand, ... > > Are you talking about dirty bit or swap bit? I'm a bit confused on why swap > bit needs to be accurate. Maybe you mean the dirty bit? https://criu.org/Shared_memory "Dumping present pages" "... CRIU does not dump all of the data. Instead, it determines which pages contain it, and only dumps those pages. This is done similarly to how regular memory dumping and restoring works, i.e. by looking for PRESENT or SWAPPED bits in owners' pagemap entries." -> Neither PRESENT nor SWAPPED results in memory not getting dumped, which makes perfect sense. 1) Process A sets up shared memory and writes data to it. 2) System swaps out memory, hints are setup. 3) Process A forks Process B, hints are not copied. 4) Process A unmaps shared memory, hints are dropped. 5) CRIU migrates process A and B and migrates only PRESENT or SWAPPED in pagemap. 6) Process B uses memory in shared memory region. Pages were not migrated. Just one example; feel free to correct me. There is notion of the mincore() systemcall: "There is one particular feature of shared memory dumps worth mentioning. Sometimes, a shared memory page can exist in the kernel, but it is not mapped to any process. CRIU detects such pages by calling mincore() on the shmem segment, which reports back the page in-memory status. The mincore bitmap is when ANDed with the per-process ones. " Not sure if they actually mean ORed, because otherwise they'd be losing pages that have been swapped out. "mincore() returns a vector that indicates whether pages of the calling process's virtual memory are resident in core (RAM)"
On Tue, Aug 17, 2021 at 08:46:45PM +0200, David Hildenbrand wrote: > > Please have a look at current pagemap impl in pte_to_pagemap_entry(). It's not > > accurate from the 1st day, imho. E.g., when a page is being migrated from numa > > node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case. We can > > make it more accurate, but I think it's fine, because it's a hint. > > That inconsistency doesn't really matter as you can determine if something > is present and worth dumping if it's either swapped or present. As long as > it's one of both but not simply nothing. > > I will shamelessly reference > tools/testing/selftests/vm/madv_populate.c:pagemap_is_populated() that > checks exactly for that (the test case uses only private anonymous memory). Then I think the MADV_POPULATE_READ|WRITE test cases shouldn't depend on PM_SWAP for that when it goes beyond anonymous private memories - when shmem swapped out the pte can be none, then the test case can fail even if it shouldn't, imho. The mincore() syscall seems to be ideally the thing you may want to make it accurate, but again it's not a problem for current anonymous private memories. > > > > > > Take CRIU as an example, it has to be correct even if a process would remap a > > > memory region, fork() and unmap in the parent as far as I understand, ... > > > > Are you talking about dirty bit or swap bit? I'm a bit confused on why swap > > bit needs to be accurate. Maybe you mean the dirty bit? > > https://criu.org/Shared_memory > > "Dumping present pages" > > "... CRIU does not dump all of the data. Instead, it determines which pages > contain it, and only dumps those pages. This is done similarly to how > regular memory dumping and restoring works, i.e. by looking for PRESENT or > SWAPPED bits in owners' pagemap entries." > > -> Neither PRESENT nor SWAPPED results in memory not getting dumped, which > makes perfect sense. > > 1) Process A sets up shared memory and writes data to it. > 2) System swaps out memory, hints are setup. > 3) Process A forks Process B, hints are not copied. > 4) Process A unmaps shared memory, hints are dropped. > 5) CRIU migrates process A and B and migrates only PRESENT or SWAPPED in > pagemap. > 6) Process B uses memory in shared memory region. Pages were not migrated. > > Just one example; feel free to correct me. I think pte marker won't crash criu, what will happen is that it'll see more ptes that used to be none that become the pte markers. This reminded me that maybe I should teach up mincore() syscall to also be aware of the pte marker at least, and all non_swap_entry() callers. > > > There is notion of the mincore() systemcall: > > "There is one particular feature of shared memory dumps worth mentioning. > Sometimes, a shared memory page can exist in the kernel, but it is not > mapped to any process. CRIU detects such pages by calling mincore() on the > shmem segment, which reports back the page in-memory status. The mincore > bitmap is when ANDed with the per-process ones. " > > Not sure if they actually mean ORed, because otherwise they'd be losing > pages that have been swapped out. "mincore() returns a vector that indicates > whether pages of the calling process's virtual memory are resident in core > (RAM)" I am wildly guessing they ORed the two just because PM_SWAP is not working properly for shmem, so the OR happens only for shmem. Criu may not only rely on mincore() because they also want the dirty bits. Btw, I noticed in 2016 criu switched from mincore() to lseek(): https://github.com/checkpoint-restore/criu/commit/1821acedd04b602b37b587eac5a481094b6274ae Criu should want to know "whether this page has valid data" not "whether this page has swapped out", so lseek() seems to be more suitable, which I'm not aware of before. I'm now wondering whether for Tiberiu's case mincore() can also be used. It should just still be a bit slow because it'll look up the cache too, but it should work similarly like the original proposal. Thanks,
On 17.08.21 22:24, Peter Xu wrote: > On Tue, Aug 17, 2021 at 08:46:45PM +0200, David Hildenbrand wrote: >>> Please have a look at current pagemap impl in pte_to_pagemap_entry(). It's not >>> accurate from the 1st day, imho. E.g., when a page is being migrated from numa >>> node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case. We can >>> make it more accurate, but I think it's fine, because it's a hint. >> >> That inconsistency doesn't really matter as you can determine if something >> is present and worth dumping if it's either swapped or present. As long as >> it's one of both but not simply nothing. >> >> I will shamelessly reference >> tools/testing/selftests/vm/madv_populate.c:pagemap_is_populated() that >> checks exactly for that (the test case uses only private anonymous memory). > > Then I think the MADV_POPULATE_READ|WRITE test cases shouldn't depend on > PM_SWAP for that when it goes beyond anonymous private memories - when shmem > swapped out the pte can be none, then the test case can fail even if it > shouldn't, imho. Exactly, because the pagemap is fairly completely broken for shmem. > > The mincore() syscall seems to be ideally the thing you may want to make it > accurate, but again it's not a problem for current anonymous private memories. I haven't checked the details, but I believe the mincore() syscall won't report swapped out pages. At least according to its documentation: "mincore() returns a vector that indicates whether pages of the calling process's virtual memory are resident in core (RAM), and so will not cause a disk access (page fault) if referenced." (to protect it from swapping and relying on mincore() we would have to mlock that memory; we'd want MCL_ONFAULT to be able to test MADV_POPULATE_READ|WRITE; or we'd just want to rely on lseek) > >> >>> >>>> Take CRIU as an example, it has to be correct even if a process would remap a >>>> memory region, fork() and unmap in the parent as far as I understand, ... >>> >>> Are you talking about dirty bit or swap bit? I'm a bit confused on why swap >>> bit needs to be accurate. Maybe you mean the dirty bit? >> >> https://criu.org/Shared_memory >> >> "Dumping present pages" >> >> "... CRIU does not dump all of the data. Instead, it determines which pages >> contain it, and only dumps those pages. This is done similarly to how >> regular memory dumping and restoring works, i.e. by looking for PRESENT or >> SWAPPED bits in owners' pagemap entries." >> >> -> Neither PRESENT nor SWAPPED results in memory not getting dumped, which >> makes perfect sense. >> >> 1) Process A sets up shared memory and writes data to it. >> 2) System swaps out memory, hints are setup. >> 3) Process A forks Process B, hints are not copied. >> 4) Process A unmaps shared memory, hints are dropped. >> 5) CRIU migrates process A and B and migrates only PRESENT or SWAPPED in >> pagemap. >> 6) Process B uses memory in shared memory region. Pages were not migrated. >> >> Just one example; feel free to correct me. > > I think pte marker won't crash criu, what will happen is that it'll see more > ptes that used to be none that become the pte markers. This reminded me that > maybe I should teach up mincore() syscall to also be aware of the pte marker at > least, and all non_swap_entry() callers. > I haven't checked what mincore() is doing, but from what I understand when reading the CRIU doc and the mincore() doc, it does the right thing without requiring any fiddling with pte marker hints. I assume you merely have a performance improvement in mind. >> >> >> There is notion of the mincore() systemcall: >> >> "There is one particular feature of shared memory dumps worth mentioning. >> Sometimes, a shared memory page can exist in the kernel, but it is not >> mapped to any process. CRIU detects such pages by calling mincore() on the >> shmem segment, which reports back the page in-memory status. The mincore >> bitmap is when ANDed with the per-process ones. " >> >> Not sure if they actually mean ORed, because otherwise they'd be losing >> pages that have been swapped out. "mincore() returns a vector that indicates >> whether pages of the calling process's virtual memory are resident in core >> (RAM)" > > I am wildly guessing they ORed the two just because PM_SWAP is not working > properly for shmem, so the OR happens only for shmem. Criu may not only rely > on mincore() because they also want the dirty bits. > > Btw, I noticed in 2016 criu switched from mincore() to lseek(): > > https://github.com/checkpoint-restore/criu/commit/1821acedd04b602b37b587eac5a481094b6274ae Interesting. That's certainly what we want when it comes to skipping holes in files. (before reading that, I wasn't even aware that mincore() existed) > > Criu should want to know "whether this page has valid data" not "whether this > page has swapped out", so lseek() seems to be more suitable, which I'm not > aware of before. Again, just as you, I learned a lot :) > > I'm now wondering whether for Tiberiu's case mincore() can also be used. It > should just still be a bit slow because it'll look up the cache too, but it > should work similarly like the original proposal. > Very right, maybe we can just avoid tampering with pagemap on shmem completely (which sounds like an excellent idea to me) and document it as "On shared memory, we will never indicate SWAPPED if the pages have been swapped out. Further, PRESENT might be under-indicated: if a shared page is currently not mapped into the page table of a process.". I saw there was a related, proposed doc update, maybe we can finetune that.
> On 18 Aug 2021, at 09:24, David Hildenbrand <david@redhat.com> wrote: > > On 17.08.21 22:24, Peter Xu wrote: >> On Tue, Aug 17, 2021 at 08:46:45PM +0200, David Hildenbrand wrote: >>>> Please have a look at current pagemap impl in pte_to_pagemap_entry(). It's not >>>> accurate from the 1st day, imho. E.g., when a page is being migrated from numa >>>> node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case. We can >>>> make it more accurate, but I think it's fine, because it's a hint. >>> >>> That inconsistency doesn't really matter as you can determine if something >>> is present and worth dumping if it's either swapped or present. As long as >>> it's one of both but not simply nothing. >>> >>> I will shamelessly reference >>> tools/testing/selftests/vm/madv_populate.c:pagemap_is_populated() that >>> checks exactly for that (the test case uses only private anonymous memory). >> Then I think the MADV_POPULATE_READ|WRITE test cases shouldn't depend on >> PM_SWAP for that when it goes beyond anonymous private memories - when shmem >> swapped out the pte can be none, then the test case can fail even if it >> shouldn't, imho. > > Exactly, because the pagemap is fairly completely broken for shmem. > >> The mincore() syscall seems to be ideally the thing you may want to make it >> accurate, but again it's not a problem for current anonymous private memories. > > I haven't checked the details, but I believe the mincore() syscall won't report swapped out pages. At least according to its documentation: > > "mincore() returns a vector that indicates whether pages of the calling process's virtual memory are resident in core (RAM), and so will not cause a disk access (page fault) if referenced." > > (to protect it from swapping and relying on mincore() we would have to mlock that memory; we'd want MCL_ONFAULT to be able to test MADV_POPULATE_READ|WRITE; or we'd just want to rely on lseek) After some digging and testing, I found out that the docs for mincore() are a little outdated, and that the RFC has an unexpected side effect on the sys call. The output vector is supposed to set the flag to 1 if the page it indicates was present in either the page cache or the swap cache. I would like to highlight that if a page got swapped out and reached swap (i.e. page content no longer stored in the swap cache), the flag gets set to 0. So indeed, mincore does not report swapped out pages, but AFAIK, it does reports pages which are still in the swap cache. There was an attempt to rework mincore altogether and make it retrieve mappings instead [1], but was quickly reverted [2] because the removed functionality was necessary for some long functioning systems. On Peter's RFC, it now looks like mincore()'s flags are set to 1 for any shared page that has been dirtied, whether it is still present, in swap cache or it arrived in swap. AFAIK, only none pages have their flags set to zero. For private pages, mincore still seems to behave normally. [1] https://github.com/torvalds/linux/commit/574823bfab82d9d8fa47f422778043fbb4b4f50e [2] https://github.com/torvalds/linux/commit/30bac164aca750892b93eef350439a0562a68647 > >>> >>>> >>>>> Take CRIU as an example, it has to be correct even if a process would remap a >>>>> memory region, fork() and unmap in the parent as far as I understand, ... >>>> >>>> Are you talking about dirty bit or swap bit? I'm a bit confused on why swap >>>> bit needs to be accurate. Maybe you mean the dirty bit? >>> >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__criu.org_Shared-5Fmemory&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=rRM5dtWOv0DNo5dDxZ2U16jl4WAw6ql5szOKa9cu_RA&m=A5H_4nfdW_jAPHckF-cuCBfRHsm2aij-cr-mELX0uww&s=DZgiYWJgLokyZkBYd5VKOnr5Fbj63aAc01Fu2BPE8Lc&e= >>> "Dumping present pages" >>> >>> "... CRIU does not dump all of the data. Instead, it determines which pages >>> contain it, and only dumps those pages. This is done similarly to how >>> regular memory dumping and restoring works, i.e. by looking for PRESENT or >>> SWAPPED bits in owners' pagemap entries." >>> >>> -> Neither PRESENT nor SWAPPED results in memory not getting dumped, which >>> makes perfect sense. >>> >>> 1) Process A sets up shared memory and writes data to it. >>> 2) System swaps out memory, hints are setup. >>> 3) Process A forks Process B, hints are not copied. >>> 4) Process A unmaps shared memory, hints are dropped. >>> 5) CRIU migrates process A and B and migrates only PRESENT or SWAPPED in >>> pagemap. >>> 6) Process B uses memory in shared memory region. Pages were not migrated. >>> >>> Just one example; feel free to correct me. >> I think pte marker won't crash criu, what will happen is that it'll see more >> ptes that used to be none that become the pte markers. This reminded me that >> maybe I should teach up mincore() syscall to also be aware of the pte marker at >> least, and all non_swap_entry() callers. I think in mincore_pte_range, the call to non_swap_entry(entry) could return true, setting the flag on the vector to 1 prematurely. Please read my comment above. > > I haven't checked what mincore() is doing, but from what I understand when reading the CRIU doc and the mincore() doc, it does the right thing without requiring any fiddling with pte marker hints. I assume you merely have a performance improvement in mind. > >>> >>> >>> There is notion of the mincore() systemcall: >>> >>> "There is one particular feature of shared memory dumps worth mentioning. >>> Sometimes, a shared memory page can exist in the kernel, but it is not >>> mapped to any process. CRIU detects such pages by calling mincore() on the >>> shmem segment, which reports back the page in-memory status. The mincore >>> bitmap is when ANDed with the per-process ones. " >>> >>> Not sure if they actually mean ORed, because otherwise they'd be losing >>> pages that have been swapped out. "mincore() returns a vector that indicates >>> whether pages of the calling process's virtual memory are resident in core >>> (RAM)" >> I am wildly guessing they ORed the two just because PM_SWAP is not working >> properly for shmem, so the OR happens only for shmem. Criu may not only rely >> on mincore() because they also want the dirty bits. >> Btw, I noticed in 2016 criu switched from mincore() to lseek(): >> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_checkpoint-2Drestore_criu_commit_1821acedd04b602b37b587eac5a481094b6274ae&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=rRM5dtWOv0DNo5dDxZ2U16jl4WAw6ql5szOKa9cu_RA&m=A5H_4nfdW_jAPHckF-cuCBfRHsm2aij-cr-mELX0uww&s=kel85AR7AGZnvBymbM7QEwc770HGO2koka-kTiF-r5U&e= > > Interesting. That's certainly what we want when it comes to skipping holes in files. (before reading that, I wasn't even aware that mincore() existed) > >> Criu should want to know "whether this page has valid data" not "whether this >> page has swapped out", so lseek() seems to be more suitable, which I'm not >> aware of before. > > Again, just as you, I learned a lot :) Same :) > >> I'm now wondering whether for Tiberiu's case mincore() can also be used. It >> should just still be a bit slow because it'll look up the cache too, but it >> should work similarly like the original proposal. I am afraid that the information returned by mincore is a little too vague to be of better help, compared to what the pagemap should provide in theory. I will have a look to see whether lseek on proc/map_files works as a "PM_SWAP" equivalent. However, the swap offset would still be missing. > > Very right, maybe we can just avoid tampering with pagemap on shmem completely (which sounds like an excellent idea to me) and document it as "On shared memory, we will never indicate SWAPPED if the pages have been swapped out. Further, PRESENT might be under-indicated: if a shared page is currently not mapped into the page table of a process.". I saw there was a related, proposed doc update, maybe we can finetune that. > We could take into consideration an alternative approach to retrieving the shared page info in user space, like storing it in sys/fs instead of per process. However, just leaving the pagemap functionality incomplete, and not providing an alternative to retrieve the missing information, does not seem right. Updating the docs with a "can't do" should be temporary, until an alternative or fix. Also, I think you are talking about my own doc update patch[3]. If not, please share a link with your next reply. [3] https://marc.info/?m=162878395426774 > > -- > Thanks, > > David / dhildenb -- Kind regards, Tibi Georgescu
>> >>> I'm now wondering whether for Tiberiu's case mincore() can also be used. It >>> should just still be a bit slow because it'll look up the cache too, but it >>> should work similarly like the original proposal. > > I am afraid that the information returned by mincore is a little too vague to be of better help, compared to what the pagemap should provide in theory. I will have a look to see whether lseek on > proc/map_files works as a "PM_SWAP" equivalent. However, the swap offset would still be missing. Well, with mincore() you could at least decide "page is present" vs. "page is swapped or not existent". At least for making pageout decisions it shouldn't really matter, no? madvise(MADV_PAGEOUT) on a hole is a nop. But I'm not 100% sure what exactly your use case is here and what you would really need, so you know best :) >> >> Very right, maybe we can just avoid tampering with pagemap on shmem completely (which sounds like an excellent idea to me) and document it as "On shared memory, we will never indicate SWAPPED if the pages have been swapped out. Further, PRESENT might be under-indicated: if a shared page is currently not mapped into the page table of a process.". I saw there was a related, proposed doc update, maybe we can finetune that. >> > We could take into consideration an alternative approach to retrieving the shared page info in user > space, like storing it in sys/fs instead of per process. However, just leaving the pagemap functionality > incomplete, and not providing an alternative to retrieve the missing information, does not seem right. Updating the docs with a "can't do" should be temporary, until an alternative or fix. > As I stated before, making pagemap less broken is not a good idea IMHO. Either make it really correct or just leave it all broken -- and document that e.g., other interfaces (lseek) shall be used. It sounds like they exist and are good enough for CRUI. And TBH, if other interfaces already exist and get the job done, I'm more than happy that we can avoid mixing more shmem stuff into pagemap and trying to compensate performance problems by introducing inconsistency. If it has an fd and we can punch that into syscalls, we should much rather use that fd to lookup stuff then going via process page tables -- if possible of course (to be evaluated, because I haven't looked into the CRIU details and how they use lseek with anonymous shared memory). > Also, I think you are talking about my own doc update patch[3]. If not, please share a link with your > next reply. > > [3] https://marc.info/?m=162878395426774 No, that's it.
> On 18 Aug 2021, at 19:13, David Hildenbrand <david@redhat.com> wrote: > >>> >>>> I'm now wondering whether for Tiberiu's case mincore() can also be used. It >>>> should just still be a bit slow because it'll look up the cache too, but it >>>> should work similarly like the original proposal. >> I am afraid that the information returned by mincore is a little too vague to be of better help, compared to what the pagemap should provide in theory. I will have a look to see whether lseek on >> proc/map_files works as a "PM_SWAP" equivalent. However, the swap offset would still be missing. > > Well, with mincore() you could at least decide "page is present" vs. "page is swapped or not existent". At least for making pageout decisions it shouldn't really matter, no? madvise(MADV_PAGEOUT) on a hole is a nop. I think you are right. In the optimisation we first presented, we should be able to send the madvise(MADV_PAGEOUT) call even if the page is none quite safely and get the wanted behaviour. Also, the "is_present" or "is_swap_or_none" question can be answered by the current pagemap too. Nice catch. However, not all use cases are the same. AFAIK, there is still no way to figure out whether a shared page is swapped out or none unless it is directly read/accessed after a pagemap check. Bringing a page into memory to check if it previously was in swap does not seem ideal. Also, we still have no mechanism to retrieve the swap offsets of shmem pages AFAIK. There is one more QEMU optimisation we are working on that requires these mappings available outside of kernel space. > > But I'm not 100% sure what exactly your use case is here and what you would really need, so you know best :) Maybe, but I am always open to (m)advise :) > >>> >>> Very right, maybe we can just avoid tampering with pagemap on shmem completely (which sounds like an excellent idea to me) and document it as "On shared memory, we will never indicate SWAPPED if the pages have been swapped out. Further, PRESENT might be under-indicated: if a shared page is currently not mapped into the page table of a process.". I saw there was a related, proposed doc update, maybe we can finetune that. >>> >> We could take into consideration an alternative approach to retrieving the shared page info in user >> space, like storing it in sys/fs instead of per process. However, just leaving the pagemap functionality >> incomplete, and not providing an alternative to retrieve the missing information, does not seem right. Updating the docs with a "can't do" should be temporary, until an alternative or fix. > > As I stated before, making pagemap less broken is not a good idea IMHO. Either make it really correct or just leave it all broken -- and document that e.g., other interfaces (lseek) shall be used. It sounds like they exist and are good enough for CRUI. > > And TBH, if other interfaces already exist and get the job done, I'm more than happy that we can avoid mixing more shmem stuff into pagemap and trying to compensate performance problems by introducing inconsistency. > > If it has an fd and we can punch that into syscalls, we should much rather use that fd to lookup stuff then going via process page tables -- if possible of course (to be evaluated, because I haven't looked into the CRIU details and how they use lseek with anonymous shared memory). I found out that it is possible to retrieve the fds of shmem/tmpfs file allocations using proc/pid/map_files, which is neat. Still, CRIU does not seem to care whether a page is swapped out or just empty, only if it is present on page cache. The holes that lseek finds would not be able to infer this difference, AFAIK. Will test the behaviour to make sure. > >> Also, I think you are talking about my own doc update patch[3]. If not, please share a link with your >> next reply. >> [3] https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fm-3D162878395426774&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=rRM5dtWOv0DNo5dDxZ2U16jl4WAw6ql5szOKa9cu_RA&m=T9yjhM3vhL_Ip2wxg2x-BZclbbY3WAO5Oc-y7IqNs7Y&s=HujDmGVIi1iXQ22oWF_GE-sPxvQ2ORTcCWEfnpqq35o&e= > > No, that's it. > Great, I'll head right there. -- Kind regards, Tibi
On 19.08.21 16:54, Tiberiu Georgescu wrote: > >> On 18 Aug 2021, at 19:13, David Hildenbrand <david@redhat.com> wrote: >> >>>> >>>>> I'm now wondering whether for Tiberiu's case mincore() can also be used. It >>>>> should just still be a bit slow because it'll look up the cache too, but it >>>>> should work similarly like the original proposal. >>> I am afraid that the information returned by mincore is a little too vague to be of better help, compared to what the pagemap should provide in theory. I will have a look to see whether lseek on >>> proc/map_files works as a "PM_SWAP" equivalent. However, the swap offset would still be missing. >> >> Well, with mincore() you could at least decide "page is present" vs. "page is swapped or not existent". At least for making pageout decisions it shouldn't really matter, no? madvise(MADV_PAGEOUT) on a hole is a nop. > > I think you are right. In the optimisation we first presented, we should be able to > send the madvise(MADV_PAGEOUT) call even if the page is none quite safely > and get the wanted behaviour. Also, the "is_present" or "is_swap_or_none" > question can be answered by the current pagemap too. Nice catch. > > However, not all use cases are the same. AFAIK, there is still no way to figure > out whether a shared page is swapped out or none unless it is directly > read/accessed after a pagemap check. Bringing a page into memory to check > if it previously was in swap does not seem ideal. Well, you can lseek() to remove all the holes and use mincore() to remove all in-memory pages. You're left with the swapped ones. Not the most efficient interface maybe, but there is a way :) > > Also, we still have no mechanism to retrieve the swap offsets of shmem pages > AFAIK. There is one more QEMU optimisation we are working on that requires > these mappings available outside of kernel space. How exactly would the swap offset really help? IMHO that's a kernel internal that shouldn't be of any value to user space -- it's merely for debugging purposes. But I'd love to learn details. [...] >> If it has an fd and we can punch that into syscalls, we should much rather use that fd to lookup stuff then going via process page tables -- if possible of course (to be evaluated, because I haven't looked into the CRIU details and how they use lseek with anonymous shared memory). > > I found out that it is possible to retrieve the fds of shmem/tmpfs file allocations > using proc/pid/map_files, which is neat. Still, CRIU does not seem to care > whether a page is swapped out or just empty, only if it is present on page cache. > The holes that lseek finds would not be able to infer this difference, AFAIK. Will > test the behaviour to make sure. CRIU wants to migrate everything. lseek() gives you the definitive answer what needs migration -- if it's swapped out or resident. Just skip the holes.
> On 19 Aug 2021, at 18:26, David Hildenbrand <david@redhat.com> wrote: > > On 19.08.21 16:54, Tiberiu Georgescu wrote: >>> On 18 Aug 2021, at 19:13, David Hildenbrand <david@redhat.com> wrote: >>> >>>>> >>>>>> I'm now wondering whether for Tiberiu's case mincore() can also be used. It >>>>>> should just still be a bit slow because it'll look up the cache too, but it >>>>>> should work similarly like the original proposal. >>>> I am afraid that the information returned by mincore is a little too vague to be of better help, compared to what the pagemap should provide in theory. I will have a look to see whether lseek on >>>> proc/map_files works as a "PM_SWAP" equivalent. However, the swap offset would still be missing. >>> >>> Well, with mincore() you could at least decide "page is present" vs. "page is swapped or not existent". At least for making pageout decisions it shouldn't really matter, no? madvise(MADV_PAGEOUT) on a hole is a nop. >> I think you are right. In the optimisation we first presented, we should be able to >> send the madvise(MADV_PAGEOUT) call even if the page is none quite safely >> and get the wanted behaviour. Also, the "is_present" or "is_swap_or_none" >> question can be answered by the current pagemap too. Nice catch. >> However, not all use cases are the same. AFAIK, there is still no way to figure >> out whether a shared page is swapped out or none unless it is directly >> read/accessed after a pagemap check. Bringing a page into memory to check >> if it previously was in swap does not seem ideal. > > Well, you can lseek() to remove all the holes and use mincore() to remove all in-memory pages. You're left with the swapped ones. Not the most efficient interface maybe, but there is a way :) Ok, that could work. Still, I have a couple of concerns. Firstly, I am worried lseek with the SEEK_HOLE flag would page in pages from swap, so using it would be a direct factor on its own output. If people are working on Live Migration, this would not be ideal. I am not 100% sure this is how lseek works, so please feel free to contradict me, but I think it would swap in some of the pages that it seeks through, if not all, to figure out when to stop. Unless it leverages the page cache somehow, or an internal bitmap. Secondly, mincore() could return some "false positives" for this particular use case. That is because it returns flag=1 for pages which are still in the swap cache, so the output becomes ambiguous. I am not saying this is not something that would ever be needed. Some people could actually be looking for exactly this scenario, and lseeking during the check could be an advantage. Just that it does not look very flexible. That is why the pagemap would have been ideal for us. Alternatively, to get all logically swapped out pages, the lseek with pagemap should do the trick. As you said, we remove holes with lseek, but we remove in-memory pages with is_present(pte) instead. This solution would still suffer from my first concern, but it should do the job. > >> Also, we still have no mechanism to retrieve the swap offsets of shmem pages >> AFAIK. There is one more QEMU optimisation we are working on that requires >> these mappings available outside of kernel space. > > How exactly would the swap offset really help? IMHO that's a kernel internal that shouldn't be of any value to user space -- it's merely for debugging purposes. But I'd love to learn details. It is possible for the swap device to be network attached and shared, so multiple hosts would need to understand its content. Then it is no longer internal to one kernel only. By being swap-aware, we can skip swapped-out pages during migration (to prevent IO and potential thrashing), and transfer those pages in another way that is zero-copy. > > [...] > >>> If it has an fd and we can punch that into syscalls, we should much rather use that fd to lookup stuff then going via process page tables -- if possible of course (to be evaluated, because I haven't looked into the CRIU details and how they use lseek with anonymous shared memory). >> I found out that it is possible to retrieve the fds of shmem/tmpfs file allocations >> using proc/pid/map_files, which is neat. Still, CRIU does not seem to care >> whether a page is swapped out or just empty, only if it is present on page cache. >> The holes that lseek finds would not be able to infer this difference, AFAIK. Will >> test the behaviour to make sure. > > CRIU wants to migrate everything. lseek() gives you the definitive answer what needs migration -- if it's swapped out or resident. Just skip the holes. Thank you for the summary. I see why the use case is sufficient for CRIU then. In our case, the optimisations aim to make the migration on QEMU swap aware. -- Kind regards, Tibi Georgescu
Hello, Tiberiu, On Fri, Aug 20, 2021 at 04:49:58PM +0000, Tiberiu Georgescu wrote: > Firstly, I am worried lseek with the SEEK_HOLE flag would page in pages from > swap, so using it would be a direct factor on its own output. If people are working > on Live Migration, this would not be ideal. I am not 100% sure this is how lseek > works, so please feel free to contradict me, but I think it would swap in some > of the pages that it seeks through, if not all, to figure out when to stop. Unless it > leverages the page cache somehow, or an internal bitmap. It shouldn't. Man page is clear on that: SEEK_DATA Adjust the file offset to the next location in the file greater than or equal to offset containing data. If offset points to data, then the file offset is set to offset. Again, I think your requirement is different from CRIU, so I think mincore() is the right thing for you. > > Secondly, mincore() could return some "false positives" for this particular use > case. That is because it returns flag=1 for pages which are still in the swap > cache, so the output becomes ambiguous. I don't think so; mincore() should return flag=0 if it's either in swap cache or even got dropped from it. I think its name/doc also shows that in the fact that "as long as it's not in RAM, the flag is cleared". That's why I think that should indeed be what you're looking for, if swp entry can be ignored. More below on that. Note that my series is as you mentioned missing the changes to support mincore() (otherwise I'll know the existance of it!). It'll be trivial to add that, but let's see whether mincore() will satisfy your need. [...] > It is possible for the swap device to be network attached and shared, so multiple > hosts would need to understand its content. Then it is no longer internal to one > kernel only. > > By being swap-aware, we can skip swapped-out pages during migration (to prevent IO and potential thrashing), and transfer those pages in another way that > is zero-copy. That sounds reasonable, but I'm not aware of any user-API that exposes swap entries to userspace, or is there one? I.e., how do you know which swap device is which? How do you guarantee the kernel swp entry information won't change along with time? Thanks,
Hello Peter, sorry for the late reply, > On Fri, Aug 20, 2021 at 04:49:58PM +0000, Tiberiu Georgescu wrote: >> Firstly, I am worried lseek with the SEEK_HOLE flag would page in pages from >> swap, so using it would be a direct factor on its own output. If people are working >> on Live Migration, this would not be ideal. I am not 100% sure this is how lseek >> works, so please feel free to contradict me, but I think it would swap in some >> of the pages that it seeks through, if not all, to figure out when to stop. Unless it >> leverages the page cache somehow, or an internal bitmap. > > It shouldn't. Man page is clear on that: > > SEEK_DATA > Adjust the file offset to the next location in the file greater > than or equal to offset containing data. If offset points to > data, then the file offset is set to offset. Ok, I got to test it out and you are right. lseek does not swap in pages. That is great news. > > Again, I think your requirement is different from CRIU, so I think mincore() is > the right thing for you. > >> >> Secondly, mincore() could return some "false positives" for this particular use >> case. That is because it returns flag=1 for pages which are still in the swap >> cache, so the output becomes ambiguous. > > I don't think so; mincore() should return flag=0 if it's either in swap cache > or even got dropped from it. I think its name/doc also shows that in the fact > that "as long as it's not in RAM, the flag is cleared". That's why I think > that should indeed be what you're looking for, if swp entry can be ignored. > More below on that. By saying there are "false positives", I do not mean that the mincore() would not work as expected, only that its definition is a little more subtle than that. And that it does not suit our needs entirely by itself. I tested mincore() compared to the pagemap, and I discovered that there are more flags set to 1 (not necessarily contiguous) compared to the pages pagemap was reporting as present. By also looking through the code, I could only conclude that pages in the swap cache were considered "still in RAM", so were set to 1 as well. When looking into what the swap cache does, it makes sense. We could use mincore() and pagemap to find the pages in the swap cache. In short, mincore() is not enough because it does not differentiate between present pages and swap-cache entries, as both are in RAM, but the latter is also in swap. It can be used with other tools to get more specific information though, so it is useful. > > Note that my series is as you mentioned missing the changes to support > mincore() (otherwise I'll know the existance of it!). It'll be trivial to add > that, but let's see whether mincore() will satisfy your need. We are currently trying to make use of all tools that we have learned of so far during our discussions (lseek, map_files, even mincore) to get the information that we need about swap pages. In theory, for many of our use cases, a combination of 2 or 3 should be enough. It is a little more convoluted than a simple pagemap call, but it can be more versatile (using lseek to skip multiple unallocated pages). As to whether the swap bit (and more) should be eventually added on the pagemap, maybe this topic makes more sense to continue on the Documentation thread. > > [...] > >> It is possible for the swap device to be network attached and shared, so multiple >> hosts would need to understand its content. Then it is no longer internal to one >> kernel only. >> >> By being swap-aware, we can skip swapped-out pages during migration (to prevent IO and potential thrashing), and transfer those pages in another way that >> is zero-copy. > > That sounds reasonable, but I'm not aware of any user-API that exposes swap > entries to userspace, or is there one? Good question. AFAIK, the swap device can be retrieved by using the swap type, which is part of the swap entry. During our discussions, I was always assuming that, if the pagemap entry kept track of the swap offset, it would keep track of the swap type and, conversely, the swap device as well. Sorry if I haven't made this assumption clear until now. So we were relying on the pagemap to expose swap entry information. Seeing it works for private pages, we thought it made sense to have worked on shared pages as well. > > I.e., how do you know which swap device is which? How do you guarantee the > kernel swp entry information won't change along with time? I don't think we can guarantee it unless we halt the guest. But QEMU does most migration work in pre-copy using a best-effort approach anyway. So, having a way to retrieve temporary, but accurate information about swap entries (i.e. post-patch pagemap) should be enough to guarantee a smoother migration process. It is intended to be repeated, unless there is no change between iterations. -- Kind regards, Tibi
On Wed, Aug 25, 2021 at 01:40:13PM +0000, Tiberiu Georgescu wrote: > Hello Peter, sorry for the late reply, Hi, Tiberiu, No worries on that. > > > On Fri, Aug 20, 2021 at 04:49:58PM +0000, Tiberiu Georgescu wrote: > >> Firstly, I am worried lseek with the SEEK_HOLE flag would page in pages from > >> swap, so using it would be a direct factor on its own output. If people are working > >> on Live Migration, this would not be ideal. I am not 100% sure this is how lseek > >> works, so please feel free to contradict me, but I think it would swap in some > >> of the pages that it seeks through, if not all, to figure out when to stop. Unless it > >> leverages the page cache somehow, or an internal bitmap. > > > > It shouldn't. Man page is clear on that: > > > > SEEK_DATA > > Adjust the file offset to the next location in the file greater > > than or equal to offset containing data. If offset points to > > data, then the file offset is set to offset. > > Ok, I got to test it out and you are right. lseek does not swap in pages. That is > great news. > > > > Again, I think your requirement is different from CRIU, so I think mincore() is > > the right thing for you. > > > >> > >> Secondly, mincore() could return some "false positives" for this particular use > >> case. That is because it returns flag=1 for pages which are still in the swap > >> cache, so the output becomes ambiguous. > > > > I don't think so; mincore() should return flag=0 if it's either in swap cache > > or even got dropped from it. I think its name/doc also shows that in the fact > > that "as long as it's not in RAM, the flag is cleared". That's why I think > > that should indeed be what you're looking for, if swp entry can be ignored. > > More below on that. > > By saying there are "false positives", I do not mean that the mincore() would > not work as expected, only that its definition is a little more subtle than that. And > that it does not suit our needs entirely by itself. > > I tested mincore() compared to the pagemap, and I discovered that there are > more flags set to 1 (not necessarily contiguous) compared to the pages pagemap > was reporting as present. By also looking through the code, I could only conclude > that pages in the swap cache were considered "still in RAM", so were set to 1 as > well. When looking into what the swap cache does, it makes sense. Please see mincore_page(): /* * When tmpfs swaps out a page from a file, any process mapping that * file will not get a swp_entry_t in its pte, but rather it is like * any other file mapping (ie. marked !present and faulted in with * tmpfs's .fault). So swapped out tmpfs mappings are tested here. */ page = find_get_incore_page(mapping, index); if (page) { present = PageUptodate(page); put_page(page); } I think the "testing" means when swapped out, the page will be NULL. If it's just the pte got zapped, the page will be returned. The call stack should look like: find_get_incore_page -> find_get_page -> pagecache_get_page(fgp_flags==0). I think the fgp_flags guaranteed it, with no FGP_ENTRY. Did you test mincore() without my patch (as my current patch will indeed cause more 1's returned than it should)? My guess is there's something else that made your test read more 1's with mincore() than pagemap, but I have no solid idea on that. > > We could use mincore() and pagemap to find the pages in the swap cache. > > In short, mincore() is not enough because it does not differentiate between > present pages and swap-cache entries, as both are in RAM, but the latter > is also in swap. It can be used with other tools to get more specific information > though, so it is useful. > > > > Note that my series is as you mentioned missing the changes to support > > mincore() (otherwise I'll know the existance of it!). It'll be trivial to add > > that, but let's see whether mincore() will satisfy your need. > > We are currently trying to make use of all tools that we have learned of so far > during our discussions (lseek, map_files, even mincore) to get the information > that we need about swap pages. In theory, for many of our use cases, a > combination of 2 or 3 should be enough. > > It is a little more convoluted than a simple pagemap call, but it can be more > versatile (using lseek to skip multiple unallocated pages). As to whether the swap > bit (and more) should be eventually added on the pagemap, maybe this topic > makes more sense to continue on the Documentation thread. > > > > [...] > > > >> It is possible for the swap device to be network attached and shared, so multiple > >> hosts would need to understand its content. Then it is no longer internal to one > >> kernel only. > >> > >> By being swap-aware, we can skip swapped-out pages during migration (to prevent IO and potential thrashing), and transfer those pages in another way that > >> is zero-copy. > > > > That sounds reasonable, but I'm not aware of any user-API that exposes swap > > entries to userspace, or is there one? > > Good question. AFAIK, the swap device can be retrieved by using the swap type, > which is part of the swap entry. During our discussions, I was always assuming > that, if the pagemap entry kept track of the swap offset, it would keep track of the > swap type and, conversely, the swap device as well. Sorry if I haven't made this > assumption clear until now. > > So we were relying on the pagemap to expose swap entry information. Seeing it > works for private pages, we thought it made sense to have worked on shared pages as well. > > > > I.e., how do you know which swap device is which? How do you guarantee the > > kernel swp entry information won't change along with time? > > I don't think we can guarantee it unless we halt the guest. Yes, halting the guest helps, though then performance start to matter because all time consumed in either pagemap or mincore() will be counted in as downtime of the VM live migration, and it's not "live" at all during this period. I'm not sure how it was done with private mappings before, because I thought that's a pre-requisite knowledge to decide whether we should migrate a page or not, but I might have missed something. We can stop vm, sample, start vm, but it could become hiccups in the guest too, or otherwise contribute to downtime when src/dst vm switches. > But QEMU does most > migration work in pre-copy using a best-effort approach anyway. > > So, having a way to retrieve temporary, but accurate information about swap > entries (i.e. post-patch pagemap) should be enough to guarantee a smoother > migration process. It is intended to be repeated, unless there is no change > between iterations. The kernel will allocate swap device index which will be assigned as swp_type, right? If there're multiple swap devices, how do you know which swp_entry is located on which device? I wanted to look for that info in "swapon -s" but I didn't. Or maybe that solution only works if there's only one swap device? Besides, I also still have a question on the accuracy. If there's no formal way for userspace to interact with the kernel, I'm wondering how to guarantee the page will be kept swapped out, and data located on the swap device will always be the latest? Because IMHO the kernel can swap in pages as wish, even if it's not accessed from the userspace. After all, all these operations should be transparent to userspace. One example in my mind is we do have page fault-around enabled for shmem, so that even if the VM is stopped, its pages can be faulted in if (unluckily, in this case, though) some page near the swapped out page got faulted in - it could be some qemu malloc()ed region that (again, unluckily) was allocated to have a virtual address very close to the mmap()ed guest memories. I am not sure whether it's a real problem, e.g., even if some page swapped in during guest halted for some reason, if no one is writting to that page, at least the data on the page will still be identical to the one located on the swap device. However I think that still sounds too tricky, and maybe fragile. Thanks,
On 25 Aug 2021, at 15:59, Peter Xu <peterx@redhat.com<mailto:peterx@redhat.com>> wrote: On Wed, Aug 25, 2021 at 01:40:13PM +0000, Tiberiu Georgescu wrote: Hello Peter, sorry for the late reply, Hi, Tiberiu, No worries on that. On Fri, Aug 20, 2021 at 04:49:58PM +0000, Tiberiu Georgescu wrote: Firstly, I am worried lseek with the SEEK_HOLE flag would page in pages from swap, so using it would be a direct factor on its own output. If people are working on Live Migration, this would not be ideal. I am not 100% sure this is how lseek works, so please feel free to contradict me, but I think it would swap in some of the pages that it seeks through, if not all, to figure out when to stop. Unless it leverages the page cache somehow, or an internal bitmap. It shouldn't. Man page is clear on that: SEEK_DATA Adjust the file offset to the next location in the file greater than or equal to offset containing data. If offset points to data, then the file offset is set to offset. Ok, I got to test it out and you are right. lseek does not swap in pages. That is great news. Again, I think your requirement is different from CRIU, so I think mincore() is the right thing for you. Secondly, mincore() could return some "false positives" for this particular use case. That is because it returns flag=1 for pages which are still in the swap cache, so the output becomes ambiguous. I don't think so; mincore() should return flag=0 if it's either in swap cache or even got dropped from it. I think its name/doc also shows that in the fact that "as long as it's not in RAM, the flag is cleared". That's why I think that should indeed be what you're looking for, if swp entry can be ignored. More below on that. By saying there are "false positives", I do not mean that the mincore() would not work as expected, only that its definition is a little more subtle than that. And that it does not suit our needs entirely by itself. I tested mincore() compared to the pagemap, and I discovered that there are more flags set to 1 (not necessarily contiguous) compared to the pages pagemap was reporting as present. By also looking through the code, I could only conclude that pages in the swap cache were considered "still in RAM", so were set to 1 as well. When looking into what the swap cache does, it makes sense. Please see mincore_page(): /* * When tmpfs swaps out a page from a file, any process mapping that * file will not get a swp_entry_t in its pte, but rather it is like * any other file mapping (ie. marked !present and faulted in with * tmpfs's .fault). So swapped out tmpfs mappings are tested here. */ page = find_get_incore_page(mapping, index); if (page) { present = PageUptodate(page); put_page(page); } I think the "testing" means when swapped out, the page will be NULL. If it's just the pte got zapped, the page will be returned. The call stack should look like: find_get_incore_page -> find_get_page -> pagecache_get_page(fgp_flags==0). I think the fgp_flags guaranteed it, with no FGP_ENTRY. Did you test mincore() without my patch (as my current patch will indeed cause more 1's returned than it should)? My guess is there's something else that made your test read more 1's with mincore() than pagemap, but I have no solid idea on that. I made sure to avoid any of our patches while testing mincore and counted the flags to make sure. There are more set than they are present. I will look into the code again and test on a non-rc kernel version, just to be safe. I still think it makes sense for mincore to consider pages in the swap cache to be "in RAM". I start seeing how useful it can be to differentiate between present pages and in-swap-cache pages. We could use mincore() and pagemap to find the pages in the swap cache. In short, mincore() is not enough because it does not differentiate between present pages and swap-cache entries, as both are in RAM, but the latter is also in swap. It can be used with other tools to get more specific information though, so it is useful. Note that my series is as you mentioned missing the changes to support mincore() (otherwise I'll know the existance of it!). It'll be trivial to add that, but let's see whether mincore() will satisfy your need. We are currently trying to make use of all tools that we have learned of so far during our discussions (lseek, map_files, even mincore) to get the information that we need about swap pages. In theory, for many of our use cases, a combination of 2 or 3 should be enough. It is a little more convoluted than a simple pagemap call, but it can be more versatile (using lseek to skip multiple unallocated pages). As to whether the swap bit (and more) should be eventually added on the pagemap, maybe this topic makes more sense to continue on the Documentation thread. [...] It is possible for the swap device to be network attached and shared, so multiple hosts would need to understand its content. Then it is no longer internal to one kernel only. By being swap-aware, we can skip swapped-out pages during migration (to prevent IO and potential thrashing), and transfer those pages in another way that is zero-copy. That sounds reasonable, but I'm not aware of any user-API that exposes swap entries to userspace, or is there one? Good question. AFAIK, the swap device can be retrieved by using the swap type, which is part of the swap entry. During our discussions, I was always assuming that, if the pagemap entry kept track of the swap offset, it would keep track of the swap type and, conversely, the swap device as well. Sorry if I haven't made this assumption clear until now. So we were relying on the pagemap to expose swap entry information. Seeing it works for private pages, we thought it made sense to have worked on shared pages as well. I.e., how do you know which swap device is which? How do you guarantee the kernel swp entry information won't change along with time? I don't think we can guarantee it unless we halt the guest. Yes, halting the guest helps, though then performance start to matter because all time consumed in either pagemap or mincore() will be counted in as downtime of the VM live migration, and it's not "live" at all during this period. That's true. That is why this "halting" is supposed to only happen once, and its duration needs to be minimised by pre-copy. Live migration AFAIK is intended to have a single imperceptible pause at some point in order to converge more quickly and cleanly. Without this halt, the whole migration procedure could run indefinitely, or for days instead of hours/minutes. I'm not sure how it was done with private mappings before, because I thought that's a pre-requisite knowledge to decide whether we should migrate a page or not, but I might have missed something. We can stop vm, sample, start vm, but it could become hiccups in the guest too, or otherwise contribute to downtime when src/dst vm switches. I see your point. Safe to say the vm should never be stopped during pre-copy. See my comment above. But QEMU does most migration work in pre-copy using a best-effort approach anyway. So, having a way to retrieve temporary, but accurate information about swap entries (i.e. post-patch pagemap) should be enough to guarantee a smoother migration process. It is intended to be repeated, unless there is no change between iterations. The kernel will allocate swap device index which will be assigned as swp_type, right? If there're multiple swap devices, how do you know which swp_entry is located on which device? I wanted to look for that info in "swapon -s" but I didn't. Or maybe that solution only works if there's only one swap device? Besides, I also still have a question on the accuracy. If there's no formal way for userspace to interact with the kernel, I'm wondering how to guarantee the page will be kept swapped out, and data located on the swap device will always be the latest? Because IMHO the kernel can swap in pages as wish, even if it's not accessed from the userspace. After all, all these operations should be transparent to userspace. One example in my mind is we do have page fault-around enabled for shmem, so that even if the VM is stopped, its pages can be faulted in if (unluckily, in this case, though) some page near the swapped out page got faulted in - it could be some qemu malloc()ed region that (again, unluckily) was allocated to have a virtual address very close to the mmap()ed guest memories. I am not sure whether it's a real problem, e.g., even if some page swapped in during guest halted for some reason, if no one is writting to that page, at least the data on the page will still be identical to the one located on the swap device. However I think that still sounds too tricky, and maybe fragile. Great example, and good point. Thank you for raising that up. This looks like a very real problem that we need to take into consideration in our prototypes. -- Kind regards, Tibi