Message ID | 20240903232241.43995-1-anthony.yznaga@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | Add support for shared PTEs across processes | expand |
We were just chatting about this on David Rientjes's MM alignment call. I thought I'd try to give a little brain Let's start by thinking about KVM and secondary MMUs. KVM has a primary mm: the QEMU (or whatever) process mm. The virtualization (EPT/NPT) tables get entries that effectively mirror the primary mm page tables and constitute a secondary MMU. If the primary page tables change, mmu_notifiers ensure that the changes get reflected into the virtualization tables and also that the virtualization paging structure caches are flushed. msharefs is doing something very similar. But, in the msharefs case, the secondary MMUs are actually normal CPU MMUs. The page tables are normal old page tables and the caches are the normal old TLB. That's what makes it so confusing: we have lots of infrastructure for dealing with that "stuff" (CPU page tables and TLB), but msharefs has short-circuited the infrastructure and it doesn't work any more. Basically, I think it makes a lot of sense to check what KVM (or another mmu_notifier user) is doing and make sure that msharefs is following its lead. For instance, KVM _should_ have the exact same "page free" flushing issue where it gets the MMU notifier call but the page may still be in the secondary MMU. I _think_ KVM fixes it with an extra page refcount that it takes when it first walks the primary page tables. But the short of it is that the msharefs host mm represents a "secondary MMU". I don't think it is really that special of an MMU other than the fact that it has an mm_struct.
On 10/2/24 10:35 AM, Dave Hansen wrote: > We were just chatting about this on David Rientjes's MM alignment call. > I thought I'd try to give a little brain > > Let's start by thinking about KVM and secondary MMUs. KVM has a primary > mm: the QEMU (or whatever) process mm. The virtualization (EPT/NPT) > tables get entries that effectively mirror the primary mm page tables > and constitute a secondary MMU. If the primary page tables change, > mmu_notifiers ensure that the changes get reflected into the > virtualization tables and also that the virtualization paging structure > caches are flushed. > > msharefs is doing something very similar. But, in the msharefs case, > the secondary MMUs are actually normal CPU MMUs. The page tables are > normal old page tables and the caches are the normal old TLB. That's > what makes it so confusing: we have lots of infrastructure for dealing > with that "stuff" (CPU page tables and TLB), but msharefs has > short-circuited the infrastructure and it doesn't work any more. > > Basically, I think it makes a lot of sense to check what KVM (or another > mmu_notifier user) is doing and make sure that msharefs is following its > lead. For instance, KVM _should_ have the exact same "page free" > flushing issue where it gets the MMU notifier call but the page may > still be in the secondary MMU. I _think_ KVM fixes it with an extra > page refcount that it takes when it first walks the primary page tables. > > But the short of it is that the msharefs host mm represents a "secondary > MMU". I don't think it is really that special of an MMU other than the > fact that it has an mm_struct. Thanks, Dave. This is helpful. I'll look at what other mmu notifier users are doing. This does align with the comments in mmu_notifier.h regarding invalidate_range_start/end. Anthony
About TLB flushing... The quick and dirty thing to do is just flush_tlb_all() after you remove the PTE from the host mm. That will surely work everywhere and it's as dirt simple as you get. Honestly, it might even be cheaper than the alternative. Also, I don't think PCIDs actually complicate the problem at all. We basically do remote mm TLB flushes using two mechanisms: 1. If the mm is loaded, use INVLPG and friends to zap the TLB 2. Bump mm->context.tlb_gen so that the next time it _gets_ loaded, the TLB is flushed. flush_tlb_func() really only cares about #1 since if the mm isn't loaded, it'll get flushed anyway at the next context switch. The alternatives I can think of: Make flush_tlb_mm_range(host_mm) work somehow. You'd need to somehow keep mm_cpumask(host_mm) up to date and also make do something to flush_tlb_func() to tell it that 'loaded_mm' isn't relevant and it should flush regardless. The other way is to use the msharefs's inode ->i_mmap to find all the VMAs mapping the file, and find all *their* mm's: for each vma in inode->i_mmap mm = vma->vm_mm flush_tlb_mm_range(<vma range here>) But that might be even worse than flush_tlb_all() because it might end up sending more than one IPI per CPU. You can fix _that_ by keeping a single cpumask that you build up: mask = 0 for each vma in inode->i_mmap mm = vma->vm_mm mask |= mm_cpumask(mm) flush_tlb_multi(mask, info); Unfortunately, 'info->mm' needs to be more than one mm, so you probably still need a new flush_tlb_func() flush type to tell it to ignore 'info->mm' and flush anyway. After all that, I kinda like flush_tlb_all(). ;)
On 10/2/24 4:11 PM, Dave Hansen wrote: > About TLB flushing... > > The quick and dirty thing to do is just flush_tlb_all() after you remove > the PTE from the host mm. That will surely work everywhere and it's as > dirt simple as you get. Honestly, it might even be cheaper than the > alternative. I think this a good place to start from. My concern is that unnecessary flushes will potentially impact unrelated loads. Performance testing as things progress can help determine if a more involved approach is needed. > > Also, I don't think PCIDs actually complicate the problem at all. We > basically do remote mm TLB flushes using two mechanisms: > > 1. If the mm is loaded, use INVLPG and friends to zap the TLB > 2. Bump mm->context.tlb_gen so that the next time it _gets_ > loaded, the TLB is flushed. > > flush_tlb_func() really only cares about #1 since if the mm isn't > loaded, it'll get flushed anyway at the next context switch. > > The alternatives I can think of: > > Make flush_tlb_mm_range(host_mm) work somehow. You'd need to somehow > keep mm_cpumask(host_mm) up to date and also make do something to > flush_tlb_func() to tell it that 'loaded_mm' isn't relevant and it > should flush regardless. > > The other way is to use the msharefs's inode ->i_mmap to find all the > VMAs mapping the file, and find all *their* mm's: > > for each vma in inode->i_mmap > mm = vma->vm_mm > flush_tlb_mm_range(<vma range here>) > > But that might be even worse than flush_tlb_all() because it might end > up sending more than one IPI per CPU. > > You can fix _that_ by keeping a single cpumask that you build up: > > mask = 0 > for each vma in inode->i_mmap > mm = vma->vm_mm > mask |= mm_cpumask(mm) > > flush_tlb_multi(mask, info); > > Unfortunately, 'info->mm' needs to be more than one mm, so you probably > still need a new flush_tlb_func() flush type to tell it to ignore > 'info->mm' and flush anyway. What about something like arch_tlbbatch_flush() which sets info->mm to NULL and uses a batch cpumask? That seems perfect though there would be a bit more work needed to ensure things work on other architectures. flush_tlb_all() it is then, for now. :-) > > After all that, I kinda like flush_tlb_all(). ;)
On 02.10.24 19:35, Dave Hansen wrote: > We were just chatting about this on David Rientjes's MM alignment call. Unfortunately I was not able to attend this time, my body decided it's a good idea to stay in bed for a couple of days. > I thought I'd try to give a little brain > > Let's start by thinking about KVM and secondary MMUs. KVM has a primary > mm: the QEMU (or whatever) process mm. The virtualization (EPT/NPT) > tables get entries that effectively mirror the primary mm page tables > and constitute a secondary MMU. If the primary page tables change, > mmu_notifiers ensure that the changes get reflected into the > virtualization tables and also that the virtualization paging structure > caches are flushed. > > msharefs is doing something very similar. But, in the msharefs case, > the secondary MMUs are actually normal CPU MMUs. The page tables are > normal old page tables and the caches are the normal old TLB. That's > what makes it so confusing: we have lots of infrastructure for dealing > with that "stuff" (CPU page tables and TLB), but msharefs has > short-circuited the infrastructure and it doesn't work any more. It's quite different IMHO, to a degree that I believe they are different beasts: Secondary MMUs: * "Belongs" to same MM context and the primary MMU (process page tables) * Maintains separate tables/PTEs, in completely separate page table hierarchy * Notifiers make sure the secondary structure stays in sync (update PTEs, flush TLB) mshare: * Possibly mapped by many different MMs. Likely nothing stops us from having on MM map multiple different mshare fds/ * Updating the PTEs directly affects all other MM page table structures (and possibly any secondary MMUs! scary) I better not think about the complexity of seconary MMUs + mshare (e.g., KVM with mshare in guest memory): MMU notifiers for all MMs must be called ... > > Basically, I think it makes a lot of sense to check what KVM (or another > mmu_notifier user) is doing and make sure that msharefs is following its > lead. For instance, KVM _should_ have the exact same "page free" > flushing issue where it gets the MMU notifier call but the page may > still be in the secondary MMU. I _think_ KVM fixes it with an extra > page refcount that it takes when it first walks the primary page tables. > > But the short of it is that the msharefs host mm represents a "secondary > MMU". I don't think it is really that special of an MMU other than the > fact that it has an mm_struct. Not sure I agree ... IMHO these are two orthogonal things. Unless we want MMU notifiers to "update" MM primary MMUs (there is not really anything to update ...), but not sure if that is what we are looking for. What you note about TLB flushing in the other mail makes sense, not sure how this interacts with any secondary MMUs ....
On 02.10.24 19:35, Dave Hansen wrote: > We were just chatting about this on David Rientjes's MM alignment call. > I thought I'd try to give a little brain > > Let's start by thinking about KVM and secondary MMUs. KVM has a primary > mm: the QEMU (or whatever) process mm. The virtualization (EPT/NPT) > tables get entries that effectively mirror the primary mm page tables > and constitute a secondary MMU. If the primary page tables change, > mmu_notifiers ensure that the changes get reflected into the > virtualization tables and also that the virtualization paging structure > caches are flushed. > > msharefs is doing something very similar. But, in the msharefs case, > the secondary MMUs are actually normal CPU MMUs. The page tables are > normal old page tables and the caches are the normal old TLB. That's > what makes it so confusing: we have lots of infrastructure for dealing > with that "stuff" (CPU page tables and TLB), but msharefs has > short-circuited the infrastructure and it doesn't work any more. > > Basically, I think it makes a lot of sense to check what KVM (or another > mmu_notifier user) is doing and make sure that msharefs is following its > lead. For instance, KVM _should_ have the exact same "page free" > flushing issue where it gets the MMU notifier call but the page may > still be in the secondary MMU. I _think_ KVM fixes it with an extra > page refcount that it takes when it first walks the primary page tables. Forgot to comment on this (brain still recovering ...). KVM only grabs a temporary reference, and drops that reference once the secondary MMU PTE was updated (present PTE installed). The notifiers on primary MMU changes (e.g., unmap) take care of any TLB invalidation before the primary MMU let's go of the page and the refcount is dropped.
On Tue, Sep 03, 2024 at 04:22:31PM -0700, Anthony Yznaga wrote: > This patch series implements a mechanism that allows userspace > processes to opt into sharing PTEs. It adds a new in-memory > filesystem - msharefs. A file created on msharefs represents a > shared region where all processes mapping that region will map > objects within it with shared PTEs. When the file is created, > a new host mm struct is created to hold the shared page tables > and vmas for objects later mapped into the shared region. This > host mm struct is associated with the file and not with a task. Taskless mm_struct can be problematic. Like, we don't have access to it's counters because it is not represented in /proc. For instance, there's no way to check its smaps. Also, I *think* it is immune to oom-killer because oom-killer looks for a victim task, not mm. I hope it is not an intended feature :P > When a process mmap's the shared region, a vm flag VM_SHARED_PT > is added to the vma. On page fault the vma is checked for the > presence of the VM_SHARED_PT flag. I think it is wrong approach. Instead of spaying VM_SHARED_PT checks across core-mm, we need to add a generic hooks that can be used by mshare and hugetlb. And remove is_vm_hugetlb_page() check from core-mm along the way. BTW, is_vm_hugetlb_page() callsites seem to be the indicator to check if mshare has to do something differently there. I feel you miss a lot of such cases.
On 10/7/24 01:44, David Hildenbrand wrote: > On 02.10.24 19:35, Dave Hansen wrote: >> We were just chatting about this on David Rientjes's MM alignment call. > > Unfortunately I was not able to attend this time, my body decided it's a > good idea to stay in bed for a couple of days. > >> I thought I'd try to give a little brain >> >> Let's start by thinking about KVM and secondary MMUs. KVM has a primary >> mm: the QEMU (or whatever) process mm. The virtualization (EPT/NPT) >> tables get entries that effectively mirror the primary mm page tables >> and constitute a secondary MMU. If the primary page tables change, >> mmu_notifiers ensure that the changes get reflected into the >> virtualization tables and also that the virtualization paging structure >> caches are flushed. >> >> msharefs is doing something very similar. But, in the msharefs case, >> the secondary MMUs are actually normal CPU MMUs. The page tables are >> normal old page tables and the caches are the normal old TLB. That's >> what makes it so confusing: we have lots of infrastructure for dealing >> with that "stuff" (CPU page tables and TLB), but msharefs has >> short-circuited the infrastructure and it doesn't work any more. > > It's quite different IMHO, to a degree that I believe they are different > beasts: > > Secondary MMUs: > * "Belongs" to same MM context and the primary MMU (process page tables) I think you're speaking to the ratio here. For each secondary MMU, I think you're saying that there's one and only one mm_struct. Is that right? > * Maintains separate tables/PTEs, in completely separate page table > hierarchy This is the case for KVM and the VMX/SVM MMUs, but it's not generally true about hardware. IOMMUs can walk x86 page tables and populate the IOTLB from the _same_ page table hierarchy as the CPU.
On 07.10.24 17:58, Dave Hansen wrote: > On 10/7/24 01:44, David Hildenbrand wrote: >> On 02.10.24 19:35, Dave Hansen wrote: >>> We were just chatting about this on David Rientjes's MM alignment call. >> >> Unfortunately I was not able to attend this time, my body decided it's a >> good idea to stay in bed for a couple of days. >> >>> I thought I'd try to give a little brain >>> >>> Let's start by thinking about KVM and secondary MMUs. KVM has a primary >>> mm: the QEMU (or whatever) process mm. The virtualization (EPT/NPT) >>> tables get entries that effectively mirror the primary mm page tables >>> and constitute a secondary MMU. If the primary page tables change, >>> mmu_notifiers ensure that the changes get reflected into the >>> virtualization tables and also that the virtualization paging structure >>> caches are flushed. >>> >>> msharefs is doing something very similar. But, in the msharefs case, >>> the secondary MMUs are actually normal CPU MMUs. The page tables are >>> normal old page tables and the caches are the normal old TLB. That's >>> what makes it so confusing: we have lots of infrastructure for dealing >>> with that "stuff" (CPU page tables and TLB), but msharefs has >>> short-circuited the infrastructure and it doesn't work any more. >> >> It's quite different IMHO, to a degree that I believe they are different >> beasts: >> >> Secondary MMUs: >> * "Belongs" to same MM context and the primary MMU (process page tables) > > I think you're speaking to the ratio here. For each secondary MMU, I > think you're saying that there's one and only one mm_struct. Is that right? Yes, that is my understanding (at least with KVM). It's a secondary MMU derived from exactly one primary MMU (MM context -> page table hierarchy). The sophisticated ( :) ) notifier mechanism when updating the primary MMU will result in keeping the secondary MMU in sync (of course, what to sync, and how to, depends in KVM on the memory slot that define how the guest physical memory layout is derived from the process virtual address space). > >> * Maintains separate tables/PTEs, in completely separate page table >> hierarchy > > This is the case for KVM and the VMX/SVM MMUs, but it's not generally > true about hardware. IOMMUs can walk x86 page tables and populate the > IOTLB from the _same_ page table hierarchy as the CPU. Yes, of course.
On Mon, Oct 07, 2024, David Hildenbrand wrote: > On 07.10.24 17:58, Dave Hansen wrote: > > On 10/7/24 01:44, David Hildenbrand wrote: > > > On 02.10.24 19:35, Dave Hansen wrote: > > > > We were just chatting about this on David Rientjes's MM alignment call. > > > > > > Unfortunately I was not able to attend this time, my body decided it's a > > > good idea to stay in bed for a couple of days. > > > > > > > I thought I'd try to give a little brain > > > > > > > > Let's start by thinking about KVM and secondary MMUs. KVM has a primary > > > > mm: the QEMU (or whatever) process mm. The virtualization (EPT/NPT) > > > > tables get entries that effectively mirror the primary mm page tables > > > > and constitute a secondary MMU. If the primary page tables change, > > > > mmu_notifiers ensure that the changes get reflected into the > > > > virtualization tables and also that the virtualization paging structure > > > > caches are flushed. > > > > > > > > msharefs is doing something very similar. But, in the msharefs case, > > > > the secondary MMUs are actually normal CPU MMUs. The page tables are > > > > normal old page tables and the caches are the normal old TLB. That's > > > > what makes it so confusing: we have lots of infrastructure for dealing > > > > with that "stuff" (CPU page tables and TLB), but msharefs has > > > > short-circuited the infrastructure and it doesn't work any more. > > > > > > It's quite different IMHO, to a degree that I believe they are different > > > beasts: > > > > > > Secondary MMUs: > > > * "Belongs" to same MM context and the primary MMU (process page tables) > > > > I think you're speaking to the ratio here. For each secondary MMU, I > > think you're saying that there's one and only one mm_struct. Is that right? > > Yes, that is my understanding (at least with KVM). It's a secondary MMU > derived from exactly one primary MMU (MM context -> page table hierarchy). I don't think the ratio is what's important. I think the important takeaway is that the secondary MMU is explicitly tied to the primary MMU that it is tracking. This is enforced in code, as the list of mmu_notifiers is stored in mm_struct. The 1:1 ratio probably holds true today, e.g. for KVM, each VM is associated with exactly one mm_struct. But fundamentally, nothing would prevent a secondary MMU that manages a so called software TLB from tracking multiple primary MMUs. E.g. it wouldn't be all that hard to implement in KVM (a bit crazy, but not hard), because KVM's memslots disallow gfn aliases, i.e. each index into KVM's secondary MMU would be associated with at most one VMA and thus mm_struct. Pulling Dave's earlier comment in: : But the short of it is that the msharefs host mm represents a "secondary : MMU". I don't think it is really that special of an MMU other than the : fact that it has an mm_struct. and David's (so. many. Davids): : I better not think about the complexity of seconary MMUs + mshare (e.g., : KVM with mshare in guest memory): MMU notifiers for all MMs must be : called ... mshare() is unique because it creates the possibly of chained "secondary" MMUs. I.e. the fact that it has an mm_struct makes it *very* special, IMO. > > > * Maintains separate tables/PTEs, in completely separate page table > > > hierarchy > > > > This is the case for KVM and the VMX/SVM MMUs, but it's not generally > > true about hardware. IOMMUs can walk x86 page tables and populate the > > IOTLB from the _same_ page table hierarchy as the CPU. > > Yes, of course. Yeah, the recent rework of invalidate_range() => arch_invalidate_secondary_tlbs() sums things up nicely: commit 1af5a8109904b7f00828e7f9f63f5695b42f8215 Author: Alistair Popple <apopple@nvidia.com> AuthorDate: Tue Jul 25 23:42:07 2023 +1000 Commit: Andrew Morton <akpm@linux-foundation.org> CommitDate: Fri Aug 18 10:12:41 2023 -0700 mmu_notifiers: rename invalidate_range notifier There are two main use cases for mmu notifiers. One is by KVM which uses mmu_notifier_invalidate_range_start()/end() to manage a software TLB. The other is to manage hardware TLBs which need to use the invalidate_range() callback because HW can establish new TLB entries at any time. Hence using start/end() can lead to memory corruption as these callbacks happen too soon/late during page unmap. mmu notifier users should therefore either use the start()/end() callbacks or the invalidate_range() callbacks. To make this usage clearer rename the invalidate_range() callback to arch_invalidate_secondary_tlbs() and update documention.
On 10/7/24 2:01 AM, Kirill A. Shutemov wrote: > On Tue, Sep 03, 2024 at 04:22:31PM -0700, Anthony Yznaga wrote: >> This patch series implements a mechanism that allows userspace >> processes to opt into sharing PTEs. It adds a new in-memory >> filesystem - msharefs. A file created on msharefs represents a >> shared region where all processes mapping that region will map >> objects within it with shared PTEs. When the file is created, >> a new host mm struct is created to hold the shared page tables >> and vmas for objects later mapped into the shared region. This >> host mm struct is associated with the file and not with a task. > Taskless mm_struct can be problematic. Like, we don't have access to it's > counters because it is not represented in /proc. For instance, there's no > way to check its smaps. Definitely needs exposure in /proc. One of the things I'm looking into is the feasibility of showing the mappings in maps/smaps/etc.. > > Also, I *think* it is immune to oom-killer because oom-killer looks for a > victim task, not mm. > I hope it is not an intended feature :P oom-killer would have to kill all sharers of an mshare region before the mshare region itself could be freed, but I'm not sure that oom-killer would be the one to free the region. An mshare region is essentially a shared memory object not unlike a tmpfs or hugetlb file. I think some higher level intelligence would have to be involved to release it if appropriate when under oom conditions. > >> When a process mmap's the shared region, a vm flag VM_SHARED_PT >> is added to the vma. On page fault the vma is checked for the >> presence of the VM_SHARED_PT flag. > I think it is wrong approach. > > Instead of spaying VM_SHARED_PT checks across core-mm, we need to add a > generic hooks that can be used by mshare and hugetlb. And remove > is_vm_hugetlb_page() check from core-mm along the way. > > BTW, is_vm_hugetlb_page() callsites seem to be the indicator to check if > mshare has to do something differently there. I feel you miss a lot of > such cases. Good point about is_vm_hugetlb_page(). I'll review the callsites (there are only ~60 of them :-). Thanks, Anthony
On 07.10.24 21:23, Anthony Yznaga wrote: > > On 10/7/24 2:01 AM, Kirill A. Shutemov wrote: >> On Tue, Sep 03, 2024 at 04:22:31PM -0700, Anthony Yznaga wrote: >>> This patch series implements a mechanism that allows userspace >>> processes to opt into sharing PTEs. It adds a new in-memory >>> filesystem - msharefs. A file created on msharefs represents a >>> shared region where all processes mapping that region will map >>> objects within it with shared PTEs. When the file is created, >>> a new host mm struct is created to hold the shared page tables >>> and vmas for objects later mapped into the shared region. This >>> host mm struct is associated with the file and not with a task. >> Taskless mm_struct can be problematic. Like, we don't have access to it's >> counters because it is not represented in /proc. For instance, there's no >> way to check its smaps. > > Definitely needs exposure in /proc. One of the things I'm looking into > is the feasibility of showing the mappings in maps/smaps/etc.. > > >> >> Also, I *think* it is immune to oom-killer because oom-killer looks for a >> victim task, not mm. >> I hope it is not an intended feature :P > > oom-killer would have to kill all sharers of an mshare region before the > mshare region itself could be freed, but I'm not sure that oom-killer > would be the one to free the region. An mshare region is essentially a > shared memory object not unlike a tmpfs or hugetlb file. I think some > higher level intelligence would have to be involved to release it if > appropriate when under oom conditions. I thought we discussed that at LSF/MM last year and the conclusion was that a process is needed (OOM kill, cgroup handling, ...), and it must remain running. Once it stops running, we can force-unmap all pages etc. Did you look at the summary/(recording if available) of that, or am I remembering things wrongly or was there actually such a discussion? I know, it's problematic that this feature switched owners, ...
On 10/7/24 12:41 PM, David Hildenbrand wrote: > On 07.10.24 21:23, Anthony Yznaga wrote: >> >> On 10/7/24 2:01 AM, Kirill A. Shutemov wrote: >>> On Tue, Sep 03, 2024 at 04:22:31PM -0700, Anthony Yznaga wrote: >>>> This patch series implements a mechanism that allows userspace >>>> processes to opt into sharing PTEs. It adds a new in-memory >>>> filesystem - msharefs. A file created on msharefs represents a >>>> shared region where all processes mapping that region will map >>>> objects within it with shared PTEs. When the file is created, >>>> a new host mm struct is created to hold the shared page tables >>>> and vmas for objects later mapped into the shared region. This >>>> host mm struct is associated with the file and not with a task. >>> Taskless mm_struct can be problematic. Like, we don't have access to >>> it's >>> counters because it is not represented in /proc. For instance, >>> there's no >>> way to check its smaps. >> >> Definitely needs exposure in /proc. One of the things I'm looking into >> is the feasibility of showing the mappings in maps/smaps/etc.. >> >> >>> >>> Also, I *think* it is immune to oom-killer because oom-killer looks >>> for a >>> victim task, not mm. >>> I hope it is not an intended feature :P >> >> oom-killer would have to kill all sharers of an mshare region before the >> mshare region itself could be freed, but I'm not sure that oom-killer >> would be the one to free the region. An mshare region is essentially a >> shared memory object not unlike a tmpfs or hugetlb file. I think some >> higher level intelligence would have to be involved to release it if >> appropriate when under oom conditions. > > > I thought we discussed that at LSF/MM last year and the conclusion was > that a process is needed (OOM kill, cgroup handling, ...), and it must > remain running. Once it stops running, we can force-unmap all pages etc. > > Did you look at the summary/(recording if available) of that, or am I > remembering things wrongly or was there actually such a discussion? You're likely right. I'll review the discussion. Anthony > > I know, it's problematic that this feature switched owners, ...
On 10/7/24 9:45 AM, Sean Christopherson wrote: > On Mon, Oct 07, 2024, David Hildenbrand wrote: >> On 07.10.24 17:58, Dave Hansen wrote: >>> On 10/7/24 01:44, David Hildenbrand wrote: >>>> On 02.10.24 19:35, Dave Hansen wrote: >>>>> We were just chatting about this on David Rientjes's MM alignment call. >>>> Unfortunately I was not able to attend this time, my body decided it's a >>>> good idea to stay in bed for a couple of days. >>>> >>>>> I thought I'd try to give a little brain >>>>> >>>>> Let's start by thinking about KVM and secondary MMUs. KVM has a primary >>>>> mm: the QEMU (or whatever) process mm. The virtualization (EPT/NPT) >>>>> tables get entries that effectively mirror the primary mm page tables >>>>> and constitute a secondary MMU. If the primary page tables change, >>>>> mmu_notifiers ensure that the changes get reflected into the >>>>> virtualization tables and also that the virtualization paging structure >>>>> caches are flushed. >>>>> >>>>> msharefs is doing something very similar. But, in the msharefs case, >>>>> the secondary MMUs are actually normal CPU MMUs. The page tables are >>>>> normal old page tables and the caches are the normal old TLB. That's >>>>> what makes it so confusing: we have lots of infrastructure for dealing >>>>> with that "stuff" (CPU page tables and TLB), but msharefs has >>>>> short-circuited the infrastructure and it doesn't work any more. >>>> It's quite different IMHO, to a degree that I believe they are different >>>> beasts: >>>> >>>> Secondary MMUs: >>>> * "Belongs" to same MM context and the primary MMU (process page tables) >>> I think you're speaking to the ratio here. For each secondary MMU, I >>> think you're saying that there's one and only one mm_struct. Is that right? >> Yes, that is my understanding (at least with KVM). It's a secondary MMU >> derived from exactly one primary MMU (MM context -> page table hierarchy). > I don't think the ratio is what's important. I think the important takeaway is > that the secondary MMU is explicitly tied to the primary MMU that it is tracking. > This is enforced in code, as the list of mmu_notifiers is stored in mm_struct. > > The 1:1 ratio probably holds true today, e.g. for KVM, each VM is associated with > exactly one mm_struct. But fundamentally, nothing would prevent a secondary MMU > that manages a so called software TLB from tracking multiple primary MMUs. > > E.g. it wouldn't be all that hard to implement in KVM (a bit crazy, but not hard), > because KVM's memslots disallow gfn aliases, i.e. each index into KVM's secondary > MMU would be associated with at most one VMA and thus mm_struct. > > Pulling Dave's earlier comment in: > > : But the short of it is that the msharefs host mm represents a "secondary > : MMU". I don't think it is really that special of an MMU other than the > : fact that it has an mm_struct. > > and David's (so. many. Davids): > > : I better not think about the complexity of seconary MMUs + mshare (e.g., > : KVM with mshare in guest memory): MMU notifiers for all MMs must be > : called ... > > mshare() is unique because it creates the possibly of chained "secondary" MMUs. > I.e. the fact that it has an mm_struct makes it *very* special, IMO. This is definitely a gap with the current mshare implementation. Mapping memory that relies on an mmu notifier in an mshare region will result in the notifier callbacks never being called. On the surface it seems like the mshare mm needs notifiers that go through every mm that has mapped the mshare region and calls its notifiers. > >>>> * Maintains separate tables/PTEs, in completely separate page table >>>> hierarchy >>> This is the case for KVM and the VMX/SVM MMUs, but it's not generally >>> true about hardware. IOMMUs can walk x86 page tables and populate the >>> IOTLB from the _same_ page table hierarchy as the CPU. >> Yes, of course. > Yeah, the recent rework of invalidate_range() => arch_invalidate_secondary_tlbs() > sums things up nicely: > > commit 1af5a8109904b7f00828e7f9f63f5695b42f8215 > Author: Alistair Popple <apopple@nvidia.com> > AuthorDate: Tue Jul 25 23:42:07 2023 +1000 > Commit: Andrew Morton <akpm@linux-foundation.org> > CommitDate: Fri Aug 18 10:12:41 2023 -0700 > > mmu_notifiers: rename invalidate_range notifier > > There are two main use cases for mmu notifiers. One is by KVM which uses > mmu_notifier_invalidate_range_start()/end() to manage a software TLB. > > The other is to manage hardware TLBs which need to use the > invalidate_range() callback because HW can establish new TLB entries at > any time. Hence using start/end() can lead to memory corruption as these > callbacks happen too soon/late during page unmap. > > mmu notifier users should therefore either use the start()/end() callbacks > or the invalidate_range() callbacks. To make this usage clearer rename > the invalidate_range() callback to arch_invalidate_secondary_tlbs() and > update documention. I believe if I implemented an arch_invalidate_secondary_tlbs notifier that flushed all TLBs, that would solve the problem of ensuring TLBs are flushed before pages being unmapped from an mshare region are freed. However, it seems like there would be potentially be a lot more collateral damage from flushing everything since the flushes would happen more often.
On Wed, Sep 4, 2024 at 1:22 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote: > One major issue to address for this series to function correctly > is how to ensure proper TLB flushing when a page in a shared > region is unmapped. For example, since the rmaps for pages in a > shared region map back to host vmas which point to a host mm, TLB > flushes won't be directed to the CPUs the sharing processes have > run on. I am by no means an expert in this area. One idea is to > install a mmu_notifier on the host mm that can gather the necessary > data and do flushes similar to the batch flushing. The mmu_notifier API has two ways you can use it: First, there is the classic mode, where before you start modifying PTEs in some range, you remove mirrored PTEs from some other context, and until you're done with your PTE modification, you don't allow creation of new mirrored PTEs. This is intended for cases where individual PTE entries are copied over to some other context (such as EPT tables for virtualization). When I last looked at that code, it looked fine, and this is what KVM uses. But it probably doesn't match your usecase, since you wouldn't want removal of a single page to cause the entire page table containing it to be temporarily unmapped from the processes that use it? Second, there is a newer mode for IOMMUv2 stuff (using the mmu_notifier_ops::invalidate_range callback), where the idea is that you have secondary MMUs that share the normal page tables, and so you basically send them invalidations at the same time you invalidate the primary MMU for the process. I think that's the right fit for this usecase; however, last I looked, this code was extremely broken (see https://lore.kernel.org/lkml/CAG48ez2NQKVbv=yG_fq_jtZjf8Q=+Wy54FxcFrK_OujFg5BwSQ@mail.gmail.com/ for context). Unless that's changed in the meantime, I think someone would have to fix that code before it can be relied on for new usecases.
On 10/14/24 1:07 PM, Jann Horn wrote: > On Wed, Sep 4, 2024 at 1:22 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote: >> One major issue to address for this series to function correctly >> is how to ensure proper TLB flushing when a page in a shared >> region is unmapped. For example, since the rmaps for pages in a >> shared region map back to host vmas which point to a host mm, TLB >> flushes won't be directed to the CPUs the sharing processes have >> run on. I am by no means an expert in this area. One idea is to >> install a mmu_notifier on the host mm that can gather the necessary >> data and do flushes similar to the batch flushing. > The mmu_notifier API has two ways you can use it: > > First, there is the classic mode, where before you start modifying > PTEs in some range, you remove mirrored PTEs from some other context, > and until you're done with your PTE modification, you don't allow > creation of new mirrored PTEs. This is intended for cases where > individual PTE entries are copied over to some other context (such as > EPT tables for virtualization). When I last looked at that code, it > looked fine, and this is what KVM uses. But it probably doesn't match > your usecase, since you wouldn't want removal of a single page to > cause the entire page table containing it to be temporarily unmapped > from the processes that use it? No, definitely don't want to do that. :-) > > Second, there is a newer mode for IOMMUv2 stuff (using the > mmu_notifier_ops::invalidate_range callback), where the idea is that > you have secondary MMUs that share the normal page tables, and so you > basically send them invalidations at the same time you invalidate the > primary MMU for the process. I think that's the right fit for this > usecase; however, last I looked, this code was extremely broken (see > https://lore.kernel.org/lkml/CAG48ez2NQKVbv=yG_fq_jtZjf8Q=+Wy54FxcFrK_OujFg5BwSQ@mail.gmail.com/ > for context). Unless that's changed in the meantime, I think someone > would have to fix that code before it can be relied on for new > usecases. Thank you for this background! Looks like there have since been some changes to the mmu notifiers, and the invalidate_range callback became arch_invalidate_secondary_tlbs. I'm currently looking into using it to flush all TLBs. Anthony
On Wed, Oct 16, 2024 at 2:59 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote: > On 10/14/24 1:07 PM, Jann Horn wrote: > > Second, there is a newer mode for IOMMUv2 stuff (using the > > mmu_notifier_ops::invalidate_range callback), where the idea is that > > you have secondary MMUs that share the normal page tables, and so you > > basically send them invalidations at the same time you invalidate the > > primary MMU for the process. I think that's the right fit for this > > usecase; however, last I looked, this code was extremely broken (see > > https://lore.kernel.org/lkml/CAG48ez2NQKVbv=yG_fq_jtZjf8Q=+Wy54FxcFrK_OujFg5BwSQ@mail.gmail.com/ > > for context). Unless that's changed in the meantime, I think someone > > would have to fix that code before it can be relied on for new > > usecases. > > Thank you for this background! Looks like there have since been some > changes to the mmu notifiers, and the invalidate_range callback became > arch_invalidate_secondary_tlbs. I'm currently looking into using it to > flush all TLBs. Ah, nice, that looks much better now. With the caveat that, from what I can tell, the notifiers only work on x86/arm64/powerpc - I guess maybe that infrastructure should be gated on a HAVE_... arch config flag...