Message ID | 20230522063725.284686-1-apopple@nvidia.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mmu_notifiers: Notify on pte permission upgrades | expand |
Hi Alistair, On 2023/5/22 14:37, Alistair Popple wrote: > Some architectures, specifically ARM and perhaps Sparc and IA64, > require TLB invalidates when upgrading pte permission from read-only > to read-write. > > The current mmu_notifier implementation assumes that upgrades do not > need notifications. Typically though mmu_notifiers are used to > implement TLB invalidations for secondary MMUs that comply with the > main CPU architecture. > > Therefore if the main CPU architecture requires an invalidation for > permission upgrade the secondary MMU will as well and an mmu_notifier > should be sent for the upgrade. > > Currently CPU invalidations for permission upgrade occur in > ptep_set_access_flags(). Unfortunately MMU notifiers cannot be called > directly from this architecture specific code as the notifier > callbacks can sleep, and ptep_set_access_flags() is usually called > whilst holding the PTL spinlock. Therefore add the notifier calls > after the PTL is dropped and only if the PTE actually changed. This > will allow secondary MMUs to obtain an updated PTE with appropriate > permissions. > > This problem was discovered during testing of an ARM SMMU > implementation that does not support broadcast TLB maintenance > (BTM). In this case the SMMU driver uses notifiers to issue TLB > invalidates. For read-only to read-write pte upgrades the SMMU > continually returned a read-only PTE to the device, even though the > CPU had a read-write PTE installed. > > Sending a mmu notifier event to the SMMU driver fixes the problem by > flushing secondary TLB entries. A new notifier event type is added so > drivers may filter out these invalidations if not required. Note a > driver should never upgrade or install a PTE in response to this mmu > notifier event as it is not synchronised against other PTE operations. > > Signed-off-by: Alistair Popple <apopple@nvidia.com> > --- > include/linux/mmu_notifier.h | 6 +++++ > mm/hugetlb.c | 24 ++++++++++++++++++- > mm/memory.c | 45 ++++++++++++++++++++++++++++++++++-- > 3 files changed, 72 insertions(+), 3 deletions(-) > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > index d6c06e140277..f14d68f119d8 100644 > --- a/include/linux/mmu_notifier.h > +++ b/include/linux/mmu_notifier.h > @@ -31,6 +31,11 @@ struct mmu_interval_notifier; > * pages in the range so to mirror those changes the user must inspect the CPU > * page table (from the end callback). > * > + * @MMU_NOTIFY_PROTECTION_UPGRAGE: update is due to a change from read-only to > + * read-write for pages in the range. This must not be used to upgrade > + * permissions on secondary PTEs, rather it should only be used to invalidate > + * caches such as secondary TLBs that may cache old read-only entries. > + * > * @MMU_NOTIFY_SOFT_DIRTY: soft dirty accounting (still same page and same > * access flags). User should soft dirty the page in the end callback to make > * sure that anyone relying on soft dirtiness catch pages that might be written > @@ -53,6 +58,7 @@ enum mmu_notifier_event { > MMU_NOTIFY_CLEAR, > MMU_NOTIFY_PROTECTION_VMA, > MMU_NOTIFY_PROTECTION_PAGE, > + MMU_NOTIFY_PROTECTION_UPGRADE, > MMU_NOTIFY_SOFT_DIRTY, > MMU_NOTIFY_RELEASE, > MMU_NOTIFY_MIGRATE, > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index bdbfeb6fb393..e5d467c7bff7 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -5987,6 +5987,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, > vm_fault_t ret; > u32 hash; > pgoff_t idx; > + bool changed = false; > struct page *page = NULL; > struct page *pagecache_page = NULL; > struct hstate *h = hstate_vma(vma); > @@ -6122,6 +6123,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, > if (!huge_pte_write(entry)) { > ret = hugetlb_wp(mm, vma, address, ptep, flags, > pagecache_page, ptl); > + if (!ret) > + changed = true; > + > goto out_put_page; > } else if (likely(flags & FAULT_FLAG_WRITE)) { > entry = huge_pte_mkdirty(entry); > @@ -6129,8 +6133,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, > } > entry = pte_mkyoung(entry); > if (huge_ptep_set_access_flags(vma, haddr, ptep, entry, > - flags & FAULT_FLAG_WRITE)) > + flags & FAULT_FLAG_WRITE)) { > update_mmu_cache(vma, haddr, ptep); > + changed = true; > + } > + > out_put_page: > if (page != pagecache_page) > unlock_page(page); > @@ -6138,6 +6145,21 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, > out_ptl: > spin_unlock(ptl); > > + if (changed) { > + struct mmu_notifier_range range; > + unsigned long hpage_mask = huge_page_mask(h); > + unsigned long hpage_size = huge_page_size(h); > + > + update_mmu_cache(vma, haddr, ptep); > + > + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_UPGRADE, > + 0, vma, mm, haddr & hpage_mask, > + (haddr & hpage_mask) + hpage_size); > + mmu_notifier_invalidate_range_start(&range); > + mmu_notifier_invalidate_range_end(&range); > + } > + > + > if (pagecache_page) { > unlock_page(pagecache_page); > put_page(pagecache_page); > diff --git a/mm/memory.c b/mm/memory.c > index f526b9152bef..0ac78c6a232c 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2098,6 +2098,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, > struct mm_struct *mm = vma->vm_mm; > pte_t *pte, entry; > spinlock_t *ptl; > + bool changed = false; > > pte = get_locked_pte(mm, addr, &ptl); > if (!pte) > @@ -2120,8 +2121,10 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, > } > entry = pte_mkyoung(*pte); > entry = maybe_mkwrite(pte_mkdirty(entry), vma); > - if (ptep_set_access_flags(vma, addr, pte, entry, 1)) > + if (ptep_set_access_flags(vma, addr, pte, entry, 1)) { > update_mmu_cache(vma, addr, pte); > + changed = true; > + } > } > goto out_unlock; > } > @@ -2142,6 +2145,17 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, > > out_unlock: > pte_unmap_unlock(pte, ptl); > + > + if (changed) { > + struct mmu_notifier_range range; > + > + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_UPGRADE, > + 0, vma, mm, addr & PAGE_MASK, > + (addr & PAGE_MASK) + PAGE_SIZE); > + mmu_notifier_invalidate_range_start(&range); > + mmu_notifier_invalidate_range_end(&range); > + } > + > return VM_FAULT_NOPAGE; > } > > @@ -2820,6 +2834,7 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src, > struct vm_area_struct *vma = vmf->vma; > struct mm_struct *mm = vma->vm_mm; > unsigned long addr = vmf->address; > + bool changed = false; > > if (likely(src)) { > if (copy_mc_user_highpage(dst, src, addr, vma)) { > @@ -2858,8 +2873,10 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src, > } > > entry = pte_mkyoung(vmf->orig_pte); > - if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0)) > + if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0)) { > update_mmu_cache(vma, addr, vmf->pte); > + changed = true; > + } > } > > /* > @@ -2897,6 +2914,16 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src, > } > } > > + if (changed) { > + struct mmu_notifier_range range; > + > + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_UPGRADE, > + 0, vma, vma->vm_mm, addr & PAGE_MASK, > + (addr & PAGE_MASK) + PAGE_SIZE); > + mmu_notifier_invalidate_range_start(&range); > + mmu_notifier_invalidate_range_end(&range); > + } > + > ret = 0; > > pte_unlock: > @@ -4877,6 +4904,7 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud) > static vm_fault_t handle_pte_fault(struct vm_fault *vmf) > { > pte_t entry; > + bool changed = false; > > if (unlikely(pmd_none(*vmf->pmd))) { > /* > @@ -4957,6 +4985,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) > if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry, > vmf->flags & FAULT_FLAG_WRITE)) { > update_mmu_cache(vmf->vma, vmf->address, vmf->pte); > + changed = true; > } else { > /* Skip spurious TLB flush for retried page fault */ > if (vmf->flags & FAULT_FLAG_TRIED) > @@ -4972,6 +5001,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) > } > unlock: > pte_unmap_unlock(vmf->pte, vmf->ptl); > + > + if (changed) { > + struct mmu_notifier_range range; > + > + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_UPGRADE, > + 0, vmf->vma, vmf->vma->vm_mm, > + vmf->address & PAGE_MASK, > + (vmf->address & PAGE_MASK) + PAGE_SIZE); > + mmu_notifier_invalidate_range_start(&range); > + mmu_notifier_invalidate_range_end(&range); > + } There are four similar patterns, can we introduce a helper function to deduplicate them? > + > return 0; > } >
Qi Zheng <qi.zheng@linux.dev> writes: > Hi Alistair, > > On 2023/5/22 14:37, Alistair Popple wrote: [...] >> + if (changed) { >> + struct mmu_notifier_range range; >> + >> + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_UPGRADE, >> + 0, vmf->vma, vmf->vma->vm_mm, >> + vmf->address & PAGE_MASK, >> + (vmf->address & PAGE_MASK) + PAGE_SIZE); >> + mmu_notifier_invalidate_range_start(&range); >> + mmu_notifier_invalidate_range_end(&range); >> + } > > There are four similar patterns, can we introduce a helper function to > deduplicate them? For sure. How about something like this? void mmu_notifier_range_start_end(enum mmu_notifier_event event, struct vm_area_struct *vma, struct mm_struct *mm, unsigned long start, unsigned long end) As an aside I didn't just use mmu_notifier_invalidate_range() as that doesn't allow an event type to be set for interval notifiers which may want to filter this. >> + >> return 0; >> } >>
On 2023/5/22 15:45, Alistair Popple wrote: > > Qi Zheng <qi.zheng@linux.dev> writes: > >> Hi Alistair, >> >> On 2023/5/22 14:37, Alistair Popple wrote: > > [...] > >>> + if (changed) { >>> + struct mmu_notifier_range range; >>> + >>> + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_UPGRADE, >>> + 0, vmf->vma, vmf->vma->vm_mm, >>> + vmf->address & PAGE_MASK, >>> + (vmf->address & PAGE_MASK) + PAGE_SIZE); >>> + mmu_notifier_invalidate_range_start(&range); >>> + mmu_notifier_invalidate_range_end(&range); >>> + } >> >> There are four similar patterns, can we introduce a helper function to >> deduplicate them? > > For sure. How about something like this? > > void mmu_notifier_range_start_end(enum mmu_notifier_event event, > struct vm_area_struct *vma, > struct mm_struct *mm, > unsigned long start, > unsigned long end) LGTM. :) > > As an aside I didn't just use mmu_notifier_invalidate_range() as that > doesn't allow an event type to be set for interval notifiers which may > want to filter this. > >>> + >>> return 0; >>> } >>> >
On Mon, May 22, 2023 at 04:37:25PM +1000, Alistair Popple wrote: > diff --git a/mm/memory.c b/mm/memory.c > index f526b9152bef..0ac78c6a232c 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2098,6 +2098,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, > struct mm_struct *mm = vma->vm_mm; > pte_t *pte, entry; > spinlock_t *ptl; > + bool changed = false; > > pte = get_locked_pte(mm, addr, &ptl); > if (!pte) > @@ -2120,8 +2121,10 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, > } > entry = pte_mkyoung(*pte); > entry = maybe_mkwrite(pte_mkdirty(entry), vma); > - if (ptep_set_access_flags(vma, addr, pte, entry, 1)) > + if (ptep_set_access_flags(vma, addr, pte, entry, 1)) { > update_mmu_cache(vma, addr, pte); > + changed = true; > + } > } > goto out_unlock; > } I haven't checked all the corner cases but can we not have a ptep_set_access_flags_notify() that handles this (and the huge equivalent)? It matches the other API like ptep_clear_flush_notify().
On Mon, May 22, 2023, Alistair Popple wrote: > Some architectures, specifically ARM and perhaps Sparc and IA64, > require TLB invalidates when upgrading pte permission from read-only > to read-write. > > The current mmu_notifier implementation assumes that upgrades do not > need notifications. Typically though mmu_notifiers are used to > implement TLB invalidations for secondary MMUs that comply with the > main CPU architecture. > > Therefore if the main CPU architecture requires an invalidation for > permission upgrade the secondary MMU will as well and an mmu_notifier > should be sent for the upgrade. > > Currently CPU invalidations for permission upgrade occur in > ptep_set_access_flags(). Unfortunately MMU notifiers cannot be called > directly from this architecture specific code as the notifier > callbacks can sleep, and ptep_set_access_flags() is usually called > whilst holding the PTL spinlock. Therefore add the notifier calls > after the PTL is dropped and only if the PTE actually changed. This > will allow secondary MMUs to obtain an updated PTE with appropriate > permissions. > > This problem was discovered during testing of an ARM SMMU > implementation that does not support broadcast TLB maintenance > (BTM). In this case the SMMU driver uses notifiers to issue TLB > invalidates. For read-only to read-write pte upgrades the SMMU > continually returned a read-only PTE to the device, even though the > CPU had a read-write PTE installed. > > Sending a mmu notifier event to the SMMU driver fixes the problem by > flushing secondary TLB entries. A new notifier event type is added so > drivers may filter out these invalidations if not required. Note a > driver should never upgrade or install a PTE in response to this mmu > notifier event as it is not synchronised against other PTE operations. > > Signed-off-by: Alistair Popple <apopple@nvidia.com> > --- > include/linux/mmu_notifier.h | 6 +++++ > mm/hugetlb.c | 24 ++++++++++++++++++- > mm/memory.c | 45 ++++++++++++++++++++++++++++++++++-- > 3 files changed, 72 insertions(+), 3 deletions(-) > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > index d6c06e140277..f14d68f119d8 100644 > --- a/include/linux/mmu_notifier.h > +++ b/include/linux/mmu_notifier.h > @@ -31,6 +31,11 @@ struct mmu_interval_notifier; > * pages in the range so to mirror those changes the user must inspect the CPU > * page table (from the end callback). > * > + * @MMU_NOTIFY_PROTECTION_UPGRAGE: update is due to a change from read-only to > + * read-write for pages in the range. This must not be used to upgrade > + * permissions on secondary PTEs, rather it should only be used to invalidate > + * caches such as secondary TLBs that may cache old read-only entries. This is a poor fit for invalidate_range_{start,end}(). All other uses bookend the primary MMU update, i.e. call start() _before_ changing PTEs. The comments are somewhat stale as they talk only about "unmapped", but the contract between the primary MMU and the secondary MMU is otherwise quite clear on when the primary MMU will invoke start() and end(). * invalidate_range_start() is called when all pages in the * range are still mapped and have at least a refcount of one. * * invalidate_range_end() is called when all pages in the * range have been unmapped and the pages have been freed by * the VM. I'm also confused as to how this actually fixes ARM's SMMU. Unless I'm looking at the wrong SMMU implementation, the SMMU implemenents only invalidate_range(), not the start()/end() variants. static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = { .invalidate_range = arm_smmu_mm_invalidate_range, .release = arm_smmu_mm_release, .free_notifier = arm_smmu_mmu_notifier_free, }; Again from include/linux/mmu_notifier.h, not implementing the start()/end() hooks is perfectly valid. And AFAICT, the existing invalidate_range() hook is pretty much a perfect fit for what you want to achieve. * If invalidate_range() is used to manage a non-CPU TLB with * shared page-tables, it not necessary to implement the * invalidate_range_start()/end() notifiers, as * invalidate_range() already catches the points in time when an * external TLB range needs to be flushed. For more in depth * discussion on this see Documentation/mm/mmu_notifier.rst Even worse, this change may silently regress performance for secondary MMUs that haven't yet taken advantage of the event type, e.g. KVM will zap all of KVM's PTEs in response to the upgrade, instead of waiting until the guest actually tries to utilize the new protections.
Sean Christopherson <seanjc@google.com> writes: > On Mon, May 22, 2023, Alistair Popple wrote: >> Some architectures, specifically ARM and perhaps Sparc and IA64, >> require TLB invalidates when upgrading pte permission from read-only >> to read-write. >> >> The current mmu_notifier implementation assumes that upgrades do not >> need notifications. Typically though mmu_notifiers are used to >> implement TLB invalidations for secondary MMUs that comply with the >> main CPU architecture. >> >> Therefore if the main CPU architecture requires an invalidation for >> permission upgrade the secondary MMU will as well and an mmu_notifier >> should be sent for the upgrade. >> >> Currently CPU invalidations for permission upgrade occur in >> ptep_set_access_flags(). Unfortunately MMU notifiers cannot be called >> directly from this architecture specific code as the notifier >> callbacks can sleep, and ptep_set_access_flags() is usually called >> whilst holding the PTL spinlock. Therefore add the notifier calls >> after the PTL is dropped and only if the PTE actually changed. This >> will allow secondary MMUs to obtain an updated PTE with appropriate >> permissions. >> >> This problem was discovered during testing of an ARM SMMU >> implementation that does not support broadcast TLB maintenance >> (BTM). In this case the SMMU driver uses notifiers to issue TLB >> invalidates. For read-only to read-write pte upgrades the SMMU >> continually returned a read-only PTE to the device, even though the >> CPU had a read-write PTE installed. >> >> Sending a mmu notifier event to the SMMU driver fixes the problem by >> flushing secondary TLB entries. A new notifier event type is added so >> drivers may filter out these invalidations if not required. Note a >> driver should never upgrade or install a PTE in response to this mmu >> notifier event as it is not synchronised against other PTE operations. >> >> Signed-off-by: Alistair Popple <apopple@nvidia.com> >> --- >> include/linux/mmu_notifier.h | 6 +++++ >> mm/hugetlb.c | 24 ++++++++++++++++++- >> mm/memory.c | 45 ++++++++++++++++++++++++++++++++++-- >> 3 files changed, 72 insertions(+), 3 deletions(-) >> >> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h >> index d6c06e140277..f14d68f119d8 100644 >> --- a/include/linux/mmu_notifier.h >> +++ b/include/linux/mmu_notifier.h >> @@ -31,6 +31,11 @@ struct mmu_interval_notifier; >> * pages in the range so to mirror those changes the user must inspect the CPU >> * page table (from the end callback). >> * >> + * @MMU_NOTIFY_PROTECTION_UPGRAGE: update is due to a change from read-only to >> + * read-write for pages in the range. This must not be used to upgrade >> + * permissions on secondary PTEs, rather it should only be used to invalidate >> + * caches such as secondary TLBs that may cache old read-only entries. > > This is a poor fit for invalidate_range_{start,end}(). All other uses bookend > the primary MMU update, i.e. call start() _before_ changing PTEs. The comments > are somewhat stale as they talk only about "unmapped", but the contract between > the primary MMU and the secondary MMU is otherwise quite clear on when the primary > MMU will invoke start() and end(). > > * invalidate_range_start() is called when all pages in the > * range are still mapped and have at least a refcount of one. > * > * invalidate_range_end() is called when all pages in the > * range have been unmapped and the pages have been freed by > * the VM. > > I'm also confused as to how this actually fixes ARM's SMMU. Unless I'm looking > at the wrong SMMU implementation, the SMMU implemenents only invalidate_range(), > not the start()/end() variants. mmu_invalidate_range_end() calls the invalidate_range() callback if the start()/end() variants aren't set. > static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = { > .invalidate_range = arm_smmu_mm_invalidate_range, > .release = arm_smmu_mm_release, > .free_notifier = arm_smmu_mmu_notifier_free, > }; > > Again from include/linux/mmu_notifier.h, not implementing the start()/end() hooks > is perfectly valid. And AFAICT, the existing invalidate_range() hook is pretty > much a perfect fit for what you want to achieve. Right, I didn't take that approach because it doesn't allow an event type to be passed which would allow them to be filtered on platforms which don't require this. I had also assumed the invalidate_range() callbacks were allowed to sleep, hence couldn't be called under PTL. That's certainly true of mmu interval notifier callbacks, but Catalin reminded me that calls such as ptep_clear_flush_notify() already call invalidate_range() callback under PTL so I guess we already assume drivers don't sleep in their invalidate_range() callbacks. I will update the comments to reflect that. > * If invalidate_range() is used to manage a non-CPU TLB with > * shared page-tables, it not necessary to implement the > * invalidate_range_start()/end() notifiers, as > * invalidate_range() already catches the points in time when an > * external TLB range needs to be flushed. For more in depth > * discussion on this see Documentation/mm/mmu_notifier.rst > > Even worse, this change may silently regress performance for secondary MMUs that > haven't yet taken advantage of the event type, e.g. KVM will zap all of KVM's PTEs > in response to the upgrade, instead of waiting until the guest actually tries to > utilize the new protections. Yeah, I like the idea of introducing a ptep_set_access_flags_notify(). That way this won't regress performance on platforms that don't need it. Note this isn't a new feature but rather a bugfix. It's unclear to me why KVM on ARM hasn't already run into this issue, but I'm no KVM expert. Thanks for the feedback. - Alistair
Catalin Marinas <catalin.marinas@arm.com> writes: > On Mon, May 22, 2023 at 04:37:25PM +1000, Alistair Popple wrote: >> diff --git a/mm/memory.c b/mm/memory.c >> index f526b9152bef..0ac78c6a232c 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -2098,6 +2098,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, >> struct mm_struct *mm = vma->vm_mm; >> pte_t *pte, entry; >> spinlock_t *ptl; >> + bool changed = false; >> >> pte = get_locked_pte(mm, addr, &ptl); >> if (!pte) >> @@ -2120,8 +2121,10 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, >> } >> entry = pte_mkyoung(*pte); >> entry = maybe_mkwrite(pte_mkdirty(entry), vma); >> - if (ptep_set_access_flags(vma, addr, pte, entry, 1)) >> + if (ptep_set_access_flags(vma, addr, pte, entry, 1)) { >> update_mmu_cache(vma, addr, pte); >> + changed = true; >> + } >> } >> goto out_unlock; >> } > > I haven't checked all the corner cases but can we not have a > ptep_set_access_flags_notify() that handles this (and the huge > equivalent)? It matches the other API like ptep_clear_flush_notify(). Good suggestion, thanks! I can make the implementations architecture specific too which helps with filtering on platforms that don't need it. I had assumed the invalidate_range() callbacks could sleep and therefore couldn't be called under PTL. However ptep_clear_flush_notify() already calls invalidate_range() under PTL, so we already assume drivers don't sleep in the invalidate_range() callbacks.
On Tue, May 23, 2023, Alistair Popple wrote: > > Sean Christopherson <seanjc@google.com> writes: > >> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > >> index d6c06e140277..f14d68f119d8 100644 > >> --- a/include/linux/mmu_notifier.h > >> +++ b/include/linux/mmu_notifier.h > >> @@ -31,6 +31,11 @@ struct mmu_interval_notifier; > >> * pages in the range so to mirror those changes the user must inspect the CPU > >> * page table (from the end callback). > >> * > >> + * @MMU_NOTIFY_PROTECTION_UPGRAGE: update is due to a change from read-only to > >> + * read-write for pages in the range. This must not be used to upgrade > >> + * permissions on secondary PTEs, rather it should only be used to invalidate > >> + * caches such as secondary TLBs that may cache old read-only entries. > > > > This is a poor fit for invalidate_range_{start,end}(). All other uses bookend > > the primary MMU update, i.e. call start() _before_ changing PTEs. The comments > > are somewhat stale as they talk only about "unmapped", but the contract between > > the primary MMU and the secondary MMU is otherwise quite clear on when the primary > > MMU will invoke start() and end(). > > > > * invalidate_range_start() is called when all pages in the > > * range are still mapped and have at least a refcount of one. > > * > > * invalidate_range_end() is called when all pages in the > > * range have been unmapped and the pages have been freed by > > * the VM. > > > > I'm also confused as to how this actually fixes ARM's SMMU. Unless I'm looking > > at the wrong SMMU implementation, the SMMU implemenents only invalidate_range(), > > not the start()/end() variants. > > mmu_invalidate_range_end() calls the invalidate_range() callback if > the start()/end() variants aren't set. Ahh, thanks. > > static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = { > > .invalidate_range = arm_smmu_mm_invalidate_range, > > .release = arm_smmu_mm_release, > > .free_notifier = arm_smmu_mmu_notifier_free, > > }; > > > > Again from include/linux/mmu_notifier.h, not implementing the start()/end() hooks > > is perfectly valid. And AFAICT, the existing invalidate_range() hook is pretty > > much a perfect fit for what you want to achieve. > > Right, I didn't take that approach because it doesn't allow an event > type to be passed which would allow them to be filtered on platforms > which don't require this. > > I had also assumed the invalidate_range() callbacks were allowed to > sleep, hence couldn't be called under PTL. That's certainly true of mmu > interval notifier callbacks, but Catalin reminded me that calls such as > ptep_clear_flush_notify() already call invalidate_range() callback under > PTL so I guess we already assume drivers don't sleep in their > invalidate_range() callbacks. I will update the comments to reflect > that. > > > * If invalidate_range() is used to manage a non-CPU TLB with > > * shared page-tables, it not necessary to implement the > > * invalidate_range_start()/end() notifiers, as > > * invalidate_range() already catches the points in time when an > > * external TLB range needs to be flushed. For more in depth > > * discussion on this see Documentation/mm/mmu_notifier.rst > > > > Even worse, this change may silently regress performance for secondary MMUs that > > haven't yet taken advantage of the event type, e.g. KVM will zap all of KVM's PTEs > > in response to the upgrade, instead of waiting until the guest actually tries to > > utilize the new protections. > > Yeah, I like the idea of introducing a > ptep_set_access_flags_notify(). That way this won't regress performance > on platforms that don't need it. Note this isn't a new feature but > rather a bugfix. It's unclear to me why KVM on ARM hasn't already run > into this issue, but I'm no KVM expert. Thanks for the feedback. KVM manages its own page tables and so does its own TLB invalidations as needed, e.g. KVM can and does change KVM's stage-2 PTEs from read-only to read/write irrespective of mmu_notifiers. I assume the SMMU issue arises only because the SMMU is reusing the host kernel's (stage-1?) page tables.
Sean Christopherson <seanjc@google.com> writes: > On Tue, May 23, 2023, Alistair Popple wrote: >> >> Sean Christopherson <seanjc@google.com> writes: [...] >> > * If invalidate_range() is used to manage a non-CPU TLB with >> > * shared page-tables, it not necessary to implement the >> > * invalidate_range_start()/end() notifiers, as >> > * invalidate_range() already catches the points in time when an >> > * external TLB range needs to be flushed. For more in depth >> > * discussion on this see Documentation/mm/mmu_notifier.rst >> > >> > Even worse, this change may silently regress performance for secondary MMUs that >> > haven't yet taken advantage of the event type, e.g. KVM will zap all of KVM's PTEs >> > in response to the upgrade, instead of waiting until the guest actually tries to >> > utilize the new protections. >> >> Yeah, I like the idea of introducing a >> ptep_set_access_flags_notify(). That way this won't regress performance >> on platforms that don't need it. Note this isn't a new feature but >> rather a bugfix. It's unclear to me why KVM on ARM hasn't already run >> into this issue, but I'm no KVM expert. Thanks for the feedback. > > KVM manages its own page tables and so does its own TLB invalidations as needed, > e.g. KVM can and does change KVM's stage-2 PTEs from read-only to read/write > irrespective of mmu_notifiers. I assume the SMMU issue arises only because the > SMMU is reusing the host kernel's (stage-1?) page tables. Argh, thanks. That makes sense. The SMMU issue arises because it is not snooping CPU TLB invalidations and therefore relies entirely on notifier callbacks to invalidate it's TLB. If it was snooping invalidations it would observe the TLB invalidation ARM64 does in ptep_set_access_flags()[1]. Now that I've figured out we can call invalidate_range() under the PTL I think I can just add the notifier call there. [1] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/mm/fault.c?id=ae8373a5add4ea39f032563cf12a02946d1e3546#n229
On 5/21/23 23:37, Alistair Popple wrote: > Some architectures, specifically ARM and perhaps Sparc and IA64, > require TLB invalidates when upgrading pte permission from read-only > to read-write. > > The current mmu_notifier implementation assumes that upgrades do not > need notifications. Typically though mmu_notifiers are used to > implement TLB invalidations for secondary MMUs that comply with the > main CPU architecture. > > Therefore if the main CPU architecture requires an invalidation for > permission upgrade the secondary MMU will as well and an mmu_notifier > should be sent for the upgrade. > > Currently CPU invalidations for permission upgrade occur in > ptep_set_access_flags(). Unfortunately MMU notifiers cannot be called > directly from this architecture specific code as the notifier > callbacks can sleep, and ptep_set_access_flags() is usually called > whilst holding the PTL spinlock. Therefore add the notifier calls > after the PTL is dropped and only if the PTE actually changed. This > will allow secondary MMUs to obtain an updated PTE with appropriate > permissions. > > This problem was discovered during testing of an ARM SMMU > implementation that does not support broadcast TLB maintenance > (BTM). In this case the SMMU driver uses notifiers to issue TLB > invalidates. For read-only to read-write pte upgrades the SMMU > continually returned a read-only PTE to the device, even though the > CPU had a read-write PTE installed. > > Sending a mmu notifier event to the SMMU driver fixes the problem by > flushing secondary TLB entries. A new notifier event type is added so > drivers may filter out these invalidations if not required. Note a > driver should never upgrade or install a PTE in response to this mmu > notifier event as it is not synchronised against other PTE operations. > > Signed-off-by: Alistair Popple <apopple@nvidia.com> > --- > include/linux/mmu_notifier.h | 6 +++++ > mm/hugetlb.c | 24 ++++++++++++++++++- > mm/memory.c | 45 ++++++++++++++++++++++++++++++++++-- > 3 files changed, 72 insertions(+), 3 deletions(-) Point of order: What is this based on? It would really help if you would either use --base with git-format-patch, or else just mention the base somewhere. Otherwise, actually applying the patch takes some hunting around...in particular for older stuff. This is from Feb, 2023, before hugetlb.c got converted to folios, right? thanks,
John Hubbard <jhubbard@nvidia.com> writes: > On 5/21/23 23:37, Alistair Popple wrote: >> Some architectures, specifically ARM and perhaps Sparc and IA64, >> require TLB invalidates when upgrading pte permission from read-only >> to read-write. >> The current mmu_notifier implementation assumes that upgrades do not >> need notifications. Typically though mmu_notifiers are used to >> implement TLB invalidations for secondary MMUs that comply with the >> main CPU architecture. >> Therefore if the main CPU architecture requires an invalidation for >> permission upgrade the secondary MMU will as well and an mmu_notifier >> should be sent for the upgrade. >> Currently CPU invalidations for permission upgrade occur in >> ptep_set_access_flags(). Unfortunately MMU notifiers cannot be called >> directly from this architecture specific code as the notifier >> callbacks can sleep, and ptep_set_access_flags() is usually called >> whilst holding the PTL spinlock. Therefore add the notifier calls >> after the PTL is dropped and only if the PTE actually changed. This >> will allow secondary MMUs to obtain an updated PTE with appropriate >> permissions. >> This problem was discovered during testing of an ARM SMMU >> implementation that does not support broadcast TLB maintenance >> (BTM). In this case the SMMU driver uses notifiers to issue TLB >> invalidates. For read-only to read-write pte upgrades the SMMU >> continually returned a read-only PTE to the device, even though the >> CPU had a read-write PTE installed. >> Sending a mmu notifier event to the SMMU driver fixes the problem by >> flushing secondary TLB entries. A new notifier event type is added so >> drivers may filter out these invalidations if not required. Note a >> driver should never upgrade or install a PTE in response to this mmu >> notifier event as it is not synchronised against other PTE operations. >> Signed-off-by: Alistair Popple <apopple@nvidia.com> >> --- >> include/linux/mmu_notifier.h | 6 +++++ >> mm/hugetlb.c | 24 ++++++++++++++++++- >> mm/memory.c | 45 ++++++++++++++++++++++++++++++++++-- >> 3 files changed, 72 insertions(+), 3 deletions(-) > > Point of order: > > What is this based on? It would really help if you would either use --base > with git-format-patch, or else just mention the base somewhere. Otherwise, > actually applying the patch takes some hunting around...in particular for > older stuff. This is from Feb, 2023, before hugetlb.c got converted to > folios, right? Yes, my bad. This is based on v6.2. I had meant to rebase it prior to sending but forgot. Based on the other review comments though I will be reworking this to put the notifier call in ptep_set_access_flags() on Arm anyway. > thanks,
On 5/22/23 16:50, Alistair Popple wrote: ... >> Again from include/linux/mmu_notifier.h, not implementing the start()/end() hooks >> is perfectly valid. And AFAICT, the existing invalidate_range() hook is pretty >> much a perfect fit for what you want to achieve. > > Right, I didn't take that approach because it doesn't allow an event > type to be passed which would allow them to be filtered on platforms > which don't require this. > > I had also assumed the invalidate_range() callbacks were allowed to > sleep, hence couldn't be called under PTL. That's certainly true of mmu > interval notifier callbacks, but Catalin reminded me that calls such as > ptep_clear_flush_notify() already call invalidate_range() callback under > PTL so I guess we already assume drivers don't sleep in their > invalidate_range() callbacks. I will update the comments to reflect This line of reasoning feels very fragile. The range notifiers generally do allow sleeping. They are using srcu (sleepable RCU) protection, btw. The fact that existing callers are calling these under PTL just means that so far, that has sorta worked. And yes, we can probably make this all work. That's not really the ideal way to deduce the API rules, though, and it would be good to clarify what they really are. Aside from those use cases, I don't see anything justifying a "not allowed to sleep" rule for .invalidate_range(), right? thanks,
John Hubbard <jhubbard@nvidia.com> writes: > On 5/22/23 16:50, Alistair Popple wrote: > ... >>> Again from include/linux/mmu_notifier.h, not implementing the start()/end() hooks >>> is perfectly valid. And AFAICT, the existing invalidate_range() hook is pretty >>> much a perfect fit for what you want to achieve. >> Right, I didn't take that approach because it doesn't allow an event >> type to be passed which would allow them to be filtered on platforms >> which don't require this. >> I had also assumed the invalidate_range() callbacks were allowed to >> sleep, hence couldn't be called under PTL. That's certainly true of mmu >> interval notifier callbacks, but Catalin reminded me that calls such as >> ptep_clear_flush_notify() already call invalidate_range() callback under >> PTL so I guess we already assume drivers don't sleep in their >> invalidate_range() callbacks. I will update the comments to reflect > > This line of reasoning feels very fragile. The range notifiers generally > do allow sleeping. They are using srcu (sleepable RCU) protection, btw. Regardless of how well documented this is or isn't (it isn't currently, but it used to be) it certainly seems to be a well established rule that the .invalidate_range() callback cannot sleep. The vast majority of callers do call this holding the PTL, and comments make it explicit that this is somewhat expected: Eg: In rmap.c: * No need to call mmu_notifier_invalidate_range() it has be * done above for all cases requiring it to happen under page * table lock before mmu_notifier_invalidate_range_end() > The fact that existing callers are calling these under PTL just means > that so far, that has sorta worked. And yes, we can probably make this > all work. That's not really the ideal way to deduce the API rules, though, > and it would be good to clarify what they really are. Of course not. I will update the documentation to clarify this, but see below for some history which clarifies how we got here. > Aside from those use cases, I don't see anything justifying a "not allowed > to sleep" rule for .invalidate_range(), right? Except that "those use cases" are approximately all of the use cases AFAICT, and as it turns out this was actually a rule when .invalidate_range() was added. Commit 0f0a327fa12c ("mmu_notifier: add the callback for mmu_notifier_invalidate_range()") included this in the documentation: * The invalidate_range() function is called under the ptl * spin-lock and not allowed to sleep. This was later removed in 5ff7091f5a2c ("mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks") which introduced the MMU_INVALIDATE_DOES_NOT_BLOCK flag: * If this [invalidate_range()] callback cannot block, and invalidate_range_{start,end} * cannot block, mmu_notifier_ops.flags should have * MMU_INVALIDATE_DOES_NOT_BLOCK set. However the removal of the original comment seems wrong - invalidate_range() was still getting called under the ptl and therefore could not block regardless of if MMU_INVALIDATE_DOES_NOT_BLOCK was set or not. Of course the flag and related documentation was removed shortly after by 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers") and 4e15a073a168 ("Revert "mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks"") None of those changes actually made it safe for .invalidate_range() callbacks to sleep, nor was that their goal. They were all about making sure it was ok for .invalidate_range_start() to sleep AFAICT. So I think it's perfectly fine to require .invalidate_range() callbacks to be non-blocking, and if they are that's a driver bug. Note that this isn't talking about mmu *interval* notifiers - they are slightly different and don't hook into the mmu_notifier_invalidate_range() call. They use start()/end() and as such are allowed to sleep. - Alistair > thanks,
On Mon, May 22, 2023 at 04:37:25PM +1000, Alistair Popple wrote: > This problem was discovered during testing of an ARM SMMU > implementation that does not support broadcast TLB maintenance > (BTM). In this case the SMMU driver uses notifiers to issue TLB > invalidates. For read-only to read-write pte upgrades the SMMU > continually returned a read-only PTE to the device, even though the > CPU had a read-write PTE installed. > > Sending a mmu notifier event to the SMMU driver fixes the problem by > flushing secondary TLB entries. A new notifier event type is added so > drivers may filter out these invalidations if not required. Note a > driver should never upgrade or install a PTE in response to this mmu > notifier event as it is not synchronised against other PTE operations. I don't see these SMMU driver changes anywhere. I.e. you're adding dead code as far as I can tell.
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index d6c06e140277..f14d68f119d8 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -31,6 +31,11 @@ struct mmu_interval_notifier; * pages in the range so to mirror those changes the user must inspect the CPU * page table (from the end callback). * + * @MMU_NOTIFY_PROTECTION_UPGRAGE: update is due to a change from read-only to + * read-write for pages in the range. This must not be used to upgrade + * permissions on secondary PTEs, rather it should only be used to invalidate + * caches such as secondary TLBs that may cache old read-only entries. + * * @MMU_NOTIFY_SOFT_DIRTY: soft dirty accounting (still same page and same * access flags). User should soft dirty the page in the end callback to make * sure that anyone relying on soft dirtiness catch pages that might be written @@ -53,6 +58,7 @@ enum mmu_notifier_event { MMU_NOTIFY_CLEAR, MMU_NOTIFY_PROTECTION_VMA, MMU_NOTIFY_PROTECTION_PAGE, + MMU_NOTIFY_PROTECTION_UPGRADE, MMU_NOTIFY_SOFT_DIRTY, MMU_NOTIFY_RELEASE, MMU_NOTIFY_MIGRATE, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bdbfeb6fb393..e5d467c7bff7 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5987,6 +5987,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, vm_fault_t ret; u32 hash; pgoff_t idx; + bool changed = false; struct page *page = NULL; struct page *pagecache_page = NULL; struct hstate *h = hstate_vma(vma); @@ -6122,6 +6123,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (!huge_pte_write(entry)) { ret = hugetlb_wp(mm, vma, address, ptep, flags, pagecache_page, ptl); + if (!ret) + changed = true; + goto out_put_page; } else if (likely(flags & FAULT_FLAG_WRITE)) { entry = huge_pte_mkdirty(entry); @@ -6129,8 +6133,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, } entry = pte_mkyoung(entry); if (huge_ptep_set_access_flags(vma, haddr, ptep, entry, - flags & FAULT_FLAG_WRITE)) + flags & FAULT_FLAG_WRITE)) { update_mmu_cache(vma, haddr, ptep); + changed = true; + } + out_put_page: if (page != pagecache_page) unlock_page(page); @@ -6138,6 +6145,21 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, out_ptl: spin_unlock(ptl); + if (changed) { + struct mmu_notifier_range range; + unsigned long hpage_mask = huge_page_mask(h); + unsigned long hpage_size = huge_page_size(h); + + update_mmu_cache(vma, haddr, ptep); + + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_UPGRADE, + 0, vma, mm, haddr & hpage_mask, + (haddr & hpage_mask) + hpage_size); + mmu_notifier_invalidate_range_start(&range); + mmu_notifier_invalidate_range_end(&range); + } + + if (pagecache_page) { unlock_page(pagecache_page); put_page(pagecache_page); diff --git a/mm/memory.c b/mm/memory.c index f526b9152bef..0ac78c6a232c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2098,6 +2098,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, struct mm_struct *mm = vma->vm_mm; pte_t *pte, entry; spinlock_t *ptl; + bool changed = false; pte = get_locked_pte(mm, addr, &ptl); if (!pte) @@ -2120,8 +2121,10 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, } entry = pte_mkyoung(*pte); entry = maybe_mkwrite(pte_mkdirty(entry), vma); - if (ptep_set_access_flags(vma, addr, pte, entry, 1)) + if (ptep_set_access_flags(vma, addr, pte, entry, 1)) { update_mmu_cache(vma, addr, pte); + changed = true; + } } goto out_unlock; } @@ -2142,6 +2145,17 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, out_unlock: pte_unmap_unlock(pte, ptl); + + if (changed) { + struct mmu_notifier_range range; + + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_UPGRADE, + 0, vma, mm, addr & PAGE_MASK, + (addr & PAGE_MASK) + PAGE_SIZE); + mmu_notifier_invalidate_range_start(&range); + mmu_notifier_invalidate_range_end(&range); + } + return VM_FAULT_NOPAGE; } @@ -2820,6 +2834,7 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src, struct vm_area_struct *vma = vmf->vma; struct mm_struct *mm = vma->vm_mm; unsigned long addr = vmf->address; + bool changed = false; if (likely(src)) { if (copy_mc_user_highpage(dst, src, addr, vma)) { @@ -2858,8 +2873,10 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src, } entry = pte_mkyoung(vmf->orig_pte); - if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0)) + if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0)) { update_mmu_cache(vma, addr, vmf->pte); + changed = true; + } } /* @@ -2897,6 +2914,16 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src, } } + if (changed) { + struct mmu_notifier_range range; + + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_UPGRADE, + 0, vma, vma->vm_mm, addr & PAGE_MASK, + (addr & PAGE_MASK) + PAGE_SIZE); + mmu_notifier_invalidate_range_start(&range); + mmu_notifier_invalidate_range_end(&range); + } + ret = 0; pte_unlock: @@ -4877,6 +4904,7 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud) static vm_fault_t handle_pte_fault(struct vm_fault *vmf) { pte_t entry; + bool changed = false; if (unlikely(pmd_none(*vmf->pmd))) { /* @@ -4957,6 +4985,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry, vmf->flags & FAULT_FLAG_WRITE)) { update_mmu_cache(vmf->vma, vmf->address, vmf->pte); + changed = true; } else { /* Skip spurious TLB flush for retried page fault */ if (vmf->flags & FAULT_FLAG_TRIED) @@ -4972,6 +5001,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) } unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); + + if (changed) { + struct mmu_notifier_range range; + + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_UPGRADE, + 0, vmf->vma, vmf->vma->vm_mm, + vmf->address & PAGE_MASK, + (vmf->address & PAGE_MASK) + PAGE_SIZE); + mmu_notifier_invalidate_range_start(&range); + mmu_notifier_invalidate_range_end(&range); + } + return 0; }
Some architectures, specifically ARM and perhaps Sparc and IA64, require TLB invalidates when upgrading pte permission from read-only to read-write. The current mmu_notifier implementation assumes that upgrades do not need notifications. Typically though mmu_notifiers are used to implement TLB invalidations for secondary MMUs that comply with the main CPU architecture. Therefore if the main CPU architecture requires an invalidation for permission upgrade the secondary MMU will as well and an mmu_notifier should be sent for the upgrade. Currently CPU invalidations for permission upgrade occur in ptep_set_access_flags(). Unfortunately MMU notifiers cannot be called directly from this architecture specific code as the notifier callbacks can sleep, and ptep_set_access_flags() is usually called whilst holding the PTL spinlock. Therefore add the notifier calls after the PTL is dropped and only if the PTE actually changed. This will allow secondary MMUs to obtain an updated PTE with appropriate permissions. This problem was discovered during testing of an ARM SMMU implementation that does not support broadcast TLB maintenance (BTM). In this case the SMMU driver uses notifiers to issue TLB invalidates. For read-only to read-write pte upgrades the SMMU continually returned a read-only PTE to the device, even though the CPU had a read-write PTE installed. Sending a mmu notifier event to the SMMU driver fixes the problem by flushing secondary TLB entries. A new notifier event type is added so drivers may filter out these invalidations if not required. Note a driver should never upgrade or install a PTE in response to this mmu notifier event as it is not synchronised against other PTE operations. Signed-off-by: Alistair Popple <apopple@nvidia.com> --- include/linux/mmu_notifier.h | 6 +++++ mm/hugetlb.c | 24 ++++++++++++++++++- mm/memory.c | 45 ++++++++++++++++++++++++++++++++++-- 3 files changed, 72 insertions(+), 3 deletions(-)