Message ID | 20211021122112.592634-3-namit@vmware.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/mprotect: avoid unnecessary TLB flushes | expand |
On Thu, Oct 21, 2021 at 05:21:09AM -0700, Nadav Amit wrote: > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h > index 448cd01eb3ec..18c3366f8f4d 100644 > --- a/arch/x86/include/asm/pgtable.h > +++ b/arch/x86/include/asm/pgtable.h > @@ -1146,6 +1146,14 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma, > } > } > #endif > + > +#define __HAVE_ARCH_PMDP_INVALIDATE_AD > +static inline pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, > + unsigned long address, pmd_t *pmdp) > +{ > + return pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp)); Did this want to be something like: pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp)); if (cpu_feature_enabled(X86_BUG_PTE_LEAK)) flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); return old; instead? > +} > + > /* > * Page table pages are page-aligned. The lower half of the top > * level is used for userspace and the top half for the kernel. > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index e5ea5f775d5c..435da011b1a2 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1795,10 +1795,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > * The race makes MADV_DONTNEED miss the huge pmd and don't clear it > * which may break userspace. > * > - * pmdp_invalidate() is required to make sure we don't miss > - * dirty/young flags set by hardware. > + * pmdp_invalidate_ad() is required to make sure we don't miss > + * dirty/young flags (which are also known as access/dirty) cannot be > + * further modifeid by the hardware. "modified", I think is the more common spelling. > */ > - entry = pmdp_invalidate(vma, addr, pmd); > + entry = pmdp_invalidate_ad(vma, addr, pmd); > > entry = pmd_modify(entry, newprot); > if (preserve_write) > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c > index 4e640baf9794..b0ce6c7391bf 100644 > --- a/mm/pgtable-generic.c > +++ b/mm/pgtable-generic.c > @@ -200,6 +200,14 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, > } > #endif > > +#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD /* * Does this deserve a comment to explain the intended difference vs * pmdp_invalidate() ? */ > +pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, > + pmd_t *pmdp) > +{ > + return pmdp_invalidate(vma, address, pmdp); > +} > +#endif > + > #ifndef pmdp_collapse_flush > pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address, > pmd_t *pmdp) > -- > 2.25.1 >
> On Oct 25, 2021, at 3:52 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Oct 21, 2021 at 05:21:09AM -0700, Nadav Amit wrote: >> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h >> index 448cd01eb3ec..18c3366f8f4d 100644 >> --- a/arch/x86/include/asm/pgtable.h >> +++ b/arch/x86/include/asm/pgtable.h >> @@ -1146,6 +1146,14 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma, >> } >> } >> #endif >> + >> +#define __HAVE_ARCH_PMDP_INVALIDATE_AD >> +static inline pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, >> + unsigned long address, pmd_t *pmdp) >> +{ >> + return pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp)); > > Did this want to be something like: > > pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp)); > if (cpu_feature_enabled(X86_BUG_PTE_LEAK)) > flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); > return old; > > instead? Yes. Of course. Where did my code go to? :( > >> +} >> + >> /* >> * Page table pages are page-aligned. The lower half of the top >> * level is used for userspace and the top half for the kernel. > >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index e5ea5f775d5c..435da011b1a2 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -1795,10 +1795,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, >> * The race makes MADV_DONTNEED miss the huge pmd and don't clear it >> * which may break userspace. >> * >> - * pmdp_invalidate() is required to make sure we don't miss >> - * dirty/young flags set by hardware. >> + * pmdp_invalidate_ad() is required to make sure we don't miss >> + * dirty/young flags (which are also known as access/dirty) cannot be >> + * further modifeid by the hardware. > > "modified", I think is the more common spelling. I tried to start a new trend. I will fix it. > >> */ >> - entry = pmdp_invalidate(vma, addr, pmd); >> + entry = pmdp_invalidate_ad(vma, addr, pmd); >> >> entry = pmd_modify(entry, newprot); >> if (preserve_write) >> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c >> index 4e640baf9794..b0ce6c7391bf 100644 >> --- a/mm/pgtable-generic.c >> +++ b/mm/pgtable-generic.c >> @@ -200,6 +200,14 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, >> } >> #endif >> >> +#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD > > /* > * Does this deserve a comment to explain the intended difference vs > * pmdp_invalidate() ? > */ I will add a comment.
On 10/21/21 5:21 AM, Nadav Amit wrote: > The first TLB flush is only necessary to prevent the dirty bit (and with > a lesser importance the access bit) from changing while the PTE is > modified. However, this is not necessary as the x86 CPUs set the > dirty-bit atomically with an additional check that the PTE is (still) > present. One caveat is Intel's Knights Landing that has a bug and does > not do so. First, did I miss the check in this patch for X86_BUG_PTE_LEAK? I don't see it anywhere. > - * pmdp_invalidate() is required to make sure we don't miss > - * dirty/young flags set by hardware. This got me thinking... In here: > https://lore.kernel.org/lkml/20160708001909.FB2443E2@viggo.jf.intel.com/ I wrote: > These bits are truly "stray". In the case of the Dirty bit, the > thread associated with the stray set was *not* allowed to write to > the page. This means that we do not have to launder the bit(s); we > can simply ignore them. Is the goal of your proposed patch here to ensure that the dirty bit is not set at *all*? Or, is it to ensure that a dirty bit which we need to *launder* is never set?
> On Oct 26, 2021, at 9:06 AM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 10/21/21 5:21 AM, Nadav Amit wrote: >> The first TLB flush is only necessary to prevent the dirty bit (and with >> a lesser importance the access bit) from changing while the PTE is >> modified. However, this is not necessary as the x86 CPUs set the >> dirty-bit atomically with an additional check that the PTE is (still) >> present. One caveat is Intel's Knights Landing that has a bug and does >> not do so. > > First, did I miss the check in this patch for X86_BUG_PTE_LEAK? I don't > see it anywhere. No, it is me who missed it. It should have been in pmdp_invalidate_ad(): diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 3481b35cb4ec..f14f64cc17b5 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -780,6 +780,30 @@ int pmd_clear_huge(pmd_t *pmd) return 0; } +/* + * pmdp_invalidate_ad() - prevents the access and dirty bits from being further + * updated by the CPU. + * + * Returns the original PTE. + * + * During an access to a page, x86 CPUs set the dirty and access bit atomically + * with an additional check of the present-bit. Therefore, it is possible to + * avoid the TLB flush if we change the PTE atomically, as pmdp_establish does. + * + * We do not make this optimization on certain CPUs that has a bug that violates + * this behavior (specifically Knights Landing). + */ +pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, + pmd_t *pmdp) +{ + pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp)); + + if (cpu_feature_enabled(X86_BUG_PTE_LEAK)) + flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); + return old; +} > >> - * pmdp_invalidate() is required to make sure we don't miss >> - * dirty/young flags set by hardware. > > This got me thinking... In here: > >> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20160708001909.FB2443E2%40viggo.jf.intel.com%2F&data=04%7C01%7Cnamit%40vmware.com%7Cf6a2a69eec094b12638108d9989afb60%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637708613735772213%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=o8fYbm8BKHvWxYC9aO5e3MFLkaOnQxvDMy%2BEnYxz56I%3D&reserved=0 > > I wrote: > >> These bits are truly "stray". In the case of the Dirty bit, the >> thread associated with the stray set was *not* allowed to write to >> the page. This means that we do not have to launder the bit(s); we >> can simply ignore them. > > Is the goal of your proposed patch here to ensure that the dirty bit is > not set at *all*? Or, is it to ensure that a dirty bit which we need to > *launder* is never set? At *all*. Err… I remembered from our previous discussions that the dirty bit cannot be set once the R/W bit is cleared atomically. But going back to the SDM, I see the (relatively new?) note: "If software on one logical processor writes to a page while software on another logical processor concurrently clears the R/W flag in the paging-structure entry that maps the page, execution on some processors may result in the entry’s dirty flag being set (due to the write on the first logical processor) and the entry’s R/W flag being clear (due to the update to the entry on the second logical processor). This will never occur on a processor that supports control-flow enforcement technology (CET)” So I guess that this optimization can only be enabled when CET is enabled. :(
> On Oct 26, 2021, at 9:47 AM, Nadav Amit <namit@vmware.com> wrote: > > > >> On Oct 26, 2021, at 9:06 AM, Dave Hansen <dave.hansen@intel.com> wrote: >> >> On 10/21/21 5:21 AM, Nadav Amit wrote: >>> The first TLB flush is only necessary to prevent the dirty bit (and with >>> a lesser importance the access bit) from changing while the PTE is >>> modified. However, this is not necessary as the x86 CPUs set the >>> dirty-bit atomically with an additional check that the PTE is (still) >>> present. One caveat is Intel's Knights Landing that has a bug and does >>> not do so. >> >> First, did I miss the check in this patch for X86_BUG_PTE_LEAK? I don't >> see it anywhere. > > No, it is me who missed it. It should have been in pmdp_invalidate_ad(): > > diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c > index 3481b35cb4ec..f14f64cc17b5 100644 > --- a/arch/x86/mm/pgtable.c > +++ b/arch/x86/mm/pgtable.c > @@ -780,6 +780,30 @@ int pmd_clear_huge(pmd_t *pmd) > return 0; > } > > +/* > + * pmdp_invalidate_ad() - prevents the access and dirty bits from being further > + * updated by the CPU. > + * > + * Returns the original PTE. > + * > + * During an access to a page, x86 CPUs set the dirty and access bit atomically > + * with an additional check of the present-bit. Therefore, it is possible to > + * avoid the TLB flush if we change the PTE atomically, as pmdp_establish does. > + * > + * We do not make this optimization on certain CPUs that has a bug that violates > + * this behavior (specifically Knights Landing). > + */ > +pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, > + pmd_t *pmdp) > +{ > + pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp)); > + > + if (cpu_feature_enabled(X86_BUG_PTE_LEAK)) > + flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); > + return old; > +} > >> >>> - * pmdp_invalidate() is required to make sure we don't miss >>> - * dirty/young flags set by hardware. >> >> This got me thinking... In here: >> >>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20160708001909.FB2443E2%40viggo.jf.intel.com%2F&data=04%7C01%7Cnamit%40vmware.com%7Cf6a2a69eec094b12638108d9989afb60%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637708613735772213%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=o8fYbm8BKHvWxYC9aO5e3MFLkaOnQxvDMy%2BEnYxz56I%3D&reserved=0 >> >> I wrote: >> >>> These bits are truly "stray". In the case of the Dirty bit, the >>> thread associated with the stray set was *not* allowed to write to >>> the page. This means that we do not have to launder the bit(s); we >>> can simply ignore them. >> >> Is the goal of your proposed patch here to ensure that the dirty bit is >> not set at *all*? Or, is it to ensure that a dirty bit which we need to >> *launder* is never set? > > At *all*. > > Err… I remembered from our previous discussions that the dirty bit cannot > be set once the R/W bit is cleared atomically. But going back to the SDM, > I see the (relatively new?) note: > > "If software on one logical processor writes to a page while software on > another logical processor concurrently clears the R/W flag in the > paging-structure entry that maps the page, execution on some processors may > result in the entry’s dirty flag being set (due to the write on the first > logical processor) and the entry’s R/W flag being clear (due to the update > to the entry on the second logical processor). This will never occur on a > processor that supports control-flow enforcement technology (CET)” > > So I guess that this optimization can only be enabled when CET is enabled. > > :( I still wonder whether the SDM comment applies to present bit vs dirty bit atomicity as well. On AMD’s APM I find: "The processor never sets the Accessed bit or the Dirty bit for a not present page (P = 0). The ordering of Accessed and Dirty bit updates with respect to surrounding loads and stores is discussed below.” ( The later comment regards ordering to WC memory ). I don’t know if I read it too creatively...
On 10/26/21 10:44 AM, Nadav Amit wrote: >> "If software on one logical processor writes to a page while software on >> another logical processor concurrently clears the R/W flag in the >> paging-structure entry that maps the page, execution on some processors may >> result in the entry’s dirty flag being set (due to the write on the first >> logical processor) and the entry’s R/W flag being clear (due to the update >> to the entry on the second logical processor). This will never occur on a >> processor that supports control-flow enforcement technology (CET)” >> >> So I guess that this optimization can only be enabled when CET is enabled. >> >> :( > I still wonder whether the SDM comment applies to present bit vs dirty > bit atomicity as well. I think it's implicit. From "4.8 ACCESSED AND DIRTY FLAGS": "Whenever there is a write to a linear address, the processor sets the dirty flag (if it is not already set) in the paging- structure entry" There can't be a "write to a linear address" without a Present=1 PTE. If it were a Dirty=1,Present=1 PTE, there's no race because there might not be a write to the PTE at all. There's also this from the "4.10.4.3 Optional Invalidation" section: "no TLB entry or paging-structure cache entry is created with information from a paging-structure entry in which the P flag is 0." That means that we don't have to worry about the TLB doing something bonkers like caching a Dirty=1 bit from a Present=0 PTE. Is that what you were worried about?
> On Oct 26, 2021, at 11:44 AM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 10/26/21 10:44 AM, Nadav Amit wrote: >>> "If software on one logical processor writes to a page while software on >>> another logical processor concurrently clears the R/W flag in the >>> paging-structure entry that maps the page, execution on some processors may >>> result in the entry’s dirty flag being set (due to the write on the first >>> logical processor) and the entry’s R/W flag being clear (due to the update >>> to the entry on the second logical processor). This will never occur on a >>> processor that supports control-flow enforcement technology (CET)” >>> >>> So I guess that this optimization can only be enabled when CET is enabled. >>> >>> :( >> I still wonder whether the SDM comment applies to present bit vs dirty >> bit atomicity as well. > > I think it's implicit. From "4.8 ACCESSED AND DIRTY FLAGS": > > "Whenever there is a write to a linear address, the processor > sets the dirty flag (if it is not already set) in the paging- > structure entry" > > There can't be a "write to a linear address" without a Present=1 PTE. > If it were a Dirty=1,Present=1 PTE, there's no race because there might > not be a write to the PTE at all. > > There's also this from the "4.10.4.3 Optional Invalidation" section: > > "no TLB entry or paging-structure cache entry is created with > information from a paging-structure entry in which the P flag > is 0." > > That means that we don't have to worry about the TLB doing something > bonkers like caching a Dirty=1 bit from a Present=0 PTE. > > Is that what you were worried about? Thanks Dave, but no - that is not my concern. To make it very clear - consider the following scenario, in which a volatile pointer p is mapped using a certain PTE, which is RW (i.e., *p is writable): CPU0 CPU1 ---- ---- x = *p [ PTE cached in TLB; PTE is not dirty ] clear_pte(PTE) *p = x [ needs to set dirty ] Note that there is no TLB flush in this scenario. The question is whether the write access to *p would succeed, setting the dirty bit on the clear, non-present entry. I was under the impression that the hardware AD-assist would recheck the PTE atomically as it sets the dirty bit. But, as I said, I am not sure anymore whether this is defined architecturally (or at least would work in practice on all CPUs modulo the Knights Landing thingy).
On 10/26/21 12:06 PM, Nadav Amit wrote: > > To make it very clear - consider the following scenario, in which > a volatile pointer p is mapped using a certain PTE, which is RW > (i.e., *p is writable): > > CPU0 CPU1 > ---- ---- > x = *p > [ PTE cached in TLB; > PTE is not dirty ] > clear_pte(PTE) > *p = x > [ needs to set dirty ] > > Note that there is no TLB flush in this scenario. The question > is whether the write access to *p would succeed, setting the > dirty bit on the clear, non-present entry. > > I was under the impression that the hardware AD-assist would > recheck the PTE atomically as it sets the dirty bit. But, as I > said, I am not sure anymore whether this is defined architecturally > (or at least would work in practice on all CPUs modulo the > Knights Landing thingy). Practically, at "x=*p", he thing that gets cached in the TLB will Dirty=0. At the "*p=x", the CPU will decide it needs to do a write, find the Dirty=0 entry and will entirely discard it. In other words, it *acts* roughly like this: x = *p INVLPG(p) *p = x; Where the INVLPG() and the "*p=x" are atomic. So, there's no _practical_ problem with your scenario. This specific behavior isn't architectural as far as I know, though. Although it's pretty much just academic, as for the architecture, are you getting hung up on the difference between the description of "Accessed": Whenever the processor uses a paging-structure entry as part of linear-address translation, it sets the accessed flag in that entry and "Dirty:" Whenever there is a write to a linear address, the processor sets the dirty flag (if it is not already set) in the paging- structure entry... Accessed says "as part of linear-address translation", which means that the address must have a translation. But, the "Dirty" section doesn't say that. It talks about "a write to a linear address" but not whether there is a linear address *translation* involved. If that's it, we could probably add a bit like: In addition to setting the accessed flag, whenever there is a write... before the dirty rules in the SDM. Or am I being dense and continuing to miss your point? :)
> On Oct 26, 2021, at 12:40 PM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 10/26/21 12:06 PM, Nadav Amit wrote: >> >> To make it very clear - consider the following scenario, in which >> a volatile pointer p is mapped using a certain PTE, which is RW >> (i.e., *p is writable): >> >> CPU0 CPU1 >> ---- ---- >> x = *p >> [ PTE cached in TLB; >> PTE is not dirty ] >> clear_pte(PTE) >> *p = x >> [ needs to set dirty ] >> >> Note that there is no TLB flush in this scenario. The question >> is whether the write access to *p would succeed, setting the >> dirty bit on the clear, non-present entry. >> >> I was under the impression that the hardware AD-assist would >> recheck the PTE atomically as it sets the dirty bit. But, as I >> said, I am not sure anymore whether this is defined architecturally >> (or at least would work in practice on all CPUs modulo the >> Knights Landing thingy). > > Practically, at "x=*p", he thing that gets cached in the TLB will > Dirty=0. At the "*p=x", the CPU will decide it needs to do a write, > find the Dirty=0 entry and will entirely discard it. In other words, it > *acts* roughly like this: > > x = *p > INVLPG(p) > *p = x; > > Where the INVLPG() and the "*p=x" are atomic. So, there's no > _practical_ problem with your scenario. This specific behavior isn't > architectural as far as I know, though. > > Although it's pretty much just academic, as for the architecture, are > you getting hung up on the difference between the description of "Accessed": > > Whenever the processor uses a paging-structure entry as part of > linear-address translation, it sets the accessed flag in that > entry > > and "Dirty:" > > Whenever there is a write to a linear address, the processor > sets the dirty flag (if it is not already set) in the paging- > structure entry... > > Accessed says "as part of linear-address translation", which means that > the address must have a translation. But, the "Dirty" section doesn't > say that. It talks about "a write to a linear address" but not whether > there is a linear address *translation* involved. > > If that's it, we could probably add a bit like: > > In addition to setting the accessed flag, whenever there is a > write... > > before the dirty rules in the SDM. > > Or am I being dense and continuing to miss your point? :) I think this time you got my question right. I was thrown off by the SDM comment on RW permissions vs dirty that I mentioned before: "If software on one logical processor writes to a page while software on another logical processor concurrently clears the R/W flag in the paging-structure entry that maps the page, execution on some processors may result in the entry’s dirty flag being set (due to the write on the first logical processor) and the entry’s R/W flag being clear (due to the update to the entry on the second logical processor).” I did not pay enough attention to these small differences that you mentioned between access and dirty this time (although I did notice them before). I do not think that the change that you offered to the SDM really clarifies the situation. Setting the access flag is done as part of caching the PTE in the TLB. The SDM change you propose does not clarify the atomicity of the permission/PTE-validity check and dirty-bit setting or the fact the PTE is invalidated if the dirty-bit needs to be set and is cached as clear [I do not presume you would want the latter in the SDM, since it is an implementation detail.] I just wonder how come the R/W-clearing and the P-clearing cause concurrent dirty bit setting to behave differently. I am not a hardware guy, but I would imagine they would be the same...
On 10/26/21 1:07 PM, Nadav Amit wrote: > I just wonder how come the R/W-clearing and the P-clearing cause concurrent > dirty bit setting to behave differently. I am not a hardware guy, but I would > imagine they would be the same... First of all, I think the non-atomic properties where a PTE can go: W=1,D=0 // original W=0,D=0 // software clears W W=0,D=1 // hardware sets D were a total implementation accident. It wasn't someone being clever and since the behavior was architecturally allowed and well-tolerated by software it was around for a while. I think I was the one that asked that it get fixed for shadow stacks, and nobody pushed back on it too hard as far as I remember. I don't think it was super hard to fix. Why do the Present/Accessed and Write/Dirty pairs act differently? I think it's a total implementation accident and wasn't by design. The KNL erratum was an erratum and wasn't codified in the architecture because it actually broke things. The pre-CET Write/Dirty behavior didn't break software to a level it was considered an erratum. It gets to live on as allowed in the architecture.
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 448cd01eb3ec..18c3366f8f4d 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1146,6 +1146,14 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma, } } #endif + +#define __HAVE_ARCH_PMDP_INVALIDATE_AD +static inline pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, + unsigned long address, pmd_t *pmdp) +{ + return pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp)); +} + /* * Page table pages are page-aligned. The lower half of the top * level is used for userspace and the top half for the kernel. diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e24d2c992b11..622efe0a9ef0 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -561,6 +561,11 @@ extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp); #endif +#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD +extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, + unsigned long address, pmd_t *pmdp); +#endif + #ifndef __HAVE_ARCH_PTE_SAME static inline int pte_same(pte_t pte_a, pte_t pte_b) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e5ea5f775d5c..435da011b1a2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1795,10 +1795,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, * The race makes MADV_DONTNEED miss the huge pmd and don't clear it * which may break userspace. * - * pmdp_invalidate() is required to make sure we don't miss - * dirty/young flags set by hardware. + * pmdp_invalidate_ad() is required to make sure we don't miss + * dirty/young flags (which are also known as access/dirty) cannot be + * further modifeid by the hardware. */ - entry = pmdp_invalidate(vma, addr, pmd); + entry = pmdp_invalidate_ad(vma, addr, pmd); entry = pmd_modify(entry, newprot); if (preserve_write) diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 4e640baf9794..b0ce6c7391bf 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -200,6 +200,14 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, } #endif +#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD +pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, + pmd_t *pmdp) +{ + return pmdp_invalidate(vma, address, pmdp); +} +#endif + #ifndef pmdp_collapse_flush pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp)