Message ID | 20231204172646.2541916-1-jthoughton@google.com (mailing list archive) |
---|---|
Headers | show |
Series | arm64: hugetlb: Fix page fault loop for sw-dirty/hw-clean contiguous PTEs | expand |
On 04/12/2023 17:26, James Houghton wrote: > It is currently possible for a userspace application to enter a page > fault loop when using HugeTLB pages implemented with contiguous PTEs > when HAFDBS is not available. This happens because: > 1. The kernel may sometimes write PTEs that are sw-dirty but hw-clean > (PTE_DIRTY | PTE_RDONLY | PTE_WRITE). Hi James, Do you know how this happens? AFAIK, this is the set of valid bit combinations, and PTE_RDONLY|PTE_WRITE|PTE_DIRTY is not one of them. Perhaps the real solution is to understand how this is happening and prevent it? /* * PTE bits configuration in the presence of hardware Dirty Bit Management * (PTE_WRITE == PTE_DBM): * * Dirty Writable | PTE_RDONLY PTE_WRITE PTE_DIRTY (sw) * 0 0 | 1 0 0 * 0 1 | 1 1 0 * 1 0 | 1 0 1 * 1 1 | 0 1 x * * When hardware DBM is not present, the sofware PTE_DIRTY bit is updated via * the page fault mechanism. Checking the dirty status of a pte becomes: * * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) */ Thanks, Ryan > 2. If, during a write, the CPU uses a sw-dirty, hw-clean PTE in handling > the memory access on a system without HAFDBS, we will get a page > fault. > 3. HugeTLB will check if it needs to update the dirty bits on the PTE. > For contiguous PTEs, it will check to see if the pgprot bits need > updating. In this case, HugeTLB wants to write a sequence of > sw-dirty, hw-dirty PTEs, but it finds that all the PTEs it is about > to overwrite are all pte_dirty() (pte_sw_dirty() => pte_dirty()), > so it thinks no update is necessary. > > Please see this[1] reproducer. > > I think (though I may be wrong) that both step (1) and step (3) are > buggy. > > The first patch in this series fixes step (3); instead of checking if > pte_dirty is matching in __cont_access_flags_changed, check pte_hw_dirty > and pte_sw_dirty separately. > > The second patch in this series makes step (1) less likely to occur. > Without this patch, we can get the kernel to write a sw-dirty, hw-clean > PTE with the following steps (showing the relevant VMA flags and pgprot > bits): > i. Create a valid, writable contiguous PTE. > VMA vmflags: VM_SHARED | VM_READ | VM_WRITE > VMA pgprot bits: PTE_RDONLY | PTE_WRITE > PTE pgprot bits: PTE_DIRTY | PTE_WRITE > ii. mprotect the VMA to PROT_NONE. > VMA vmflags: VM_SHARED > VMA pgprot bits: PTE_RDONLY > PTE pgprot bits: PTE_DIRTY | PTE_RDONLY > iii. mprotect the VMA back to PROT_READ | PROT_WRITE. > VMA vmflags: VM_SHARED | VM_READ | VM_WRITE > VMA pgprot bits: PTE_RDONLY | PTE_WRITE > PTE pgprot bits: PTE_DIRTY | PTE_WRITE | PTE_RDONLY > > Applying either one of the two patches in this patchset will fix the > particular issue with HugeTLB pages implemented with contiguous PTEs. > It's possible that only one of these patches should be taken, or that > the right fix is something else entirely. > > [1]: https://gist.github.com/48ca/11d1e466deee032cb35aa8c2280f93b0 > > James Houghton (2): > arm64: hugetlb: Distinguish between hw and sw dirtiness in > __cont_access_flags_changed > arm64: mm: Always make sw-dirty PTEs hw-dirty in pte_modify > > arch/arm64/include/asm/pgtable.h | 6 ++++++ > arch/arm64/mm/hugetlbpage.c | 5 ++++- > 2 files changed, 10 insertions(+), 1 deletion(-) > > > base-commit: 645a9a454fdb7e698a63a275edca6a17ef97afc4
On Tue, Dec 5, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 04/12/2023 17:26, James Houghton wrote: > > It is currently possible for a userspace application to enter a page > > fault loop when using HugeTLB pages implemented with contiguous PTEs > > when HAFDBS is not available. This happens because: > > 1. The kernel may sometimes write PTEs that are sw-dirty but hw-clean > > (PTE_DIRTY | PTE_RDONLY | PTE_WRITE). > > Hi James, > > Do you know how this happens? Hi Ryan, Thanks for taking a look! I do understand why this is happening. There is an explanation in the reproducer[1] and also in this cover letter (though I realize I could have been a little clearer). See below. > AFAIK, this is the set of valid bit combinations, and > PTE_RDONLY|PTE_WRITE|PTE_DIRTY is not one of them. Perhaps the real solution is > to understand how this is happening and prevent it? > > /* > * PTE bits configuration in the presence of hardware Dirty Bit Management > * (PTE_WRITE == PTE_DBM): > * > * Dirty Writable | PTE_RDONLY PTE_WRITE PTE_DIRTY (sw) > * 0 0 | 1 0 0 > * 0 1 | 1 1 0 > * 1 0 | 1 0 1 > * 1 1 | 0 1 x > * > * When hardware DBM is not present, the sofware PTE_DIRTY bit is updated via > * the page fault mechanism. Checking the dirty status of a pte becomes: > * > * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) > */ Thanks for pointing this out. So (1) is definitely a bug. The second patch in this series makes it impossible to create such a PTE via pte_modify (by forcing sw-dirty PTEs to be hw-dirty as well). > > The second patch in this series makes step (1) less likely to occur. It makes it impossible to create this invalid set of bits via pte_modify(). Assuming all PTE pgprot updates are done via the proper interfaces, patch #2 might actually make this invalid bit combination impossible to produce (that's certainly the goal). So perhaps language stronger than "less likely" is appropriate. Here's the sequence of events to trigger this bug, via mprotect(): > > Without this patch, we can get the kernel to write a sw-dirty, hw-clean > > PTE with the following steps (showing the relevant VMA flags and pgprot > > bits): > > i. Create a valid, writable contiguous PTE. > > VMA vmflags: VM_SHARED | VM_READ | VM_WRITE > > VMA pgprot bits: PTE_RDONLY | PTE_WRITE > > PTE pgprot bits: PTE_DIRTY | PTE_WRITE > > ii. mprotect the VMA to PROT_NONE. > > VMA vmflags: VM_SHARED > > VMA pgprot bits: PTE_RDONLY > > PTE pgprot bits: PTE_DIRTY | PTE_RDONLY > > iii. mprotect the VMA back to PROT_READ | PROT_WRITE. > > VMA vmflags: VM_SHARED | VM_READ | VM_WRITE > > VMA pgprot bits: PTE_RDONLY | PTE_WRITE > > PTE pgprot bits: PTE_DIRTY | PTE_WRITE | PTE_RDONLY With patch #2, the PTE pgprot bits in step iii become PTE_DIRTY | PTE_WRITE (hw-dirtiness is set, as the PTE is sw-dirty). Thanks! > > [1]: https://gist.github.com/48ca/11d1e466deee032cb35aa8c2280f93b0
On 05/12/2023 17:54, James Houghton wrote: > On Tue, Dec 5, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 04/12/2023 17:26, James Houghton wrote: >>> It is currently possible for a userspace application to enter a page >>> fault loop when using HugeTLB pages implemented with contiguous PTEs >>> when HAFDBS is not available. This happens because: >>> 1. The kernel may sometimes write PTEs that are sw-dirty but hw-clean >>> (PTE_DIRTY | PTE_RDONLY | PTE_WRITE). >> >> Hi James, >> >> Do you know how this happens? > > Hi Ryan, > > Thanks for taking a look! I do understand why this is happening. There > is an explanation in the reproducer[1] and also in this cover letter > (though I realize I could have been a little clearer). See below. Sigh... sorry! I totally missed your (excellent) explanation. > >> AFAIK, this is the set of valid bit combinations, and >> PTE_RDONLY|PTE_WRITE|PTE_DIRTY is not one of them. Perhaps the real solution is >> to understand how this is happening and prevent it? >> >> /* >> * PTE bits configuration in the presence of hardware Dirty Bit Management >> * (PTE_WRITE == PTE_DBM): >> * >> * Dirty Writable | PTE_RDONLY PTE_WRITE PTE_DIRTY (sw) >> * 0 0 | 1 0 0 >> * 0 1 | 1 1 0 >> * 1 0 | 1 0 1 >> * 1 1 | 0 1 x >> * >> * When hardware DBM is not present, the sofware PTE_DIRTY bit is updated via >> * the page fault mechanism. Checking the dirty status of a pte becomes: >> * >> * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) >> */ > > Thanks for pointing this out. So (1) is definitely a bug. The second > patch in this series makes it impossible to create such a PTE via > pte_modify (by forcing sw-dirty PTEs to be hw-dirty as well). Yes; I think the second patch should be sufficient; I took a quick look at the other helpers and I don't see anything else that could get the PTE to the invalid state. I have a series that starts using the contpte bit for (multi-size) THP opportunistically. This bug will affect that too I think. Your patch #2 will fix for both hugetlb and my series. I'd rather not apply an equivalent to your patch #1 because its not quite as straightforward in my code path. But I'm pretty confident that patch # is all that's needed here. Thanks, Ryan > >>> The second patch in this series makes step (1) less likely to occur. > > It makes it impossible to create this invalid set of bits via > pte_modify(). Assuming all PTE pgprot updates are done via the proper > interfaces, patch #2 might actually make this invalid bit combination > impossible to produce (that's certainly the goal). So perhaps language > stronger than "less likely" is appropriate. > > Here's the sequence of events to trigger this bug, via mprotect(): > >>> Without this patch, we can get the kernel to write a sw-dirty, hw-clean >>> PTE with the following steps (showing the relevant VMA flags and pgprot >>> bits): >>> i. Create a valid, writable contiguous PTE. >>> VMA vmflags: VM_SHARED | VM_READ | VM_WRITE >>> VMA pgprot bits: PTE_RDONLY | PTE_WRITE >>> PTE pgprot bits: PTE_DIRTY | PTE_WRITE >>> ii. mprotect the VMA to PROT_NONE. >>> VMA vmflags: VM_SHARED >>> VMA pgprot bits: PTE_RDONLY >>> PTE pgprot bits: PTE_DIRTY | PTE_RDONLY >>> iii. mprotect the VMA back to PROT_READ | PROT_WRITE. >>> VMA vmflags: VM_SHARED | VM_READ | VM_WRITE >>> VMA pgprot bits: PTE_RDONLY | PTE_WRITE >>> PTE pgprot bits: PTE_DIRTY | PTE_WRITE | PTE_RDONLY > > With patch #2, the PTE pgprot bits in step iii become PTE_DIRTY | > PTE_WRITE (hw-dirtiness is set, as the PTE is sw-dirty). > > Thanks! > >>> [1]: https://gist.github.com/48ca/11d1e466deee032cb35aa8c2280f93b0
On Wed, Dec 6, 2023 at 2:24 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 05/12/2023 17:54, James Houghton wrote: > > On Tue, Dec 5, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Thanks for pointing this out. So (1) is definitely a bug. The second > > patch in this series makes it impossible to create such a PTE via > > pte_modify (by forcing sw-dirty PTEs to be hw-dirty as well). > > Yes; I think the second patch should be sufficient; I took a quick look at the > other helpers and I don't see anything else that could get the PTE to the > invalid state. > > I have a series that starts using the contpte bit for (multi-size) THP > opportunistically. This bug will affect that too I think. Your patch #2 will fix > for both hugetlb and my series. I'd rather not apply an equivalent to your patch > #1 because its not quite as straightforward in my code path. But I'm pretty > confident that patch # is all that's needed here. There is no need to apply a patch #1-equivalent for multi-size THPs. :) If multi-size THP has the same problem as HugeTLB, patch #2 will fix it too. I don't think multi-size THP has the equivalent problem -- in fact, I'm not sure how multi-size THP keeps the PTE_DIRTY, PTE_WRITE (DBM), and the PTE_RDONLY bits in sync (they do need to be in-sync with each other when the contiguous bit is being used, right?). I included patch #1 (with cc:stable) because it's a more direct fix for HugeTLB that might be slightly easier to backport. If you think that patch #1 should be dropped and patch #2 should be backported, please let me know. Thanks for the review!
On Mon, 04 Dec 2023 17:26:44 +0000, James Houghton wrote: > It is currently possible for a userspace application to enter a page > fault loop when using HugeTLB pages implemented with contiguous PTEs > when HAFDBS is not available. This happens because: > 1. The kernel may sometimes write PTEs that are sw-dirty but hw-clean > (PTE_DIRTY | PTE_RDONLY | PTE_WRITE). > 2. If, during a write, the CPU uses a sw-dirty, hw-clean PTE in handling > the memory access on a system without HAFDBS, we will get a page > fault. > 3. HugeTLB will check if it needs to update the dirty bits on the PTE. > For contiguous PTEs, it will check to see if the pgprot bits need > updating. In this case, HugeTLB wants to write a sequence of > sw-dirty, hw-dirty PTEs, but it finds that all the PTEs it is about > to overwrite are all pte_dirty() (pte_sw_dirty() => pte_dirty()), > so it thinks no update is necessary. > > [...] Applied to arm64 (for-next/fixes), thanks! [2/2] arm64: mm: Always make sw-dirty PTEs hw-dirty in pte_modify https://git.kernel.org/arm64/c/3c0696076aad I only picked up the second patch and added the description from the cover letter into the commit log.