mbox series

[0/2] arm64: hugetlb: Fix page fault loop for sw-dirty/hw-clean contiguous PTEs

Message ID 20231204172646.2541916-1-jthoughton@google.com (mailing list archive)
Headers show
Series arm64: hugetlb: Fix page fault loop for sw-dirty/hw-clean contiguous PTEs | expand

Message

James Houghton Dec. 4, 2023, 5:26 p.m. UTC
It is currently possible for a userspace application to enter a page
fault loop when using HugeTLB pages implemented with contiguous PTEs
when HAFDBS is not available. This happens because:
1. The kernel may sometimes write PTEs that are sw-dirty but hw-clean
   (PTE_DIRTY | PTE_RDONLY | PTE_WRITE).
2. If, during a write, the CPU uses a sw-dirty, hw-clean PTE in handling
   the memory access on a system without HAFDBS, we will get a page
   fault.
3. HugeTLB will check if it needs to update the dirty bits on the PTE.
   For contiguous PTEs, it will check to see if the pgprot bits need
   updating. In this case, HugeTLB wants to write a sequence of
   sw-dirty, hw-dirty PTEs, but it finds that all the PTEs it is about
   to overwrite are all pte_dirty() (pte_sw_dirty() => pte_dirty()),
   so it thinks no update is necessary.

Please see this[1] reproducer.

I think (though I may be wrong) that both step (1) and step (3) are
buggy.

The first patch in this series fixes step (3); instead of checking if
pte_dirty is matching in __cont_access_flags_changed, check pte_hw_dirty
and pte_sw_dirty separately.

The second patch in this series makes step (1) less likely to occur.
Without this patch, we can get the kernel to write a sw-dirty, hw-clean
PTE with the following steps (showing the relevant VMA flags and pgprot
bits):
i.   Create a valid, writable contiguous PTE.
       VMA vmflags:     VM_SHARED | VM_READ | VM_WRITE
       VMA pgprot bits: PTE_RDONLY | PTE_WRITE
       PTE pgprot bits: PTE_DIRTY | PTE_WRITE
ii.  mprotect the VMA to PROT_NONE.
       VMA vmflags:     VM_SHARED
       VMA pgprot bits: PTE_RDONLY
       PTE pgprot bits: PTE_DIRTY | PTE_RDONLY
iii. mprotect the VMA back to PROT_READ | PROT_WRITE.
       VMA vmflags:     VM_SHARED | VM_READ | VM_WRITE
       VMA pgprot bits: PTE_RDONLY | PTE_WRITE
       PTE pgprot bits: PTE_DIRTY | PTE_WRITE | PTE_RDONLY

Applying either one of the two patches in this patchset will fix the
particular issue with HugeTLB pages implemented with contiguous PTEs.
It's possible that only one of these patches should be taken, or that
the right fix is something else entirely.

[1]: https://gist.github.com/48ca/11d1e466deee032cb35aa8c2280f93b0

James Houghton (2):
  arm64: hugetlb: Distinguish between hw and sw dirtiness in
    __cont_access_flags_changed
  arm64: mm: Always make sw-dirty PTEs hw-dirty in pte_modify

 arch/arm64/include/asm/pgtable.h | 6 ++++++
 arch/arm64/mm/hugetlbpage.c      | 5 ++++-
 2 files changed, 10 insertions(+), 1 deletion(-)


base-commit: 645a9a454fdb7e698a63a275edca6a17ef97afc4

Comments

Ryan Roberts Dec. 5, 2023, 2:43 p.m. UTC | #1
On 04/12/2023 17:26, James Houghton wrote:
> It is currently possible for a userspace application to enter a page
> fault loop when using HugeTLB pages implemented with contiguous PTEs
> when HAFDBS is not available. This happens because:
> 1. The kernel may sometimes write PTEs that are sw-dirty but hw-clean
>    (PTE_DIRTY | PTE_RDONLY | PTE_WRITE).

Hi James,

Do you know how this happens?

AFAIK, this is the set of valid bit combinations, and
PTE_RDONLY|PTE_WRITE|PTE_DIRTY is not one of them. Perhaps the real solution is
to understand how this is happening and prevent it?

/*
 * PTE bits configuration in the presence of hardware Dirty Bit Management
 * (PTE_WRITE == PTE_DBM):
 *
 * Dirty  Writable | PTE_RDONLY  PTE_WRITE  PTE_DIRTY (sw)
 *   0      0      |   1           0          0
 *   0      1      |   1           1          0
 *   1      0      |   1           0          1
 *   1      1      |   0           1          x
 *
 * When hardware DBM is not present, the sofware PTE_DIRTY bit is updated via
 * the page fault mechanism. Checking the dirty status of a pte becomes:
 *
 *   PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
 */


Thanks,
Ryan


> 2. If, during a write, the CPU uses a sw-dirty, hw-clean PTE in handling
>    the memory access on a system without HAFDBS, we will get a page
>    fault.
> 3. HugeTLB will check if it needs to update the dirty bits on the PTE.
>    For contiguous PTEs, it will check to see if the pgprot bits need
>    updating. In this case, HugeTLB wants to write a sequence of
>    sw-dirty, hw-dirty PTEs, but it finds that all the PTEs it is about
>    to overwrite are all pte_dirty() (pte_sw_dirty() => pte_dirty()),
>    so it thinks no update is necessary.
> 
> Please see this[1] reproducer.
> 
> I think (though I may be wrong) that both step (1) and step (3) are
> buggy.
> 
> The first patch in this series fixes step (3); instead of checking if
> pte_dirty is matching in __cont_access_flags_changed, check pte_hw_dirty
> and pte_sw_dirty separately.
> 
> The second patch in this series makes step (1) less likely to occur.
> Without this patch, we can get the kernel to write a sw-dirty, hw-clean
> PTE with the following steps (showing the relevant VMA flags and pgprot
> bits):
> i.   Create a valid, writable contiguous PTE.
>        VMA vmflags:     VM_SHARED | VM_READ | VM_WRITE
>        VMA pgprot bits: PTE_RDONLY | PTE_WRITE
>        PTE pgprot bits: PTE_DIRTY | PTE_WRITE
> ii.  mprotect the VMA to PROT_NONE.
>        VMA vmflags:     VM_SHARED
>        VMA pgprot bits: PTE_RDONLY
>        PTE pgprot bits: PTE_DIRTY | PTE_RDONLY
> iii. mprotect the VMA back to PROT_READ | PROT_WRITE.
>        VMA vmflags:     VM_SHARED | VM_READ | VM_WRITE
>        VMA pgprot bits: PTE_RDONLY | PTE_WRITE
>        PTE pgprot bits: PTE_DIRTY | PTE_WRITE | PTE_RDONLY
> 
> Applying either one of the two patches in this patchset will fix the
> particular issue with HugeTLB pages implemented with contiguous PTEs.
> It's possible that only one of these patches should be taken, or that
> the right fix is something else entirely.
> 
> [1]: https://gist.github.com/48ca/11d1e466deee032cb35aa8c2280f93b0
> 
> James Houghton (2):
>   arm64: hugetlb: Distinguish between hw and sw dirtiness in
>     __cont_access_flags_changed
>   arm64: mm: Always make sw-dirty PTEs hw-dirty in pte_modify
> 
>  arch/arm64/include/asm/pgtable.h | 6 ++++++
>  arch/arm64/mm/hugetlbpage.c      | 5 ++++-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> 
> base-commit: 645a9a454fdb7e698a63a275edca6a17ef97afc4
James Houghton Dec. 5, 2023, 5:54 p.m. UTC | #2
On Tue, Dec 5, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 04/12/2023 17:26, James Houghton wrote:
> > It is currently possible for a userspace application to enter a page
> > fault loop when using HugeTLB pages implemented with contiguous PTEs
> > when HAFDBS is not available. This happens because:
> > 1. The kernel may sometimes write PTEs that are sw-dirty but hw-clean
> >    (PTE_DIRTY | PTE_RDONLY | PTE_WRITE).
>
> Hi James,
>
> Do you know how this happens?

Hi Ryan,

Thanks for taking a look! I do understand why this is happening. There
is an explanation in the reproducer[1] and also in this cover letter
(though I realize I could have been a little clearer). See below.

> AFAIK, this is the set of valid bit combinations, and
> PTE_RDONLY|PTE_WRITE|PTE_DIRTY is not one of them. Perhaps the real solution is
> to understand how this is happening and prevent it?
>
> /*
>  * PTE bits configuration in the presence of hardware Dirty Bit Management
>  * (PTE_WRITE == PTE_DBM):
>  *
>  * Dirty  Writable | PTE_RDONLY  PTE_WRITE  PTE_DIRTY (sw)
>  *   0      0      |   1           0          0
>  *   0      1      |   1           1          0
>  *   1      0      |   1           0          1
>  *   1      1      |   0           1          x
>  *
>  * When hardware DBM is not present, the sofware PTE_DIRTY bit is updated via
>  * the page fault mechanism. Checking the dirty status of a pte becomes:
>  *
>  *   PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
>  */

Thanks for pointing this out. So (1) is definitely a bug. The second
patch in this series makes it impossible to create such a PTE via
pte_modify (by forcing sw-dirty PTEs to be hw-dirty as well).

> > The second patch in this series makes step (1) less likely to occur.

It makes it impossible to create this invalid set of bits via
pte_modify(). Assuming all PTE pgprot updates are done via the proper
interfaces, patch #2 might actually make this invalid bit combination
impossible to produce (that's certainly the goal). So perhaps language
stronger than "less likely" is appropriate.

Here's the sequence of events to trigger this bug, via mprotect():

> > Without this patch, we can get the kernel to write a sw-dirty, hw-clean
> > PTE with the following steps (showing the relevant VMA flags and pgprot
> > bits):
> > i.   Create a valid, writable contiguous PTE.
> >        VMA vmflags:     VM_SHARED | VM_READ | VM_WRITE
> >        VMA pgprot bits: PTE_RDONLY | PTE_WRITE
> >        PTE pgprot bits: PTE_DIRTY | PTE_WRITE
> > ii.  mprotect the VMA to PROT_NONE.
> >        VMA vmflags:     VM_SHARED
> >        VMA pgprot bits: PTE_RDONLY
> >        PTE pgprot bits: PTE_DIRTY | PTE_RDONLY
> > iii. mprotect the VMA back to PROT_READ | PROT_WRITE.
> >        VMA vmflags:     VM_SHARED | VM_READ | VM_WRITE
> >        VMA pgprot bits: PTE_RDONLY | PTE_WRITE
> >        PTE pgprot bits: PTE_DIRTY | PTE_WRITE | PTE_RDONLY

With patch #2, the PTE pgprot bits in step iii become PTE_DIRTY |
PTE_WRITE (hw-dirtiness is set, as the PTE is sw-dirty).

Thanks!

> > [1]: https://gist.github.com/48ca/11d1e466deee032cb35aa8c2280f93b0
Ryan Roberts Dec. 6, 2023, 10:24 a.m. UTC | #3
On 05/12/2023 17:54, James Houghton wrote:
> On Tue, Dec 5, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 04/12/2023 17:26, James Houghton wrote:
>>> It is currently possible for a userspace application to enter a page
>>> fault loop when using HugeTLB pages implemented with contiguous PTEs
>>> when HAFDBS is not available. This happens because:
>>> 1. The kernel may sometimes write PTEs that are sw-dirty but hw-clean
>>>    (PTE_DIRTY | PTE_RDONLY | PTE_WRITE).
>>
>> Hi James,
>>
>> Do you know how this happens?
> 
> Hi Ryan,
> 
> Thanks for taking a look! I do understand why this is happening. There
> is an explanation in the reproducer[1] and also in this cover letter
> (though I realize I could have been a little clearer). See below.

Sigh... sorry! I totally missed your (excellent) explanation.


> 
>> AFAIK, this is the set of valid bit combinations, and
>> PTE_RDONLY|PTE_WRITE|PTE_DIRTY is not one of them. Perhaps the real solution is
>> to understand how this is happening and prevent it?
>>
>> /*
>>  * PTE bits configuration in the presence of hardware Dirty Bit Management
>>  * (PTE_WRITE == PTE_DBM):
>>  *
>>  * Dirty  Writable | PTE_RDONLY  PTE_WRITE  PTE_DIRTY (sw)
>>  *   0      0      |   1           0          0
>>  *   0      1      |   1           1          0
>>  *   1      0      |   1           0          1
>>  *   1      1      |   0           1          x
>>  *
>>  * When hardware DBM is not present, the sofware PTE_DIRTY bit is updated via
>>  * the page fault mechanism. Checking the dirty status of a pte becomes:
>>  *
>>  *   PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
>>  */
> 
> Thanks for pointing this out. So (1) is definitely a bug. The second
> patch in this series makes it impossible to create such a PTE via
> pte_modify (by forcing sw-dirty PTEs to be hw-dirty as well).

Yes; I think the second patch should be sufficient; I took a quick look at the
other helpers and I don't see anything else that could get the PTE to the
invalid state.

I have a series that starts using the contpte bit for (multi-size) THP
opportunistically. This bug will affect that too I think. Your patch #2 will fix
for both hugetlb and my series. I'd rather not apply an equivalent to your patch
#1 because its not quite as straightforward in my code path. But I'm pretty
confident that patch # is all that's needed here.

Thanks,
Ryan

> 
>>> The second patch in this series makes step (1) less likely to occur.
> 
> It makes it impossible to create this invalid set of bits via
> pte_modify(). Assuming all PTE pgprot updates are done via the proper
> interfaces, patch #2 might actually make this invalid bit combination
> impossible to produce (that's certainly the goal). So perhaps language
> stronger than "less likely" is appropriate.
> 
> Here's the sequence of events to trigger this bug, via mprotect():
> 
>>> Without this patch, we can get the kernel to write a sw-dirty, hw-clean
>>> PTE with the following steps (showing the relevant VMA flags and pgprot
>>> bits):
>>> i.   Create a valid, writable contiguous PTE.
>>>        VMA vmflags:     VM_SHARED | VM_READ | VM_WRITE
>>>        VMA pgprot bits: PTE_RDONLY | PTE_WRITE
>>>        PTE pgprot bits: PTE_DIRTY | PTE_WRITE
>>> ii.  mprotect the VMA to PROT_NONE.
>>>        VMA vmflags:     VM_SHARED
>>>        VMA pgprot bits: PTE_RDONLY
>>>        PTE pgprot bits: PTE_DIRTY | PTE_RDONLY
>>> iii. mprotect the VMA back to PROT_READ | PROT_WRITE.
>>>        VMA vmflags:     VM_SHARED | VM_READ | VM_WRITE
>>>        VMA pgprot bits: PTE_RDONLY | PTE_WRITE
>>>        PTE pgprot bits: PTE_DIRTY | PTE_WRITE | PTE_RDONLY
> 
> With patch #2, the PTE pgprot bits in step iii become PTE_DIRTY |
> PTE_WRITE (hw-dirtiness is set, as the PTE is sw-dirty).
> 
> Thanks!
> 
>>> [1]: https://gist.github.com/48ca/11d1e466deee032cb35aa8c2280f93b0
James Houghton Dec. 6, 2023, 9:01 p.m. UTC | #4
On Wed, Dec 6, 2023 at 2:24 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 05/12/2023 17:54, James Houghton wrote:
> > On Tue, Dec 5, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> > Thanks for pointing this out. So (1) is definitely a bug. The second
> > patch in this series makes it impossible to create such a PTE via
> > pte_modify (by forcing sw-dirty PTEs to be hw-dirty as well).
>
> Yes; I think the second patch should be sufficient; I took a quick look at the
> other helpers and I don't see anything else that could get the PTE to the
> invalid state.
>
> I have a series that starts using the contpte bit for (multi-size) THP
> opportunistically. This bug will affect that too I think. Your patch #2 will fix
> for both hugetlb and my series. I'd rather not apply an equivalent to your patch
> #1 because its not quite as straightforward in my code path. But I'm pretty
> confident that patch # is all that's needed here.

There is no need to apply a patch #1-equivalent for multi-size THPs.
:) If multi-size THP has the same problem as HugeTLB, patch #2 will
fix it too. I don't think multi-size THP has the equivalent problem --
in fact, I'm not sure how multi-size THP keeps the PTE_DIRTY,
PTE_WRITE (DBM), and the PTE_RDONLY bits in sync (they do need to be
in-sync with each other when the contiguous bit is being used,
right?).

I included patch #1 (with cc:stable) because it's a more direct fix
for HugeTLB that might be slightly easier to backport. If you think
that patch #1 should be dropped and patch #2 should be backported,
please let me know.

Thanks for the review!
Catalin Marinas Dec. 12, 2023, 5:22 p.m. UTC | #5
On Mon, 04 Dec 2023 17:26:44 +0000, James Houghton wrote:
> It is currently possible for a userspace application to enter a page
> fault loop when using HugeTLB pages implemented with contiguous PTEs
> when HAFDBS is not available. This happens because:
> 1. The kernel may sometimes write PTEs that are sw-dirty but hw-clean
>    (PTE_DIRTY | PTE_RDONLY | PTE_WRITE).
> 2. If, during a write, the CPU uses a sw-dirty, hw-clean PTE in handling
>    the memory access on a system without HAFDBS, we will get a page
>    fault.
> 3. HugeTLB will check if it needs to update the dirty bits on the PTE.
>    For contiguous PTEs, it will check to see if the pgprot bits need
>    updating. In this case, HugeTLB wants to write a sequence of
>    sw-dirty, hw-dirty PTEs, but it finds that all the PTEs it is about
>    to overwrite are all pte_dirty() (pte_sw_dirty() => pte_dirty()),
>    so it thinks no update is necessary.
> 
> [...]

Applied to arm64 (for-next/fixes), thanks!

[2/2] arm64: mm: Always make sw-dirty PTEs hw-dirty in pte_modify
      https://git.kernel.org/arm64/c/3c0696076aad

I only picked up the second patch and added the description from the
cover letter into the commit log.