diff mbox series

mm: hugepage: mark splitted page dirty when needed

Message ID 20180904075510.22338-1-peterx@redhat.com (mailing list archive)
State New, archived
Headers show
Series mm: hugepage: mark splitted page dirty when needed | expand

Commit Message

Peter Xu Sept. 4, 2018, 7:55 a.m. UTC
When splitting a huge page, we should set all small pages as dirty if
the original huge page has the dirty bit set before.  Otherwise we'll
lose the original dirty bit.

CC: Andrea Arcangeli <aarcange@redhat.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
CC: Michal Hocko <mhocko@suse.com>
CC: Zi Yan <zi.yan@cs.rutgers.edu>
CC: Huang Ying <ying.huang@intel.com>
CC: Dan Williams <dan.j.williams@intel.com>
CC: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
CC: "Jérôme Glisse" <jglisse@redhat.com>
CC: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
CC: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
CC: Souptick Joarder <jrdr.linux@gmail.com>
CC: linux-mm@kvack.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---

To the reviewers: I'm new to the mm world so sorry if this patch is
making silly mistakes, however it did solve a problem for me when
testing with a customized Linux tree mostly based on Andrea's userfault
write-protect work.  Without the change, my customized QEMU/tcg tree
will not be able to do correct UFFDIO_WRITEPROTECT and then QEMU will
get a SIGBUS when faulting multiple times.  With the change (or of
course disabling THP) then UFFDIO_WRITEPROTECT will be able to correctly
resolve the write protections then it runs well.  Any comment would be
welcomed.  TIA.
---
 mm/huge_memory.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Kirill A . Shutemov Sept. 4, 2018, 8:01 a.m. UTC | #1
On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> When splitting a huge page, we should set all small pages as dirty if
> the original huge page has the dirty bit set before.  Otherwise we'll
> lose the original dirty bit.

We don't lose it. It got transfered to struct page flag:

	if (pmd_dirty(old_pmd))
		SetPageDirty(page);
Zi Yan Sept. 4, 2018, 2 p.m. UTC | #2
On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:

> On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
>> When splitting a huge page, we should set all small pages as dirty if
>> the original huge page has the dirty bit set before.  Otherwise we'll
>> lose the original dirty bit.
>
> We don't lose it. It got transfered to struct page flag:
>
> 	if (pmd_dirty(old_pmd))
> 		SetPageDirty(page);
>

Plus, when split_huge_page_to_list() splits a THP, its subroutine __split_huge_page()
propagates the dirty bit in the head page flag to all subpages in __split_huge_page_tail().

--
Best Regards
Yan Zi
Peter Xu Sept. 5, 2018, 7:30 a.m. UTC | #3
On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> 
> > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> >> When splitting a huge page, we should set all small pages as dirty if
> >> the original huge page has the dirty bit set before.  Otherwise we'll
> >> lose the original dirty bit.
> >
> > We don't lose it. It got transfered to struct page flag:
> >
> > 	if (pmd_dirty(old_pmd))
> > 		SetPageDirty(page);
> >
> 
> Plus, when split_huge_page_to_list() splits a THP, its subroutine __split_huge_page()
> propagates the dirty bit in the head page flag to all subpages in __split_huge_page_tail().

Hi, Kirill, Zi,

Thanks for your responses!

Though in my test the huge page seems to be splitted not by
split_huge_page_to_list() but by explicit calls to
change_protection().  The stack looks like this (again, this is a
customized kernel, and I added an explicit dump_stack() there):

  kernel:  dump_stack+0x5c/0x7b
  kernel:  __split_huge_pmd+0x192/0xdc0
  kernel:  ? update_load_avg+0x8b/0x550
  kernel:  ? update_load_avg+0x8b/0x550
  kernel:  ? account_entity_enqueue+0xc5/0xf0
  kernel:  ? enqueue_entity+0x112/0x650
  kernel:  change_protection+0x3a2/0xab0
  kernel:  mwriteprotect_range+0xdd/0x110
  kernel:  userfaultfd_ioctl+0x50b/0x1210
  kernel:  ? do_futex+0x2cf/0xb20
  kernel:  ? tty_write+0x1d2/0x2f0
  kernel:  ? do_vfs_ioctl+0x9f/0x610
  kernel:  do_vfs_ioctl+0x9f/0x610
  kernel:  ? __x64_sys_futex+0x88/0x180
  kernel:  ksys_ioctl+0x70/0x80
  kernel:  __x64_sys_ioctl+0x16/0x20
  kernel:  do_syscall_64+0x55/0x150
  kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9

At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
to kernel space, which is handled by mwriteprotect_range().  In case
you'd like to refer to the kernel, it's basically this one from
Andrea's (with very trivial changes):

  https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault

So... do we have two paths to split the huge pages separately?

Another (possibly very naive) question is: could any of you hint me
how the page dirty bit is finally applied to the PTEs?  These two
dirty flags confused me for a few days already (the SetPageDirty() one
which sets the page dirty flag, and the pte_mkdirty() which sets that
onto the real PTEs).

Regards,
Zi Yan Sept. 5, 2018, 12:49 p.m. UTC | #4
On 5 Sep 2018, at 3:30, Peter Xu wrote:

> On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
>> On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
>>
>>> On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
>>>> When splitting a huge page, we should set all small pages as dirty if
>>>> the original huge page has the dirty bit set before.  Otherwise we'll
>>>> lose the original dirty bit.
>>>
>>> We don't lose it. It got transfered to struct page flag:
>>>
>>> 	if (pmd_dirty(old_pmd))
>>> 		SetPageDirty(page);
>>>
>>
>> Plus, when split_huge_page_to_list() splits a THP, its subroutine __split_huge_page()
>> propagates the dirty bit in the head page flag to all subpages in __split_huge_page_tail().
>
> Hi, Kirill, Zi,
>
> Thanks for your responses!
>
> Though in my test the huge page seems to be splitted not by
> split_huge_page_to_list() but by explicit calls to
> change_protection().  The stack looks like this (again, this is a
> customized kernel, and I added an explicit dump_stack() there):
>
>   kernel:  dump_stack+0x5c/0x7b
>   kernel:  __split_huge_pmd+0x192/0xdc0
>   kernel:  ? update_load_avg+0x8b/0x550
>   kernel:  ? update_load_avg+0x8b/0x550
>   kernel:  ? account_entity_enqueue+0xc5/0xf0
>   kernel:  ? enqueue_entity+0x112/0x650
>   kernel:  change_protection+0x3a2/0xab0
>   kernel:  mwriteprotect_range+0xdd/0x110
>   kernel:  userfaultfd_ioctl+0x50b/0x1210
>   kernel:  ? do_futex+0x2cf/0xb20
>   kernel:  ? tty_write+0x1d2/0x2f0
>   kernel:  ? do_vfs_ioctl+0x9f/0x610
>   kernel:  do_vfs_ioctl+0x9f/0x610
>   kernel:  ? __x64_sys_futex+0x88/0x180
>   kernel:  ksys_ioctl+0x70/0x80
>   kernel:  __x64_sys_ioctl+0x16/0x20
>   kernel:  do_syscall_64+0x55/0x150
>   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> to kernel space, which is handled by mwriteprotect_range().  In case
> you'd like to refer to the kernel, it's basically this one from
> Andrea's (with very trivial changes):
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
>
> So... do we have two paths to split the huge pages separately?
>
> Another (possibly very naive) question is: could any of you hint me
> how the page dirty bit is finally applied to the PTEs?  These two
> dirty flags confused me for a few days already (the SetPageDirty() one
> which sets the page dirty flag, and the pte_mkdirty() which sets that
> onto the real PTEs).

change_protection() only causes splitting a PMD entry into multiple PTEs
but not the physical compound page, so my answer does not apply to your case.
It is unclear how the dirty bit makes your QEMU get a SIGBUS. I think you
need to describe your problem with more details.

AFAIK, the PageDirty bit will not apply back to any PTEs. So for your case,
when reporting a page’s dirty bit information, some function in the kernel only checks
the PTE’s dirty bit but not the dirty bit in the struct page flags, which
might provide a wrong answer.


—
Best Regards,
Yan Zi
Kirill A . Shutemov Sept. 5, 2018, 12:55 p.m. UTC | #5
On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > 
> > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > >> When splitting a huge page, we should set all small pages as dirty if
> > >> the original huge page has the dirty bit set before.  Otherwise we'll
> > >> lose the original dirty bit.
> > >
> > > We don't lose it. It got transfered to struct page flag:
> > >
> > > 	if (pmd_dirty(old_pmd))
> > > 		SetPageDirty(page);
> > >
> > 
> > Plus, when split_huge_page_to_list() splits a THP, its subroutine __split_huge_page()
> > propagates the dirty bit in the head page flag to all subpages in __split_huge_page_tail().
> 
> Hi, Kirill, Zi,
> 
> Thanks for your responses!
> 
> Though in my test the huge page seems to be splitted not by
> split_huge_page_to_list() but by explicit calls to
> change_protection().  The stack looks like this (again, this is a
> customized kernel, and I added an explicit dump_stack() there):
> 
>   kernel:  dump_stack+0x5c/0x7b
>   kernel:  __split_huge_pmd+0x192/0xdc0
>   kernel:  ? update_load_avg+0x8b/0x550
>   kernel:  ? update_load_avg+0x8b/0x550
>   kernel:  ? account_entity_enqueue+0xc5/0xf0
>   kernel:  ? enqueue_entity+0x112/0x650
>   kernel:  change_protection+0x3a2/0xab0
>   kernel:  mwriteprotect_range+0xdd/0x110
>   kernel:  userfaultfd_ioctl+0x50b/0x1210
>   kernel:  ? do_futex+0x2cf/0xb20
>   kernel:  ? tty_write+0x1d2/0x2f0
>   kernel:  ? do_vfs_ioctl+0x9f/0x610
>   kernel:  do_vfs_ioctl+0x9f/0x610
>   kernel:  ? __x64_sys_futex+0x88/0x180
>   kernel:  ksys_ioctl+0x70/0x80
>   kernel:  __x64_sys_ioctl+0x16/0x20
>   kernel:  do_syscall_64+0x55/0x150
>   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> to kernel space, which is handled by mwriteprotect_range().  In case
> you'd like to refer to the kernel, it's basically this one from
> Andrea's (with very trivial changes):
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
> 
> So... do we have two paths to split the huge pages separately?

We have two entiries that can be split: page table enties and underlying
compound page.

split_huge_pmd() (and variants of it) split the PMD entry into a PTE page
table. It doens't touch underlying compound page. The page still can be
mapped in other place as huge.

split_huge_page() (and ivariants of it) split compound page into a number
of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
PMD, but not other way around.

> 
> Another (possibly very naive) question is: could any of you hint me
> how the page dirty bit is finally applied to the PTEs?  These two
> dirty flags confused me for a few days already (the SetPageDirty() one
> which sets the page dirty flag, and the pte_mkdirty() which sets that
> onto the real PTEs).

Dirty bit from page table entries transferes to sturct page flug and used
for decision making in reclaim path.
Peter Xu Sept. 6, 2018, 11:39 a.m. UTC | #6
On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote:
> On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > > 
> > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > > >> When splitting a huge page, we should set all small pages as dirty if
> > > >> the original huge page has the dirty bit set before.  Otherwise we'll
> > > >> lose the original dirty bit.
> > > >
> > > > We don't lose it. It got transfered to struct page flag:
> > > >
> > > > 	if (pmd_dirty(old_pmd))
> > > > 		SetPageDirty(page);
> > > >
> > > 
> > > Plus, when split_huge_page_to_list() splits a THP, its subroutine __split_huge_page()
> > > propagates the dirty bit in the head page flag to all subpages in __split_huge_page_tail().
> > 
> > Hi, Kirill, Zi,
> > 
> > Thanks for your responses!
> > 
> > Though in my test the huge page seems to be splitted not by
> > split_huge_page_to_list() but by explicit calls to
> > change_protection().  The stack looks like this (again, this is a
> > customized kernel, and I added an explicit dump_stack() there):
> > 
> >   kernel:  dump_stack+0x5c/0x7b
> >   kernel:  __split_huge_pmd+0x192/0xdc0
> >   kernel:  ? update_load_avg+0x8b/0x550
> >   kernel:  ? update_load_avg+0x8b/0x550
> >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> >   kernel:  ? enqueue_entity+0x112/0x650
> >   kernel:  change_protection+0x3a2/0xab0
> >   kernel:  mwriteprotect_range+0xdd/0x110
> >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> >   kernel:  ? do_futex+0x2cf/0xb20
> >   kernel:  ? tty_write+0x1d2/0x2f0
> >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> >   kernel:  do_vfs_ioctl+0x9f/0x610
> >   kernel:  ? __x64_sys_futex+0x88/0x180
> >   kernel:  ksys_ioctl+0x70/0x80
> >   kernel:  __x64_sys_ioctl+0x16/0x20
> >   kernel:  do_syscall_64+0x55/0x150
> >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > to kernel space, which is handled by mwriteprotect_range().  In case
> > you'd like to refer to the kernel, it's basically this one from
> > Andrea's (with very trivial changes):
> > 
> >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
> > 
> > So... do we have two paths to split the huge pages separately?
> 
> We have two entiries that can be split: page table enties and underlying
> compound page.
> 
> split_huge_pmd() (and variants of it) split the PMD entry into a PTE page
> table. It doens't touch underlying compound page. The page still can be
> mapped in other place as huge.
> 
> split_huge_page() (and ivariants of it) split compound page into a number
> of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
> PMD, but not other way around.
> 
> > 
> > Another (possibly very naive) question is: could any of you hint me
> > how the page dirty bit is finally applied to the PTEs?  These two
> > dirty flags confused me for a few days already (the SetPageDirty() one
> > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > onto the real PTEs).
> 
> Dirty bit from page table entries transferes to sturct page flug and used
> for decision making in reclaim path.

Thanks for explaining.  It's much clearer for me.

Though for the issue I have encountered, I am still confused on why
that dirty bit can be ignored for the splitted PTEs.  Indeed we have:

	if (pmd_dirty(old_pmd))
		SetPageDirty(page);

However to me this only transfers (as you explained above) the dirty
bit (AFAIU it's possibly set by the hardware when the page is written)
to the page struct of the compound page.  It did not really apply to
every small page of the splitted huge page.  As you also explained,
this __split_huge_pmd() only splits the PMD entry but it keeps the
compound huge page there, then IMHO it should also apply the dirty
bits from the huge page to all the small page entries, no?

These dirty bits are really important to my scenario since AFAIU the
change_protection() call is using these dirty bits to decide whether
it should append the WRITE bit - it finally corresponds to the lines
in change_pte_range():

        /* Avoid taking write faults for known dirty pages */
        if (dirty_accountable && pte_dirty(ptent) &&
                        (pte_soft_dirty(ptent) ||
                                !(vma->vm_flags & VM_SOFTDIRTY))) {
                ptent = pte_mkwrite(ptent);
        }

So when mprotect() with that range (my case is UFFDIO_WRITEPROTECT,
which is similar) although we pass in the new protocol with VM_WRITE
here it'll still mask it since the dirty bit is not set, then the
userspace program (in my case, the QEMU thread that handles write
protect failures) can never fixup the write-protected page fault.

Am I missing anything important here?

Regards,
Peter Xu Sept. 6, 2018, 11:43 a.m. UTC | #7
On Wed, Sep 05, 2018 at 08:49:20AM -0400, Zi Yan wrote:
> On 5 Sep 2018, at 3:30, Peter Xu wrote:
> 
> > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> >> On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> >>
> >>> On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> >>>> When splitting a huge page, we should set all small pages as dirty if
> >>>> the original huge page has the dirty bit set before.  Otherwise we'll
> >>>> lose the original dirty bit.
> >>>
> >>> We don't lose it. It got transfered to struct page flag:
> >>>
> >>> 	if (pmd_dirty(old_pmd))
> >>> 		SetPageDirty(page);
> >>>
> >>
> >> Plus, when split_huge_page_to_list() splits a THP, its subroutine __split_huge_page()
> >> propagates the dirty bit in the head page flag to all subpages in __split_huge_page_tail().
> >
> > Hi, Kirill, Zi,
> >
> > Thanks for your responses!
> >
> > Though in my test the huge page seems to be splitted not by
> > split_huge_page_to_list() but by explicit calls to
> > change_protection().  The stack looks like this (again, this is a
> > customized kernel, and I added an explicit dump_stack() there):
> >
> >   kernel:  dump_stack+0x5c/0x7b
> >   kernel:  __split_huge_pmd+0x192/0xdc0
> >   kernel:  ? update_load_avg+0x8b/0x550
> >   kernel:  ? update_load_avg+0x8b/0x550
> >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> >   kernel:  ? enqueue_entity+0x112/0x650
> >   kernel:  change_protection+0x3a2/0xab0
> >   kernel:  mwriteprotect_range+0xdd/0x110
> >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> >   kernel:  ? do_futex+0x2cf/0xb20
> >   kernel:  ? tty_write+0x1d2/0x2f0
> >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> >   kernel:  do_vfs_ioctl+0x9f/0x610
> >   kernel:  ? __x64_sys_futex+0x88/0x180
> >   kernel:  ksys_ioctl+0x70/0x80
> >   kernel:  __x64_sys_ioctl+0x16/0x20
> >   kernel:  do_syscall_64+0x55/0x150
> >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > to kernel space, which is handled by mwriteprotect_range().  In case
> > you'd like to refer to the kernel, it's basically this one from
> > Andrea's (with very trivial changes):
> >
> >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
> >
> > So... do we have two paths to split the huge pages separately?
> >
> > Another (possibly very naive) question is: could any of you hint me
> > how the page dirty bit is finally applied to the PTEs?  These two
> > dirty flags confused me for a few days already (the SetPageDirty() one
> > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > onto the real PTEs).
> 
> change_protection() only causes splitting a PMD entry into multiple PTEs
> but not the physical compound page, so my answer does not apply to your case.
> It is unclear how the dirty bit makes your QEMU get a SIGBUS. I think you
> need to describe your problem with more details.

Hi, Zi,

I explained with some more details on my problem in my other reply to
Kirill.  Please have a look.

> 
> AFAIK, the PageDirty bit will not apply back to any PTEs. So for your case,
> when reporting a page’s dirty bit information, some function in the kernel only checks
> the PTE’s dirty bit but not the dirty bit in the struct page flags, which
> might provide a wrong answer.

Are you suggesting that we should always check both places (the PTE
dirty bit) and also the page flag to know whether a page is dirty
(hence, either of the bit set should mean the page is dirty)?

Thanks,
Kirill A . Shutemov Sept. 6, 2018, 2:08 p.m. UTC | #8
On Thu, Sep 06, 2018 at 07:39:33PM +0800, Peter Xu wrote:
> On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> > > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > > > 
> > > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > > > >> When splitting a huge page, we should set all small pages as dirty if
> > > > >> the original huge page has the dirty bit set before.  Otherwise we'll
> > > > >> lose the original dirty bit.
> > > > >
> > > > > We don't lose it. It got transfered to struct page flag:
> > > > >
> > > > > 	if (pmd_dirty(old_pmd))
> > > > > 		SetPageDirty(page);
> > > > >
> > > > 
> > > > Plus, when split_huge_page_to_list() splits a THP, its subroutine __split_huge_page()
> > > > propagates the dirty bit in the head page flag to all subpages in __split_huge_page_tail().
> > > 
> > > Hi, Kirill, Zi,
> > > 
> > > Thanks for your responses!
> > > 
> > > Though in my test the huge page seems to be splitted not by
> > > split_huge_page_to_list() but by explicit calls to
> > > change_protection().  The stack looks like this (again, this is a
> > > customized kernel, and I added an explicit dump_stack() there):
> > > 
> > >   kernel:  dump_stack+0x5c/0x7b
> > >   kernel:  __split_huge_pmd+0x192/0xdc0
> > >   kernel:  ? update_load_avg+0x8b/0x550
> > >   kernel:  ? update_load_avg+0x8b/0x550
> > >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> > >   kernel:  ? enqueue_entity+0x112/0x650
> > >   kernel:  change_protection+0x3a2/0xab0
> > >   kernel:  mwriteprotect_range+0xdd/0x110
> > >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> > >   kernel:  ? do_futex+0x2cf/0xb20
> > >   kernel:  ? tty_write+0x1d2/0x2f0
> > >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> > >   kernel:  do_vfs_ioctl+0x9f/0x610
> > >   kernel:  ? __x64_sys_futex+0x88/0x180
> > >   kernel:  ksys_ioctl+0x70/0x80
> > >   kernel:  __x64_sys_ioctl+0x16/0x20
> > >   kernel:  do_syscall_64+0x55/0x150
> > >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > 
> > > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > > to kernel space, which is handled by mwriteprotect_range().  In case
> > > you'd like to refer to the kernel, it's basically this one from
> > > Andrea's (with very trivial changes):
> > > 
> > >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
> > > 
> > > So... do we have two paths to split the huge pages separately?
> > 
> > We have two entiries that can be split: page table enties and underlying
> > compound page.
> > 
> > split_huge_pmd() (and variants of it) split the PMD entry into a PTE page
> > table. It doens't touch underlying compound page. The page still can be
> > mapped in other place as huge.
> > 
> > split_huge_page() (and ivariants of it) split compound page into a number
> > of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
> > PMD, but not other way around.
> > 
> > > 
> > > Another (possibly very naive) question is: could any of you hint me
> > > how the page dirty bit is finally applied to the PTEs?  These two
> > > dirty flags confused me for a few days already (the SetPageDirty() one
> > > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > > onto the real PTEs).
> > 
> > Dirty bit from page table entries transferes to sturct page flug and used
> > for decision making in reclaim path.
> 
> Thanks for explaining.  It's much clearer for me.
> 
> Though for the issue I have encountered, I am still confused on why
> that dirty bit can be ignored for the splitted PTEs.  Indeed we have:
> 
> 	if (pmd_dirty(old_pmd))
> 		SetPageDirty(page);
> 
> However to me this only transfers (as you explained above) the dirty
> bit (AFAIU it's possibly set by the hardware when the page is written)
> to the page struct of the compound page.  It did not really apply to
> every small page of the splitted huge page.  As you also explained,
> this __split_huge_pmd() only splits the PMD entry but it keeps the
> compound huge page there, then IMHO it should also apply the dirty
> bits from the huge page to all the small page entries, no?

The bit on compound page represents all small subpages. PageDirty() on any
subpage will return you true if the compound page is dirty.

> These dirty bits are really important to my scenario since AFAIU the
> change_protection() call is using these dirty bits to decide whether
> it should append the WRITE bit - it finally corresponds to the lines
> in change_pte_range():
> 
>         /* Avoid taking write faults for known dirty pages */
>         if (dirty_accountable && pte_dirty(ptent) &&
>                         (pte_soft_dirty(ptent) ||
>                                 !(vma->vm_flags & VM_SOFTDIRTY))) {
>                 ptent = pte_mkwrite(ptent);
>         }
> 
> So when mprotect() with that range (my case is UFFDIO_WRITEPROTECT,
> which is similar) although we pass in the new protocol with VM_WRITE
> here it'll still mask it since the dirty bit is not set, then the
> userspace program (in my case, the QEMU thread that handles write
> protect failures) can never fixup the write-protected page fault.

I don't follow here.

The code you quoting above is an apportunistic optimization and should not
be mission-critical. The dirty and writable bits can go away as soon as
you drop page table lock for the page.
Jerome Glisse Sept. 6, 2018, 2:17 p.m. UTC | #9
On Thu, Sep 06, 2018 at 07:39:33PM +0800, Peter Xu wrote:
> On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> > > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > > > 
> > > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > > > >> When splitting a huge page, we should set all small pages as dirty if
> > > > >> the original huge page has the dirty bit set before.  Otherwise we'll
> > > > >> lose the original dirty bit.
> > > > >
> > > > > We don't lose it. It got transfered to struct page flag:
> > > > >
> > > > > 	if (pmd_dirty(old_pmd))
> > > > > 		SetPageDirty(page);
> > > > >
> > > > 
> > > > Plus, when split_huge_page_to_list() splits a THP, its subroutine __split_huge_page()
> > > > propagates the dirty bit in the head page flag to all subpages in __split_huge_page_tail().
> > > 
> > > Hi, Kirill, Zi,
> > > 
> > > Thanks for your responses!
> > > 
> > > Though in my test the huge page seems to be splitted not by
> > > split_huge_page_to_list() but by explicit calls to
> > > change_protection().  The stack looks like this (again, this is a
> > > customized kernel, and I added an explicit dump_stack() there):
> > > 
> > >   kernel:  dump_stack+0x5c/0x7b
> > >   kernel:  __split_huge_pmd+0x192/0xdc0
> > >   kernel:  ? update_load_avg+0x8b/0x550
> > >   kernel:  ? update_load_avg+0x8b/0x550
> > >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> > >   kernel:  ? enqueue_entity+0x112/0x650
> > >   kernel:  change_protection+0x3a2/0xab0
> > >   kernel:  mwriteprotect_range+0xdd/0x110
> > >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> > >   kernel:  ? do_futex+0x2cf/0xb20
> > >   kernel:  ? tty_write+0x1d2/0x2f0
> > >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> > >   kernel:  do_vfs_ioctl+0x9f/0x610
> > >   kernel:  ? __x64_sys_futex+0x88/0x180
> > >   kernel:  ksys_ioctl+0x70/0x80
> > >   kernel:  __x64_sys_ioctl+0x16/0x20
> > >   kernel:  do_syscall_64+0x55/0x150
> > >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > 
> > > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > > to kernel space, which is handled by mwriteprotect_range().  In case
> > > you'd like to refer to the kernel, it's basically this one from
> > > Andrea's (with very trivial changes):
> > > 
> > >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
> > > 
> > > So... do we have two paths to split the huge pages separately?
> > 
> > We have two entiries that can be split: page table enties and underlying
> > compound page.
> > 
> > split_huge_pmd() (and variants of it) split the PMD entry into a PTE page
> > table. It doens't touch underlying compound page. The page still can be
> > mapped in other place as huge.
> > 
> > split_huge_page() (and ivariants of it) split compound page into a number
> > of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
> > PMD, but not other way around.
> > 
> > > 
> > > Another (possibly very naive) question is: could any of you hint me
> > > how the page dirty bit is finally applied to the PTEs?  These two
> > > dirty flags confused me for a few days already (the SetPageDirty() one
> > > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > > onto the real PTEs).
> > 
> > Dirty bit from page table entries transferes to sturct page flug and used
> > for decision making in reclaim path.
> 
> Thanks for explaining.  It's much clearer for me.
> 
> Though for the issue I have encountered, I am still confused on why
> that dirty bit can be ignored for the splitted PTEs.  Indeed we have:
> 
> 	if (pmd_dirty(old_pmd))
> 		SetPageDirty(page);
> 
> However to me this only transfers (as you explained above) the dirty
> bit (AFAIU it's possibly set by the hardware when the page is written)
> to the page struct of the compound page.  It did not really apply to
> every small page of the splitted huge page.  As you also explained,
> this __split_huge_pmd() only splits the PMD entry but it keeps the
> compound huge page there, then IMHO it should also apply the dirty
> bits from the huge page to all the small page entries, no?
> 
> These dirty bits are really important to my scenario since AFAIU the
> change_protection() call is using these dirty bits to decide whether
> it should append the WRITE bit - it finally corresponds to the lines
> in change_pte_range():
> 
>         /* Avoid taking write faults for known dirty pages */
>         if (dirty_accountable && pte_dirty(ptent) &&
>                         (pte_soft_dirty(ptent) ||
>                                 !(vma->vm_flags & VM_SOFTDIRTY))) {
>                 ptent = pte_mkwrite(ptent);
>         }
> 
> So when mprotect() with that range (my case is UFFDIO_WRITEPROTECT,
> which is similar) although we pass in the new protocol with VM_WRITE
> here it'll still mask it since the dirty bit is not set, then the
> userspace program (in my case, the QEMU thread that handles write
> protect failures) can never fixup the write-protected page fault.
> 
> Am I missing anything important here?
> 

For reference mwriteprotect_range code:
https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=b16cb9fcb76bec59cbe1427e73246dc81a4942e2

mwriteprotect_range usage:
https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=aa97daa6e54f2cfed1a6f1f38f9629608b8aadcc

Maybe you can describe the issues you are having because i admit
not seing what is wrong here. When mwriteprotect_range is call
with UFFDIO_WRITEPROTECT_MODE_WP then dirty_accountable is false
and thus above if is not taken and pte is properly write protected
and thus UFFDIO_WRITEPROTECT_MODE_WP do what its name suggest no
matter what is the pte dirty state.

I am not sure what UFFDIO_WRITEPROTECT_MODE_DONTWAKE means as this
is the one that might depends on the pte dirty state. So without
knowing what UFFDIO_WRITEPROTECT_MODE_DONTWAKE do, i am not sure
i see any bug here.

Cheers,
Jérôme
Peter Xu Sept. 7, 2018, 4:35 a.m. UTC | #10
On Thu, Sep 06, 2018 at 05:08:42PM +0300, Kirill A. Shutemov wrote:
> On Thu, Sep 06, 2018 at 07:39:33PM +0800, Peter Xu wrote:
> > On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote:
> > > On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> > > > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > > > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > > > > 
> > > > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > > > > >> When splitting a huge page, we should set all small pages as dirty if
> > > > > >> the original huge page has the dirty bit set before.  Otherwise we'll
> > > > > >> lose the original dirty bit.
> > > > > >
> > > > > > We don't lose it. It got transfered to struct page flag:
> > > > > >
> > > > > > 	if (pmd_dirty(old_pmd))
> > > > > > 		SetPageDirty(page);
> > > > > >
> > > > > 
> > > > > Plus, when split_huge_page_to_list() splits a THP, its subroutine __split_huge_page()
> > > > > propagates the dirty bit in the head page flag to all subpages in __split_huge_page_tail().
> > > > 
> > > > Hi, Kirill, Zi,
> > > > 
> > > > Thanks for your responses!
> > > > 
> > > > Though in my test the huge page seems to be splitted not by
> > > > split_huge_page_to_list() but by explicit calls to
> > > > change_protection().  The stack looks like this (again, this is a
> > > > customized kernel, and I added an explicit dump_stack() there):
> > > > 
> > > >   kernel:  dump_stack+0x5c/0x7b
> > > >   kernel:  __split_huge_pmd+0x192/0xdc0
> > > >   kernel:  ? update_load_avg+0x8b/0x550
> > > >   kernel:  ? update_load_avg+0x8b/0x550
> > > >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> > > >   kernel:  ? enqueue_entity+0x112/0x650
> > > >   kernel:  change_protection+0x3a2/0xab0
> > > >   kernel:  mwriteprotect_range+0xdd/0x110
> > > >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> > > >   kernel:  ? do_futex+0x2cf/0xb20
> > > >   kernel:  ? tty_write+0x1d2/0x2f0
> > > >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> > > >   kernel:  do_vfs_ioctl+0x9f/0x610
> > > >   kernel:  ? __x64_sys_futex+0x88/0x180
> > > >   kernel:  ksys_ioctl+0x70/0x80
> > > >   kernel:  __x64_sys_ioctl+0x16/0x20
> > > >   kernel:  do_syscall_64+0x55/0x150
> > > >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > 
> > > > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > > > to kernel space, which is handled by mwriteprotect_range().  In case
> > > > you'd like to refer to the kernel, it's basically this one from
> > > > Andrea's (with very trivial changes):
> > > > 
> > > >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
> > > > 
> > > > So... do we have two paths to split the huge pages separately?
> > > 
> > > We have two entiries that can be split: page table enties and underlying
> > > compound page.
> > > 
> > > split_huge_pmd() (and variants of it) split the PMD entry into a PTE page
> > > table. It doens't touch underlying compound page. The page still can be
> > > mapped in other place as huge.
> > > 
> > > split_huge_page() (and ivariants of it) split compound page into a number
> > > of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
> > > PMD, but not other way around.
> > > 
> > > > 
> > > > Another (possibly very naive) question is: could any of you hint me
> > > > how the page dirty bit is finally applied to the PTEs?  These two
> > > > dirty flags confused me for a few days already (the SetPageDirty() one
> > > > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > > > onto the real PTEs).
> > > 
> > > Dirty bit from page table entries transferes to sturct page flug and used
> > > for decision making in reclaim path.
> > 
> > Thanks for explaining.  It's much clearer for me.
> > 
> > Though for the issue I have encountered, I am still confused on why
> > that dirty bit can be ignored for the splitted PTEs.  Indeed we have:
> > 
> > 	if (pmd_dirty(old_pmd))
> > 		SetPageDirty(page);
> > 
> > However to me this only transfers (as you explained above) the dirty
> > bit (AFAIU it's possibly set by the hardware when the page is written)
> > to the page struct of the compound page.  It did not really apply to
> > every small page of the splitted huge page.  As you also explained,
> > this __split_huge_pmd() only splits the PMD entry but it keeps the
> > compound huge page there, then IMHO it should also apply the dirty
> > bits from the huge page to all the small page entries, no?
> 
> The bit on compound page represents all small subpages. PageDirty() on any
> subpage will return you true if the compound page is dirty.

Ah I didn't notice this before (since PageDirty is defined with
PF_HEAD), thanks for pointing out.

> 
> > These dirty bits are really important to my scenario since AFAIU the
> > change_protection() call is using these dirty bits to decide whether
> > it should append the WRITE bit - it finally corresponds to the lines
> > in change_pte_range():
> > 
> >         /* Avoid taking write faults for known dirty pages */
> >         if (dirty_accountable && pte_dirty(ptent) &&
> >                         (pte_soft_dirty(ptent) ||
> >                                 !(vma->vm_flags & VM_SOFTDIRTY))) {
> >                 ptent = pte_mkwrite(ptent);
> >         }
> > 
> > So when mprotect() with that range (my case is UFFDIO_WRITEPROTECT,
> > which is similar) although we pass in the new protocol with VM_WRITE
> > here it'll still mask it since the dirty bit is not set, then the
> > userspace program (in my case, the QEMU thread that handles write
> > protect failures) can never fixup the write-protected page fault.
> 
> I don't follow here.
> 
> The code you quoting above is an apportunistic optimization and should not
> be mission-critical. The dirty and writable bits can go away as soon as
> you drop page table lock for the page.

Indeed it's an optimization, IIUC it tries to avoid an extra but
possibly useless write-protect page fault when the dirty bits are
already set after all.  However that's a bit trickly here - in my use
case the write-protect page faults will be handled by one of the QEMU
thread that reads the userfaultfd handle, so the fault must be handled
there instead of inside kernel otherwise there'll be nested page
faults forever (and userfaultfd will detect that then send a SIGBUS
instead).

I'll try to explain with some more details on how I understand what
happened.  This should also answer Zi's question so I'll avoid
replying twice there.  Please feel free to correct me.

Firstly, below should be the correct steps to handle a userspace write
protect page fault using Andrea's userfault-wp tree (I only mentioned
the page fault steps and ignored most of the irrelevant setup
procedures):

1. QEMU write-protects page P using UFFDIO_WRITEPROTECT ioctl, then
   the write bit removed from PTE, so QEMU can capture any further
   writes to the page

   ... (time passes)...

2. [vCPU thread] Guest writes to the page P, trigger wp page fault

3. [vCPU thread] Since the page (and the whole vma) is tracked by
   userfault-wp, it goes into handle_userfault() to notify userspace
   about the page fault and waits...

4. [userfault thread] Gets the message about the page fault, do
   anything it like with the page P (mostly copy it somewhere), and
   fixup the page fault by another UFFDIO_WRITEPROTECT ioctl, this
   time to reset the write bit.  After that, it'll wake up the vCPU
   thread

5. [vCPU thread] Got waked up, retry the page fault by returning a
   VM_FAULT_RETRY in handle_userfault().  Then this time we'll see the
   PTE with write bit set correctly.  vCPU continues execution.

Then let's consider THP here, where we might miss the dirty page for
the PTE of the small page P.  In that case at step (4) when we want to
recover the write bit we'll fail since the dirty bit is missing in the
small PTE, so the write bit will still be cleared (expecting that the
next page fault will fill it up).  However in step (5) we can't really
fill in the write bit since we'll fault again into the
handle_userfault() before that happens and then it goes back to step
(3) then it can actualy loop forever if without the loop detection
code in handle_userfault().

So I think now I understand that setting up the dirty bit in the
compound page should be enough, then would below change acceptable
instead?

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6d331620b9e5..0d4a8129a5e7 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -73,6 +73,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,                                                                             
                if (pte_present(oldpte)) {
                        pte_t ptent;
                        bool preserve_write = prot_numa && pte_write(oldpte);
+                       bool dirty;

                        /*
                         * Avoid trapping faults against the zero or KSM
@@ -115,8 +116,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,                                                                          
                        if (preserve_write)
                                ptent = pte_mk_savedwrite(ptent);

+                       /*
+                        * The extra PageDirty() check will make sure
+                        * we'll capture the dirty page even if the
+                        * PTE dirty bit unset.  One case is when the
+                        * PTE is splitted from a huge PMD, in that
+                        * case the dirty flag might only be set on
+                        * the compound page instead of this PTE.
+                        */
+                       dirty = pte_dirty(ptent) || PageDirty(pte_page(ptent));
+
                        /* Avoid taking write faults for known dirty pages */
-                       if (dirty_accountable && pte_dirty(ptent) &&
+                       if (dirty_accountable && dirty &&
                                        (pte_soft_dirty(ptent) ||
                                         !(vma->vm_flags & VM_SOFTDIRTY))) {
                                ptent = pte_mkwrite(ptent);

I tested that this change can also fix my problem (QEMU will not get
SIGBUS after write protection starts).

Thanks,
Jerome Glisse Sept. 7, 2018, 5:54 p.m. UTC | #11
On Fri, Sep 07, 2018 at 12:35:24PM +0800, Peter Xu wrote:
> On Thu, Sep 06, 2018 at 05:08:42PM +0300, Kirill A. Shutemov wrote:
> > On Thu, Sep 06, 2018 at 07:39:33PM +0800, Peter Xu wrote:
> > > On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote:
> > > > On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> > > > > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > > > > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > > > > > 
> > > > > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > > > > > >> When splitting a huge page, we should set all small pages as dirty if
> > > > > > >> the original huge page has the dirty bit set before.  Otherwise we'll
> > > > > > >> lose the original dirty bit.
> > > > > > >
> > > > > > > We don't lose it. It got transfered to struct page flag:
> > > > > > >
> > > > > > > 	if (pmd_dirty(old_pmd))
> > > > > > > 		SetPageDirty(page);
> > > > > > >
> > > > > > 
> > > > > > Plus, when split_huge_page_to_list() splits a THP, its subroutine __split_huge_page()
> > > > > > propagates the dirty bit in the head page flag to all subpages in __split_huge_page_tail().
> > > > > 
> > > > > Hi, Kirill, Zi,
> > > > > 
> > > > > Thanks for your responses!
> > > > > 
> > > > > Though in my test the huge page seems to be splitted not by
> > > > > split_huge_page_to_list() but by explicit calls to
> > > > > change_protection().  The stack looks like this (again, this is a
> > > > > customized kernel, and I added an explicit dump_stack() there):
> > > > > 
> > > > >   kernel:  dump_stack+0x5c/0x7b
> > > > >   kernel:  __split_huge_pmd+0x192/0xdc0
> > > > >   kernel:  ? update_load_avg+0x8b/0x550
> > > > >   kernel:  ? update_load_avg+0x8b/0x550
> > > > >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> > > > >   kernel:  ? enqueue_entity+0x112/0x650
> > > > >   kernel:  change_protection+0x3a2/0xab0
> > > > >   kernel:  mwriteprotect_range+0xdd/0x110
> > > > >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> > > > >   kernel:  ? do_futex+0x2cf/0xb20
> > > > >   kernel:  ? tty_write+0x1d2/0x2f0
> > > > >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> > > > >   kernel:  do_vfs_ioctl+0x9f/0x610
> > > > >   kernel:  ? __x64_sys_futex+0x88/0x180
> > > > >   kernel:  ksys_ioctl+0x70/0x80
> > > > >   kernel:  __x64_sys_ioctl+0x16/0x20
> > > > >   kernel:  do_syscall_64+0x55/0x150
> > > > >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > 
> > > > > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > > > > to kernel space, which is handled by mwriteprotect_range().  In case
> > > > > you'd like to refer to the kernel, it's basically this one from
> > > > > Andrea's (with very trivial changes):
> > > > > 
> > > > >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
> > > > > 
> > > > > So... do we have two paths to split the huge pages separately?
> > > > 
> > > > We have two entiries that can be split: page table enties and underlying
> > > > compound page.
> > > > 
> > > > split_huge_pmd() (and variants of it) split the PMD entry into a PTE page
> > > > table. It doens't touch underlying compound page. The page still can be
> > > > mapped in other place as huge.
> > > > 
> > > > split_huge_page() (and ivariants of it) split compound page into a number
> > > > of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
> > > > PMD, but not other way around.
> > > > 
> > > > > 
> > > > > Another (possibly very naive) question is: could any of you hint me
> > > > > how the page dirty bit is finally applied to the PTEs?  These two
> > > > > dirty flags confused me for a few days already (the SetPageDirty() one
> > > > > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > > > > onto the real PTEs).
> > > > 
> > > > Dirty bit from page table entries transferes to sturct page flug and used
> > > > for decision making in reclaim path.
> > > 
> > > Thanks for explaining.  It's much clearer for me.
> > > 
> > > Though for the issue I have encountered, I am still confused on why
> > > that dirty bit can be ignored for the splitted PTEs.  Indeed we have:
> > > 
> > > 	if (pmd_dirty(old_pmd))
> > > 		SetPageDirty(page);
> > > 
> > > However to me this only transfers (as you explained above) the dirty
> > > bit (AFAIU it's possibly set by the hardware when the page is written)
> > > to the page struct of the compound page.  It did not really apply to
> > > every small page of the splitted huge page.  As you also explained,
> > > this __split_huge_pmd() only splits the PMD entry but it keeps the
> > > compound huge page there, then IMHO it should also apply the dirty
> > > bits from the huge page to all the small page entries, no?
> > 
> > The bit on compound page represents all small subpages. PageDirty() on any
> > subpage will return you true if the compound page is dirty.
> 
> Ah I didn't notice this before (since PageDirty is defined with
> PF_HEAD), thanks for pointing out.
> 
> > 
> > > These dirty bits are really important to my scenario since AFAIU the
> > > change_protection() call is using these dirty bits to decide whether
> > > it should append the WRITE bit - it finally corresponds to the lines
> > > in change_pte_range():
> > > 
> > >         /* Avoid taking write faults for known dirty pages */
> > >         if (dirty_accountable && pte_dirty(ptent) &&
> > >                         (pte_soft_dirty(ptent) ||
> > >                                 !(vma->vm_flags & VM_SOFTDIRTY))) {
> > >                 ptent = pte_mkwrite(ptent);
> > >         }
> > > 
> > > So when mprotect() with that range (my case is UFFDIO_WRITEPROTECT,
> > > which is similar) although we pass in the new protocol with VM_WRITE
> > > here it'll still mask it since the dirty bit is not set, then the
> > > userspace program (in my case, the QEMU thread that handles write
> > > protect failures) can never fixup the write-protected page fault.
> > 
> > I don't follow here.
> > 
> > The code you quoting above is an apportunistic optimization and should not
> > be mission-critical. The dirty and writable bits can go away as soon as
> > you drop page table lock for the page.
> 
> Indeed it's an optimization, IIUC it tries to avoid an extra but
> possibly useless write-protect page fault when the dirty bits are
> already set after all.  However that's a bit trickly here - in my use
> case the write-protect page faults will be handled by one of the QEMU
> thread that reads the userfaultfd handle, so the fault must be handled
> there instead of inside kernel otherwise there'll be nested page
> faults forever (and userfaultfd will detect that then send a SIGBUS
> instead).
> 
> I'll try to explain with some more details on how I understand what
> happened.  This should also answer Zi's question so I'll avoid
> replying twice there.  Please feel free to correct me.
> 
> Firstly, below should be the correct steps to handle a userspace write
> protect page fault using Andrea's userfault-wp tree (I only mentioned
> the page fault steps and ignored most of the irrelevant setup
> procedures):
> 
> 1. QEMU write-protects page P using UFFDIO_WRITEPROTECT ioctl, then
>    the write bit removed from PTE, so QEMU can capture any further
>    writes to the page
> 
>    ... (time passes)...

UFFDIO_WRITEPROTECT with UFFDIO_WRITEPROTECT_MODE_WP

> 
> 2. [vCPU thread] Guest writes to the page P, trigger wp page fault
> 
> 3. [vCPU thread] Since the page (and the whole vma) is tracked by
>    userfault-wp, it goes into handle_userfault() to notify userspace
>    about the page fault and waits...
> 
> 4. [userfault thread] Gets the message about the page fault, do
>    anything it like with the page P (mostly copy it somewhere), and
>    fixup the page fault by another UFFDIO_WRITEPROTECT ioctl, this
>    time to reset the write bit.  After that, it'll wake up the vCPU
>    thread

UFFDIO_WRITEPROTECT with !UFFDIO_WRITEPROTECT_MODE_WP

It confused me when looking at code:
https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=aa97daa6e54f2cfed1a6f1f38f9629608b8aadcc

> 
> 5. [vCPU thread] Got waked up, retry the page fault by returning a
>    VM_FAULT_RETRY in handle_userfault().  Then this time we'll see the
>    PTE with write bit set correctly.  vCPU continues execution.
> 
> Then let's consider THP here, where we might miss the dirty page for
> the PTE of the small page P.  In that case at step (4) when we want to
> recover the write bit we'll fail since the dirty bit is missing in the
> small PTE, so the write bit will still be cleared (expecting that the
> next page fault will fill it up).  However in step (5) we can't really
> fill in the write bit since we'll fault again into the
> handle_userfault() before that happens and then it goes back to step
> (3) then it can actualy loop forever if without the loop detection
> code in handle_userfault().
> 
> So I think now I understand that setting up the dirty bit in the
> compound page should be enough, then would below change acceptable
> instead?
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 6d331620b9e5..0d4a8129a5e7 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -73,6 +73,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,                                                                             
>                 if (pte_present(oldpte)) {
>                         pte_t ptent;
>                         bool preserve_write = prot_numa && pte_write(oldpte);
> +                       bool dirty;
> 
>                         /*
>                          * Avoid trapping faults against the zero or KSM
> @@ -115,8 +116,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,                                                                          
>                         if (preserve_write)
>                                 ptent = pte_mk_savedwrite(ptent);
> 
> +                       /*
> +                        * The extra PageDirty() check will make sure
> +                        * we'll capture the dirty page even if the
> +                        * PTE dirty bit unset.  One case is when the
> +                        * PTE is splitted from a huge PMD, in that
> +                        * case the dirty flag might only be set on
> +                        * the compound page instead of this PTE.
> +                        */
> +                       dirty = pte_dirty(ptent) || PageDirty(pte_page(ptent));
> +
>                         /* Avoid taking write faults for known dirty pages */
> -                       if (dirty_accountable && pte_dirty(ptent) &&
> +                       if (dirty_accountable && dirty &&
>                                         (pte_soft_dirty(ptent) ||
>                                          !(vma->vm_flags & VM_SOFTDIRTY))) {
>                                 ptent = pte_mkwrite(ptent);
> 
> I tested that this change can also fix my problem (QEMU will not get
> SIGBUS after write protection starts).

This is wrong mwriteprotect_range() should already properly set pte
entry to non write protect:

https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=b16cb9fcb76bec59cbe1427e73246dc81a4942e2

	if (enable_wp)
		newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
	else
		newprot = vm_get_page_prot(dst_vma->vm_flags);

So it seems that the vm_flags do not have VM_WRITE set.

To me this all points out so a bug somewhere in userspace or a miss use
of userfaultfd. Here is what i believe to be the chain of event:

  1 QEMU or vCPU write to the affected ufaultfd range and this set the pte
    dirty bit on all the entry in the affected range

  ...

  2 QEMU write protect the affected range with UFFDIO_WRITEPROTECT and
    UFFDIO_WRITEPROTECT_MODE_WP flag set. This clear the pte write bit
    and thus write protect the range. Because it is anonymous memory and
    soft dirty is likely disabled, the dirty bit set in 1 is still there
    and is preserved.

  3 vCPU tries to write to the affected range. This trigger a userfaultfd
    and QEMU handle it and call UFFDIO_WRITEPROTECT but this time without
    UFFDIO_WRITEPROTECT_MODE_WP flag (ie to unprotect).

    For some reasons the affected vma do not have the VM_WRITE flags set
    anymore probably through mprotect() syscall by QEMU. So that the new
    prot for the pte do not have the write bit set.

    But because of the change_pte_range() optimization and because the
    pte dirty bit is set from 1 then the pte write bit set which is wrong
    as the VM_WRITE have been clear.


So hence there is a bug in QEMU somewhere is my best guess.

Note that this means that the:
  https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=b16cb9fcb76bec59cbe1427e73246dc81a4942e2

Needs to be updated with:
-change_protection(dst_vma, start, start + len, newprot, !enable_wp, 0);
+change_protection(dst_vma, start, start + len, newprot, 0, 0);

To avoid such bug to creep in and have bad side effect like allowing
someone to unwrite protect a range of memory.

Cheers,
Jérôme
Peter Xu Sept. 10, 2018, 4:07 a.m. UTC | #12
On Fri, Sep 07, 2018 at 01:54:35PM -0400, Jerome Glisse wrote:
> On Fri, Sep 07, 2018 at 12:35:24PM +0800, Peter Xu wrote:
> > On Thu, Sep 06, 2018 at 05:08:42PM +0300, Kirill A. Shutemov wrote:
> > > On Thu, Sep 06, 2018 at 07:39:33PM +0800, Peter Xu wrote:
> > > > On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote:
> > > > > On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> > > > > > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > > > > > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > > > > > > 
> > > > > > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > > > > > > >> When splitting a huge page, we should set all small pages as dirty if
> > > > > > > >> the original huge page has the dirty bit set before.  Otherwise we'll
> > > > > > > >> lose the original dirty bit.
> > > > > > > >
> > > > > > > > We don't lose it. It got transfered to struct page flag:
> > > > > > > >
> > > > > > > > 	if (pmd_dirty(old_pmd))
> > > > > > > > 		SetPageDirty(page);
> > > > > > > >
> > > > > > > 
> > > > > > > Plus, when split_huge_page_to_list() splits a THP, its subroutine __split_huge_page()
> > > > > > > propagates the dirty bit in the head page flag to all subpages in __split_huge_page_tail().
> > > > > > 
> > > > > > Hi, Kirill, Zi,
> > > > > > 
> > > > > > Thanks for your responses!
> > > > > > 
> > > > > > Though in my test the huge page seems to be splitted not by
> > > > > > split_huge_page_to_list() but by explicit calls to
> > > > > > change_protection().  The stack looks like this (again, this is a
> > > > > > customized kernel, and I added an explicit dump_stack() there):
> > > > > > 
> > > > > >   kernel:  dump_stack+0x5c/0x7b
> > > > > >   kernel:  __split_huge_pmd+0x192/0xdc0
> > > > > >   kernel:  ? update_load_avg+0x8b/0x550
> > > > > >   kernel:  ? update_load_avg+0x8b/0x550
> > > > > >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> > > > > >   kernel:  ? enqueue_entity+0x112/0x650
> > > > > >   kernel:  change_protection+0x3a2/0xab0
> > > > > >   kernel:  mwriteprotect_range+0xdd/0x110
> > > > > >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> > > > > >   kernel:  ? do_futex+0x2cf/0xb20
> > > > > >   kernel:  ? tty_write+0x1d2/0x2f0
> > > > > >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> > > > > >   kernel:  do_vfs_ioctl+0x9f/0x610
> > > > > >   kernel:  ? __x64_sys_futex+0x88/0x180
> > > > > >   kernel:  ksys_ioctl+0x70/0x80
> > > > > >   kernel:  __x64_sys_ioctl+0x16/0x20
> > > > > >   kernel:  do_syscall_64+0x55/0x150
> > > > > >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > > 
> > > > > > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > > > > > to kernel space, which is handled by mwriteprotect_range().  In case
> > > > > > you'd like to refer to the kernel, it's basically this one from
> > > > > > Andrea's (with very trivial changes):
> > > > > > 
> > > > > >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
> > > > > > 
> > > > > > So... do we have two paths to split the huge pages separately?
> > > > > 
> > > > > We have two entiries that can be split: page table enties and underlying
> > > > > compound page.
> > > > > 
> > > > > split_huge_pmd() (and variants of it) split the PMD entry into a PTE page
> > > > > table. It doens't touch underlying compound page. The page still can be
> > > > > mapped in other place as huge.
> > > > > 
> > > > > split_huge_page() (and ivariants of it) split compound page into a number
> > > > > of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
> > > > > PMD, but not other way around.
> > > > > 
> > > > > > 
> > > > > > Another (possibly very naive) question is: could any of you hint me
> > > > > > how the page dirty bit is finally applied to the PTEs?  These two
> > > > > > dirty flags confused me for a few days already (the SetPageDirty() one
> > > > > > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > > > > > onto the real PTEs).
> > > > > 
> > > > > Dirty bit from page table entries transferes to sturct page flug and used
> > > > > for decision making in reclaim path.
> > > > 
> > > > Thanks for explaining.  It's much clearer for me.
> > > > 
> > > > Though for the issue I have encountered, I am still confused on why
> > > > that dirty bit can be ignored for the splitted PTEs.  Indeed we have:
> > > > 
> > > > 	if (pmd_dirty(old_pmd))
> > > > 		SetPageDirty(page);
> > > > 
> > > > However to me this only transfers (as you explained above) the dirty
> > > > bit (AFAIU it's possibly set by the hardware when the page is written)
> > > > to the page struct of the compound page.  It did not really apply to
> > > > every small page of the splitted huge page.  As you also explained,
> > > > this __split_huge_pmd() only splits the PMD entry but it keeps the
> > > > compound huge page there, then IMHO it should also apply the dirty
> > > > bits from the huge page to all the small page entries, no?
> > > 
> > > The bit on compound page represents all small subpages. PageDirty() on any
> > > subpage will return you true if the compound page is dirty.
> > 
> > Ah I didn't notice this before (since PageDirty is defined with
> > PF_HEAD), thanks for pointing out.
> > 
> > > 
> > > > These dirty bits are really important to my scenario since AFAIU the
> > > > change_protection() call is using these dirty bits to decide whether
> > > > it should append the WRITE bit - it finally corresponds to the lines
> > > > in change_pte_range():
> > > > 
> > > >         /* Avoid taking write faults for known dirty pages */
> > > >         if (dirty_accountable && pte_dirty(ptent) &&
> > > >                         (pte_soft_dirty(ptent) ||
> > > >                                 !(vma->vm_flags & VM_SOFTDIRTY))) {
> > > >                 ptent = pte_mkwrite(ptent);
> > > >         }
> > > > 
> > > > So when mprotect() with that range (my case is UFFDIO_WRITEPROTECT,
> > > > which is similar) although we pass in the new protocol with VM_WRITE
> > > > here it'll still mask it since the dirty bit is not set, then the
> > > > userspace program (in my case, the QEMU thread that handles write
> > > > protect failures) can never fixup the write-protected page fault.
> > > 
> > > I don't follow here.
> > > 
> > > The code you quoting above is an apportunistic optimization and should not
> > > be mission-critical. The dirty and writable bits can go away as soon as
> > > you drop page table lock for the page.
> > 
> > Indeed it's an optimization, IIUC it tries to avoid an extra but
> > possibly useless write-protect page fault when the dirty bits are
> > already set after all.  However that's a bit trickly here - in my use
> > case the write-protect page faults will be handled by one of the QEMU
> > thread that reads the userfaultfd handle, so the fault must be handled
> > there instead of inside kernel otherwise there'll be nested page
> > faults forever (and userfaultfd will detect that then send a SIGBUS
> > instead).
> > 
> > I'll try to explain with some more details on how I understand what
> > happened.  This should also answer Zi's question so I'll avoid
> > replying twice there.  Please feel free to correct me.
> > 
> > Firstly, below should be the correct steps to handle a userspace write
> > protect page fault using Andrea's userfault-wp tree (I only mentioned
> > the page fault steps and ignored most of the irrelevant setup
> > procedures):
> > 
> > 1. QEMU write-protects page P using UFFDIO_WRITEPROTECT ioctl, then
> >    the write bit removed from PTE, so QEMU can capture any further
> >    writes to the page
> > 
> >    ... (time passes)...
> 
> UFFDIO_WRITEPROTECT with UFFDIO_WRITEPROTECT_MODE_WP
> 
> > 
> > 2. [vCPU thread] Guest writes to the page P, trigger wp page fault
> > 
> > 3. [vCPU thread] Since the page (and the whole vma) is tracked by
> >    userfault-wp, it goes into handle_userfault() to notify userspace
> >    about the page fault and waits...
> > 
> > 4. [userfault thread] Gets the message about the page fault, do
> >    anything it like with the page P (mostly copy it somewhere), and
> >    fixup the page fault by another UFFDIO_WRITEPROTECT ioctl, this
> >    time to reset the write bit.  After that, it'll wake up the vCPU
> >    thread
> 
> UFFDIO_WRITEPROTECT with !UFFDIO_WRITEPROTECT_MODE_WP
> 
> It confused me when looking at code:
> https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=aa97daa6e54f2cfed1a6f1f38f9629608b8aadcc
> 
> > 
> > 5. [vCPU thread] Got waked up, retry the page fault by returning a
> >    VM_FAULT_RETRY in handle_userfault().  Then this time we'll see the
> >    PTE with write bit set correctly.  vCPU continues execution.
> > 
> > Then let's consider THP here, where we might miss the dirty page for
> > the PTE of the small page P.  In that case at step (4) when we want to
> > recover the write bit we'll fail since the dirty bit is missing in the
> > small PTE, so the write bit will still be cleared (expecting that the
> > next page fault will fill it up).  However in step (5) we can't really
> > fill in the write bit since we'll fault again into the
> > handle_userfault() before that happens and then it goes back to step
> > (3) then it can actualy loop forever if without the loop detection
> > code in handle_userfault().
> > 
> > So I think now I understand that setting up the dirty bit in the
> > compound page should be enough, then would below change acceptable
> > instead?
> > 
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 6d331620b9e5..0d4a8129a5e7 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -73,6 +73,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,                                                                             
> >                 if (pte_present(oldpte)) {
> >                         pte_t ptent;
> >                         bool preserve_write = prot_numa && pte_write(oldpte);
> > +                       bool dirty;
> > 
> >                         /*
> >                          * Avoid trapping faults against the zero or KSM
> > @@ -115,8 +116,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,                                                                          
> >                         if (preserve_write)
> >                                 ptent = pte_mk_savedwrite(ptent);
> > 
> > +                       /*
> > +                        * The extra PageDirty() check will make sure
> > +                        * we'll capture the dirty page even if the
> > +                        * PTE dirty bit unset.  One case is when the
> > +                        * PTE is splitted from a huge PMD, in that
> > +                        * case the dirty flag might only be set on
> > +                        * the compound page instead of this PTE.
> > +                        */
> > +                       dirty = pte_dirty(ptent) || PageDirty(pte_page(ptent));
> > +
> >                         /* Avoid taking write faults for known dirty pages */
> > -                       if (dirty_accountable && pte_dirty(ptent) &&
> > +                       if (dirty_accountable && dirty &&
> >                                         (pte_soft_dirty(ptent) ||
> >                                          !(vma->vm_flags & VM_SOFTDIRTY))) {
> >                                 ptent = pte_mkwrite(ptent);
> > 
> > I tested that this change can also fix my problem (QEMU will not get
> > SIGBUS after write protection starts).
> 
> This is wrong mwriteprotect_range() should already properly set pte
> entry to non write protect:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=b16cb9fcb76bec59cbe1427e73246dc81a4942e2
> 
> 	if (enable_wp)
> 		newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
> 	else
> 		newprot = vm_get_page_prot(dst_vma->vm_flags);
> 
> So it seems that the vm_flags do not have VM_WRITE set.

Hi, Jerome,

I think the vma has correct VM_WRITE flag there.  I added some prints
into mwriteprotect_range() to trap more information when coredump
happens:

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d3d0a13a636f..ebdcd76887df 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -606,6 +606,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
        struct vm_area_struct *dst_vma;
        pgprot_t newprot;
        int err;
+       unsigned long pages;
 
        /*
         * Sanitize the command parameters:
@@ -638,13 +639,17 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
        if (!vma_is_anonymous(dst_vma))
                goto out_unlock;
 
+       pr_info("%s: vm_flags: 0x%lx\n", __func__, dst_vma->vm_flags);
+
        if (enable_wp)
                newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
        else
                newprot = vm_get_page_prot(dst_vma->vm_flags);
 
-       change_protection(dst_vma, start, start + len, newprot,
-                               !enable_wp, 0);
+       pages = change_protection(dst_vma, start, start + len, newprot,
+                                 !enable_wp, 0);
+       pr_info("%s: 0x%lx-0x%lx changed %ld pages (newprot=%lx, wp=%d)\n",
+               __func__, start, start + len, pages, newprot.pgprot, enable_wp);
 
        err = 0;
 out_unlock:

Here's what I got starting from when QEMU starts until QEMU coredumps:

Sep 10 11:40:24 px-ws kernel: mwriteprotect_range: vm_flags: 0xa0121073
Sep 10 11:40:24 px-ws kernel: mwriteprotect_range: 0x7f205fe00000-0x7f209fe00000 changed 1024 pages (newprot=8000000000000025, wp=1)                                        
Sep 10 11:40:24 px-ws kernel: mwriteprotect_range: vm_flags: 0xa0121073
Sep 10 11:40:24 px-ws kernel: mwriteprotect_range: 0x7f20aa000000-0x7f20ab000000 changed 512 pages (newprot=8000000000000025, wp=1)                                         
Sep 10 11:40:24 px-ws kernel: mwriteprotect_range: vm_flags: 0xa0121073   <----------------------------------- [1]
Sep 10 11:40:24 px-ws kernel: mwriteprotect_range: 0x7f205fe7f000-0x7f205fe80000 changed 1 pages (newprot=8000000000000025, wp=0)                                           
Sep 10 11:40:24 px-ws kernel: FAULT_FLAG_ALLOW_RETRY missing 71
Sep 10 11:40:24 px-ws kernel: CPU: 7 PID: 1637 Comm: qemu-system-x86 Not tainted 4.19.0-rc2+ #27                                                                            
Sep 10 11:40:24 px-ws kernel: Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC6AUS 06/22/2016                                                             
Sep 10 11:40:24 px-ws kernel: Call Trace:
Sep 10 11:40:24 px-ws kernel:  dump_stack+0x5c/0x7b
Sep 10 11:40:24 px-ws kernel:  handle_userfault+0x4b5/0x780
Sep 10 11:40:24 px-ws kernel:  ? schedule+0x32/0x80
Sep 10 11:40:24 px-ws kernel:  ? handle_userfault+0x47e/0x780
Sep 10 11:40:24 px-ws kernel:  do_wp_page+0x1bd/0x5a0
Sep 10 11:40:24 px-ws kernel:  __handle_mm_fault+0x7f9/0x1250
Sep 10 11:40:24 px-ws kernel:  handle_mm_fault+0xfc/0x1f0
Sep 10 11:40:24 px-ws kernel:  __do_page_fault+0x255/0x520
Sep 10 11:40:24 px-ws kernel:  do_page_fault+0x32/0x110
Sep 10 11:40:24 px-ws kernel:  ? page_fault+0x8/0x30
Sep 10 11:40:24 px-ws kernel:  page_fault+0x1e/0x30
Sep 10 11:40:24 px-ws kernel: RIP: 0033:0x7f20ac9108c9
Sep 10 11:40:24 px-ws kernel: Code: 75 03 c1 ef 07 48 81 e6 00 f0 ff ff 81 e7 e0 1f 00 00 49 8d bc 3e 40 57 00 00 48 3b 37 48 8b f5 0f 85 32 00 00 00 48 03 77 10 <89> 1e 49f
Sep 10 11:40:24 px-ws kernel: RSP: 002b:00007f20abdda390 EFLAGS: 00010202
Sep 10 11:40:24 px-ws kernel: RAX: 00007f20ac915b80 RBX: 000000003fec3baa RCX: 0000000000007318                                                                             
Sep 10 11:40:24 px-ws kernel: RDX: 0000000000000000 RSI: 00007f205fe7fed0 RDI: 000055cc4efef980                                                                             
Sep 10 11:40:24 px-ws kernel: RBP: 000000000007fed0 R08: 0000000000000000 R09: 0000000000000007                                                                             
Sep 10 11:40:24 px-ws kernel: R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000242                                                                             
Sep 10 11:40:24 px-ws kernel: R13: 0000000000000000 R14: 000055cc4efe9260 R15: 00007f20abddb700                                                                             

The first four lines of mwriteprotect_range() are trying to set things
up (they have wp=1) which seems fine.  Note that lines 5-6 of
mwriteprotect_range() entry (marked with [1]) is the wp=0 one where
QEMU tries to recover a write protect page fault for a 4K page.  We
can see that the vm_flags has VM_WRITE properly set (bit 2 of
0xa0121073) rather than missing.

Though we can see the newprot didn't really have that VM_WRITE set but
it's expected since in vm_get_page_prot (further, which is actually
protection_map) we'll drop that write bit due to the protect mapping
(READ+WRITE will map to PAGE_COPY, which does not have VM_WRITE set).

> 
> To me this all points out so a bug somewhere in userspace or a miss use
> of userfaultfd. Here is what i believe to be the chain of event:
> 
>   1 QEMU or vCPU write to the affected ufaultfd range and this set the pte
>     dirty bit on all the entry in the affected range
> 
>   ...
> 
>   2 QEMU write protect the affected range with UFFDIO_WRITEPROTECT and
>     UFFDIO_WRITEPROTECT_MODE_WP flag set. This clear the pte write bit
>     and thus write protect the range. Because it is anonymous memory and
>     soft dirty is likely disabled, the dirty bit set in 1 is still there
>     and is preserved.
> 
>   3 vCPU tries to write to the affected range. This trigger a userfaultfd
>     and QEMU handle it and call UFFDIO_WRITEPROTECT but this time without
>     UFFDIO_WRITEPROTECT_MODE_WP flag (ie to unprotect).
> 
>     For some reasons the affected vma do not have the VM_WRITE flags set
>     anymore probably through mprotect() syscall by QEMU. So that the new
>     prot for the pte do not have the write bit set.

Please refers to [1] above.  The crash happens stably every time if
without my fix patch applied.

> 
>     But because of the change_pte_range() optimization and because the
>     pte dirty bit is set from 1 then the pte write bit set which is wrong
>     as the VM_WRITE have been clear.

I actually have dumped the dirty flag there too and it's missing, and
that's why I think we should have the bit.  It's indeed a bit awkward
at least to me since when running with the userfault-wp tree the dirty
bit optimization becomes more like a correctness issue rather than a
performance issue.

> 
> 
> So hence there is a bug in QEMU somewhere is my best guess.
> 
> Note that this means that the:
>   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=b16cb9fcb76bec59cbe1427e73246dc81a4942e2
> 
> Needs to be updated with:
> -change_protection(dst_vma, start, start + len, newprot, !enable_wp, 0);
> +change_protection(dst_vma, start, start + len, newprot, 0, 0);

I'm not sure about this suggestion as well - if we keep the
change_pte_range() with dirty_accountable=false, then how could we
further apply the WM_WRITE flag at all with current implementation
(note that with the dirty bit optimization, we'll just ignore the
VM_WRITE bit if dirty_accountable is false...)?  If without VM_WRITE,
how could we fix a write-protect page fault after all from userspace?

Please correct me if I missed anything important.

Thanks!
diff mbox series

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3bc7e9c9a2a..0754a16923d5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2176,6 +2176,8 @@  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				entry = pte_mkold(entry);
 			if (soft_dirty)
 				entry = pte_mksoft_dirty(entry);
+			if (dirty)
+				entry = pte_mkdirty(entry);
 		}
 		pte = pte_offset_map(&_pmd, addr);
 		BUG_ON(!pte_none(*pte));