KVM: use set_page_dirty rather than SetPageDirty

Message ID	08b5b2c516b81788ca411dc031d403de4594755e.1643226777.git.boris@bur.io (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Boris Burkov <boris@bur.io> To: kvm@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com>, kernel-team@fb.com Subject: [PATCH] KVM: use set_page_dirty rather than SetPageDirty Date: Wed, 26 Jan 2022 11:54:55 -0800 Message-Id: <08b5b2c516b81788ca411dc031d403de4594755e.1643226777.git.boris@bur.io> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	KVM: use set_page_dirty rather than SetPageDirty \| expand KVM: use set_page_dirty rather than SetPageDirty

Boris Burkov Jan. 26, 2022, 7:54 p.m. UTC

At Facebook, we have hit a bug in an interaction between KVM and btrfs
running android emulators in a build/test environment using the android
emulator's ram snapshot features (-snapshot and -snapstorage). The
important aspect of those features is that they result in qemu mmap-ing
a file rather than anonymous memory for the guest's memory.

Ultimately, we observe (with drgn) pages of the mapped file stuck in
btrfs writeback because the mapping's xarray lacks the expected dirty
tags. I have not yet been able to pin down the exact vm behavior that
results in these bad kvm_set_pfn_dirty calls, but I caught them by
instrumenting SetPageDirty with a warning, getting a stack trace like:

RIP: 0010:kvm_set_pfn_dirty+0xaf/0xd0 [kvm]
<snip>
 Call Trace:
  kvm_release_pfn+0x2d/0x40 [kvm]
  __kvm_map_gfn+0x115/0x2b0 [kvm]
  kvm_arch_vcpu_ioctl_run+0x1538/0x1b30 [kvm]
  ? call_function_interrupt+0xa/0x20
  kvm_vcpu_ioctl+0x232/0x5e0 [kvm]
  ksys_ioctl+0x83/0xc0
  __x64_sys_ioctl+0x16/0x20
  do_syscall_64+0x42/0x110
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

kvm_arch_vcpu_ioctl_run+0x1538 is the call to complete_userspace_io on
line 8728, for what it's worth. I also confirmed that the page being
dirtied in this codepath is the one we end up stuck on.

This is on a kernel based off of 5.6, but as far as I can tell, the
behavior in KVM is still incorrect currently, as it doesn't account for
file backed pages.

I tested this fix on the workload and it did prevent the hangs. However,
I am unsure if the fix is appropriate from a locking perspective, so I
hope to draw some extra attention to that aspect. set_page_dirty_lock in
mm/page-writeback.c has a comment about locking that says set_page_dirty
should be called with the page locked or while definitely holding a
reference to the mapping's host inode. I believe that the mmap should
have that reference, so for fear of hurting KVM performance or
introducing a deadlock, I opted for the unlocked variant.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 virt/kvm/kvm_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Sean Christopherson Jan. 26, 2022, 9:59 p.m. UTC | #1

On Wed, Jan 26, 2022, Boris Burkov wrote:
> I tested this fix on the workload and it did prevent the hangs. However,
> I am unsure if the fix is appropriate from a locking perspective, so I
> hope to draw some extra attention to that aspect. set_page_dirty_lock in
> mm/page-writeback.c has a comment about locking that says set_page_dirty
> should be called with the page locked or while definitely holding a
> reference to the mapping's host inode. I believe that the mmap should
> have that reference, so for fear of hurting KVM performance or
> introducing a deadlock, I opted for the unlocked variant.

KVM doesn't hold a reference per se, but it does subscribe to mmu_notifier events
and will not mark the page dirty after KVM has been instructed to unmap the page
(barring bugs, which we've had a slew of).  So yeah, the unlocked variant should
be safe.

Is it feasible to trigger this behavior in a selftest?  KVM has had, and probably
still has, many bugs that all boil down to KVM assuming guest memory is backed by
either anonymous memory or something like shmem/HugeTLBFS/memfd that isn't typically
truncated by the host.

Boris Burkov Jan. 26, 2022, 11:11 p.m. UTC | #2

On Wed, Jan 26, 2022 at 09:59:02PM +0000, Sean Christopherson wrote:
> On Wed, Jan 26, 2022, Boris Burkov wrote:
> > I tested this fix on the workload and it did prevent the hangs. However,
> > I am unsure if the fix is appropriate from a locking perspective, so I
> > hope to draw some extra attention to that aspect. set_page_dirty_lock in
> > mm/page-writeback.c has a comment about locking that says set_page_dirty
> > should be called with the page locked or while definitely holding a
> > reference to the mapping's host inode. I believe that the mmap should
> > have that reference, so for fear of hurting KVM performance or
> > introducing a deadlock, I opted for the unlocked variant.
> 
> KVM doesn't hold a reference per se, but it does subscribe to mmu_notifier events
> and will not mark the page dirty after KVM has been instructed to unmap the page
> (barring bugs, which we've had a slew of).  So yeah, the unlocked variant should
> be safe.
> 
> Is it feasible to trigger this behavior in a selftest?  KVM has had, and probably
> still has, many bugs that all boil down to KVM assuming guest memory is backed by
> either anonymous memory or something like shmem/HugeTLBFS/memfd that isn't typically
> truncated by the host.

I haven't been able to isolate a reproducer, yet. I am a bit stumped
because there isn't a lot for me to go off from that stack I shared--the
best I have so far is that I need to trick KVM into emulating
instructions at some point to get to this 'complete_userspace_io'
codepath? I will keep trying, since I think it would be valuable to know
what exactly happened. Open to try any suggestions you might have as
well.

Thanks for the response,
Boris

Chris Mason Jan. 27, 2022, 12:02 a.m. UTC | #3

> On Jan 26, 2022, at 6:11 PM, Boris Burkov <boris@bur.io> wrote:
> 
> On Wed, Jan 26, 2022 at 09:59:02PM +0000, Sean Christopherson wrote:
>> On Wed, Jan 26, 2022, Boris Burkov wrote:
>>> I tested this fix on the workload and it did prevent the hangs. However,
>>> I am unsure if the fix is appropriate from a locking perspective, so I
>>> hope to draw some extra attention to that aspect. set_page_dirty_lock in
>>> mm/page-writeback.c has a comment about locking that says set_page_dirty
>>> should be called with the page locked or while definitely holding a
>>> reference to the mapping's host inode. I believe that the mmap should
>>> have that reference, so for fear of hurting KVM performance or
>>> introducing a deadlock, I opted for the unlocked variant.
>> 
>> KVM doesn't hold a reference per se, but it does subscribe to mmu_notifier events
>> and will not mark the page dirty after KVM has been instructed to unmap the page
>> (barring bugs, which we've had a slew of).  So yeah, the unlocked variant should
>> be safe.
>> 
>> Is it feasible to trigger this behavior in a selftest?  KVM has had, and probably
>> still has, many bugs that all boil down to KVM assuming guest memory is backed by
>> either anonymous memory or something like shmem/HugeTLBFS/memfd that isn't typically
>> truncated by the host.
> 
> I haven't been able to isolate a reproducer, yet. I am a bit stumped
> because there isn't a lot for me to go off from that stack I shared--the
> best I have so far is that I need to trick KVM into emulating
> instructions at some point to get to this 'complete_userspace_io'
> codepath? I will keep trying, since I think it would be valuable to know
> what exactly happened. Open to try any suggestions you might have as
> well.

From the btrfs side, bare calls to set_page_dirty() are suboptimal, since it doesn’t go through the ->page_mkwrite() dance that we use to properly COW things.  It’s still much better than SetPageDirty(), but I’d love to understand why kvm needs to dirty the page so we can figure out how to go through the normal mmap file io paths.

-chris

Sean Christopherson Jan. 27, 2022, 1:36 a.m. UTC | #4

On Thu, Jan 27, 2022, Chris Mason wrote:
> 
> 
> > On Jan 26, 2022, at 6:11 PM, Boris Burkov <boris@bur.io> wrote:
> > 
> > On Wed, Jan 26, 2022 at 09:59:02PM +0000, Sean Christopherson wrote:
> >> On Wed, Jan 26, 2022, Boris Burkov wrote:
> >>> I tested this fix on the workload and it did prevent the hangs. However,
> >>> I am unsure if the fix is appropriate from a locking perspective, so I
> >>> hope to draw some extra attention to that aspect. set_page_dirty_lock in
> >>> mm/page-writeback.c has a comment about locking that says set_page_dirty
> >>> should be called with the page locked or while definitely holding a
> >>> reference to the mapping's host inode. I believe that the mmap should
> >>> have that reference, so for fear of hurting KVM performance or
> >>> introducing a deadlock, I opted for the unlocked variant.
> >> 
> >> KVM doesn't hold a reference per se, but it does subscribe to mmu_notifier events
> >> and will not mark the page dirty after KVM has been instructed to unmap the page
> >> (barring bugs, which we've had a slew of).  So yeah, the unlocked variant should
> >> be safe.
> >> 
> >> Is it feasible to trigger this behavior in a selftest?  KVM has had, and probably
> >> still has, many bugs that all boil down to KVM assuming guest memory is backed by
> >> either anonymous memory or something like shmem/HugeTLBFS/memfd that isn't typically
> >> truncated by the host.
> > 
> > I haven't been able to isolate a reproducer, yet. I am a bit stumped
> > because there isn't a lot for me to go off from that stack I shared--the
> > best I have so far is that I need to trick KVM into emulating
> > instructions at some point to get to this 'complete_userspace_io'
> > codepath? I will keep trying, since I think it would be valuable to know
> > what exactly happened. Open to try any suggestions you might have as
> > well.
> 
> From the btrfs side, bare calls to set_page_dirty() are suboptimal, since it
> doesn’t go through the ->page_mkwrite() dance that we use to properly COW
> things.  It’s still much better than SetPageDirty(), but I’d love to
> understand why kvm needs to dirty the page so we can figure out how to go
> through the normal mmap file io paths.

Ah, is the issue that writeback gets stuck because KVM perpetually marks the
page as dirty?  The page in question should have already gone through ->page_mkwrite().
Outside of one or two internal mmaps that KVM fully controls and are anonymous memory,
KVM doesn't modify VMAs.  KVM is calling SetPageDirty() to mark that it has written
to the page; KVM either when it unmaps the page from the guest, or in this case, when
it kunmap()'s a page KVM itself accessed.

Based on the call stack, my best guest is that KVM is udpating steal_time info.
That's triggered when the vCPU is (re)loaded, which would explain the correlation
to complete_userspace_io() as KVM unloads=>reloads the vCPU before/after exiting
to userspace to handle emulate I/O.

Oh!  I assume that the page is either unmapped or made read-only before writeback?
v5.6 (and many kernels since) had a bug where KVM would "miss" mmu_notifier events
for the steal_time cache.  It's basically a use-after-free issue at that point.  Commit
7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status").

Paolo Bonzini Jan. 27, 2022, 12:20 p.m. UTC | #5

On 1/27/22 01:02, Chris Mason wrote:
> From the btrfs side, bare calls to set_page_dirty() are suboptimal,
> since it doesn’t go through the ->page_mkwrite() dance that we use to
> properly COW things.  It’s still much better than SetPageDirty(), but
> I’d love to understand why kvm needs to dirty the page so we can
> figure out how to go through the normal mmap file io paths.
Shouldn't ->page_mkwrite() occur at the point of get_user_pages, such as 
via handle_mm_fault->handle_pte_fault->do_fault->do_shared_fault?  That 
always happens before SetPageDirty(), or set_page_dirty() after Boris's 
patch.

Thanks,

Paolo

Chris Mason Jan. 27, 2022, 2:52 p.m. UTC | #6

> On Jan 27, 2022, at 7:20 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 1/27/22 01:02, Chris Mason wrote:
>> From the btrfs side, bare calls to set_page_dirty() are suboptimal,
>> since it doesn’t go through the ->page_mkwrite() dance that we use to
>> properly COW things.  It’s still much better than SetPageDirty(), but
>> I’d love to understand why kvm needs to dirty the page so we can
>> figure out how to go through the normal mmap file io paths.
> Shouldn't ->page_mkwrite() occur at the point of get_user_pages, such as via handle_mm_fault->handle_pte_fault->do_fault->do_shared_fault?  That always happens before SetPageDirty(), or set_page_dirty() after Boris's patch.

page_mkwrite() is where btrfs does its COW setup, waits for IO in flight, and also sets the page dirty.  If that’s already happening for these pages, do we need an additional set_page_dirty() at all?

Boris found https://lists.openwall.net/linux-kernel/2016/02/11/702, where Maxim suggests just dropping the SetPageDirty() on file back pages.

The problem with bare set_page_dirty() calls is that it bypasses our synchronization for stable pages.  We have to support it because of weird get_user_pages() corners, but page_mwkrite() is much preferred.  Hopefully our use of clear_page_dirty_for_io() makes sure that any modifications to the page go through page_mkwrite() again, so I think Maxim’s patch might just be correct.

-chris

Chris Mason Jan. 27, 2022, 3 p.m. UTC | #7

> On Jan 26, 2022, at 8:36 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> On Thu, Jan 27, 2022, Chris Mason wrote:
>> 
>> 
>>> On Jan 26, 2022, at 6:11 PM, Boris Burkov <boris@bur.io> wrote:
>>> 
>>> On Wed, Jan 26, 2022 at 09:59:02PM +0000, Sean Christopherson wrote:
>>>> On Wed, Jan 26, 2022, Boris Burkov wrote:
>>>>> I tested this fix on the workload and it did prevent the hangs. However,
>>>>> I am unsure if the fix is appropriate from a locking perspective, so I
>>>>> hope to draw some extra attention to that aspect. set_page_dirty_lock in
>>>>> mm/page-writeback.c has a comment about locking that says set_page_dirty
>>>>> should be called with the page locked or while definitely holding a
>>>>> reference to the mapping's host inode. I believe that the mmap should
>>>>> have that reference, so for fear of hurting KVM performance or
>>>>> introducing a deadlock, I opted for the unlocked variant.
>>>> 
>>>> KVM doesn't hold a reference per se, but it does subscribe to mmu_notifier events
>>>> and will not mark the page dirty after KVM has been instructed to unmap the page
>>>> (barring bugs, which we've had a slew of).  So yeah, the unlocked variant should
>>>> be safe.
>>>> 
>>>> Is it feasible to trigger this behavior in a selftest?  KVM has had, and probably
>>>> still has, many bugs that all boil down to KVM assuming guest memory is backed by
>>>> either anonymous memory or something like shmem/HugeTLBFS/memfd that isn't typically
>>>> truncated by the host.
>>> 
>>> I haven't been able to isolate a reproducer, yet. I am a bit stumped
>>> because there isn't a lot for me to go off from that stack I shared--the
>>> best I have so far is that I need to trick KVM into emulating
>>> instructions at some point to get to this 'complete_userspace_io'
>>> codepath? I will keep trying, since I think it would be valuable to know
>>> what exactly happened. Open to try any suggestions you might have as
>>> well.
>> 
>> From the btrfs side, bare calls to set_page_dirty() are suboptimal, since it
>> doesn’t go through the ->page_mkwrite() dance that we use to properly COW
>> things.  It’s still much better than SetPageDirty(), but I’d love to
>> understand why kvm needs to dirty the page so we can figure out how to go
>> through the normal mmap file io paths.
> 
> Ah, is the issue that writeback gets stuck because KVM perpetually marks the
> page as dirty?  The page in question should have already gone through ->page_mkwrite().
> Outside of one or two internal mmaps that KVM fully controls and are anonymous memory,
> KVM doesn't modify VMAs.  KVM is calling SetPageDirty() to mark that it has written
> to the page; KVM either when it unmaps the page from the guest, or in this case, when
> it kunmap()'s a page KVM itself accessed.
> 

I think KVM is just calling SetPageDirty() once.  The problem is that SetPageDirty() just flips the bit and doesn’t set any of the tags in the radix tree, so we can easily hit this check in filemap_fdatawrite_wbc():

        if (!mapping_can_writeback(mapping) ||
            !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
                return 0;

Since almost everyone writing dirty pages to disk wanders through a check or search for tagged pages, the page just never gets written at all.

> Based on the call stack, my best guest is that KVM is udpating steal_time info.
> That's triggered when the vCPU is (re)loaded, which would explain the correlation
> to complete_userspace_io() as KVM unloads=>reloads the vCPU before/after exiting
> to userspace to handle emulate I/O.
> 
> Oh!  I assume that the page is either unmapped or made read-only before writeback?
> v5.6 (and many kernels since) had a bug where KVM would "miss" mmu_notifier events
> for the steal_time cache.  It's basically a use-after-free issue at that point.  Commit
> 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status”)

Oh, looks like we are missing that one, interesting.  We use clear_page_dirty_for_io() before writing pages, so yes it does get set readonly via page_mkclean()

-chris

KVM: use set_page_dirty rather than SetPageDirty

Commit Message

Comments

Patch