Message ID | 20211119235759.1304274-14-dmatlack@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86/mmu: Eager Page Splitting for the TDP MMU | expand |
On Fri, Nov 19, 2021 at 11:57:57PM +0000, David Matlack wrote: > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 6768ef9c0891..4e78ef2dd352 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, > gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask); > gfn_t end = slot->base_gfn + gfn_offset + __fls(mask); > > + /* > + * Try to proactively split any large pages down to 4KB so that > + * vCPUs don't have to take write-protection faults. > + */ > + kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K); > + > kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M); > > /* Cross two large pages? */ Is it intended to try split every time even if we could have split it already? As I remember Paolo mentioned that we can skip split if it's not the 1st CLEAR_LOG on the same range, and IIUC that makes sense. But indeed I don't see a trivial way to know whether this is the first clear of this range. Maybe we can maintain "how many huge pages are there under current kvm_mmu_page node" somehow? Then if root sp has the counter==0, then we can skip it. Just a wild idea.. Or maybe it's intended to try split unconditionally for some reason? If so it'll be great to mention that either in the commit message or in comments. Thanks,
On Fri, Nov 26, 2021 at 4:17 AM Peter Xu <peterx@redhat.com> wrote: > > On Fri, Nov 19, 2021 at 11:57:57PM +0000, David Matlack wrote: > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 6768ef9c0891..4e78ef2dd352 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, > > gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask); > > gfn_t end = slot->base_gfn + gfn_offset + __fls(mask); > > > > + /* > > + * Try to proactively split any large pages down to 4KB so that > > + * vCPUs don't have to take write-protection faults. > > + */ > > + kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K); > > + > > kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M); > > > > /* Cross two large pages? */ > > Is it intended to try split every time even if we could have split it already? > As I remember Paolo mentioned that we can skip split if it's not the 1st > CLEAR_LOG on the same range, and IIUC that makes sense. > > But indeed I don't see a trivial way to know whether this is the first clear of > this range. Maybe we can maintain "how many huge pages are there under current > kvm_mmu_page node" somehow? Then if root sp has the counter==0, then we can > skip it. Just a wild idea.. > > Or maybe it's intended to try split unconditionally for some reason? If so > it'll be great to mention that either in the commit message or in comments. Thanks for calling this out. Could the same be said about the existing code that unconditionally tries to write-protect 2M+ pages? I aimed to keep parity with the write-protection calls (always try to split before write-protecting) but I agree there might be opportunities available to skip altogether. By the way, looking at this code again I think I see some potential bugs: - I don't think I ever free split_caches in the initially-all-set case. - What happens if splitting fails the CLEAR_LOG but succeeds the CLEAR_LOG? We would end up propagating the write-protection on the 2M page down to the 4K page. This might cause issues if using PML. > > Thanks, > > -- > Peter Xu >
On Tue, Nov 30, 2021 at 4:16 PM David Matlack <dmatlack@google.com> wrote: > > On Fri, Nov 26, 2021 at 4:17 AM Peter Xu <peterx@redhat.com> wrote: > > > > On Fri, Nov 19, 2021 at 11:57:57PM +0000, David Matlack wrote: > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > > index 6768ef9c0891..4e78ef2dd352 100644 > > > --- a/arch/x86/kvm/mmu/mmu.c > > > +++ b/arch/x86/kvm/mmu/mmu.c > > > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, > > > gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask); > > > gfn_t end = slot->base_gfn + gfn_offset + __fls(mask); > > > > > > + /* > > > + * Try to proactively split any large pages down to 4KB so that > > > + * vCPUs don't have to take write-protection faults. > > > + */ > > > + kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K); > > > + > > > kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M); > > > > > > /* Cross two large pages? */ > > > > Is it intended to try split every time even if we could have split it already? > > As I remember Paolo mentioned that we can skip split if it's not the 1st > > CLEAR_LOG on the same range, and IIUC that makes sense. > > > > But indeed I don't see a trivial way to know whether this is the first clear of > > this range. Maybe we can maintain "how many huge pages are there under current > > kvm_mmu_page node" somehow? Then if root sp has the counter==0, then we can > > skip it. Just a wild idea.. > > > > Or maybe it's intended to try split unconditionally for some reason? If so > > it'll be great to mention that either in the commit message or in comments. > > Thanks for calling this out. Could the same be said about the existing > code that unconditionally tries to write-protect 2M+ pages? I aimed to > keep parity with the write-protection calls (always try to split > before write-protecting) but I agree there might be opportunities > available to skip altogether. > > By the way, looking at this code again I think I see some potential bugs: > - I don't think I ever free split_caches in the initially-all-set case. > - What happens if splitting fails the CLEAR_LOG but succeeds the > CLEAR_LOG? Gah, meant to say "first CLEAR_LOG" and "second CLEAR_LOG" here. > We would end up propagating the write-protection on the 2M > page down to the 4K page. This might cause issues if using PML. > > > > > Thanks, > > > > -- > > Peter Xu > >
On Tue, Nov 30, 2021 at 04:17:01PM -0800, David Matlack wrote: > On Tue, Nov 30, 2021 at 4:16 PM David Matlack <dmatlack@google.com> wrote: > > > > On Fri, Nov 26, 2021 at 4:17 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > On Fri, Nov 19, 2021 at 11:57:57PM +0000, David Matlack wrote: > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > > > index 6768ef9c0891..4e78ef2dd352 100644 > > > > --- a/arch/x86/kvm/mmu/mmu.c > > > > +++ b/arch/x86/kvm/mmu/mmu.c > > > > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, > > > > gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask); > > > > gfn_t end = slot->base_gfn + gfn_offset + __fls(mask); > > > > > > > > + /* > > > > + * Try to proactively split any large pages down to 4KB so that > > > > + * vCPUs don't have to take write-protection faults. > > > > + */ > > > > + kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K); > > > > + > > > > kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M); > > > > > > > > /* Cross two large pages? */ > > > > > > Is it intended to try split every time even if we could have split it already? > > > As I remember Paolo mentioned that we can skip split if it's not the 1st > > > CLEAR_LOG on the same range, and IIUC that makes sense. > > > > > > But indeed I don't see a trivial way to know whether this is the first clear of > > > this range. Maybe we can maintain "how many huge pages are there under current > > > kvm_mmu_page node" somehow? Then if root sp has the counter==0, then we can > > > skip it. Just a wild idea.. > > > > > > Or maybe it's intended to try split unconditionally for some reason? If so > > > it'll be great to mention that either in the commit message or in comments. > > > > Thanks for calling this out. Could the same be said about the existing > > code that unconditionally tries to write-protect 2M+ pages? They're different because wr-protect can be restored (to be not-wr-protected) when vcpu threads write to the pages, so they need to be always done. For huge page split - when it happened during dirty tracking it'll not be recovered anymore, so it's a one-time thing. > > I aimed to keep parity with the write-protection calls (always try to split > > before write-protecting) but I agree there might be opportunities available > > to skip altogether. So IMHO it's not about parity but it could be about how easy can it be implemented, and whether it'll be worth it to add that complexity. Besides the above accounting idea per-sp, we can have other ways to do this too, e.g., keeping a bitmap showing which range has been split: that bitmap will be 2M in granule for x86 because that'll be enough. We init-all-ones for the bitmap when start logging for a memslot. But again maybe it turns out we don't really want that complexity. IMHO a good start could be the perf numbers (which I asked in the cover letter) comparing the overhead of 2nd+ iterations of CLEAR_LOG with/without eager page split. > > > > By the way, looking at this code again I think I see some potential bugs: > > - I don't think I ever free split_caches in the initially-all-set case. I saw that it's freed in kvm_mmu_try_split_large_pages(), no? > > - What happens if splitting fails the CLEAR_LOG but succeeds the > > CLEAR_LOG? > > Gah, meant to say "first CLEAR_LOG" and "second CLEAR_LOG" here. > > > We would end up propagating the write-protection on the 2M > > page down to the 4K page. This might cause issues if using PML. Hmm looks correct.. I'm wondering what will happen with that. Firstly this should be rare as the 1st split should in 99% cases succeed. Then if split failed at the 1st attempt, we wr-protected sptes even during pml during the split. When written, we'll go the fast page fault and record the writes too, I think, as we'll apply dirty bit to the new spte so I think it'll just skip pml. Looks like we'll be using a mixture of pml+wp but all dirty will still be captured as exptected?.. There could be leftover wp when stopping dirty logging, but that seems not directly harmful too. It'll make things a bit messed up, at least.
On Fri, Nov 19, 2021, David Matlack wrote: > When using initially-all-set, large pages are not write-protected when > dirty logging is enabled on the memslot. Instead they are > write-protected once userspace invoked CLEAR_DIRTY_LOG for the first > time, and only for the specific sub-region of the memslot that userspace > whishes to clear. > > Enhance CLEAR_DIRTY_LOG to also try to split large pages prior to > write-protecting to avoid causing write-protection faults on vCPU > threads. This also allows userspace to smear the cost of large page > splitting across multiple ioctls rather than splitting the entire > memslot when not using initially-all-set. > > Signed-off-by: David Matlack <dmatlack@google.com> > --- > arch/x86/include/asm/kvm_host.h | 4 ++++ > arch/x86/kvm/mmu/mmu.c | 30 ++++++++++++++++++++++-------- > 2 files changed, 26 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 432a4df817ec..6b5bf99f57af 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1591,6 +1591,10 @@ void kvm_mmu_reset_context(struct kvm_vcpu *vcpu); > void kvm_mmu_slot_remove_write_access(struct kvm *kvm, > const struct kvm_memory_slot *memslot, > int start_level); > +void kvm_mmu_try_split_large_pages(struct kvm *kvm, I would prefer we use hugepage when possible, mostly because that's the terminology used by the kernel. KVM is comically inconsistent, but if we make an effort to use hugepage when adding new code, hopefully someday we'll have enough inertia to commit fully to hugepage. > + const struct kvm_memory_slot *memslot, > + u64 start, u64 end, > + int target_level); > void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm, > const struct kvm_memory_slot *memslot, > int target_level); > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 6768ef9c0891..4e78ef2dd352 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, > gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask); > gfn_t end = slot->base_gfn + gfn_offset + __fls(mask); > > + /* > + * Try to proactively split any large pages down to 4KB so that > + * vCPUs don't have to take write-protection faults. > + */ > + kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K); This should return a value. If splitting succeeds, there should be no hugepages and so walking the page tables to write-protect 2M is unnecessary. Same for the previous patch, although skipping the write-protect path is a little less straightforward in that case. > + > kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M); > > /* Cross two large pages? */
On Wed, Dec 1, 2021 at 11:22 AM Sean Christopherson <seanjc@google.com> wrote: > > On Fri, Nov 19, 2021, David Matlack wrote: > > When using initially-all-set, large pages are not write-protected when > > dirty logging is enabled on the memslot. Instead they are > > write-protected once userspace invoked CLEAR_DIRTY_LOG for the first > > time, and only for the specific sub-region of the memslot that userspace > > whishes to clear. > > > > Enhance CLEAR_DIRTY_LOG to also try to split large pages prior to > > write-protecting to avoid causing write-protection faults on vCPU > > threads. This also allows userspace to smear the cost of large page > > splitting across multiple ioctls rather than splitting the entire > > memslot when not using initially-all-set. > > > > Signed-off-by: David Matlack <dmatlack@google.com> > > --- > > arch/x86/include/asm/kvm_host.h | 4 ++++ > > arch/x86/kvm/mmu/mmu.c | 30 ++++++++++++++++++++++-------- > > 2 files changed, 26 insertions(+), 8 deletions(-) > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > index 432a4df817ec..6b5bf99f57af 100644 > > --- a/arch/x86/include/asm/kvm_host.h > > +++ b/arch/x86/include/asm/kvm_host.h > > @@ -1591,6 +1591,10 @@ void kvm_mmu_reset_context(struct kvm_vcpu *vcpu); > > void kvm_mmu_slot_remove_write_access(struct kvm *kvm, > > const struct kvm_memory_slot *memslot, > > int start_level); > > +void kvm_mmu_try_split_large_pages(struct kvm *kvm, > > I would prefer we use hugepage when possible, mostly because that's the terminology > used by the kernel. KVM is comically inconsistent, but if we make an effort to use > hugepage when adding new code, hopefully someday we'll have enough inertia to commit > fully to hugepage. In my mind "huge page" implies 2M and "large page" is generic to 2m and 1g. (IDK if we settled on a name for 1G pages) I've definitely been guilty of reinforcing this inconsistent terminology. (Though it was consistent in my head, of course.) If we want to pick one and use it everywhere, I'm happy to get onboard with a standard terminology. > > > + const struct kvm_memory_slot *memslot, > > + u64 start, u64 end, > > + int target_level); > > void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm, > > const struct kvm_memory_slot *memslot, > > int target_level); > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 6768ef9c0891..4e78ef2dd352 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, > > gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask); > > gfn_t end = slot->base_gfn + gfn_offset + __fls(mask); > > > > + /* > > + * Try to proactively split any large pages down to 4KB so that > > + * vCPUs don't have to take write-protection faults. > > + */ > > + kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K); > > This should return a value. If splitting succeeds, there should be no hugepages > and so walking the page tables to write-protect 2M is unnecessary. Same for the > previous patch, although skipping the write-protect path is a little less > straightforward in that case. > > > + > > kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M); > > > > /* Cross two large pages? */
On Wed, Dec 01, 2021, Ben Gardon wrote: > On Wed, Dec 1, 2021 at 11:22 AM Sean Christopherson <seanjc@google.com> wrote: > > I would prefer we use hugepage when possible, mostly because that's the terminology > > used by the kernel. KVM is comically inconsistent, but if we make an effort to use > > hugepage when adding new code, hopefully someday we'll have enough inertia to commit > > fully to hugepage. > > In my mind "huge page" implies 2M and "large page" is generic to 2m > and 1g. (IDK if we settled on a name for 1G pages) What about 4m PSE pages? :-) I'm mostly joking, but it does raise the point that trying to provide unique names for each size is a bit of a fools errand, especially on non-x86 architectures that support a broader variety of hugepage sizes. IMO, the least ambiguous way to refer to hugepages is to say that everything that isn't a 4k page (or whatever PAGE_SIZE is on the architecture) is a hugepage, and then explicitly state the size of the page if it matters. > I've definitely been guilty of reinforcing this inconsistent > terminology. (Though it was consistent in my head, of course.) If we > want to pick one and use it everywhere, I'm happy to get onboard with > a standard terminology. I hear you on using "large page", I've had to undo a solid decade of "large page" terminology from my pre-Linux days. But for better or worse, the kernel uses hugepage, e.g. hugetlbfs supports 1gb and 2mb pages. I think we should follow the kernel, especially since we have aspirations of unifying more of KVM's MMU across multiple architectures.
On Wed, Dec 1, 2021 at 12:17 PM Sean Christopherson <seanjc@google.com> wrote: > > On Wed, Dec 01, 2021, Ben Gardon wrote: > > On Wed, Dec 1, 2021 at 11:22 AM Sean Christopherson <seanjc@google.com> wrote: > > > I would prefer we use hugepage when possible, mostly because that's the terminology > > > used by the kernel. KVM is comically inconsistent, but if we make an effort to use > > > hugepage when adding new code, hopefully someday we'll have enough inertia to commit > > > fully to hugepage. > > > > In my mind "huge page" implies 2M and "large page" is generic to 2m > > and 1g. (IDK if we settled on a name for 1G pages) > > What about 4m PSE pages? :-) > > I'm mostly joking, but it does raise the point that trying to provide unique names > for each size is a bit of a fools errand, especially on non-x86 architectures that > support a broader variety of hugepage sizes. IMO, the least ambiguous way to refer > to hugepages is to say that everything that isn't a 4k page (or whatever PAGE_SIZE > is on the architecture) is a hugepage, and then explicitly state the size of the > page if it matters. > > > I've definitely been guilty of reinforcing this inconsistent > > terminology. (Though it was consistent in my head, of course.) If we > > want to pick one and use it everywhere, I'm happy to get onboard with > > a standard terminology. > > I hear you on using "large page", I've had to undo a solid decade of "large page" > terminology from my pre-Linux days. But for better or worse, the kernel uses > hugepage, e.g. hugetlbfs supports 1gb and 2mb pages. I think we should follow > the kernel, especially since we have aspirations of unifying more of KVM's MMU > across multiple architectures. Sounds good to me. I'll keep that in mind in future patches. I'm happy to call them anything as long as we all use the same terms.
On Tue, Nov 30, 2021 at 8:04 PM Peter Xu <peterx@redhat.com> wrote: > > On Tue, Nov 30, 2021 at 04:17:01PM -0800, David Matlack wrote: > > On Tue, Nov 30, 2021 at 4:16 PM David Matlack <dmatlack@google.com> wrote: > > > > > > On Fri, Nov 26, 2021 at 4:17 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > On Fri, Nov 19, 2021 at 11:57:57PM +0000, David Matlack wrote: > > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > > > > index 6768ef9c0891..4e78ef2dd352 100644 > > > > > --- a/arch/x86/kvm/mmu/mmu.c > > > > > +++ b/arch/x86/kvm/mmu/mmu.c > > > > > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, > > > > > gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask); > > > > > gfn_t end = slot->base_gfn + gfn_offset + __fls(mask); > > > > > > > > > > + /* > > > > > + * Try to proactively split any large pages down to 4KB so that > > > > > + * vCPUs don't have to take write-protection faults. > > > > > + */ > > > > > + kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K); > > > > > + > > > > > kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M); > > > > > > > > > > /* Cross two large pages? */ > > > > > > > > Is it intended to try split every time even if we could have split it already? > > > > As I remember Paolo mentioned that we can skip split if it's not the 1st > > > > CLEAR_LOG on the same range, and IIUC that makes sense. > > > > > > > > But indeed I don't see a trivial way to know whether this is the first clear of > > > > this range. Maybe we can maintain "how many huge pages are there under current > > > > kvm_mmu_page node" somehow? Then if root sp has the counter==0, then we can > > > > skip it. Just a wild idea.. > > > > > > > > Or maybe it's intended to try split unconditionally for some reason? If so > > > > it'll be great to mention that either in the commit message or in comments. > > > > > > Thanks for calling this out. Could the same be said about the existing > > > code that unconditionally tries to write-protect 2M+ pages? > > They're different because wr-protect can be restored (to be not-wr-protected) > when vcpu threads write to the pages, so they need to be always done. That's true for 4K pages, but not for write-protecting 2M+ pages (which is what we're discussing here). Once KVM write-protects a 2M+ page, it should never need to write-protect it again, but we always try to here. Same goes with splitting. > > For huge page split - when it happened during dirty tracking it'll not be > recovered anymore, so it's a one-time thing. > > > > I aimed to keep parity with the write-protection calls (always try to split > > > before write-protecting) but I agree there might be opportunities available > > > to skip altogether. > > So IMHO it's not about parity but it could be about how easy can it be > implemented, and whether it'll be worth it to add that complexity. Agreed. > > Besides the above accounting idea per-sp, we can have other ways to do this > too, e.g., keeping a bitmap showing which range has been split: that bitmap > will be 2M in granule for x86 because that'll be enough. We init-all-ones for > the bitmap when start logging for a memslot. > > But again maybe it turns out we don't really want that complexity. > > IMHO a good start could be the perf numbers (which I asked in the cover letter) > comparing the overhead of 2nd+ iterations of CLEAR_LOG with/without eager page > split. Ack. I'll be sure to include these in v1! > > > > > > > By the way, looking at this code again I think I see some potential bugs: > > > - I don't think I ever free split_caches in the initially-all-set case. > > I saw that it's freed in kvm_mmu_try_split_large_pages(), no? Ah yes you are right. I misremembered how I implemented it and thought we kept the split_caches around across calls to CLEAR_LOG. (We probably should TBH. The current implementation is quite wasteful.) > > > > - What happens if splitting fails the CLEAR_LOG but succeeds the > > > CLEAR_LOG? > > > > Gah, meant to say "first CLEAR_LOG" and "second CLEAR_LOG" here. > > > > > We would end up propagating the write-protection on the 2M > > > page down to the 4K page. This might cause issues if using PML. > > Hmm looks correct.. I'm wondering what will happen with that. > > Firstly this should be rare as the 1st split should in 99% cases succeed. > > Then if split failed at the 1st attempt, we wr-protected sptes even during pml > during the split. When written, we'll go the fast page fault and record the > writes too, I think, as we'll apply dirty bit to the new spte so I think it'll > just skip pml. Looks like we'll be using a mixture of pml+wp but all dirty > will still be captured as exptected?.. That's what I was hoping for. I'll double check for v1. > > There could be leftover wp when stopping dirty logging, but that seems not > directly harmful too. It'll make things a bit messed up, at least. > > -- > Peter Xu >
On Wed, Dec 1, 2021 at 11:22 AM Sean Christopherson <seanjc@google.com> wrote: > > On Fri, Nov 19, 2021, David Matlack wrote: > > When using initially-all-set, large pages are not write-protected when > > dirty logging is enabled on the memslot. Instead they are > > write-protected once userspace invoked CLEAR_DIRTY_LOG for the first > > time, and only for the specific sub-region of the memslot that userspace > > whishes to clear. > > > > Enhance CLEAR_DIRTY_LOG to also try to split large pages prior to > > write-protecting to avoid causing write-protection faults on vCPU > > threads. This also allows userspace to smear the cost of large page > > splitting across multiple ioctls rather than splitting the entire > > memslot when not using initially-all-set. > > > > Signed-off-by: David Matlack <dmatlack@google.com> > > --- > > arch/x86/include/asm/kvm_host.h | 4 ++++ > > arch/x86/kvm/mmu/mmu.c | 30 ++++++++++++++++++++++-------- > > 2 files changed, 26 insertions(+), 8 deletions(-) > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > index 432a4df817ec..6b5bf99f57af 100644 > > --- a/arch/x86/include/asm/kvm_host.h > > +++ b/arch/x86/include/asm/kvm_host.h > > @@ -1591,6 +1591,10 @@ void kvm_mmu_reset_context(struct kvm_vcpu *vcpu); > > void kvm_mmu_slot_remove_write_access(struct kvm *kvm, > > const struct kvm_memory_slot *memslot, > > int start_level); > > +void kvm_mmu_try_split_large_pages(struct kvm *kvm, > > I would prefer we use hugepage when possible, mostly because that's the terminology > used by the kernel. KVM is comically inconsistent, but if we make an effort to use > hugepage when adding new code, hopefully someday we'll have enough inertia to commit > fully to hugepage. Will do. > > > + const struct kvm_memory_slot *memslot, > > + u64 start, u64 end, > > + int target_level); > > void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm, > > const struct kvm_memory_slot *memslot, > > int target_level); > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 6768ef9c0891..4e78ef2dd352 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, > > gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask); > > gfn_t end = slot->base_gfn + gfn_offset + __fls(mask); > > > > + /* > > + * Try to proactively split any large pages down to 4KB so that > > + * vCPUs don't have to take write-protection faults. > > + */ > > + kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K); > > This should return a value. If splitting succeeds, there should be no hugepages > and so walking the page tables to write-protect 2M is unnecessary. Same for the > previous patch, although skipping the write-protect path is a little less > straightforward in that case. Great idea! Will do. > > > + > > kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M); > > > > /* Cross two large pages? */
On Wed, Dec 01, 2021 at 02:14:27PM -0800, David Matlack wrote: > > > > Thanks for calling this out. Could the same be said about the existing > > > > code that unconditionally tries to write-protect 2M+ pages? > > > > They're different because wr-protect can be restored (to be not-wr-protected) > > when vcpu threads write to the pages, so they need to be always done. > > That's true for 4K pages, but not for write-protecting 2M+ pages > (which is what we're discussing here). Once KVM write-protects a 2M+ > page, it should never need to write-protect it again, but we always > try to here. Same goes with splitting. Ah I see, that's fair point. :) Yeah let's see how it goes with the numbers, I'd hope it's trivial to do both wr-protect 2m and the split unconditionally, because for CLEAR_LOG the major overhead should be walking the small pages instead, afaiu.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 432a4df817ec..6b5bf99f57af 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1591,6 +1591,10 @@ void kvm_mmu_reset_context(struct kvm_vcpu *vcpu); void kvm_mmu_slot_remove_write_access(struct kvm *kvm, const struct kvm_memory_slot *memslot, int start_level); +void kvm_mmu_try_split_large_pages(struct kvm *kvm, + const struct kvm_memory_slot *memslot, + u64 start, u64 end, + int target_level); void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm, const struct kvm_memory_slot *memslot, int target_level); diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 6768ef9c0891..4e78ef2dd352 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask); gfn_t end = slot->base_gfn + gfn_offset + __fls(mask); + /* + * Try to proactively split any large pages down to 4KB so that + * vCPUs don't have to take write-protection faults. + */ + kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K); + kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M); /* Cross two large pages? */ @@ -5880,21 +5886,17 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, kvm_arch_flush_remote_tlbs_memslot(kvm, memslot); } -void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm, - const struct kvm_memory_slot *memslot, - int target_level) +void kvm_mmu_try_split_large_pages(struct kvm *kvm, + const struct kvm_memory_slot *memslot, + u64 start, u64 end, + int target_level) { - u64 start, end; - if (!is_tdp_mmu_enabled(kvm)) return; if (mmu_topup_split_caches(kvm)) return; - start = memslot->base_gfn; - end = start + memslot->npages; - read_lock(&kvm->mmu_lock); kvm_tdp_mmu_try_split_large_pages(kvm, memslot, start, end, target_level); read_unlock(&kvm->mmu_lock); @@ -5902,6 +5904,18 @@ void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm, mmu_free_split_caches(kvm); } +void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm, + const struct kvm_memory_slot *memslot, + int target_level) +{ + u64 start, end; + + start = memslot->base_gfn; + end = start + memslot->npages; + + kvm_mmu_try_split_large_pages(kvm, memslot, start, end, target_level); +} + static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, struct kvm_rmap_head *rmap_head, const struct kvm_memory_slot *slot)
When using initially-all-set, large pages are not write-protected when dirty logging is enabled on the memslot. Instead they are write-protected once userspace invoked CLEAR_DIRTY_LOG for the first time, and only for the specific sub-region of the memslot that userspace whishes to clear. Enhance CLEAR_DIRTY_LOG to also try to split large pages prior to write-protecting to avoid causing write-protection faults on vCPU threads. This also allows userspace to smear the cost of large page splitting across multiple ioctls rather than splitting the entire memslot when not using initially-all-set. Signed-off-by: David Matlack <dmatlack@google.com> --- arch/x86/include/asm/kvm_host.h | 4 ++++ arch/x86/kvm/mmu/mmu.c | 30 ++++++++++++++++++++++-------- 2 files changed, 26 insertions(+), 8 deletions(-)