Message ID | 20230206165851.3106338-1-ricarkol@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Implement Eager Page Splitting for ARM. | expand |
Hi Ricardo, On 2/7/23 3:58 AM, Ricardo Koller wrote: > Eager Page Splitting improves the performance of dirty-logging (used > in live migrations) when guest memory is backed by huge-pages. It's > an optimization used in Google Cloud since 2016 on x86, and for the > last couple of months on ARM. > > Background and motivation > ========================= > Dirty logging is typically used for live-migration iterative copying. > KVM implements dirty-logging at the PAGE_SIZE granularity (will refer > to 4K pages from now on). It does it by faulting on write-protected > 4K pages. Therefore, enabling dirty-logging on a huge-page requires > breaking it into 4K pages in the first place. KVM does this breaking > on fault, and because it's in the critical path it only maps the 4K > page that faulted; every other 4K page is left unmapped. This is not > great for performance on ARM for a couple of reasons: > > - Splitting on fault can halt vcpus for milliseconds in some > implementations. Splitting a block PTE requires using a broadcasted > TLB invalidation (TLBI) for every huge-page (due to the > break-before-make requirement). Note that x86 doesn't need this. We > observed some implementations that take millliseconds to complete > broadcasted TLBIs when done in parallel from multiple vcpus. And > that's exactly what happens when doing it on fault: multiple vcpus > fault at the same time triggering TLBIs in parallel. > > - Read intensive guest workloads end up paying for dirty-logging. > Only mapping the faulting 4K page means that all the other pages > that were part of the huge-page will now be unmapped. The effect is > that any access, including reads, now has to fault. > > Eager Page Splitting (on ARM) > ============================= > Eager Page Splitting fixes the above two issues by eagerly splitting > huge-pages when enabling dirty logging. The goal is to avoid doing it > while faulting on write-protected pages. This is what the TDP MMU does > for x86 [0], except that x86 does it for different reasons: to avoid > grabbing the MMU lock on fault. Note that taking care of > write-protection faults still requires grabbing the MMU lock on ARM, > but not on x86 (with the fast_page_fault path). > > An additional benefit of eagerly splitting huge-pages is that it can > be done in a controlled way (e.g., via an IOCTL). This series provides > two knobs for doing it, just like its x86 counterpart: when enabling > dirty logging, and when using the KVM_CLEAR_DIRTY_LOG ioctl. The > benefit of doing it on KVM_CLEAR_DIRTY_LOG is that this ioctl takes > ranges, and not complete memslots like when enabling dirty logging. > This means that the cost of splitting (mainly broadcasted TLBIs) can > be throttled: split a range, wait for a bit, split another range, etc. > The benefits of this approach were presented by Oliver Upton at KVM > Forum 2022 [1]. > [...] Sorry for raising questions about the design lately. There are two operations regarding the existing huge page mapping. Here, lets take PMD and PTE mapping as an example for discussion: (a) The existing PMD mapping is split to contiguous 512 PTE mappings when all sub-pages are written in sequence and dirty logging has been enabled (b) The contiguous 512 PTE mappings are combined to one PMD mapping when dirty logging is disabled. Before this series is applied, both (a) and (b) are handled by the page fault handler. After this series is applied, (a) is handled in the ioctl handler while (b) is still handled in the page fault handler. I'm not sure why we can't eagerly split the PMD mapping into 512 PTE mapping in the page fault handler? In this way, the implementation may be simplified by extending kvm_pgtable_stage2_map(). In the implementation, the newly introduced API kvm_pgtable_stage2_split() calls to kvm_pgtable_stage2_create_unlinked() and then stage2_map_walker(), which is part of kvm_pgtable_stage2_map(), to create the unlinked page tables. It's why I have the question. Thanks, Gavin
Hi Gavin, On Tue, Feb 14, 2023 at 04:57:59PM +1100, Gavin Shan wrote: > On 2/7/23 3:58 AM, Ricardo Koller wrote: <snip> > > Eager Page Splitting fixes the above two issues by eagerly splitting > > huge-pages when enabling dirty logging. The goal is to avoid doing it > > while faulting on write-protected pages. </snip> > I'm not sure why we can't eagerly split the PMD mapping into 512 PTE > mapping in the page fault handler? The entire goal of the series is to avoid page splitting at all on the stage-2 abort path. Ideally we want to minimize the time taken to handle a fault so we can get back to running the guest. The requirement to perform a break-before-make operation to change the mapping granularity can, as Ricardo points out, be a bottleneck on contemporary implementations. There is a clear uplift with the proposed implementation already, and I would expect that margin to widen if/when we add support for lockless (i.e. RCU-protected) permission relaxation. > In the implementation, the newly introduced API > kvm_pgtable_stage2_split() calls to kvm_pgtable_stage2_create_unlinked() > and then stage2_map_walker(), which is part of kvm_pgtable_stage2_map(), > to create the unlinked page tables. This is deliberate code reuse. Page table construction in the fault path is largely similar to that of eager split besides the fact that one is working on 'live' page tables whereas the other is not. As such I gave the suggestion to Ricardo to reuse what we have today for the sake of eager splitting.
On Mon, Feb 13, 2023 at 11:33 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > Hi Gavin, > > On Tue, Feb 14, 2023 at 04:57:59PM +1100, Gavin Shan wrote: > > On 2/7/23 3:58 AM, Ricardo Koller wrote: > > <snip> > > > > Eager Page Splitting fixes the above two issues by eagerly splitting > > > huge-pages when enabling dirty logging. The goal is to avoid doing it > > > while faulting on write-protected pages. > > </snip> > > > I'm not sure why we can't eagerly split the PMD mapping into 512 PTE > > mapping in the page fault handler? > > The entire goal of the series is to avoid page splitting at all on the > stage-2 abort path. Ideally we want to minimize the time taken to handle > a fault so we can get back to running the guest. The requirement to > perform a break-before-make operation to change the mapping granularity > can, as Ricardo points out, be a bottleneck on contemporary implementations. > > There is a clear uplift with the proposed implementation already, and I > would expect that margin to widen if/when we add support for lockless > (i.e. RCU-protected) permission relaxation. There's also the issue of allocating 513 pages on fault when splitting PUDs. > > > In the implementation, the newly introduced API > > kvm_pgtable_stage2_split() calls to kvm_pgtable_stage2_create_unlinked() > > and then stage2_map_walker(), which is part of kvm_pgtable_stage2_map(), > > to create the unlinked page tables. > > This is deliberate code reuse. Page table construction in the fault path > is largely similar to that of eager split besides the fact that one is > working on 'live' page tables whereas the other is not. As such I gave > the suggestion to Ricardo to reuse what we have today for the sake of > eager splitting. > > -- > Thanks, > Oliver