Message ID | 20220415215901.1737897-10-oupton@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: arm64: Parallelize stage 2 fault handling | expand |
Hi Oliver, On Friday 15 Apr 2022 at 21:58:53 (+0000), Oliver Upton wrote: > Breaking a table pte is insufficient to guarantee ownership of an > unlinked subtree. Parallel software walkers could be traversing > substructures and changing their mappings. > > Recurse through the unlinked subtree and lock all descendent ptes > to take ownership of the subtree. Since the ptes are actually being > evicted, return table ptes back to the table walker to ensure child > tables are also traversed. Note that this is done both in both the > pre-order and leaf visitors as the underlying pte remains volatile until > it is unlinked. Still trying to get the full picture of the series so bear with me. IIUC the case you're dealing with here is when we're coallescing a table into a block with concurrent walkers making changes in the sub-tree. I believe this should happen when turning dirty logging off? Why do we need to recursively lock the entire sub-tree at all in this case? As long as the table is turned into a locked invalid PTE, what concurrent walkers are doing in the sub-tree should be irrelevant no? None of the changes they do will be made visible to the hardware anyway. So as long as the sub-tree isn't freed under their feet (which should be the point of the RCU protection) this should be all fine? Is there a case where this is not actually true? Thanks, Quentin
On Thu, Apr 21, 2022 at 01:21:54PM +0000, Quentin Perret wrote: > Hi Oliver, > > On Friday 15 Apr 2022 at 21:58:53 (+0000), Oliver Upton wrote: > > Breaking a table pte is insufficient to guarantee ownership of an > > unlinked subtree. Parallel software walkers could be traversing > > substructures and changing their mappings. > > > > Recurse through the unlinked subtree and lock all descendent ptes > > to take ownership of the subtree. Since the ptes are actually being > > evicted, return table ptes back to the table walker to ensure child > > tables are also traversed. Note that this is done both in both the > > pre-order and leaf visitors as the underlying pte remains volatile until > > it is unlinked. > > Still trying to get the full picture of the series so bear with me. IIUC > the case you're dealing with here is when we're coallescing a table into > a block with concurrent walkers making changes in the sub-tree. I > believe this should happen when turning dirty logging off? Yup, I think that's the only time we wind up collapsing tables. > Why do we need to recursively lock the entire sub-tree at all in this > case? As long as the table is turned into a locked invalid PTE, what > concurrent walkers are doing in the sub-tree should be irrelevant no? > None of the changes they do will be made visible to the hardware anyway. > So as long as the sub-tree isn't freed under their feet (which should be > the point of the RCU protection) this should be all fine? Is there a > case where this is not actually true? The problem arises when you're trying to actually free an unlinked subtree. All bets are off until the next RCU grace period. What would stop another software walker from installing a table to a PTE that I've already visited? I think we'd wind up leaking a table page in this case as the walker doing the table collapse assumes it has successfully freed everything underneath. The other option would be to not touch the subtree at all until the rcu callback, as at that point software will not tweak the tables any more. No need for atomics/spinning and can just do a boring traversal. Of course, I lazily avoided this option because it would be a bit more code but isn't too awfully complicated. Does this paint a better picture, or have I only managed to confuse even more? :) -- Thanks, Oliver
On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote: > The other option would be to not touch the subtree at all until the rcu > callback, as at that point software will not tweak the tables any more. > No need for atomics/spinning and can just do a boring traversal. Right that is sort of what I had in mind. Note that I'm still trying to make my mind about the overall approach -- I can see how RCU protection provides a rather elegant solution to this problem, but this makes the whole thing inaccessible to e.g. pKVM where RCU is a non-starter. A possible alternative that comes to mind would be to have all walkers take references on the pages as they walk down, and release them on their way back, but I'm still not sure how to make this race-safe. I'll have a think ...
On Fri, Apr 22, 2022 at 04:00:45PM +0000, Quentin Perret wrote: > On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote: > > The other option would be to not touch the subtree at all until the rcu > > callback, as at that point software will not tweak the tables any more. > > No need for atomics/spinning and can just do a boring traversal. > > Right that is sort of what I had in mind. Note that I'm still trying to > make my mind about the overall approach -- I can see how RCU protection > provides a rather elegant solution to this problem, but this makes the > whole thing inaccessible to e.g. pKVM where RCU is a non-starter. Heh, figuring out how to do this for pKVM seemed hard hence my lazy attempt :) > A > possible alternative that comes to mind would be to have all walkers > take references on the pages as they walk down, and release them on > their way back, but I'm still not sure how to make this race-safe. I'll > have a think ... Does pKVM ever collapse tables into blocks? That is the only reason any of this mess ever gets roped in. If not I think it is possible to get away with a rwlock with unmap on the write side and everything else on the read side, right? As far as regular KVM goes we get in this business when disabling dirty logging on a memslot. Guest faults will lazily collapse the tables back into blocks. An equally valid implementation would be just to unmap the whole memslot and have the guest build out the tables again, which could work with the aforementioned rwlock. Do any of my ramblings sound workable? :) -- Thanks, Oliver
On Friday 22 Apr 2022 at 20:41:47 (+0000), Oliver Upton wrote: > On Fri, Apr 22, 2022 at 04:00:45PM +0000, Quentin Perret wrote: > > On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote: > > > The other option would be to not touch the subtree at all until the rcu > > > callback, as at that point software will not tweak the tables any more. > > > No need for atomics/spinning and can just do a boring traversal. > > > > Right that is sort of what I had in mind. Note that I'm still trying to > > make my mind about the overall approach -- I can see how RCU protection > > provides a rather elegant solution to this problem, but this makes the > > whole thing inaccessible to e.g. pKVM where RCU is a non-starter. > > Heh, figuring out how to do this for pKVM seemed hard hence my lazy > attempt :) > > > A > > possible alternative that comes to mind would be to have all walkers > > take references on the pages as they walk down, and release them on > > their way back, but I'm still not sure how to make this race-safe. I'll > > have a think ... > > Does pKVM ever collapse tables into blocks? That is the only reason any > of this mess ever gets roped in. If not I think it is possible to get > away with a rwlock with unmap on the write side and everything else on > the read side, right? > > As far as regular KVM goes we get in this business when disabling dirty > logging on a memslot. Guest faults will lazily collapse the tables back > into blocks. An equally valid implementation would be just to unmap the > whole memslot and have the guest build out the tables again, which could > work with the aforementioned rwlock. Apologies for the delay on this one, I was away for a while. Yup, that all makes sense. FWIW the pKVM use-case I have in mind is slightly different. Specifically, in the pKVM world the hypervisor maintains a stage-2 for the host, that is all identity mapped. So we use nice big block mappings as much as we can. But when a protected guest starts, the hypervisor needs to break down the host stage-2 blocks to unmap the 4K guest pages from the host (which is where the protection comes from in pKVM). And when the guest is torn down, the host can reclaim its pages, hence putting us in a position to coallesce its stage-2 into nice big blocks again. Note that none of this coallescing is currently implemented even in our pKVM prototype, so it's a bit unfair to ask you to deal with this stuff now, but clearly it'd be cool if there was a way we could make these things coexist and even ideally share some code...
On Tue, May 03, 2022 at 02:17:25PM +0000, Quentin Perret wrote: > On Friday 22 Apr 2022 at 20:41:47 (+0000), Oliver Upton wrote: > > On Fri, Apr 22, 2022 at 04:00:45PM +0000, Quentin Perret wrote: > > > On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote: > > > > The other option would be to not touch the subtree at all until the rcu > > > > callback, as at that point software will not tweak the tables any more. > > > > No need for atomics/spinning and can just do a boring traversal. > > > > > > Right that is sort of what I had in mind. Note that I'm still trying to > > > make my mind about the overall approach -- I can see how RCU protection > > > provides a rather elegant solution to this problem, but this makes the > > > whole thing inaccessible to e.g. pKVM where RCU is a non-starter. > > > > Heh, figuring out how to do this for pKVM seemed hard hence my lazy > > attempt :) > > > > > A > > > possible alternative that comes to mind would be to have all walkers > > > take references on the pages as they walk down, and release them on > > > their way back, but I'm still not sure how to make this race-safe. I'll > > > have a think ... > > > > Does pKVM ever collapse tables into blocks? That is the only reason any > > of this mess ever gets roped in. If not I think it is possible to get > > away with a rwlock with unmap on the write side and everything else on > > the read side, right? > > > > As far as regular KVM goes we get in this business when disabling dirty > > logging on a memslot. Guest faults will lazily collapse the tables back > > into blocks. An equally valid implementation would be just to unmap the > > whole memslot and have the guest build out the tables again, which could > > work with the aforementioned rwlock. > > Apologies for the delay on this one, I was away for a while. > > Yup, that all makes sense. FWIW the pKVM use-case I have in mind is > slightly different. Specifically, in the pKVM world the hypervisor > maintains a stage-2 for the host, that is all identity mapped. So we use > nice big block mappings as much as we can. But when a protected guest > starts, the hypervisor needs to break down the host stage-2 blocks to > unmap the 4K guest pages from the host (which is where the protection > comes from in pKVM). And when the guest is torn down, the host can > reclaim its pages, hence putting us in a position to coallesce its > stage-2 into nice big blocks again. Note that none of this coallescing > is currently implemented even in our pKVM prototype, so it's a bit > unfair to ask you to deal with this stuff now, but clearly it'd be cool > if there was a way we could make these things coexist and even ideally > share some code... Oh, it certainly isn't unfair to make sure we've got good constructs landing for everyone to use :-) I'll need to chew on this a bit more to have a better answer. The reason I hesitate to do the giant unmap for non-pKVM is that I believe we'd be leaving some performance on the table for newer implementations of the architecture. Having said that, avoiding a tlbi vmalls12e1is on every collapsed table is highly desirable. FEAT_BBM=2 semantics in the MMU is also on the todo list. In this case we'd do a direct table->block transformation on the PTE and elide that nasty tlbi. Unless there's objections, I'll probably hobble this series along as-is for the time being. My hope is that other table walkers can join in on the parallel party later down the road. Thanks for getting back to me. -- Best, Oliver
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index ffdfd5ee9642..146fc44acf31 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -838,6 +838,54 @@ static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_m } } +static kvm_pte_t stage2_unlink_pte_shared(kvm_pte_t *ptep) +{ + kvm_pte_t old; + + while (true) { + old = xchg(ptep, KVM_INVALID_PTE_LOCKED); + if (old != KVM_INVALID_PTE_LOCKED) + return old; + + cpu_relax(); + } +} + + +/** + * stage2_unlink_pte() - Tears down an unreachable pte, returning the next pte + * to visit (if any). + * + * @ptep: pointer to the pte to unlink + * @level: page table level of the pte + * @shared: true if the tables are shared by multiple software walkers + * @mm_ops: pointer to the mm ops table + * + * Return: a table pte if another level of recursion is necessary, 0 otherwise. + */ +static kvm_pte_t stage2_unlink_pte(kvm_pte_t *ptep, u32 level, bool shared, + struct kvm_pgtable_mm_ops *mm_ops) +{ + kvm_pte_t old; + + if (shared) { + old = stage2_unlink_pte_shared(ptep); + } else { + old = *ptep; + WRITE_ONCE(*ptep, KVM_INVALID_PTE_LOCKED); + } + + WARN_ON(stage2_pte_is_locked(old)); + + if (kvm_pte_table(old, level)) + return old; + + if (stage2_pte_is_counted(old)) + mm_ops->put_page(ptep); + + return 0; +} + static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr, u32 level, struct kvm_pgtable_mm_ops *mm_ops) { @@ -922,8 +970,10 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level, struct stage2_map_data *data, bool shared) { - if (data->anchor) + if (data->anchor) { + *old = stage2_unlink_pte(ptep, level, shared, data->mm_ops); return 0; + } if (!stage2_leaf_mapping_allowed(addr, end, level, data)) return 0; @@ -944,9 +994,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, int ret; if (data->anchor) { - if (stage2_pte_is_counted(*old)) - mm_ops->put_page(ptep); - + *old = stage2_unlink_pte(ptep, level, shared, data->mm_ops); return 0; }
Breaking a table pte is insufficient to guarantee ownership of an unlinked subtree. Parallel software walkers could be traversing substructures and changing their mappings. Recurse through the unlinked subtree and lock all descendent ptes to take ownership of the subtree. Since the ptes are actually being evicted, return table ptes back to the table walker to ensure child tables are also traversed. Note that this is done both in both the pre-order and leaf visitors as the underlying pte remains volatile until it is unlinked. Signed-off-by: Oliver Upton <oupton@google.com> --- arch/arm64/kvm/hyp/pgtable.c | 56 +++++++++++++++++++++++++++++++++--- 1 file changed, 52 insertions(+), 4 deletions(-)