diff mbox series

[RFC,09/17] KVM: arm64: Tear down unlinked page tables in parallel walk

Message ID 20220415215901.1737897-10-oupton@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: arm64: Parallelize stage 2 fault handling | expand

Commit Message

Oliver Upton April 15, 2022, 9:58 p.m. UTC
Breaking a table pte is insufficient to guarantee ownership of an
unlinked subtree. Parallel software walkers could be traversing
substructures and changing their mappings.

Recurse through the unlinked subtree and lock all descendent ptes
to take ownership of the subtree. Since the ptes are actually being
evicted, return table ptes back to the table walker to ensure child
tables are also traversed. Note that this is done both in both the
pre-order and leaf visitors as the underlying pte remains volatile until
it is unlinked.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 56 +++++++++++++++++++++++++++++++++---
 1 file changed, 52 insertions(+), 4 deletions(-)

Comments

Quentin Perret April 21, 2022, 1:21 p.m. UTC | #1
Hi Oliver,

On Friday 15 Apr 2022 at 21:58:53 (+0000), Oliver Upton wrote:
> Breaking a table pte is insufficient to guarantee ownership of an
> unlinked subtree. Parallel software walkers could be traversing
> substructures and changing their mappings.
> 
> Recurse through the unlinked subtree and lock all descendent ptes
> to take ownership of the subtree. Since the ptes are actually being
> evicted, return table ptes back to the table walker to ensure child
> tables are also traversed. Note that this is done both in both the
> pre-order and leaf visitors as the underlying pte remains volatile until
> it is unlinked.

Still trying to get the full picture of the series so bear with me. IIUC
the case you're dealing with here is when we're coallescing a table into
a block with concurrent walkers making changes in the sub-tree. I
believe this should happen when turning dirty logging off?

Why do we need to recursively lock the entire sub-tree at all in this
case? As long as the table is turned into a locked invalid PTE, what
concurrent walkers are doing in the sub-tree should be irrelevant no?
None of the changes they do will be made visible to the hardware anyway.
So as long as the sub-tree isn't freed under their feet (which should be
the point of the RCU protection) this should be all fine? Is there a
case where this is not actually true?

Thanks,
Quentin
Oliver Upton April 21, 2022, 4:40 p.m. UTC | #2
On Thu, Apr 21, 2022 at 01:21:54PM +0000, Quentin Perret wrote:
> Hi Oliver,
> 
> On Friday 15 Apr 2022 at 21:58:53 (+0000), Oliver Upton wrote:
> > Breaking a table pte is insufficient to guarantee ownership of an
> > unlinked subtree. Parallel software walkers could be traversing
> > substructures and changing their mappings.
> > 
> > Recurse through the unlinked subtree and lock all descendent ptes
> > to take ownership of the subtree. Since the ptes are actually being
> > evicted, return table ptes back to the table walker to ensure child
> > tables are also traversed. Note that this is done both in both the
> > pre-order and leaf visitors as the underlying pte remains volatile until
> > it is unlinked.
> 
> Still trying to get the full picture of the series so bear with me. IIUC
> the case you're dealing with here is when we're coallescing a table into
> a block with concurrent walkers making changes in the sub-tree. I
> believe this should happen when turning dirty logging off?

Yup, I think that's the only time we wind up collapsing tables.

> Why do we need to recursively lock the entire sub-tree at all in this
> case? As long as the table is turned into a locked invalid PTE, what
> concurrent walkers are doing in the sub-tree should be irrelevant no?
> None of the changes they do will be made visible to the hardware anyway.
> So as long as the sub-tree isn't freed under their feet (which should be
> the point of the RCU protection) this should be all fine? Is there a
> case where this is not actually true?

The problem arises when you're trying to actually free an unlinked
subtree. All bets are off until the next RCU grace period. What would
stop another software walker from installing a table to a PTE that I've
already visited? I think we'd wind up leaking a table page in this case
as the walker doing the table collapse assumes it has successfully freed
everything underneath.

The other option would be to not touch the subtree at all until the rcu
callback, as at that point software will not tweak the tables any more.
No need for atomics/spinning and can just do a boring traversal. Of
course, I lazily avoided this option because it would be a bit more code
but isn't too awfully complicated.

Does this paint a better picture, or have I only managed to confuse even
more? :)

--
Thanks,
Oliver
Quentin Perret April 22, 2022, 4 p.m. UTC | #3
On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote:
> The other option would be to not touch the subtree at all until the rcu
> callback, as at that point software will not tweak the tables any more.
> No need for atomics/spinning and can just do a boring traversal.

Right that is sort of what I had in mind. Note that I'm still trying to
make my mind about the overall approach -- I can see how RCU protection
provides a rather elegant solution to this problem, but this makes the
whole thing inaccessible to e.g. pKVM where RCU is a non-starter. A
possible alternative that comes to mind would be to have all walkers
take references on the pages as they walk down, and release them on
their way back, but I'm still not sure how to make this race-safe. I'll
have a think ...
Oliver Upton April 22, 2022, 8:41 p.m. UTC | #4
On Fri, Apr 22, 2022 at 04:00:45PM +0000, Quentin Perret wrote:
> On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote:
> > The other option would be to not touch the subtree at all until the rcu
> > callback, as at that point software will not tweak the tables any more.
> > No need for atomics/spinning and can just do a boring traversal.
> 
> Right that is sort of what I had in mind. Note that I'm still trying to
> make my mind about the overall approach -- I can see how RCU protection
> provides a rather elegant solution to this problem, but this makes the
> whole thing inaccessible to e.g. pKVM where RCU is a non-starter.

Heh, figuring out how to do this for pKVM seemed hard hence my lazy
attempt :)

> A
> possible alternative that comes to mind would be to have all walkers
> take references on the pages as they walk down, and release them on
> their way back, but I'm still not sure how to make this race-safe. I'll
> have a think ...

Does pKVM ever collapse tables into blocks? That is the only reason any
of this mess ever gets roped in. If not I think it is possible to get
away with a rwlock with unmap on the write side and everything else on
the read side, right?

As far as regular KVM goes we get in this business when disabling dirty
logging on a memslot. Guest faults will lazily collapse the tables back
into blocks. An equally valid implementation would be just to unmap the
whole memslot and have the guest build out the tables again, which could
work with the aforementioned rwlock.

Do any of my ramblings sound workable? :)

--
Thanks,
Oliver
Quentin Perret May 3, 2022, 2:17 p.m. UTC | #5
On Friday 22 Apr 2022 at 20:41:47 (+0000), Oliver Upton wrote:
> On Fri, Apr 22, 2022 at 04:00:45PM +0000, Quentin Perret wrote:
> > On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote:
> > > The other option would be to not touch the subtree at all until the rcu
> > > callback, as at that point software will not tweak the tables any more.
> > > No need for atomics/spinning and can just do a boring traversal.
> > 
> > Right that is sort of what I had in mind. Note that I'm still trying to
> > make my mind about the overall approach -- I can see how RCU protection
> > provides a rather elegant solution to this problem, but this makes the
> > whole thing inaccessible to e.g. pKVM where RCU is a non-starter.
> 
> Heh, figuring out how to do this for pKVM seemed hard hence my lazy
> attempt :)
> 
> > A
> > possible alternative that comes to mind would be to have all walkers
> > take references on the pages as they walk down, and release them on
> > their way back, but I'm still not sure how to make this race-safe. I'll
> > have a think ...
> 
> Does pKVM ever collapse tables into blocks? That is the only reason any
> of this mess ever gets roped in. If not I think it is possible to get
> away with a rwlock with unmap on the write side and everything else on
> the read side, right?
> 
> As far as regular KVM goes we get in this business when disabling dirty
> logging on a memslot. Guest faults will lazily collapse the tables back
> into blocks. An equally valid implementation would be just to unmap the
> whole memslot and have the guest build out the tables again, which could
> work with the aforementioned rwlock.

Apologies for the delay on this one, I was away for a while.

Yup, that all makes sense. FWIW the pKVM use-case I have in mind is
slightly different. Specifically, in the pKVM world the hypervisor
maintains a stage-2 for the host, that is all identity mapped. So we use
nice big block mappings as much as we can. But when a protected guest
starts, the hypervisor needs to break down the host stage-2 blocks to
unmap the 4K guest pages from the host (which is where the protection
comes from in pKVM). And when the guest is torn down, the host can
reclaim its pages, hence putting us in a position to coallesce its
stage-2 into nice big blocks again. Note that none of this coallescing
is currently implemented even in our pKVM prototype, so it's a bit
unfair to ask you to deal with this stuff now, but clearly it'd be cool
if there was a way we could make these things coexist and even ideally
share some code...
Oliver Upton May 4, 2022, 6:03 a.m. UTC | #6
On Tue, May 03, 2022 at 02:17:25PM +0000, Quentin Perret wrote:
> On Friday 22 Apr 2022 at 20:41:47 (+0000), Oliver Upton wrote:
> > On Fri, Apr 22, 2022 at 04:00:45PM +0000, Quentin Perret wrote:
> > > On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote:
> > > > The other option would be to not touch the subtree at all until the rcu
> > > > callback, as at that point software will not tweak the tables any more.
> > > > No need for atomics/spinning and can just do a boring traversal.
> > > 
> > > Right that is sort of what I had in mind. Note that I'm still trying to
> > > make my mind about the overall approach -- I can see how RCU protection
> > > provides a rather elegant solution to this problem, but this makes the
> > > whole thing inaccessible to e.g. pKVM where RCU is a non-starter.
> > 
> > Heh, figuring out how to do this for pKVM seemed hard hence my lazy
> > attempt :)
> > 
> > > A
> > > possible alternative that comes to mind would be to have all walkers
> > > take references on the pages as they walk down, and release them on
> > > their way back, but I'm still not sure how to make this race-safe. I'll
> > > have a think ...
> > 
> > Does pKVM ever collapse tables into blocks? That is the only reason any
> > of this mess ever gets roped in. If not I think it is possible to get
> > away with a rwlock with unmap on the write side and everything else on
> > the read side, right?
> > 
> > As far as regular KVM goes we get in this business when disabling dirty
> > logging on a memslot. Guest faults will lazily collapse the tables back
> > into blocks. An equally valid implementation would be just to unmap the
> > whole memslot and have the guest build out the tables again, which could
> > work with the aforementioned rwlock.
> 
> Apologies for the delay on this one, I was away for a while.
> 
> Yup, that all makes sense. FWIW the pKVM use-case I have in mind is
> slightly different. Specifically, in the pKVM world the hypervisor
> maintains a stage-2 for the host, that is all identity mapped. So we use
> nice big block mappings as much as we can. But when a protected guest
> starts, the hypervisor needs to break down the host stage-2 blocks to
> unmap the 4K guest pages from the host (which is where the protection
> comes from in pKVM). And when the guest is torn down, the host can
> reclaim its pages, hence putting us in a position to coallesce its
> stage-2 into nice big blocks again. Note that none of this coallescing
> is currently implemented even in our pKVM prototype, so it's a bit
> unfair to ask you to deal with this stuff now, but clearly it'd be cool
> if there was a way we could make these things coexist and even ideally
> share some code...

Oh, it certainly isn't unfair to make sure we've got good constructs
landing for everyone to use :-)

I'll need to chew on this a bit more to have a better answer. The reason
I hesitate to do the giant unmap for non-pKVM is that I believe we'd be
leaving some performance on the table for newer implementations of the
architecture. Having said that, avoiding a tlbi vmalls12e1is on every
collapsed table is highly desirable.

FEAT_BBM=2 semantics in the MMU is also on the todo list. In this case
we'd do a direct table->block transformation on the PTE and elide that
nasty tlbi.

Unless there's objections, I'll probably hobble this series along as-is
for the time being. My hope is that other table walkers can join in on
the parallel party later down the road.

Thanks for getting back to me.

--
Best,
Oliver
diff mbox series

Patch

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index ffdfd5ee9642..146fc44acf31 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -838,6 +838,54 @@  static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_m
 	}
 }
 
+static kvm_pte_t stage2_unlink_pte_shared(kvm_pte_t *ptep)
+{
+	kvm_pte_t old;
+
+	while (true) {
+		old = xchg(ptep, KVM_INVALID_PTE_LOCKED);
+		if (old != KVM_INVALID_PTE_LOCKED)
+			return old;
+
+		cpu_relax();
+	}
+}
+
+
+/**
+ * stage2_unlink_pte() - Tears down an unreachable pte, returning the next pte
+ *			 to visit (if any).
+ *
+ * @ptep: pointer to the pte to unlink
+ * @level: page table level of the pte
+ * @shared: true if the tables are shared by multiple software walkers
+ * @mm_ops: pointer to the mm ops table
+ *
+ * Return: a table pte if another level of recursion is necessary, 0 otherwise.
+ */
+static kvm_pte_t stage2_unlink_pte(kvm_pte_t *ptep, u32 level, bool shared,
+				   struct kvm_pgtable_mm_ops *mm_ops)
+{
+	kvm_pte_t old;
+
+	if (shared) {
+		old = stage2_unlink_pte_shared(ptep);
+	} else {
+		old = *ptep;
+		WRITE_ONCE(*ptep, KVM_INVALID_PTE_LOCKED);
+	}
+
+	WARN_ON(stage2_pte_is_locked(old));
+
+	if (kvm_pte_table(old, level))
+		return old;
+
+	if (stage2_pte_is_counted(old))
+		mm_ops->put_page(ptep);
+
+	return 0;
+}
+
 static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
 			   u32 level, struct kvm_pgtable_mm_ops *mm_ops)
 {
@@ -922,8 +970,10 @@  static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 				     struct stage2_map_data *data,
 				     bool shared)
 {
-	if (data->anchor)
+	if (data->anchor) {
+		*old = stage2_unlink_pte(ptep, level, shared, data->mm_ops);
 		return 0;
+	}
 
 	if (!stage2_leaf_mapping_allowed(addr, end, level, data))
 		return 0;
@@ -944,9 +994,7 @@  static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	int ret;
 
 	if (data->anchor) {
-		if (stage2_pte_is_counted(*old))
-			mm_ops->put_page(ptep);
-
+		*old = stage2_unlink_pte(ptep, level, shared, data->mm_ops);
 		return 0;
 	}