diff mbox series

[03/14] KVM: arm64: Continue stage-2 map when re-creating mappings

Message ID 20210719104735.3681732-4-qperret@google.com (mailing list archive)
State New, archived
Headers show
Series Track shared pages at EL2 in protected mode | expand

Commit Message

Quentin Perret July 19, 2021, 10:47 a.m. UTC
The stage-2 map walkers currently return -EAGAIN when re-creating
identical mappings or only changing access permissions. This allows to
optimize mapping pages for concurrent (v)CPUs faulting on the same
page.

While this works as expected when touching one page-table leaf at a
time, this can lead to difficult situations when mapping larger ranges.
Indeed, a large map operation can fail in the middle if an existing
mapping is found in the range, even if it has compatible attributes,
hence leaving only half of the range mapped.

To avoid having to deal with such failures in the caller, don't
interrupt the map operation when hitting existing PTEs, but make sure to
still return -EAGAIN so that user_mem_abort() can mark the page dirty
when needed.

Cc: Yanan Wang <wangyanan55@huawei.com>
Signed-off-by: Quentin Perret <qperret@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h |  2 +-
 arch/arm64/kvm/hyp/pgtable.c         | 21 +++++++++++++++++----
 2 files changed, 18 insertions(+), 5 deletions(-)

Comments

Marc Zyngier July 19, 2021, 12:14 p.m. UTC | #1
On Mon, 19 Jul 2021 11:47:24 +0100,
Quentin Perret <qperret@google.com> wrote:
> 
> The stage-2 map walkers currently return -EAGAIN when re-creating
> identical mappings or only changing access permissions. This allows to
> optimize mapping pages for concurrent (v)CPUs faulting on the same
> page.
> 
> While this works as expected when touching one page-table leaf at a
> time, this can lead to difficult situations when mapping larger ranges.
> Indeed, a large map operation can fail in the middle if an existing
> mapping is found in the range, even if it has compatible attributes,
> hence leaving only half of the range mapped.

I'm curious of when this can happen. We normally map a single leaf at
a time, and we don't have a way to map multiple leaves at once: we
either use the VMA base size or try to upgrade it to a THP, but the
result is always a single leaf entry. What changed?

> To avoid having to deal with such failures in the caller, don't
> interrupt the map operation when hitting existing PTEs, but make sure to
> still return -EAGAIN so that user_mem_abort() can mark the page dirty
> when needed.

I don't follow you here: if you return -EAGAIN for a writable mapping,
we don't account for the page to be dirty on the assumption that
nothing has been mapped. But if there is a way to map more than a
single entry and to get -EAGAIN at the same time, then we're bound to
lose data on page eviction.

Can you shed some light on this?

Thanks,

	M.
Quentin Perret July 19, 2021, 1:32 p.m. UTC | #2
On Monday 19 Jul 2021 at 13:14:48 (+0100), Marc Zyngier wrote:
> On Mon, 19 Jul 2021 11:47:24 +0100,
> Quentin Perret <qperret@google.com> wrote:
> > 
> > The stage-2 map walkers currently return -EAGAIN when re-creating
> > identical mappings or only changing access permissions. This allows to
> > optimize mapping pages for concurrent (v)CPUs faulting on the same
> > page.
> > 
> > While this works as expected when touching one page-table leaf at a
> > time, this can lead to difficult situations when mapping larger ranges.
> > Indeed, a large map operation can fail in the middle if an existing
> > mapping is found in the range, even if it has compatible attributes,
> > hence leaving only half of the range mapped.
> 
> I'm curious of when this can happen. We normally map a single leaf at
> a time, and we don't have a way to map multiple leaves at once: we
> either use the VMA base size or try to upgrade it to a THP, but the
> result is always a single leaf entry. What changed?

Nothing _yet_ :-)

The 'share' hypercall introduced near the end of the series allows to
share multiple physically contiguous pages in one go -- this is mostly
to allow sharing data-structures that are larger than a page.

So if one of the pages happens to be already mapped by the time the
hypercall is issued, mapping the range with the right SW bits becomes
difficult as kvm_pgtable_stage2_map() will fail halfway through, which
is tricky to handle.

This patch shouldn't change anything for existing users that only map
things that are nicely aligned at block/page granularity, but should
make the life of new users easier, so that seemed like a win.

> > To avoid having to deal with such failures in the caller, don't
> > interrupt the map operation when hitting existing PTEs, but make sure to
> > still return -EAGAIN so that user_mem_abort() can mark the page dirty
> > when needed.
> 
> I don't follow you here: if you return -EAGAIN for a writable mapping,
> we don't account for the page to be dirty on the assumption that
> nothing has been mapped. But if there is a way to map more than a
> single entry and to get -EAGAIN at the same time, then we're bound to
> lose data on page eviction.
> 
> Can you shed some light on this?

Sure. For guests, hitting the -EAGAIN case means we've lost the race
with another vCPU that faulted the same page. In this case the other
vCPU either mapped the page RO, which means that our vCPU will then get
a permission fault next time we run it which will lead to the page being
marked dirty, or the other vCPU mapped the page RW in which case it
already marked the page dirty for us and we can safely re-enter the
guest without doing anything else.

So what I meant by "still return -EAGAIN so that user_mem_abort() can
mark the page dirty when needed" is "make sure to mark the page dirty
only when necessary: if winning the race and marking the page RW, or
in the permission fault path". That is, by keeping the -EAGAIN I want to
make sure we don't mark the page dirty twice. (This might fine, but this
would be new behaviour, and it was not clear that would scale well to
many vCPUs faulting the same page).

I see how this wording can be highly confusing though, I'll and re-word
for the next version.

Cheers,
Quentin
Marc Zyngier July 20, 2021, 8:26 a.m. UTC | #3
On Mon, 19 Jul 2021 14:32:10 +0100,
Quentin Perret <qperret@google.com> wrote:
> 
> On Monday 19 Jul 2021 at 13:14:48 (+0100), Marc Zyngier wrote:
> > On Mon, 19 Jul 2021 11:47:24 +0100,
> > Quentin Perret <qperret@google.com> wrote:
> > > 
> > > The stage-2 map walkers currently return -EAGAIN when re-creating
> > > identical mappings or only changing access permissions. This allows to
> > > optimize mapping pages for concurrent (v)CPUs faulting on the same
> > > page.
> > > 
> > > While this works as expected when touching one page-table leaf at a
> > > time, this can lead to difficult situations when mapping larger ranges.
> > > Indeed, a large map operation can fail in the middle if an existing
> > > mapping is found in the range, even if it has compatible attributes,
> > > hence leaving only half of the range mapped.
> > 
> > I'm curious of when this can happen. We normally map a single leaf at
> > a time, and we don't have a way to map multiple leaves at once: we
> > either use the VMA base size or try to upgrade it to a THP, but the
> > result is always a single leaf entry. What changed?
> 
> Nothing _yet_ :-)
> 
> The 'share' hypercall introduced near the end of the series allows to
> share multiple physically contiguous pages in one go -- this is mostly
> to allow sharing data-structures that are larger than a page.
> 
> So if one of the pages happens to be already mapped by the time the
> hypercall is issued, mapping the range with the right SW bits becomes
> difficult as kvm_pgtable_stage2_map() will fail halfway through, which
> is tricky to handle.
> 
> This patch shouldn't change anything for existing users that only map
> things that are nicely aligned at block/page granularity, but should
> make the life of new users easier, so that seemed like a win.

Right, but this is on a different path, right? Guests can never fault
multiple mappings at once, and it takes you a host hypercall to
perform this "multiple leaves at once".

Is there any way we can restrict this to the hypercall? Or even
better, keep the hypercall as a "one page at a time" thing? I can't
imagine it being performance critical (it is a once-off, and only used
over a rather small region of memory). Then, the called doesn't have
to worry about the page already being mapped or not. This would also
match the behaviour of what I do on the MMIO side.

Or do you anticipate much gain from this being able to use block
mappings?

> 
> > > To avoid having to deal with such failures in the caller, don't
> > > interrupt the map operation when hitting existing PTEs, but make sure to
> > > still return -EAGAIN so that user_mem_abort() can mark the page dirty
> > > when needed.
> > 
> > I don't follow you here: if you return -EAGAIN for a writable mapping,
> > we don't account for the page to be dirty on the assumption that
> > nothing has been mapped. But if there is a way to map more than a
> > single entry and to get -EAGAIN at the same time, then we're bound to
> > lose data on page eviction.
> > 
> > Can you shed some light on this?
> 
> Sure. For guests, hitting the -EAGAIN case means we've lost the race
> with another vCPU that faulted the same page. In this case the other
> vCPU either mapped the page RO, which means that our vCPU will then get
> a permission fault next time we run it which will lead to the page being
> marked dirty, or the other vCPU mapped the page RW in which case it
> already marked the page dirty for us and we can safely re-enter the
> guest without doing anything else.
> 
> So what I meant by "still return -EAGAIN so that user_mem_abort() can
> mark the page dirty when needed" is "make sure to mark the page dirty
> only when necessary: if winning the race and marking the page RW, or
> in the permission fault path". That is, by keeping the -EAGAIN I want to
> make sure we don't mark the page dirty twice. (This might fine, but this
> would be new behaviour, and it was not clear that would scale well to
> many vCPUs faulting the same page).
> 
> I see how this wording can be highly confusing though, I'll and re-word
> for the next version.

I indeed found it pretty confusing. A reword would be much appreciated.

Thanks,

	M.
Quentin Perret July 20, 2021, 11:56 a.m. UTC | #4
On Tuesday 20 Jul 2021 at 09:26:10 (+0100), Marc Zyngier wrote:
> Right, but this is on a different path, right? Guests can never fault
> multiple mappings at once, and it takes you a host hypercall to
> perform this "multiple leaves at once".

Right.

> Is there any way we can restrict this to the hypercall? Or even
> better, keep the hypercall as a "one page at a time" thing? I can't
> imagine it being performance critical (it is a once-off, and only used
> over a rather small region of memory). Then, the called doesn't have
> to worry about the page already being mapped or not. This would also
> match the behaviour of what I do on the MMIO side.
> 
> Or do you anticipate much gain from this being able to use block
> mappings?

Not really no, especially given that mappings of shared pages will be
forced to page granularity thanks to the other patch we discussed in
this series. I was just hoping to reduce the overhead a bit by reducing
the number of hypercalls. But as you said, this probably doesn't matter
all that much, so happy to rework that. I'll look into making the hcall
use one page at a time, and hopefully that'll simplify a few other
things in the check_host_share_hyp() path near the end of this series.

Thanks,
Quentin
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index d6649352c8b3..af62203d2f7a 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -258,7 +258,7 @@  void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
  * If device attributes are not explicitly requested in @prot, then the
  * mapping will be normal, cacheable.
  *
- * Note that the update of a valid leaf PTE in this function will be aborted,
+ * Note that the update of a valid leaf PTE in this function will be skipped,
  * if it's trying to recreate the exact same mapping or only change the access
  * permissions. Instead, the vCPU will exit one more time from guest if still
  * needed and then go through the path of relaxing permissions.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 978f341d02ca..bb73c5331b7c 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -475,6 +475,8 @@  struct stage2_map_data {
 	void				*memcache;
 
 	struct kvm_pgtable_mm_ops	*mm_ops;
+
+	int				ret;
 };
 
 u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
@@ -612,8 +614,10 @@  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 		 * the vCPU will exit one more time from guest if still needed
 		 * and then go through the path of relaxing permissions.
 		 */
-		if (!stage2_pte_needs_update(old, new))
-			return -EAGAIN;
+		if (!stage2_pte_needs_update(old, new)) {
+			data->ret = -EAGAIN;
+			goto out;
+		}
 
 		stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
 	}
@@ -629,6 +633,7 @@  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	smp_store_release(ptep, new);
 	if (stage2_pte_is_counted(new))
 		mm_ops->get_page(ptep);
+out:
 	if (kvm_phys_is_valid(phys))
 		data->phys += granule;
 	return 0;
@@ -771,6 +776,7 @@  int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
 		.mmu		= pgt->mmu,
 		.memcache	= mc,
 		.mm_ops		= pgt->mm_ops,
+		.ret		= 0,
 	};
 	struct kvm_pgtable_walker walker = {
 		.cb		= stage2_map_walker,
@@ -789,7 +795,10 @@  int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
 
 	ret = kvm_pgtable_walk(pgt, addr, size, &walker);
 	dsb(ishst);
-	return ret;
+	if (ret)
+		return ret;
+
+	return map_data.ret;
 }
 
 int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
@@ -802,6 +811,7 @@  int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
 		.memcache	= mc,
 		.mm_ops		= pgt->mm_ops,
 		.owner_id	= owner_id,
+		.ret		= 0,
 	};
 	struct kvm_pgtable_walker walker = {
 		.cb		= stage2_map_walker,
@@ -815,7 +825,10 @@  int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
 		return -EINVAL;
 
 	ret = kvm_pgtable_walk(pgt, addr, size, &walker);
-	return ret;
+	if (ret)
+		return ret;
+
+	return map_data.ret;
 }
 
 static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,