mbox series

[v3,0/2] s390/mm: shared zeropage + KVM fixes

Message ID 20240411161441.910170-1-david@redhat.com (mailing list archive)
Headers show
Series s390/mm: shared zeropage + KVM fixes | expand

Message

David Hildenbrand April 11, 2024, 4:14 p.m. UTC
This series fixes one issue with uffd + shared zeropages on s390x and
fixes that "ordinary" KVM guests can make use of shared zeropages again.

userfaultfd could currently end up mapping shared zeropages into processes
that forbid shared zeropages. This only apples to s390x, relevant for
handling PV guests and guests that use storage kets correctly. Fix it
by placing a zeroed folio instead of the shared zeropage during
UFFDIO_ZEROPAGE instead.

I stumbled over this issue while looking into a customer scenario that
is using:

(1) Memory ballooning for dynamic resizing. Start a VM with, say, 100 GiB
    and inflate the balloon during boot to 60 GiB. The VM has ~40 GiB
    available and additional memory can be "fake hotplugged" to the VM
    later on demand by deflating the balloon. Actual memory overcommit is
    not desired, so physical memory would only be moved between VMs.

(2) Live migration of VMs between sites to evacuate servers in case of
    emergency.

Without the shared zeropage, during (2), the VM would suddenly consume
100 GiB on the migration source and destination. On the migration source,
where we don't excpect memory overcommit, we could easilt end up crashing
the VM during migration.

Independent of that, memory handed back to the hypervisor using "free page
reporting" would end up consuming actual memory after the migration on the
destination, not getting freed up until reused+freed again.

While there might be ways to optimize parts of this in QEMU, we really
should just support the shared zeropage again for ordinary VMs.

We only expect legcy guests to make use of storage keys, so let's handle
zeropages again when enabling storage keys or when enabling PV. To not
break userfaultfd like we did in the past, don't zap the shared zeropages,
but instead trigger unsharing faults, just like we do for unsharing
KSM pages in break_ksm().

Unsharing faults will simply replace the shared zeropage by a zeroed
anonymous folio. We can already trigger the same fault path using GUP,
when trying to long-term pin a shared zeropage, but also when unmerging
a KSM-placed zeropages, so this is nothing new.

Patch #1 tested on 86-64 by forcing mm_forbids_zeropage() to be 1, and
running the uffd selftests.

Patch #2 tested on s390x: the live migration scenario now works as
expected, and kvm-unit-tests that trigger usage of skeys work well, whereby
I can see detection and unsharing of shared zeropages.

Further (as broken in v2), I tested that the shared zeropage is no
longer populated after skeys are used -- that mm_forbids_zeropage() works
as expected:
  ./s390x-run s390x/skey.elf \
   -no-shutdown \
   -chardev socket,id=monitor,path=/var/tmp/mon,server,nowait \
   -mon chardev=monitor,mode=readline

  Then, in another shell:

  # cat /proc/`pgrep qemu`/smaps_rollup | grep Rss
  Rss:               31484 kB
  #  echo "dump-guest-memory tmp" | sudo nc -U /var/tmp/mon
  ...
  # cat /proc/`pgrep qemu`/smaps_rollup | grep Rss
  Rss:              160452 kB

  -> Reading guest memory does not populate the shared zeropage

  Doing the same with selftest.elf (no skeys)

  # cat /proc/`pgrep qemu`/smaps_rollup | grep Rss
  Rss:               30900 kB
  #  echo "dump-guest-memory tmp" | sudo nc -U /var/tmp/mon
  ...
  # cat /proc/`pgrep qemu`/smaps_rollup | grep Rsstmp/mon
  Rss:               30924 kB

  -> Reading guest memory does populate the shared zeropage

Based on s390/features. Andrew agreed that both patches can go via the
s390x tree.

v2 -> v3:
* "mm/userfaultfd: don't place zeropages when zeropages are disallowed"
 -> Fix wrong mm_forbids_zeropage check
* "s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests"
 -> Fix wrong mm_forbids_zeropage define

v1 -> v2:
* "mm/userfaultfd: don't place zeropages when zeropages are disallowed"
 -> Minor "ret" ahndling tweaks
* "s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests"
 -> Added Fixes: tag

Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-s390@vger.kernel.org


David Hildenbrand (2):
  mm/userfaultfd: don't place zeropages when zeropages are disallowed
  s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests

 arch/s390/include/asm/gmap.h        |   2 +-
 arch/s390/include/asm/mmu.h         |   5 +
 arch/s390/include/asm/mmu_context.h |   1 +
 arch/s390/include/asm/pgtable.h     |  16 ++-
 arch/s390/kvm/kvm-s390.c            |   4 +-
 arch/s390/mm/gmap.c                 | 163 +++++++++++++++++++++-------
 mm/userfaultfd.c                    |  34 ++++++
 7 files changed, 178 insertions(+), 47 deletions(-)

Comments

Andrew Morton April 11, 2024, 9:28 p.m. UTC | #1
On Thu, 11 Apr 2024 18:14:39 +0200 David Hildenbrand <david@redhat.com> wrote:

> This series fixes one issue with uffd + shared zeropages on s390x and
> fixes that "ordinary" KVM guests can make use of shared zeropages again.
> 
> ...
>
> Without the shared zeropage, during (2), the VM would suddenly consume
> 100 GiB on the migration source and destination. On the migration source,
> where we don't excpect memory overcommit, we could easilt end up crashing
> the VM during migration.
> 
> Independent of that, memory handed back to the hypervisor using "free page
> reporting" would end up consuming actual memory after the migration on the
> destination, not getting freed up until reused+freed again.
> 

Is a backport desirable?

If so, the [1/2] Fixes dates back to 2015 and the [2/2] Fixes is from
2017.  Is it appropriate that the patches be backported so far back,
and into different kernel versions?
David Hildenbrand April 11, 2024, 9:56 p.m. UTC | #2
On 11.04.24 23:28, Andrew Morton wrote:
> On Thu, 11 Apr 2024 18:14:39 +0200 David Hildenbrand <david@redhat.com> wrote:
> 
>> This series fixes one issue with uffd + shared zeropages on s390x and
>> fixes that "ordinary" KVM guests can make use of shared zeropages again.
>>
>> ...
>>
>> Without the shared zeropage, during (2), the VM would suddenly consume
>> 100 GiB on the migration source and destination. On the migration source,
>> where we don't excpect memory overcommit, we could easilt end up crashing
>> the VM during migration.
>>
>> Independent of that, memory handed back to the hypervisor using "free page
>> reporting" would end up consuming actual memory after the migration on the
>> destination, not getting freed up until reused+freed again.
>>
> 
> Is a backport desirable?
> 
> If so, the [1/2] Fixes dates back to 2015 and the [2/2] Fixes is from
> 2017.  Is it appropriate that the patches be backported so far back,
> and into different kernel versions?
> 

[2/2] won't be easy to backport to kernels without FAULT_FLAG_UNSHARE, 
so I wouldn't really suggest backports to kernels before that. [1/2] 
might be reasonable to backport, but might require some tweaking (page 
vs. folio).
Alexander Gordeev April 12, 2024, 1:25 p.m. UTC | #3
> David Hildenbrand (2):
>   mm/userfaultfd: don't place zeropages when zeropages are disallowed
>   s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests
> 
>  arch/s390/include/asm/gmap.h        |   2 +-
>  arch/s390/include/asm/mmu.h         |   5 +
>  arch/s390/include/asm/mmu_context.h |   1 +
>  arch/s390/include/asm/pgtable.h     |  16 ++-
>  arch/s390/kvm/kvm-s390.c            |   4 +-
>  arch/s390/mm/gmap.c                 | 163 +++++++++++++++++++++-------
>  mm/userfaultfd.c                    |  34 ++++++
>  7 files changed, 178 insertions(+), 47 deletions(-)

Applied.
Thanks, David!
Alexander Gordeev April 17, 2024, 12:46 p.m. UTC | #4
On Thu, Apr 11, 2024 at 06:14:39PM +0200, David Hildenbrand wrote:

Hi David,

> Based on s390/features. Andrew agreed that both patches can go via the
> s390x tree.

I am going to put on a branch this series together with the selftest:
https://lore.kernel.org/r/20240412084329.30315-1-david@redhat.com

I there something in s390/features your three patches depend on?
Or v6.9-rc2 contains everything needed already?

Thanks!
David Hildenbrand April 17, 2024, 12:47 p.m. UTC | #5
On 17.04.24 14:46, Alexander Gordeev wrote:
> On Thu, Apr 11, 2024 at 06:14:39PM +0200, David Hildenbrand wrote:
> 
> Hi David,
> 
>> Based on s390/features. Andrew agreed that both patches can go via the
>> s390x tree.
> 
> I am going to put on a branch this series together with the selftest:
> https://lore.kernel.org/r/20240412084329.30315-1-david@redhat.com
> 
> I there something in s390/features your three patches depend on?
> Or v6.9-rc2 contains everything needed already?

v6.9-rc2 should have all we need IIRC.