mbox series

[v1,00/12] virtio-mem: Expose device memory via multiple memslots

Message ID 20211027124531.57561-1-david@redhat.com (mailing list archive)
Headers show
Series virtio-mem: Expose device memory via multiple memslots | expand

Message

David Hildenbrand Oct. 27, 2021, 12:45 p.m. UTC
This is the follow-up of [1], dropping auto-detection and vhost-user
changes from the initial RFC.

Based-on: 20211011175346.15499-1-david@redhat.com

A virtio-mem device is represented by a single large RAM memory region
backed by a single large mmap.

Right now, we map that complete memory region into guest physical addres
space, resulting in a very large memory mapping, KVM memory slot, ...
although only a small amount of memory might actually be exposed to the VM.

For example, when starting a VM with a 1 TiB virtio-mem device that only
exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
in order to hotplug more memory later, we waste a lot of memory on metadata
for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
optimizations in KVM are being worked on to reduce this metadata overhead
on x86-64 in some cases, it remains a problem with nested VMs and there are
other reasons why we would want to reduce the total memory slot to a
reasonable minimum.

We want to:
a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
   inside QEMU KVM code where possible.
b) Not always expose all device-memory to the VM, to reduce the attack
   surface of malicious VMs without using userfaultfd.

So instead, expose the RAM memory region not by a single large mapping
(consuming one memslot) but instead by multiple mappings, each consuming
one memslot. To do that, we divide the RAM memory region via aliases into
separate parts and only map the aliases into a device container we actually
need. We have to make sure that QEMU won't silently merge the memory
sections corresponding to the aliases (and thereby also memslots),
otherwise we lose atomic updates with KVM and vhost-user, which we deeply
care about when adding/removing memory. Further, to get memslot accounting
right, such merging is better avoided.

Within the memslots, virtio-mem can (un)plug memory in smaller granularity
dynamically. So memslots are a pure optimization to tackle a) and b) above.

The user configures how many memslots a virtio-mem device should use, the
default is "1" -- essentially corresponding to the old behavior.

Memslots are right now mapped once they fall into the usable device region
(which grows/shrinks on demand right now either when requesting to
 hotplug more memory or during/after reboots). In the future, with
VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we'll be able to (un)map aliases even
more dynamically when (un)plugging device blocks.


Adding a 500GiB virtio-mem device with "memslots=500" and not hotplugging
any memory results in:
    0000000140000000-000001047fffffff (prio 0, i/o): device-memory
      0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots

Requesting the VM to consume 2 GiB results in (note: the usable region size
is bigger than 2 GiB, so 3 * 1 GiB memslots are required):
    0000000140000000-000001047fffffff (prio 0, i/o): device-memory
      0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
        0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
        0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
        00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff

Requesting the VM to consume 20 GiB results in:
    0000000140000000-000001047fffffff (prio 0, i/o): device-memory
      0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
        0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
        0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
        00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff
        0000000200000000-000000023fffffff (prio 0, ram): alias virtio-mem-memslot-3 @mem0 00000000c0000000-00000000ffffffff
        0000000240000000-000000027fffffff (prio 0, ram): alias virtio-mem-memslot-4 @mem0 0000000100000000-000000013fffffff
        0000000280000000-00000002bfffffff (prio 0, ram): alias virtio-mem-memslot-5 @mem0 0000000140000000-000000017fffffff
        00000002c0000000-00000002ffffffff (prio 0, ram): alias virtio-mem-memslot-6 @mem0 0000000180000000-00000001bfffffff
        0000000300000000-000000033fffffff (prio 0, ram): alias virtio-mem-memslot-7 @mem0 00000001c0000000-00000001ffffffff
        0000000340000000-000000037fffffff (prio 0, ram): alias virtio-mem-memslot-8 @mem0 0000000200000000-000000023fffffff
        0000000380000000-00000003bfffffff (prio 0, ram): alias virtio-mem-memslot-9 @mem0 0000000240000000-000000027fffffff
        00000003c0000000-00000003ffffffff (prio 0, ram): alias virtio-mem-memslot-10 @mem0 0000000280000000-00000002bfffffff
        0000000400000000-000000043fffffff (prio 0, ram): alias virtio-mem-memslot-11 @mem0 00000002c0000000-00000002ffffffff
        0000000440000000-000000047fffffff (prio 0, ram): alias virtio-mem-memslot-12 @mem0 0000000300000000-000000033fffffff
        0000000480000000-00000004bfffffff (prio 0, ram): alias virtio-mem-memslot-13 @mem0 0000000340000000-000000037fffffff
        00000004c0000000-00000004ffffffff (prio 0, ram): alias virtio-mem-memslot-14 @mem0 0000000380000000-00000003bfffffff
        0000000500000000-000000053fffffff (prio 0, ram): alias virtio-mem-memslot-15 @mem0 00000003c0000000-00000003ffffffff
        0000000540000000-000000057fffffff (prio 0, ram): alias virtio-mem-memslot-16 @mem0 0000000400000000-000000043fffffff
        0000000580000000-00000005bfffffff (prio 0, ram): alias virtio-mem-memslot-17 @mem0 0000000440000000-000000047fffffff
        00000005c0000000-00000005ffffffff (prio 0, ram): alias virtio-mem-memslot-18 @mem0 0000000480000000-00000004bfffffff
        0000000600000000-000000063fffffff (prio 0, ram): alias virtio-mem-memslot-19 @mem0 00000004c0000000-00000004ffffffff
        0000000640000000-000000067fffffff (prio 0, ram): alias virtio-mem-memslot-20 @mem0 0000000500000000-000000053fffffff

Requesting the VM to consume 5 GiB and rebooting (note: usable region size
will change during reboots) results in:
    0000000140000000-000001047fffffff (prio 0, i/o): device-memory
      0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
        0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
        0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
        00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff
        0000000200000000-000000023fffffff (prio 0, ram): alias virtio-mem-memslot-3 @mem0 00000000c0000000-00000000ffffffff
        0000000240000000-000000027fffffff (prio 0, ram): alias virtio-mem-memslot-4 @mem0 0000000100000000-000000013fffffff
        0000000280000000-00000002bfffffff (prio 0, ram): alias virtio-mem-memslot-5 @mem0 0000000140000000-000000017fffffff


In addition to other factors (e.g., device block size), we limit the number
of memslots to 1024 per devices and the size of one memslot to at least
128 MiB. Further, we make sure internally to align the memslot size to at
least 128 MiB. For now, we limit the total number of memslots that can
be used by memory devices to 2048, to no go crazy on individual RAM
mappings in our address spaces.

Future work:
- vhost-user and libvhost-user/vhost-user-backend changes to support more than
  32 memslots.
- "memslots=0" mode to allow for auto-determining the number of memslots to
  use.
- Eventually have an interface to query the memslot limit for a QEMU
  instance. But vhost-* devices complicate that matter.

RCF -> v1:
- Dropped "max-memslots=" parameter and converted to "memslots=" parameter
- Dropped auto-determining the number of memslots to use
- Dropped vhost* memslot changes
- Improved error messages regarding memory slot limits
- Reshuffled, cleaned up patches, rewrote patch descriptions

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Ani Sinha <ani@anisinha.ca>
Cc: Peter Xu <peterx@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Philippe Mathieu-Daudé <f4bug@amsat.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
Cc: kvm@vger.kernel.org

[1] https://lkml.kernel.org/r/20211013103330.26869-1-david@redhat.com

David Hildenbrand (12):
  kvm: Return number of free memslots
  vhost: Return number of free memslots
  memory: Allow for marking memory region aliases unmergeable
  vhost: Don't merge unmergeable memory sections
  memory-device: Move memory_device_check_addable() directly into
    memory_device_pre_plug()
  memory-device: Generalize memory_device_used_region_size()
  memory-device: Support memory devices that dynamically consume
    multiple memslots
  vhost: Respect reserved memslots for memory devices when realizing a
    vhost device
  memory: Drop mapping check from
    memory_region_get_ram_discard_manager()
  virtio-mem: Fix typo in virito_mem_intersect_memory_section() function
    name
  virtio-mem: Set the RamDiscardManager for the RAM memory region
    earlier
  virtio-mem: Expose device memory via multiple memslots

 accel/kvm/kvm-all.c            |  24 ++--
 accel/stubs/kvm-stub.c         |   4 +-
 hw/mem/memory-device.c         | 115 ++++++++++++++----
 hw/virtio/vhost-stub.c         |   2 +-
 hw/virtio/vhost.c              |  21 ++--
 hw/virtio/virtio-mem-pci.c     |  23 ++++
 hw/virtio/virtio-mem.c         | 212 +++++++++++++++++++++++++++++----
 include/exec/memory.h          |  23 ++++
 include/hw/mem/memory-device.h |  33 +++++
 include/hw/virtio/vhost.h      |   2 +-
 include/hw/virtio/virtio-mem.h |  25 +++-
 include/sysemu/kvm.h           |   2 +-
 softmmu/memory.c               |  35 ++++--
 stubs/qmp_memory_device.c      |   5 +
 14 files changed, 449 insertions(+), 77 deletions(-)

Comments

Michael S. Tsirkin Nov. 1, 2021, 10:15 p.m. UTC | #1
On Wed, Oct 27, 2021 at 02:45:19PM +0200, David Hildenbrand wrote:
> This is the follow-up of [1], dropping auto-detection and vhost-user
> changes from the initial RFC.
> 
> Based-on: 20211011175346.15499-1-david@redhat.com
> 
> A virtio-mem device is represented by a single large RAM memory region
> backed by a single large mmap.
> 
> Right now, we map that complete memory region into guest physical addres
> space, resulting in a very large memory mapping, KVM memory slot, ...
> although only a small amount of memory might actually be exposed to the VM.
> 
> For example, when starting a VM with a 1 TiB virtio-mem device that only
> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
> in order to hotplug more memory later, we waste a lot of memory on metadata
> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
> optimizations in KVM are being worked on to reduce this metadata overhead
> on x86-64 in some cases, it remains a problem with nested VMs and there are
> other reasons why we would want to reduce the total memory slot to a
> reasonable minimum.
> 
> We want to:
> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
>    inside QEMU KVM code where possible.
> b) Not always expose all device-memory to the VM, to reduce the attack
>    surface of malicious VMs without using userfaultfd.

I'm confused by the mention of these security considerations,
and I expect users will be just as confused.
So let's say user wants to not be exposed. What value for
the option should be used? What if a lower option is used?
Is there still some security advantage?

> So instead, expose the RAM memory region not by a single large mapping
> (consuming one memslot) but instead by multiple mappings, each consuming
> one memslot. To do that, we divide the RAM memory region via aliases into
> separate parts and only map the aliases into a device container we actually
> need. We have to make sure that QEMU won't silently merge the memory
> sections corresponding to the aliases (and thereby also memslots),
> otherwise we lose atomic updates with KVM and vhost-user, which we deeply
> care about when adding/removing memory. Further, to get memslot accounting
> right, such merging is better avoided.
> 
> Within the memslots, virtio-mem can (un)plug memory in smaller granularity
> dynamically. So memslots are a pure optimization to tackle a) and b) above.
> 
> The user configures how many memslots a virtio-mem device should use, the
> default is "1" -- essentially corresponding to the old behavior.
> 
> Memslots are right now mapped once they fall into the usable device region
> (which grows/shrinks on demand right now either when requesting to
>  hotplug more memory or during/after reboots). In the future, with
> VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we'll be able to (un)map aliases even
> more dynamically when (un)plugging device blocks.
> 
> 
> Adding a 500GiB virtio-mem device with "memslots=500" and not hotplugging
> any memory results in:
>     0000000140000000-000001047fffffff (prio 0, i/o): device-memory
>       0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
> 
> Requesting the VM to consume 2 GiB results in (note: the usable region size
> is bigger than 2 GiB, so 3 * 1 GiB memslots are required):
>     0000000140000000-000001047fffffff (prio 0, i/o): device-memory
>       0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
>         0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
>         0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
>         00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff
> 
> Requesting the VM to consume 20 GiB results in:
>     0000000140000000-000001047fffffff (prio 0, i/o): device-memory
>       0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
>         0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
>         0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
>         00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff
>         0000000200000000-000000023fffffff (prio 0, ram): alias virtio-mem-memslot-3 @mem0 00000000c0000000-00000000ffffffff
>         0000000240000000-000000027fffffff (prio 0, ram): alias virtio-mem-memslot-4 @mem0 0000000100000000-000000013fffffff
>         0000000280000000-00000002bfffffff (prio 0, ram): alias virtio-mem-memslot-5 @mem0 0000000140000000-000000017fffffff
>         00000002c0000000-00000002ffffffff (prio 0, ram): alias virtio-mem-memslot-6 @mem0 0000000180000000-00000001bfffffff
>         0000000300000000-000000033fffffff (prio 0, ram): alias virtio-mem-memslot-7 @mem0 00000001c0000000-00000001ffffffff
>         0000000340000000-000000037fffffff (prio 0, ram): alias virtio-mem-memslot-8 @mem0 0000000200000000-000000023fffffff
>         0000000380000000-00000003bfffffff (prio 0, ram): alias virtio-mem-memslot-9 @mem0 0000000240000000-000000027fffffff
>         00000003c0000000-00000003ffffffff (prio 0, ram): alias virtio-mem-memslot-10 @mem0 0000000280000000-00000002bfffffff
>         0000000400000000-000000043fffffff (prio 0, ram): alias virtio-mem-memslot-11 @mem0 00000002c0000000-00000002ffffffff
>         0000000440000000-000000047fffffff (prio 0, ram): alias virtio-mem-memslot-12 @mem0 0000000300000000-000000033fffffff
>         0000000480000000-00000004bfffffff (prio 0, ram): alias virtio-mem-memslot-13 @mem0 0000000340000000-000000037fffffff
>         00000004c0000000-00000004ffffffff (prio 0, ram): alias virtio-mem-memslot-14 @mem0 0000000380000000-00000003bfffffff
>         0000000500000000-000000053fffffff (prio 0, ram): alias virtio-mem-memslot-15 @mem0 00000003c0000000-00000003ffffffff
>         0000000540000000-000000057fffffff (prio 0, ram): alias virtio-mem-memslot-16 @mem0 0000000400000000-000000043fffffff
>         0000000580000000-00000005bfffffff (prio 0, ram): alias virtio-mem-memslot-17 @mem0 0000000440000000-000000047fffffff
>         00000005c0000000-00000005ffffffff (prio 0, ram): alias virtio-mem-memslot-18 @mem0 0000000480000000-00000004bfffffff
>         0000000600000000-000000063fffffff (prio 0, ram): alias virtio-mem-memslot-19 @mem0 00000004c0000000-00000004ffffffff
>         0000000640000000-000000067fffffff (prio 0, ram): alias virtio-mem-memslot-20 @mem0 0000000500000000-000000053fffffff
> 
> Requesting the VM to consume 5 GiB and rebooting (note: usable region size
> will change during reboots) results in:
>     0000000140000000-000001047fffffff (prio 0, i/o): device-memory
>       0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
>         0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
>         0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
>         00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff
>         0000000200000000-000000023fffffff (prio 0, ram): alias virtio-mem-memslot-3 @mem0 00000000c0000000-00000000ffffffff
>         0000000240000000-000000027fffffff (prio 0, ram): alias virtio-mem-memslot-4 @mem0 0000000100000000-000000013fffffff
>         0000000280000000-00000002bfffffff (prio 0, ram): alias virtio-mem-memslot-5 @mem0 0000000140000000-000000017fffffff
> 
> 
> In addition to other factors (e.g., device block size), we limit the number
> of memslots to 1024 per devices and the size of one memslot to at least
> 128 MiB. Further, we make sure internally to align the memslot size to at
> least 128 MiB. For now, we limit the total number of memslots that can
> be used by memory devices to 2048, to no go crazy on individual RAM
> mappings in our address spaces.
> 
> Future work:
> - vhost-user and libvhost-user/vhost-user-backend changes to support more than
>   32 memslots.
> - "memslots=0" mode to allow for auto-determining the number of memslots to
>   use.
> - Eventually have an interface to query the memslot limit for a QEMU
>   instance. But vhost-* devices complicate that matter.
> 
> RCF -> v1:
> - Dropped "max-memslots=" parameter and converted to "memslots=" parameter
> - Dropped auto-determining the number of memslots to use
> - Dropped vhost* memslot changes
> - Improved error messages regarding memory slot limits
> - Reshuffled, cleaned up patches, rewrote patch descriptions
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Ani Sinha <ani@anisinha.ca>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Philippe Mathieu-Daudé <f4bug@amsat.org>
> Cc: Hui Zhu <teawater@gmail.com>
> Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
> Cc: kvm@vger.kernel.org
> 
> [1] https://lkml.kernel.org/r/20211013103330.26869-1-david@redhat.com
> 
> David Hildenbrand (12):
>   kvm: Return number of free memslots
>   vhost: Return number of free memslots
>   memory: Allow for marking memory region aliases unmergeable
>   vhost: Don't merge unmergeable memory sections
>   memory-device: Move memory_device_check_addable() directly into
>     memory_device_pre_plug()
>   memory-device: Generalize memory_device_used_region_size()
>   memory-device: Support memory devices that dynamically consume
>     multiple memslots
>   vhost: Respect reserved memslots for memory devices when realizing a
>     vhost device
>   memory: Drop mapping check from
>     memory_region_get_ram_discard_manager()
>   virtio-mem: Fix typo in virito_mem_intersect_memory_section() function
>     name
>   virtio-mem: Set the RamDiscardManager for the RAM memory region
>     earlier
>   virtio-mem: Expose device memory via multiple memslots
> 
>  accel/kvm/kvm-all.c            |  24 ++--
>  accel/stubs/kvm-stub.c         |   4 +-
>  hw/mem/memory-device.c         | 115 ++++++++++++++----
>  hw/virtio/vhost-stub.c         |   2 +-
>  hw/virtio/vhost.c              |  21 ++--
>  hw/virtio/virtio-mem-pci.c     |  23 ++++
>  hw/virtio/virtio-mem.c         | 212 +++++++++++++++++++++++++++++----
>  include/exec/memory.h          |  23 ++++
>  include/hw/mem/memory-device.h |  33 +++++
>  include/hw/virtio/vhost.h      |   2 +-
>  include/hw/virtio/virtio-mem.h |  25 +++-
>  include/sysemu/kvm.h           |   2 +-
>  softmmu/memory.c               |  35 ++++--
>  stubs/qmp_memory_device.c      |   5 +
>  14 files changed, 449 insertions(+), 77 deletions(-)
> 
> -- 
> 2.31.1
David Hildenbrand Nov. 2, 2021, 8:33 a.m. UTC | #2
On 01.11.21 23:15, Michael S. Tsirkin wrote:
> On Wed, Oct 27, 2021 at 02:45:19PM +0200, David Hildenbrand wrote:
>> This is the follow-up of [1], dropping auto-detection and vhost-user
>> changes from the initial RFC.
>>
>> Based-on: 20211011175346.15499-1-david@redhat.com
>>
>> A virtio-mem device is represented by a single large RAM memory region
>> backed by a single large mmap.
>>
>> Right now, we map that complete memory region into guest physical addres
>> space, resulting in a very large memory mapping, KVM memory slot, ...
>> although only a small amount of memory might actually be exposed to the VM.
>>
>> For example, when starting a VM with a 1 TiB virtio-mem device that only
>> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
>> in order to hotplug more memory later, we waste a lot of memory on metadata
>> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
>> optimizations in KVM are being worked on to reduce this metadata overhead
>> on x86-64 in some cases, it remains a problem with nested VMs and there are
>> other reasons why we would want to reduce the total memory slot to a
>> reasonable minimum.
>>
>> We want to:
>> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
>>    inside QEMU KVM code where possible.
>> b) Not always expose all device-memory to the VM, to reduce the attack
>>    surface of malicious VMs without using userfaultfd.
> 
> I'm confused by the mention of these security considerations,
> and I expect users will be just as confused.

Malicious VMs wanting to consume more memory than desired is only
relevant when running untrusted VMs in some environments, and it can be
caught differently, for example, by carefully monitoring and limiting
the maximum memory consumption of a VM. We have the same issue already
when using virtio-balloon to logically unplug memory. For me, it's a
secondary concern ( optimizing a is much more important ).

Some users showed interest in having QEMU disallow access to unplugged
memory, because coming up with a maximum memory consumption for a VM is
hard. This is one step into that direction without having to run with
uffd enabled all of the time.

("security is somewhat the wrong word. we won't be able to steal any
information from the hypervisor.)


> So let's say user wants to not be exposed. What value for
> the option should be used? What if a lower option is used?
> Is there still some security advantage?

My recommendation will be to use 1 memslot per gigabyte as default if
possible in the configuration. If we have a virtio-mem devices with a
maximum size of 128 GiB, the suggestion will be to use memslots=128.
Some setups will require less (e.g., vhost-user until adjusted, old
KVM), some setups can allow for more. I assume that most users will
later set "memslots=0", to enable auto-detection mode.


Assume we have a virtio-mem device with a maximum size of 1 TiB and we
hotplugged 1 GiB to the VM. With "memslots=1", the malicious VM could
actually access the whole 1 TiB. With "memslots=1024", the malicious VM
could only access additional ~ 1 GiB. With "memslots=512", ~ 2 GiB.
That's the reduced attack surface.

Of course, it's different after we hotunplugged memory, before we have
VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE support in QEMU, because all memory
inside the usable region has to be accessible and we cannot "unplug" the
memslots.


Note: With upcoming VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE changes in QEMU,
one will be able to disallow any access for malicious VMs by setting the
memblock size just as big as the device block size.

So with a 128 GiB virtio-mem device with memslots=128,block-size=1G, or
with memslots=1024,block-size=128M we could make it impossible for a
malicious VM to consume more memory than intended. But we lose
flexibility due to the block size and the limited number of available
memslots.

But again, for "full protection against malicious VMs" I consider
userfaultfd protection more flexible. This approach here gives some
advantage, especially when having large virtio-mem devices that start
out small.
Michael S. Tsirkin Nov. 2, 2021, 11:35 a.m. UTC | #3
On Tue, Nov 02, 2021 at 09:33:55AM +0100, David Hildenbrand wrote:
> On 01.11.21 23:15, Michael S. Tsirkin wrote:
> > On Wed, Oct 27, 2021 at 02:45:19PM +0200, David Hildenbrand wrote:
> >> This is the follow-up of [1], dropping auto-detection and vhost-user
> >> changes from the initial RFC.
> >>
> >> Based-on: 20211011175346.15499-1-david@redhat.com
> >>
> >> A virtio-mem device is represented by a single large RAM memory region
> >> backed by a single large mmap.
> >>
> >> Right now, we map that complete memory region into guest physical addres
> >> space, resulting in a very large memory mapping, KVM memory slot, ...
> >> although only a small amount of memory might actually be exposed to the VM.
> >>
> >> For example, when starting a VM with a 1 TiB virtio-mem device that only
> >> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
> >> in order to hotplug more memory later, we waste a lot of memory on metadata
> >> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
> >> optimizations in KVM are being worked on to reduce this metadata overhead
> >> on x86-64 in some cases, it remains a problem with nested VMs and there are
> >> other reasons why we would want to reduce the total memory slot to a
> >> reasonable minimum.
> >>
> >> We want to:
> >> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
> >>    inside QEMU KVM code where possible.
> >> b) Not always expose all device-memory to the VM, to reduce the attack
> >>    surface of malicious VMs without using userfaultfd.
> > 
> > I'm confused by the mention of these security considerations,
> > and I expect users will be just as confused.
> 
> Malicious VMs wanting to consume more memory than desired is only
> relevant when running untrusted VMs in some environments, and it can be
> caught differently, for example, by carefully monitoring and limiting
> the maximum memory consumption of a VM. We have the same issue already
> when using virtio-balloon to logically unplug memory. For me, it's a
> secondary concern ( optimizing a is much more important ).
> 
> Some users showed interest in having QEMU disallow access to unplugged
> memory, because coming up with a maximum memory consumption for a VM is
> hard. This is one step into that direction without having to run with
> uffd enabled all of the time.

Sorry about missing the memo - is there a lot of overhead associated
with uffd then?

> ("security is somewhat the wrong word. we won't be able to steal any
> information from the hypervisor.)

Right. Let's just spell it out.
Further, removing memory still requires guest cooperation.

> 
> > So let's say user wants to not be exposed. What value for
> > the option should be used? What if a lower option is used?
> > Is there still some security advantage?
> 
> My recommendation will be to use 1 memslot per gigabyte as default if
> possible in the configuration. If we have a virtio-mem devices with a
> maximum size of 128 GiB, the suggestion will be to use memslots=128.
> Some setups will require less (e.g., vhost-user until adjusted, old
> KVM), some setups can allow for more. I assume that most users will
> later set "memslots=0", to enable auto-detection mode.
> 
> 
> Assume we have a virtio-mem device with a maximum size of 1 TiB and we
> hotplugged 1 GiB to the VM. With "memslots=1", the malicious VM could
> actually access the whole 1 TiB. With "memslots=1024", the malicious VM
> could only access additional ~ 1 GiB. With "memslots=512", ~ 2 GiB.
> That's the reduced attack surface.
> 
> Of course, it's different after we hotunplugged memory, before we have
> VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE support in QEMU, because all memory
> inside the usable region has to be accessible and we cannot "unplug" the
> memslots.
> 
> 
> Note: With upcoming VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE changes in QEMU,
> one will be able to disallow any access for malicious VMs by setting the
> memblock size just as big as the device block size.
> 
> So with a 128 GiB virtio-mem device with memslots=128,block-size=1G, or
> with memslots=1024,block-size=128M we could make it impossible for a
> malicious VM to consume more memory than intended. But we lose
> flexibility due to the block size and the limited number of available
> memslots.
> 
> But again, for "full protection against malicious VMs" I consider
> userfaultfd protection more flexible. This approach here gives some
> advantage, especially when having large virtio-mem devices that start
> out small.
> 
> -- 
> Thanks,
> 
> David / dhildenb
David Hildenbrand Nov. 2, 2021, 11:55 a.m. UTC | #4
On 02.11.21 12:35, Michael S. Tsirkin wrote:
> On Tue, Nov 02, 2021 at 09:33:55AM +0100, David Hildenbrand wrote:
>> On 01.11.21 23:15, Michael S. Tsirkin wrote:
>>> On Wed, Oct 27, 2021 at 02:45:19PM +0200, David Hildenbrand wrote:
>>>> This is the follow-up of [1], dropping auto-detection and vhost-user
>>>> changes from the initial RFC.
>>>>
>>>> Based-on: 20211011175346.15499-1-david@redhat.com
>>>>
>>>> A virtio-mem device is represented by a single large RAM memory region
>>>> backed by a single large mmap.
>>>>
>>>> Right now, we map that complete memory region into guest physical addres
>>>> space, resulting in a very large memory mapping, KVM memory slot, ...
>>>> although only a small amount of memory might actually be exposed to the VM.
>>>>
>>>> For example, when starting a VM with a 1 TiB virtio-mem device that only
>>>> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
>>>> in order to hotplug more memory later, we waste a lot of memory on metadata
>>>> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
>>>> optimizations in KVM are being worked on to reduce this metadata overhead
>>>> on x86-64 in some cases, it remains a problem with nested VMs and there are
>>>> other reasons why we would want to reduce the total memory slot to a
>>>> reasonable minimum.
>>>>
>>>> We want to:
>>>> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
>>>>    inside QEMU KVM code where possible.
>>>> b) Not always expose all device-memory to the VM, to reduce the attack
>>>>    surface of malicious VMs without using userfaultfd.
>>>
>>> I'm confused by the mention of these security considerations,
>>> and I expect users will be just as confused.
>>
>> Malicious VMs wanting to consume more memory than desired is only
>> relevant when running untrusted VMs in some environments, and it can be
>> caught differently, for example, by carefully monitoring and limiting
>> the maximum memory consumption of a VM. We have the same issue already
>> when using virtio-balloon to logically unplug memory. For me, it's a
>> secondary concern ( optimizing a is much more important ).
>>
>> Some users showed interest in having QEMU disallow access to unplugged
>> memory, because coming up with a maximum memory consumption for a VM is
>> hard. This is one step into that direction without having to run with
>> uffd enabled all of the time.
> 
> Sorry about missing the memo - is there a lot of overhead associated
> with uffd then?

When used with huge/gigantic pages, we don't particularly care.

For other memory backends, we'll have to route any population via the
uffd handler: guest accesses a 4k page -> place a 4k page from user
space. Instead of the kernel automatically placing a THP, we'd be
placing single 4k pages and have to hope the kernel will collapse them
into a THP later.

khugepagd will only collapse into a THP if all affected page table
entries are present and don't map the zero page, though.

So we'll most certainly use less THP for our VM and VM startup time
("first memory access after plugging memory") can be slower.

I have prototypes for it, with some optimizations (e.g., on 4k guest
access, populate the whole THP area), but we might not want to enable it
all of the time. (interaction with postcopy has to be fixed, but it's
not a fundamental issue)


Extending uffd-based protection for virtio-mem to other processes
(vhost-user), is a bit more complicated, and I am not 100% sure if it's
worth the trouble for now. memslots provide at least some high-level
protection for the important case of having a virtio-mem device to
eventually hotplug a lot of memory later.

> 
>> ("security is somewhat the wrong word. we won't be able to steal any
>> information from the hypervisor.)
> 
> Right. Let's just spell it out.
> Further, removing memory still requires guest cooperation.

Right.
Michael S. Tsirkin Nov. 2, 2021, 5:06 p.m. UTC | #5
On Tue, Nov 02, 2021 at 12:55:17PM +0100, David Hildenbrand wrote:
> On 02.11.21 12:35, Michael S. Tsirkin wrote:
> > On Tue, Nov 02, 2021 at 09:33:55AM +0100, David Hildenbrand wrote:
> >> On 01.11.21 23:15, Michael S. Tsirkin wrote:
> >>> On Wed, Oct 27, 2021 at 02:45:19PM +0200, David Hildenbrand wrote:
> >>>> This is the follow-up of [1], dropping auto-detection and vhost-user
> >>>> changes from the initial RFC.
> >>>>
> >>>> Based-on: 20211011175346.15499-1-david@redhat.com
> >>>>
> >>>> A virtio-mem device is represented by a single large RAM memory region
> >>>> backed by a single large mmap.
> >>>>
> >>>> Right now, we map that complete memory region into guest physical addres
> >>>> space, resulting in a very large memory mapping, KVM memory slot, ...
> >>>> although only a small amount of memory might actually be exposed to the VM.
> >>>>
> >>>> For example, when starting a VM with a 1 TiB virtio-mem device that only
> >>>> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
> >>>> in order to hotplug more memory later, we waste a lot of memory on metadata
> >>>> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
> >>>> optimizations in KVM are being worked on to reduce this metadata overhead
> >>>> on x86-64 in some cases, it remains a problem with nested VMs and there are
> >>>> other reasons why we would want to reduce the total memory slot to a
> >>>> reasonable minimum.
> >>>>
> >>>> We want to:
> >>>> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
> >>>>    inside QEMU KVM code where possible.
> >>>> b) Not always expose all device-memory to the VM, to reduce the attack
> >>>>    surface of malicious VMs without using userfaultfd.
> >>>
> >>> I'm confused by the mention of these security considerations,
> >>> and I expect users will be just as confused.
> >>
> >> Malicious VMs wanting to consume more memory than desired is only
> >> relevant when running untrusted VMs in some environments, and it can be
> >> caught differently, for example, by carefully monitoring and limiting
> >> the maximum memory consumption of a VM. We have the same issue already
> >> when using virtio-balloon to logically unplug memory. For me, it's a
> >> secondary concern ( optimizing a is much more important ).
> >>
> >> Some users showed interest in having QEMU disallow access to unplugged
> >> memory, because coming up with a maximum memory consumption for a VM is
> >> hard. This is one step into that direction without having to run with
> >> uffd enabled all of the time.
> > 
> > Sorry about missing the memo - is there a lot of overhead associated
> > with uffd then?
> 
> When used with huge/gigantic pages, we don't particularly care.
> 
> For other memory backends, we'll have to route any population via the
> uffd handler: guest accesses a 4k page -> place a 4k page from user
> space. Instead of the kernel automatically placing a THP, we'd be
> placing single 4k pages and have to hope the kernel will collapse them
> into a THP later.

How much value there is in a THP given it's not present?


> khugepagd will only collapse into a THP if all affected page table
> entries are present and don't map the zero page, though.
> 
> So we'll most certainly use less THP for our VM and VM startup time
> ("first memory access after plugging memory") can be slower.
> 
> I have prototypes for it, with some optimizations (e.g., on 4k guest
> access, populate the whole THP area), but we might not want to enable it
> all of the time. (interaction with postcopy has to be fixed, but it's
> not a fundamental issue)
> 
> 
> Extending uffd-based protection for virtio-mem to other processes
> (vhost-user), is a bit more complicated, and I am not 100% sure if it's
> worth the trouble for now. memslots provide at least some high-level
> protection for the important case of having a virtio-mem device to
> eventually hotplug a lot of memory later.
> 
> > 
> >> ("security is somewhat the wrong word. we won't be able to steal any
> >> information from the hypervisor.)
> > 
> > Right. Let's just spell it out.
> > Further, removing memory still requires guest cooperation.
> 
> Right.
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb
David Hildenbrand Nov. 2, 2021, 5:10 p.m. UTC | #6
On 02.11.21 18:06, Michael S. Tsirkin wrote:
> On Tue, Nov 02, 2021 at 12:55:17PM +0100, David Hildenbrand wrote:
>> On 02.11.21 12:35, Michael S. Tsirkin wrote:
>>> On Tue, Nov 02, 2021 at 09:33:55AM +0100, David Hildenbrand wrote:
>>>> On 01.11.21 23:15, Michael S. Tsirkin wrote:
>>>>> On Wed, Oct 27, 2021 at 02:45:19PM +0200, David Hildenbrand wrote:
>>>>>> This is the follow-up of [1], dropping auto-detection and vhost-user
>>>>>> changes from the initial RFC.
>>>>>>
>>>>>> Based-on: 20211011175346.15499-1-david@redhat.com
>>>>>>
>>>>>> A virtio-mem device is represented by a single large RAM memory region
>>>>>> backed by a single large mmap.
>>>>>>
>>>>>> Right now, we map that complete memory region into guest physical addres
>>>>>> space, resulting in a very large memory mapping, KVM memory slot, ...
>>>>>> although only a small amount of memory might actually be exposed to the VM.
>>>>>>
>>>>>> For example, when starting a VM with a 1 TiB virtio-mem device that only
>>>>>> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
>>>>>> in order to hotplug more memory later, we waste a lot of memory on metadata
>>>>>> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
>>>>>> optimizations in KVM are being worked on to reduce this metadata overhead
>>>>>> on x86-64 in some cases, it remains a problem with nested VMs and there are
>>>>>> other reasons why we would want to reduce the total memory slot to a
>>>>>> reasonable minimum.
>>>>>>
>>>>>> We want to:
>>>>>> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
>>>>>>    inside QEMU KVM code where possible.
>>>>>> b) Not always expose all device-memory to the VM, to reduce the attack
>>>>>>    surface of malicious VMs without using userfaultfd.
>>>>>
>>>>> I'm confused by the mention of these security considerations,
>>>>> and I expect users will be just as confused.
>>>>
>>>> Malicious VMs wanting to consume more memory than desired is only
>>>> relevant when running untrusted VMs in some environments, and it can be
>>>> caught differently, for example, by carefully monitoring and limiting
>>>> the maximum memory consumption of a VM. We have the same issue already
>>>> when using virtio-balloon to logically unplug memory. For me, it's a
>>>> secondary concern ( optimizing a is much more important ).
>>>>
>>>> Some users showed interest in having QEMU disallow access to unplugged
>>>> memory, because coming up with a maximum memory consumption for a VM is
>>>> hard. This is one step into that direction without having to run with
>>>> uffd enabled all of the time.
>>>
>>> Sorry about missing the memo - is there a lot of overhead associated
>>> with uffd then?
>>
>> When used with huge/gigantic pages, we don't particularly care.
>>
>> For other memory backends, we'll have to route any population via the
>> uffd handler: guest accesses a 4k page -> place a 4k page from user
>> space. Instead of the kernel automatically placing a THP, we'd be
>> placing single 4k pages and have to hope the kernel will collapse them
>> into a THP later.
> 
> How much value there is in a THP given it's not present?

If you don't place a THP right during the first page fault inside the
THP region, you'll have to rely on khugepagd to eventually place a huge
page later -- and manually fault in each and every 4k page. I haven't
done any performance measurements so far. Going via userspace on every
4k fault will most certainly hurt performance when first touching memory.
Michael S. Tsirkin Nov. 7, 2021, 8:14 a.m. UTC | #7
On Tue, Nov 02, 2021 at 06:10:13PM +0100, David Hildenbrand wrote:
> On 02.11.21 18:06, Michael S. Tsirkin wrote:
> > On Tue, Nov 02, 2021 at 12:55:17PM +0100, David Hildenbrand wrote:
> >> On 02.11.21 12:35, Michael S. Tsirkin wrote:
> >>> On Tue, Nov 02, 2021 at 09:33:55AM +0100, David Hildenbrand wrote:
> >>>> On 01.11.21 23:15, Michael S. Tsirkin wrote:
> >>>>> On Wed, Oct 27, 2021 at 02:45:19PM +0200, David Hildenbrand wrote:
> >>>>>> This is the follow-up of [1], dropping auto-detection and vhost-user
> >>>>>> changes from the initial RFC.
> >>>>>>
> >>>>>> Based-on: 20211011175346.15499-1-david@redhat.com
> >>>>>>
> >>>>>> A virtio-mem device is represented by a single large RAM memory region
> >>>>>> backed by a single large mmap.
> >>>>>>
> >>>>>> Right now, we map that complete memory region into guest physical addres
> >>>>>> space, resulting in a very large memory mapping, KVM memory slot, ...
> >>>>>> although only a small amount of memory might actually be exposed to the VM.
> >>>>>>
> >>>>>> For example, when starting a VM with a 1 TiB virtio-mem device that only
> >>>>>> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
> >>>>>> in order to hotplug more memory later, we waste a lot of memory on metadata
> >>>>>> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
> >>>>>> optimizations in KVM are being worked on to reduce this metadata overhead
> >>>>>> on x86-64 in some cases, it remains a problem with nested VMs and there are
> >>>>>> other reasons why we would want to reduce the total memory slot to a
> >>>>>> reasonable minimum.
> >>>>>>
> >>>>>> We want to:
> >>>>>> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
> >>>>>>    inside QEMU KVM code where possible.
> >>>>>> b) Not always expose all device-memory to the VM, to reduce the attack
> >>>>>>    surface of malicious VMs without using userfaultfd.
> >>>>>
> >>>>> I'm confused by the mention of these security considerations,
> >>>>> and I expect users will be just as confused.
> >>>>
> >>>> Malicious VMs wanting to consume more memory than desired is only
> >>>> relevant when running untrusted VMs in some environments, and it can be
> >>>> caught differently, for example, by carefully monitoring and limiting
> >>>> the maximum memory consumption of a VM. We have the same issue already
> >>>> when using virtio-balloon to logically unplug memory. For me, it's a
> >>>> secondary concern ( optimizing a is much more important ).
> >>>>
> >>>> Some users showed interest in having QEMU disallow access to unplugged
> >>>> memory, because coming up with a maximum memory consumption for a VM is
> >>>> hard. This is one step into that direction without having to run with
> >>>> uffd enabled all of the time.
> >>>
> >>> Sorry about missing the memo - is there a lot of overhead associated
> >>> with uffd then?
> >>
> >> When used with huge/gigantic pages, we don't particularly care.
> >>
> >> For other memory backends, we'll have to route any population via the
> >> uffd handler: guest accesses a 4k page -> place a 4k page from user
> >> space. Instead of the kernel automatically placing a THP, we'd be
> >> placing single 4k pages and have to hope the kernel will collapse them
> >> into a THP later.
> > 
> > How much value there is in a THP given it's not present?
> 
> If you don't place a THP right during the first page fault inside the
> THP region, you'll have to rely on khugepagd to eventually place a huge
> page later -- and manually fault in each and every 4k page. I haven't
> done any performance measurements so far. Going via userspace on every
> 4k fault will most certainly hurt performance when first touching memory.

So, if the focus is performance improvement, maybe show the speedup?


> -- 
> Thanks,
> 
> David / dhildenb
David Hildenbrand Nov. 7, 2021, 9:21 a.m. UTC | #8
On 07.11.21 09:14, Michael S. Tsirkin wrote:
> On Tue, Nov 02, 2021 at 06:10:13PM +0100, David Hildenbrand wrote:
>> On 02.11.21 18:06, Michael S. Tsirkin wrote:
>>> On Tue, Nov 02, 2021 at 12:55:17PM +0100, David Hildenbrand wrote:
>>>> On 02.11.21 12:35, Michael S. Tsirkin wrote:
>>>>> On Tue, Nov 02, 2021 at 09:33:55AM +0100, David Hildenbrand wrote:
>>>>>> On 01.11.21 23:15, Michael S. Tsirkin wrote:
>>>>>>> On Wed, Oct 27, 2021 at 02:45:19PM +0200, David Hildenbrand wrote:
>>>>>>>> This is the follow-up of [1], dropping auto-detection and vhost-user
>>>>>>>> changes from the initial RFC.
>>>>>>>>
>>>>>>>> Based-on: 20211011175346.15499-1-david@redhat.com
>>>>>>>>
>>>>>>>> A virtio-mem device is represented by a single large RAM memory region
>>>>>>>> backed by a single large mmap.
>>>>>>>>
>>>>>>>> Right now, we map that complete memory region into guest physical addres
>>>>>>>> space, resulting in a very large memory mapping, KVM memory slot, ...
>>>>>>>> although only a small amount of memory might actually be exposed to the VM.
>>>>>>>>
>>>>>>>> For example, when starting a VM with a 1 TiB virtio-mem device that only
>>>>>>>> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
>>>>>>>> in order to hotplug more memory later, we waste a lot of memory on metadata
>>>>>>>> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
>>>>>>>> optimizations in KVM are being worked on to reduce this metadata overhead
>>>>>>>> on x86-64 in some cases, it remains a problem with nested VMs and there are
>>>>>>>> other reasons why we would want to reduce the total memory slot to a
>>>>>>>> reasonable minimum.
>>>>>>>>
>>>>>>>> We want to:
>>>>>>>> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
>>>>>>>>    inside QEMU KVM code where possible.
>>>>>>>> b) Not always expose all device-memory to the VM, to reduce the attack
>>>>>>>>    surface of malicious VMs without using userfaultfd.
>>>>>>>
>>>>>>> I'm confused by the mention of these security considerations,
>>>>>>> and I expect users will be just as confused.
>>>>>>
>>>>>> Malicious VMs wanting to consume more memory than desired is only
>>>>>> relevant when running untrusted VMs in some environments, and it can be
>>>>>> caught differently, for example, by carefully monitoring and limiting
>>>>>> the maximum memory consumption of a VM. We have the same issue already
>>>>>> when using virtio-balloon to logically unplug memory. For me, it's a
>>>>>> secondary concern ( optimizing a is much more important ).
>>>>>>
>>>>>> Some users showed interest in having QEMU disallow access to unplugged
>>>>>> memory, because coming up with a maximum memory consumption for a VM is
>>>>>> hard. This is one step into that direction without having to run with
>>>>>> uffd enabled all of the time.
>>>>>
>>>>> Sorry about missing the memo - is there a lot of overhead associated
>>>>> with uffd then?
>>>>
>>>> When used with huge/gigantic pages, we don't particularly care.
>>>>
>>>> For other memory backends, we'll have to route any population via the
>>>> uffd handler: guest accesses a 4k page -> place a 4k page from user
>>>> space. Instead of the kernel automatically placing a THP, we'd be
>>>> placing single 4k pages and have to hope the kernel will collapse them
>>>> into a THP later.
>>>
>>> How much value there is in a THP given it's not present?
>>
>> If you don't place a THP right during the first page fault inside the
>> THP region, you'll have to rely on khugepagd to eventually place a huge
>> page later -- and manually fault in each and every 4k page. I haven't
>> done any performance measurements so far. Going via userspace on every
>> 4k fault will most certainly hurt performance when first touching memory.
> 
> So, if the focus is performance improvement, maybe show the speedup?

Let's not focus on b), a) is the primary goal of this series:

"
a) Reduce the metadata overhead, including bitmap sizes inside KVM but
also inside QEMU KVM code where possible.
"

Because:

"
For example, when starting a VM with a 1 TiB virtio-mem device that only
exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
in order to hotplug more memory later, we waste a lot of memory on
metadata for KVM memory slots (> 2 GiB!) and accompanied bitmaps.
"

Partially tackling b) is just a nice side effect of this series. In the
long term, we'll want userfaultfd-based protection, and I'll do a
performance evaluation then, how userfaultf vs. !userfaultfd compares
(boot time, run time, THP consumption).

I'll adjust the cover letter for the next version to make this clearer.
Michael S. Tsirkin Nov. 7, 2021, 10:21 a.m. UTC | #9
On Sun, Nov 07, 2021 at 10:21:33AM +0100, David Hildenbrand wrote:
> Let's not focus on b), a) is the primary goal of this series:
> 
> "
> a) Reduce the metadata overhead, including bitmap sizes inside KVM but
> also inside QEMU KVM code where possible.
> "
> 
> Because:
> 
> "
> For example, when starting a VM with a 1 TiB virtio-mem device that only
> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
> in order to hotplug more memory later, we waste a lot of memory on
> metadata for KVM memory slots (> 2 GiB!) and accompanied bitmaps.
> "
> 
> Partially tackling b) is just a nice side effect of this series. In the
> long term, we'll want userfaultfd-based protection, and I'll do a
> performance evaluation then, how userfaultf vs. !userfaultfd compares
> (boot time, run time, THP consumption).
> 
> I'll adjust the cover letter for the next version to make this clearer.

So given this is short-term, and long term we'll use uffd possibly with
some extension (a syscall to populate 1G in one go?) isn't there some
way to hide this from management? It's a one way street: once we get
management involved in playing with memory slots we no longer can go
back and control them ourselves. Not to mention it's a lot of
complexity to push out to management.
David Hildenbrand Nov. 7, 2021, 10:53 a.m. UTC | #10
On 07.11.21 11:21, Michael S. Tsirkin wrote:
> On Sun, Nov 07, 2021 at 10:21:33AM +0100, David Hildenbrand wrote:
>> Let's not focus on b), a) is the primary goal of this series:
>>
>> "
>> a) Reduce the metadata overhead, including bitmap sizes inside KVM but
>> also inside QEMU KVM code where possible.
>> "
>>
>> Because:
>>
>> "
>> For example, when starting a VM with a 1 TiB virtio-mem device that only
>> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
>> in order to hotplug more memory later, we waste a lot of memory on
>> metadata for KVM memory slots (> 2 GiB!) and accompanied bitmaps.
>> "
>>
>> Partially tackling b) is just a nice side effect of this series. In the
>> long term, we'll want userfaultfd-based protection, and I'll do a
>> performance evaluation then, how userfaultf vs. !userfaultfd compares
>> (boot time, run time, THP consumption).
>>
>> I'll adjust the cover letter for the next version to make this clearer.
> 
> So given this is short-term, and long term we'll use uffd possibly with
> some extension (a syscall to populate 1G in one go?) isn't there some
> way to hide this from management? It's a one way street: once we get
> management involved in playing with memory slots we no longer can go
> back and control them ourselves. Not to mention it's a lot of
> complexity to push out to management.

For b) userfaultfd + optimizatons is the way to go long term.
For a) userfaultfd does not help in any way, and that's what I currently
care about most.

1) For the management layer it will be as simple as providing a
"memslots" parameter to the user. I don't expect management to do manual
memslot detection+calculation -- management layer is the wrong place
because it has limited insight. Either QEMU will do it automatically or
the user will do it manually. For QEMU to do it reliably, we'll have to
teach the management layer to specify any vhost* devices before
virtio-mem* devices on the QEMU cmdline -- that is the only real
complexity I see.

2) "control them ourselves" will essentially be enabled via "memslots=0"
(auto-detect mode". The user has to opt in.

"memslots" is a pure optimization mechanism. While I'd love to hide this
complexity from user space and always use the auto-detect mode,
especially hotplug of vhost devices is a real problem and requires users
to opt-in.

I assume once we have "memslots=0" (auto-detect) mode, most people will:
* Set "memslots=0" to enable the optimization and essentially let QEMU
  control it. Will work in most cases and we can document perfectly
  where it won't. We'll always fail gracefully.
* Leave "memslots=1" if they don't care about the optimization or run a
  problematic setup.
* Set "memslots=X if they run a problemantic setup in still care about
  the optimization.


To be precise, we could have a "memslots-optimiation=true|false" toggle
instead. IMHO that could be limiting for these corner case setups where
auto-detection is problematic and users still want to optimize --
especially eventually hotplugging vhost devices. But as I assume
99.9999% of all setups will enable auto-detect mode, I don't have a
strong opinion.