mbox series

[RFC,0/5] KVM: x86: KVM_MEM_ALLONES memory

Message ID 20200514180540.52407-1-vkuznets@redhat.com (mailing list archive)
Headers show
Series KVM: x86: KVM_MEM_ALLONES memory | expand

Message

Vitaly Kuznetsov May 14, 2020, 6:05 p.m. UTC
The idea of the patchset was suggested by Michael S. Tsirkin.

PCIe config space can (depending on the configuration) be quite big but
usually is sparsely populated. Guest may scan it by accessing individual
device's page which, when device is missing, is supposed to have 'pci
holes' semantics: reads return '0xff' and writes get discarded. Currently,
userspace has to allocate real memory for these holes and fill them with
'0xff'. Moreover, different VMs usually require different memory.

The idea behind the feature introduced by this patch is: let's have a
single read-only page filled with '0xff' in KVM and map it to all such
PCI holes in all VMs. This will free userspace of obligation to allocate
real memory and also allow us to speed up access to these holes as we
can aggressively map the whole slot upon first fault.

RFC. I've only tested the feature with the selftest (PATCH5) on Intel/AMD
with and wiuthout EPT/NPT. I haven't tested memslot modifications yet.

Patches are against kvm/next.

Vitaly Kuznetsov (5):
  KVM: rename labels in kvm_init()
  KVM: x86: introduce KVM_MEM_ALLONES memory
  KVM: x86: move kvm_vcpu_gfn_to_memslot() out of try_async_pf()
  KVM: x86: aggressively map PTEs in KVM_MEM_ALLONES slots
  KVM: selftests: add KVM_MEM_ALLONES test

 Documentation/virt/kvm/api.rst                |  22 ++--
 arch/x86/include/uapi/asm/kvm.h               |   1 +
 arch/x86/kvm/mmu/mmu.c                        |  34 ++++--
 arch/x86/kvm/mmu/paging_tmpl.h                |  30 ++++-
 arch/x86/kvm/x86.c                            |   9 +-
 include/linux/kvm_host.h                      |  15 ++-
 include/uapi/linux/kvm.h                      |   2 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../testing/selftests/kvm/include/kvm_util.h  |   1 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  81 +++++++------
 .../kvm/x86_64/memory_region_allones.c        | 112 ++++++++++++++++++
 virt/kvm/kvm_main.c                           | 110 +++++++++++++----
 12 files changed, 342 insertions(+), 76 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/memory_region_allones.c

Comments

Peter Xu May 14, 2020, 10:05 p.m. UTC | #1
On Thu, May 14, 2020 at 08:05:35PM +0200, Vitaly Kuznetsov wrote:
> The idea of the patchset was suggested by Michael S. Tsirkin.
> 
> PCIe config space can (depending on the configuration) be quite big but
> usually is sparsely populated. Guest may scan it by accessing individual
> device's page which, when device is missing, is supposed to have 'pci
> holes' semantics: reads return '0xff' and writes get discarded. Currently,
> userspace has to allocate real memory for these holes and fill them with
> '0xff'. Moreover, different VMs usually require different memory.
> 
> The idea behind the feature introduced by this patch is: let's have a
> single read-only page filled with '0xff' in KVM and map it to all such
> PCI holes in all VMs. This will free userspace of obligation to allocate
> real memory and also allow us to speed up access to these holes as we
> can aggressively map the whole slot upon first fault.
> 
> RFC. I've only tested the feature with the selftest (PATCH5) on Intel/AMD
> with and wiuthout EPT/NPT. I haven't tested memslot modifications yet.
> 
> Patches are against kvm/next.

Hi, Vitaly,

Could this be done in userspace with existing techniques?

E.g., shm_open() with a handle and fill one 0xff page, then remap it to
anywhere needed in QEMU?

Thanks,
Sean Christopherson May 14, 2020, 10:56 p.m. UTC | #2
On Thu, May 14, 2020 at 06:05:16PM -0400, Peter Xu wrote:
> On Thu, May 14, 2020 at 08:05:35PM +0200, Vitaly Kuznetsov wrote:
> > The idea of the patchset was suggested by Michael S. Tsirkin.
> > 
> > PCIe config space can (depending on the configuration) be quite big but
> > usually is sparsely populated. Guest may scan it by accessing individual
> > device's page which, when device is missing, is supposed to have 'pci
> > holes' semantics: reads return '0xff' and writes get discarded. Currently,
> > userspace has to allocate real memory for these holes and fill them with
> > '0xff'. Moreover, different VMs usually require different memory.
> > 
> > The idea behind the feature introduced by this patch is: let's have a
> > single read-only page filled with '0xff' in KVM and map it to all such
> > PCI holes in all VMs. This will free userspace of obligation to allocate
> > real memory and also allow us to speed up access to these holes as we
> > can aggressively map the whole slot upon first fault.
> > 
> > RFC. I've only tested the feature with the selftest (PATCH5) on Intel/AMD
> > with and wiuthout EPT/NPT. I haven't tested memslot modifications yet.
> > 
> > Patches are against kvm/next.
> 
> Hi, Vitaly,
> 
> Could this be done in userspace with existing techniques?
> 
> E.g., shm_open() with a handle and fill one 0xff page, then remap it to
> anywhere needed in QEMU?

Mapping that 4k page over and over is going to get expensive, e.g. each
duplicate will need a VMA and a memslot, plus any PTE overhead.  If the
total sum of the holes is >2mb it'll even overflow the mumber of allowed
memslots.
Peter Xu May 14, 2020, 11:22 p.m. UTC | #3
On Thu, May 14, 2020 at 03:56:24PM -0700, Sean Christopherson wrote:
> On Thu, May 14, 2020 at 06:05:16PM -0400, Peter Xu wrote:
> > On Thu, May 14, 2020 at 08:05:35PM +0200, Vitaly Kuznetsov wrote:
> > > The idea of the patchset was suggested by Michael S. Tsirkin.
> > > 
> > > PCIe config space can (depending on the configuration) be quite big but
> > > usually is sparsely populated. Guest may scan it by accessing individual
> > > device's page which, when device is missing, is supposed to have 'pci
> > > holes' semantics: reads return '0xff' and writes get discarded. Currently,
> > > userspace has to allocate real memory for these holes and fill them with
> > > '0xff'. Moreover, different VMs usually require different memory.
> > > 
> > > The idea behind the feature introduced by this patch is: let's have a
> > > single read-only page filled with '0xff' in KVM and map it to all such
> > > PCI holes in all VMs. This will free userspace of obligation to allocate
> > > real memory and also allow us to speed up access to these holes as we
> > > can aggressively map the whole slot upon first fault.
> > > 
> > > RFC. I've only tested the feature with the selftest (PATCH5) on Intel/AMD
> > > with and wiuthout EPT/NPT. I haven't tested memslot modifications yet.
> > > 
> > > Patches are against kvm/next.
> > 
> > Hi, Vitaly,
> > 
> > Could this be done in userspace with existing techniques?
> > 
> > E.g., shm_open() with a handle and fill one 0xff page, then remap it to
> > anywhere needed in QEMU?
> 
> Mapping that 4k page over and over is going to get expensive, e.g. each
> duplicate will need a VMA and a memslot, plus any PTE overhead.  If the
> total sum of the holes is >2mb it'll even overflow the mumber of allowed
> memslots.

What's the PTE overhead you mentioned?  We need to fill PTEs one by one on
fault even if the page is allocated in the kernel, am I right?

4K is only an example - we can also use more pages as the template.  However I
guess the kvm memslot count could be a limit..  Could I ask what's the normal
size of this 0xff region, and its distribution?

Thanks,
Sean Christopherson May 14, 2020, 11:32 p.m. UTC | #4
On Thu, May 14, 2020 at 07:22:50PM -0400, Peter Xu wrote:
> On Thu, May 14, 2020 at 03:56:24PM -0700, Sean Christopherson wrote:
> > On Thu, May 14, 2020 at 06:05:16PM -0400, Peter Xu wrote:
> > > E.g., shm_open() with a handle and fill one 0xff page, then remap it to
> > > anywhere needed in QEMU?
> > 
> > Mapping that 4k page over and over is going to get expensive, e.g. each
> > duplicate will need a VMA and a memslot, plus any PTE overhead.  If the
> > total sum of the holes is >2mb it'll even overflow the mumber of allowed
> > memslots.
> 
> What's the PTE overhead you mentioned?  We need to fill PTEs one by one on
> fault even if the page is allocated in the kernel, am I right?

It won't require host PTEs for every page if it's a kernel page.  I doubt
PTEs are a significant overhead, especially compared to memslots, but it's
still worth considering.

My thought was to skimp on both host PTEs _and_ KVM SPTEs by always sending
the PCI hole accesses down the slow MMIO path[*].

[*] https://lkml.kernel.org/r/20200514194624.GB15847@linux.intel.com

> 4K is only an example - we can also use more pages as the template.  However I
> guess the kvm memslot count could be a limit..  Could I ask what's the normal
> size of this 0xff region, and its distribution?
> 
> Thanks,
> 
> -- 
> Peter Xu
>
Andy Lutomirski May 15, 2020, 1:03 a.m. UTC | #5
On Thu, May 14, 2020 at 3:56 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Thu, May 14, 2020 at 06:05:16PM -0400, Peter Xu wrote:
> > On Thu, May 14, 2020 at 08:05:35PM +0200, Vitaly Kuznetsov wrote:
> > > The idea of the patchset was suggested by Michael S. Tsirkin.
> > >
> > > PCIe config space can (depending on the configuration) be quite big but
> > > usually is sparsely populated. Guest may scan it by accessing individual
> > > device's page which, when device is missing, is supposed to have 'pci
> > > holes' semantics: reads return '0xff' and writes get discarded. Currently,
> > > userspace has to allocate real memory for these holes and fill them with
> > > '0xff'. Moreover, different VMs usually require different memory.
> > >
> > > The idea behind the feature introduced by this patch is: let's have a
> > > single read-only page filled with '0xff' in KVM and map it to all such
> > > PCI holes in all VMs. This will free userspace of obligation to allocate
> > > real memory and also allow us to speed up access to these holes as we
> > > can aggressively map the whole slot upon first fault.
> > >
> > > RFC. I've only tested the feature with the selftest (PATCH5) on Intel/AMD
> > > with and wiuthout EPT/NPT. I haven't tested memslot modifications yet.
> > >
> > > Patches are against kvm/next.
> >
> > Hi, Vitaly,
> >
> > Could this be done in userspace with existing techniques?
> >
> > E.g., shm_open() with a handle and fill one 0xff page, then remap it to
> > anywhere needed in QEMU?
>
> Mapping that 4k page over and over is going to get expensive, e.g. each
> duplicate will need a VMA and a memslot, plus any PTE overhead.  If the
> total sum of the holes is >2mb it'll even overflow the mumber of allowed
> memslots.

How about a tiny character device driver /dev/ones?
Vitaly Kuznetsov May 15, 2020, 8:42 a.m. UTC | #6
Sean Christopherson <sean.j.christopherson@intel.com> writes:

> On Thu, May 14, 2020 at 07:22:50PM -0400, Peter Xu wrote:
>> On Thu, May 14, 2020 at 03:56:24PM -0700, Sean Christopherson wrote:
>> > On Thu, May 14, 2020 at 06:05:16PM -0400, Peter Xu wrote:
>> > > E.g., shm_open() with a handle and fill one 0xff page, then remap it to
>> > > anywhere needed in QEMU?
>> > 
>> > Mapping that 4k page over and over is going to get expensive, e.g. each
>> > duplicate will need a VMA and a memslot, plus any PTE overhead.  If the
>> > total sum of the holes is >2mb it'll even overflow the mumber of allowed
>> > memslots.
>> 
>> What's the PTE overhead you mentioned?  We need to fill PTEs one by one on
>> fault even if the page is allocated in the kernel, am I right?
>
> It won't require host PTEs for every page if it's a kernel page.  I doubt
> PTEs are a significant overhead, especially compared to memslots, but it's
> still worth considering.
>
> My thought was to skimp on both host PTEs _and_ KVM SPTEs by always sending
> the PCI hole accesses down the slow MMIO path[*].
>
> [*] https://lkml.kernel.org/r/20200514194624.GB15847@linux.intel.com
>

If we drop 'aggressive' patch from this patchset we can probably get
away with KVM_MEM_READONLY and userspace VMAs but this will only help us
to save some memory, it won't speed things up.

>> 4K is only an example - we can also use more pages as the template.  However I
>> guess the kvm memslot count could be a limit..  Could I ask what's the normal
>> size of this 0xff region, and its distribution?

Julia/Michael, could you please provide some 'normal' configuration for
a Q35 machine and its PCIe config space?
Peter Xu May 15, 2020, 11:15 a.m. UTC | #7
On Thu, May 14, 2020 at 06:03:20PM -0700, Andy Lutomirski wrote:
> On Thu, May 14, 2020 at 3:56 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Thu, May 14, 2020 at 06:05:16PM -0400, Peter Xu wrote:
> > > On Thu, May 14, 2020 at 08:05:35PM +0200, Vitaly Kuznetsov wrote:
> > > > The idea of the patchset was suggested by Michael S. Tsirkin.
> > > >
> > > > PCIe config space can (depending on the configuration) be quite big but
> > > > usually is sparsely populated. Guest may scan it by accessing individual
> > > > device's page which, when device is missing, is supposed to have 'pci
> > > > holes' semantics: reads return '0xff' and writes get discarded. Currently,
> > > > userspace has to allocate real memory for these holes and fill them with
> > > > '0xff'. Moreover, different VMs usually require different memory.
> > > >
> > > > The idea behind the feature introduced by this patch is: let's have a
> > > > single read-only page filled with '0xff' in KVM and map it to all such
> > > > PCI holes in all VMs. This will free userspace of obligation to allocate
> > > > real memory and also allow us to speed up access to these holes as we
> > > > can aggressively map the whole slot upon first fault.
> > > >
> > > > RFC. I've only tested the feature with the selftest (PATCH5) on Intel/AMD
> > > > with and wiuthout EPT/NPT. I haven't tested memslot modifications yet.
> > > >
> > > > Patches are against kvm/next.
> > >
> > > Hi, Vitaly,
> > >
> > > Could this be done in userspace with existing techniques?
> > >
> > > E.g., shm_open() with a handle and fill one 0xff page, then remap it to
> > > anywhere needed in QEMU?
> >
> > Mapping that 4k page over and over is going to get expensive, e.g. each
> > duplicate will need a VMA and a memslot, plus any PTE overhead.  If the
> > total sum of the holes is >2mb it'll even overflow the mumber of allowed
> > memslots.
> 
> How about a tiny character device driver /dev/ones?

Yeah, this looks very clean.

Or I also like Sean's idea about using the slow path - I think the answer could
depend on a better knowledge on the problem to solve (PCI scan for small VM
boots) to firstly justify that the fast path is required.  E.g., could we even
workaround that inefficient reading of 0xff's for our use case?  After all what
the BIOS really needs is not those 0xff's, but some other facts.

Thanks!