Message ID | 20200514180540.52407-1-vkuznets@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: x86: KVM_MEM_ALLONES memory | expand |
On Thu, May 14, 2020 at 08:05:35PM +0200, Vitaly Kuznetsov wrote: > The idea of the patchset was suggested by Michael S. Tsirkin. > > PCIe config space can (depending on the configuration) be quite big but > usually is sparsely populated. Guest may scan it by accessing individual > device's page which, when device is missing, is supposed to have 'pci > holes' semantics: reads return '0xff' and writes get discarded. Currently, > userspace has to allocate real memory for these holes and fill them with > '0xff'. Moreover, different VMs usually require different memory. > > The idea behind the feature introduced by this patch is: let's have a > single read-only page filled with '0xff' in KVM and map it to all such > PCI holes in all VMs. This will free userspace of obligation to allocate > real memory and also allow us to speed up access to these holes as we > can aggressively map the whole slot upon first fault. > > RFC. I've only tested the feature with the selftest (PATCH5) on Intel/AMD > with and wiuthout EPT/NPT. I haven't tested memslot modifications yet. > > Patches are against kvm/next. Hi, Vitaly, Could this be done in userspace with existing techniques? E.g., shm_open() with a handle and fill one 0xff page, then remap it to anywhere needed in QEMU? Thanks,
On Thu, May 14, 2020 at 06:05:16PM -0400, Peter Xu wrote: > On Thu, May 14, 2020 at 08:05:35PM +0200, Vitaly Kuznetsov wrote: > > The idea of the patchset was suggested by Michael S. Tsirkin. > > > > PCIe config space can (depending on the configuration) be quite big but > > usually is sparsely populated. Guest may scan it by accessing individual > > device's page which, when device is missing, is supposed to have 'pci > > holes' semantics: reads return '0xff' and writes get discarded. Currently, > > userspace has to allocate real memory for these holes and fill them with > > '0xff'. Moreover, different VMs usually require different memory. > > > > The idea behind the feature introduced by this patch is: let's have a > > single read-only page filled with '0xff' in KVM and map it to all such > > PCI holes in all VMs. This will free userspace of obligation to allocate > > real memory and also allow us to speed up access to these holes as we > > can aggressively map the whole slot upon first fault. > > > > RFC. I've only tested the feature with the selftest (PATCH5) on Intel/AMD > > with and wiuthout EPT/NPT. I haven't tested memslot modifications yet. > > > > Patches are against kvm/next. > > Hi, Vitaly, > > Could this be done in userspace with existing techniques? > > E.g., shm_open() with a handle and fill one 0xff page, then remap it to > anywhere needed in QEMU? Mapping that 4k page over and over is going to get expensive, e.g. each duplicate will need a VMA and a memslot, plus any PTE overhead. If the total sum of the holes is >2mb it'll even overflow the mumber of allowed memslots.
On Thu, May 14, 2020 at 03:56:24PM -0700, Sean Christopherson wrote: > On Thu, May 14, 2020 at 06:05:16PM -0400, Peter Xu wrote: > > On Thu, May 14, 2020 at 08:05:35PM +0200, Vitaly Kuznetsov wrote: > > > The idea of the patchset was suggested by Michael S. Tsirkin. > > > > > > PCIe config space can (depending on the configuration) be quite big but > > > usually is sparsely populated. Guest may scan it by accessing individual > > > device's page which, when device is missing, is supposed to have 'pci > > > holes' semantics: reads return '0xff' and writes get discarded. Currently, > > > userspace has to allocate real memory for these holes and fill them with > > > '0xff'. Moreover, different VMs usually require different memory. > > > > > > The idea behind the feature introduced by this patch is: let's have a > > > single read-only page filled with '0xff' in KVM and map it to all such > > > PCI holes in all VMs. This will free userspace of obligation to allocate > > > real memory and also allow us to speed up access to these holes as we > > > can aggressively map the whole slot upon first fault. > > > > > > RFC. I've only tested the feature with the selftest (PATCH5) on Intel/AMD > > > with and wiuthout EPT/NPT. I haven't tested memslot modifications yet. > > > > > > Patches are against kvm/next. > > > > Hi, Vitaly, > > > > Could this be done in userspace with existing techniques? > > > > E.g., shm_open() with a handle and fill one 0xff page, then remap it to > > anywhere needed in QEMU? > > Mapping that 4k page over and over is going to get expensive, e.g. each > duplicate will need a VMA and a memslot, plus any PTE overhead. If the > total sum of the holes is >2mb it'll even overflow the mumber of allowed > memslots. What's the PTE overhead you mentioned? We need to fill PTEs one by one on fault even if the page is allocated in the kernel, am I right? 4K is only an example - we can also use more pages as the template. However I guess the kvm memslot count could be a limit.. Could I ask what's the normal size of this 0xff region, and its distribution? Thanks,
On Thu, May 14, 2020 at 07:22:50PM -0400, Peter Xu wrote: > On Thu, May 14, 2020 at 03:56:24PM -0700, Sean Christopherson wrote: > > On Thu, May 14, 2020 at 06:05:16PM -0400, Peter Xu wrote: > > > E.g., shm_open() with a handle and fill one 0xff page, then remap it to > > > anywhere needed in QEMU? > > > > Mapping that 4k page over and over is going to get expensive, e.g. each > > duplicate will need a VMA and a memslot, plus any PTE overhead. If the > > total sum of the holes is >2mb it'll even overflow the mumber of allowed > > memslots. > > What's the PTE overhead you mentioned? We need to fill PTEs one by one on > fault even if the page is allocated in the kernel, am I right? It won't require host PTEs for every page if it's a kernel page. I doubt PTEs are a significant overhead, especially compared to memslots, but it's still worth considering. My thought was to skimp on both host PTEs _and_ KVM SPTEs by always sending the PCI hole accesses down the slow MMIO path[*]. [*] https://lkml.kernel.org/r/20200514194624.GB15847@linux.intel.com > 4K is only an example - we can also use more pages as the template. However I > guess the kvm memslot count could be a limit.. Could I ask what's the normal > size of this 0xff region, and its distribution? > > Thanks, > > -- > Peter Xu >
On Thu, May 14, 2020 at 3:56 PM Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > On Thu, May 14, 2020 at 06:05:16PM -0400, Peter Xu wrote: > > On Thu, May 14, 2020 at 08:05:35PM +0200, Vitaly Kuznetsov wrote: > > > The idea of the patchset was suggested by Michael S. Tsirkin. > > > > > > PCIe config space can (depending on the configuration) be quite big but > > > usually is sparsely populated. Guest may scan it by accessing individual > > > device's page which, when device is missing, is supposed to have 'pci > > > holes' semantics: reads return '0xff' and writes get discarded. Currently, > > > userspace has to allocate real memory for these holes and fill them with > > > '0xff'. Moreover, different VMs usually require different memory. > > > > > > The idea behind the feature introduced by this patch is: let's have a > > > single read-only page filled with '0xff' in KVM and map it to all such > > > PCI holes in all VMs. This will free userspace of obligation to allocate > > > real memory and also allow us to speed up access to these holes as we > > > can aggressively map the whole slot upon first fault. > > > > > > RFC. I've only tested the feature with the selftest (PATCH5) on Intel/AMD > > > with and wiuthout EPT/NPT. I haven't tested memslot modifications yet. > > > > > > Patches are against kvm/next. > > > > Hi, Vitaly, > > > > Could this be done in userspace with existing techniques? > > > > E.g., shm_open() with a handle and fill one 0xff page, then remap it to > > anywhere needed in QEMU? > > Mapping that 4k page over and over is going to get expensive, e.g. each > duplicate will need a VMA and a memslot, plus any PTE overhead. If the > total sum of the holes is >2mb it'll even overflow the mumber of allowed > memslots. How about a tiny character device driver /dev/ones?
Sean Christopherson <sean.j.christopherson@intel.com> writes: > On Thu, May 14, 2020 at 07:22:50PM -0400, Peter Xu wrote: >> On Thu, May 14, 2020 at 03:56:24PM -0700, Sean Christopherson wrote: >> > On Thu, May 14, 2020 at 06:05:16PM -0400, Peter Xu wrote: >> > > E.g., shm_open() with a handle and fill one 0xff page, then remap it to >> > > anywhere needed in QEMU? >> > >> > Mapping that 4k page over and over is going to get expensive, e.g. each >> > duplicate will need a VMA and a memslot, plus any PTE overhead. If the >> > total sum of the holes is >2mb it'll even overflow the mumber of allowed >> > memslots. >> >> What's the PTE overhead you mentioned? We need to fill PTEs one by one on >> fault even if the page is allocated in the kernel, am I right? > > It won't require host PTEs for every page if it's a kernel page. I doubt > PTEs are a significant overhead, especially compared to memslots, but it's > still worth considering. > > My thought was to skimp on both host PTEs _and_ KVM SPTEs by always sending > the PCI hole accesses down the slow MMIO path[*]. > > [*] https://lkml.kernel.org/r/20200514194624.GB15847@linux.intel.com > If we drop 'aggressive' patch from this patchset we can probably get away with KVM_MEM_READONLY and userspace VMAs but this will only help us to save some memory, it won't speed things up. >> 4K is only an example - we can also use more pages as the template. However I >> guess the kvm memslot count could be a limit.. Could I ask what's the normal >> size of this 0xff region, and its distribution? Julia/Michael, could you please provide some 'normal' configuration for a Q35 machine and its PCIe config space?
On Thu, May 14, 2020 at 06:03:20PM -0700, Andy Lutomirski wrote: > On Thu, May 14, 2020 at 3:56 PM Sean Christopherson > <sean.j.christopherson@intel.com> wrote: > > > > On Thu, May 14, 2020 at 06:05:16PM -0400, Peter Xu wrote: > > > On Thu, May 14, 2020 at 08:05:35PM +0200, Vitaly Kuznetsov wrote: > > > > The idea of the patchset was suggested by Michael S. Tsirkin. > > > > > > > > PCIe config space can (depending on the configuration) be quite big but > > > > usually is sparsely populated. Guest may scan it by accessing individual > > > > device's page which, when device is missing, is supposed to have 'pci > > > > holes' semantics: reads return '0xff' and writes get discarded. Currently, > > > > userspace has to allocate real memory for these holes and fill them with > > > > '0xff'. Moreover, different VMs usually require different memory. > > > > > > > > The idea behind the feature introduced by this patch is: let's have a > > > > single read-only page filled with '0xff' in KVM and map it to all such > > > > PCI holes in all VMs. This will free userspace of obligation to allocate > > > > real memory and also allow us to speed up access to these holes as we > > > > can aggressively map the whole slot upon first fault. > > > > > > > > RFC. I've only tested the feature with the selftest (PATCH5) on Intel/AMD > > > > with and wiuthout EPT/NPT. I haven't tested memslot modifications yet. > > > > > > > > Patches are against kvm/next. > > > > > > Hi, Vitaly, > > > > > > Could this be done in userspace with existing techniques? > > > > > > E.g., shm_open() with a handle and fill one 0xff page, then remap it to > > > anywhere needed in QEMU? > > > > Mapping that 4k page over and over is going to get expensive, e.g. each > > duplicate will need a VMA and a memslot, plus any PTE overhead. If the > > total sum of the holes is >2mb it'll even overflow the mumber of allowed > > memslots. > > How about a tiny character device driver /dev/ones? Yeah, this looks very clean. Or I also like Sean's idea about using the slow path - I think the answer could depend on a better knowledge on the problem to solve (PCI scan for small VM boots) to firstly justify that the fast path is required. E.g., could we even workaround that inefficient reading of 0xff's for our use case? After all what the BIOS really needs is not those 0xff's, but some other facts. Thanks!