Message ID | 20171009025000.39435-1-aik@ozlabs.ru (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, 9 Oct 2017 13:50:00 +1100 Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > At the moment the protection in VFIO MMIO mappings is forced to > _PAGE_NON_IDEMPOTENT which means that write combining is not really > available to the userspace even for prefetchable 64bit MMIO BARs. > > This replaces pgprot_noncached() with a platform specific > phys_mem_access_prot() when available and depending on the platform > the vm_page_prot may be set to _PAGE_TOLERANT allowing to exploit > the write combining feature. > > The guest drivers still have to use _wc versions of > the ioremap/pci_ioremap API to get write combininig working. > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > --- > > This should allow DPDK and radix guests (x86, POWERPC, etc) to > do write combining. > > POWERPC hash guests should not be affected by this change, it should > work even without this. > --- > drivers/vfio/pci/vfio_pci.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c > index f041b1a6cf66..014192b42724 100644 > --- a/drivers/vfio/pci/vfio_pci.c > +++ b/drivers/vfio/pci/vfio_pci.c > @@ -1156,8 +1156,13 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma) > } > > vma->vm_private_data = vdev; > - vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); > vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff; > +#ifdef __HAVE_PHYS_MEM_ACCESS_PROT > + vma->vm_page_prot = phys_mem_access_prot(NULL, vma->vm_pgoff, > + req_len, vma->vm_page_prot); > +#else > + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); > +#endif > > return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, > req_len, vma->vm_page_prot); Are you testing __HAVE_PHYS_MEM_ACCESS_PROT because the version of phys_mem_access_prot() defined in drivers/char/mem.c can dereference @file and we're hoping that platforms we care about won't both define __HAVE_PHYS_MEM_ACCESS_PROT and look at @file? Sketchy. Thanks, Alex
On 11/10/17 08:55, Alex Williamson wrote: > On Mon, 9 Oct 2017 13:50:00 +1100 > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > >> At the moment the protection in VFIO MMIO mappings is forced to >> _PAGE_NON_IDEMPOTENT which means that write combining is not really >> available to the userspace even for prefetchable 64bit MMIO BARs. >> >> This replaces pgprot_noncached() with a platform specific >> phys_mem_access_prot() when available and depending on the platform >> the vm_page_prot may be set to _PAGE_TOLERANT allowing to exploit >> the write combining feature. >> >> The guest drivers still have to use _wc versions of >> the ioremap/pci_ioremap API to get write combininig working. >> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >> --- >> >> This should allow DPDK and radix guests (x86, POWERPC, etc) to >> do write combining. >> >> POWERPC hash guests should not be affected by this change, it should >> work even without this. >> --- >> drivers/vfio/pci/vfio_pci.c | 7 ++++++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c >> index f041b1a6cf66..014192b42724 100644 >> --- a/drivers/vfio/pci/vfio_pci.c >> +++ b/drivers/vfio/pci/vfio_pci.c >> @@ -1156,8 +1156,13 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma) >> } >> >> vma->vm_private_data = vdev; >> - vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); >> vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff; >> +#ifdef __HAVE_PHYS_MEM_ACCESS_PROT >> + vma->vm_page_prot = phys_mem_access_prot(NULL, vma->vm_pgoff, >> + req_len, vma->vm_page_prot); >> +#else >> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); >> +#endif >> >> return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, >> req_len, vma->vm_page_prot); > > Are you testing __HAVE_PHYS_MEM_ACCESS_PROT because the version of > phys_mem_access_prot() defined in drivers/char/mem.c can dereference > @file and we're hoping that platforms we care about won't both define > __HAVE_PHYS_MEM_ACCESS_PROT and look at @file? No. That version in mem.c is static and not exported at all and I do not understand why it got this name. Every other instance of phys_mem_access_prot is accompanied by #define __HAVE_PHYS_MEM_ACCESS_PROT But 3 instances (ia64, x86, mips) are not exported (arm, arm64, ppc are) and v2 of this will come with 3 more single line patches, if we decide to proceed. The only version which actually looks at @file is in arch/mips/loongson64/common/mem.c and I do not know what to do about it (can it do VFIO at all?), I could pass a file there but no actual code would use it anyway. > Sketchy. Thanks, > > Alex >
On Wed, 11 Oct 2017 13:05:00 +1100 Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > On 11/10/17 08:55, Alex Williamson wrote: > > On Mon, 9 Oct 2017 13:50:00 +1100 > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > > > >> At the moment the protection in VFIO MMIO mappings is forced to > >> _PAGE_NON_IDEMPOTENT which means that write combining is not really > >> available to the userspace even for prefetchable 64bit MMIO BARs. > >> > >> This replaces pgprot_noncached() with a platform specific > >> phys_mem_access_prot() when available and depending on the platform > >> the vm_page_prot may be set to _PAGE_TOLERANT allowing to exploit > >> the write combining feature. > >> > >> The guest drivers still have to use _wc versions of > >> the ioremap/pci_ioremap API to get write combininig working. > >> > >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > >> --- > >> > >> This should allow DPDK and radix guests (x86, POWERPC, etc) to > >> do write combining. > >> > >> POWERPC hash guests should not be affected by this change, it should > >> work even without this. > >> --- > >> drivers/vfio/pci/vfio_pci.c | 7 ++++++- > >> 1 file changed, 6 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c > >> index f041b1a6cf66..014192b42724 100644 > >> --- a/drivers/vfio/pci/vfio_pci.c > >> +++ b/drivers/vfio/pci/vfio_pci.c > >> @@ -1156,8 +1156,13 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma) > >> } > >> > >> vma->vm_private_data = vdev; > >> - vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); > >> vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff; > >> +#ifdef __HAVE_PHYS_MEM_ACCESS_PROT > >> + vma->vm_page_prot = phys_mem_access_prot(NULL, vma->vm_pgoff, > >> + req_len, vma->vm_page_prot); > >> +#else > >> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); > >> +#endif > >> > >> return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, > >> req_len, vma->vm_page_prot); > > > > Are you testing __HAVE_PHYS_MEM_ACCESS_PROT because the version of > > phys_mem_access_prot() defined in drivers/char/mem.c can dereference > > @file and we're hoping that platforms we care about won't both define > > __HAVE_PHYS_MEM_ACCESS_PROT and look at @file? > > No. > > That version in mem.c is static and not exported at all and I do not > understand why it got this name. Every other instance of > phys_mem_access_prot is accompanied by > > #define __HAVE_PHYS_MEM_ACCESS_PROT I did miss that mem.c was static there, but I think the point is still valid, file being NULL doesn't seem to be a universally expected option. > But 3 instances (ia64, x86, mips) are not exported (arm, arm64, ppc are) > and v2 of this will come with 3 more single line patches, if we decide to > proceed. > > The only version which actually looks at @file is in > arch/mips/loongson64/common/mem.c and I do not know what to do about it > (can it do VFIO at all?), I could pass a file there but no actual code > would use it anyway. The question is not whether this particular platform could use vfio, but instead is whether vfio is using the function correctly. That, I really don't know. But also... arch/arm64/mm/mmu.c: pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn, unsigned long size, pgprot_t vma_prot) { if (!pfn_valid(pfn)) return pgprot_noncached(vma_prot); else if (file->f_flags & O_SYNC) return pgprot_writecombine(vma_prot); return vma_prot; } Why do we get to ignore dereferencing file on an arch that we definitely care about? Thanks, Alex
On 11/10/17 13:42, Alex Williamson wrote: > On Wed, 11 Oct 2017 13:05:00 +1100 > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > >> On 11/10/17 08:55, Alex Williamson wrote: >>> On Mon, 9 Oct 2017 13:50:00 +1100 >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote: >>> >>>> At the moment the protection in VFIO MMIO mappings is forced to >>>> _PAGE_NON_IDEMPOTENT which means that write combining is not really >>>> available to the userspace even for prefetchable 64bit MMIO BARs. >>>> >>>> This replaces pgprot_noncached() with a platform specific >>>> phys_mem_access_prot() when available and depending on the platform >>>> the vm_page_prot may be set to _PAGE_TOLERANT allowing to exploit >>>> the write combining feature. >>>> >>>> The guest drivers still have to use _wc versions of >>>> the ioremap/pci_ioremap API to get write combininig working. >>>> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>>> --- >>>> >>>> This should allow DPDK and radix guests (x86, POWERPC, etc) to >>>> do write combining. >>>> >>>> POWERPC hash guests should not be affected by this change, it should >>>> work even without this. >>>> --- >>>> drivers/vfio/pci/vfio_pci.c | 7 ++++++- >>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c >>>> index f041b1a6cf66..014192b42724 100644 >>>> --- a/drivers/vfio/pci/vfio_pci.c >>>> +++ b/drivers/vfio/pci/vfio_pci.c >>>> @@ -1156,8 +1156,13 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma) >>>> } >>>> >>>> vma->vm_private_data = vdev; >>>> - vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); >>>> vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff; >>>> +#ifdef __HAVE_PHYS_MEM_ACCESS_PROT >>>> + vma->vm_page_prot = phys_mem_access_prot(NULL, vma->vm_pgoff, >>>> + req_len, vma->vm_page_prot); >>>> +#else >>>> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); >>>> +#endif >>>> >>>> return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, >>>> req_len, vma->vm_page_prot); >>> >>> Are you testing __HAVE_PHYS_MEM_ACCESS_PROT because the version of >>> phys_mem_access_prot() defined in drivers/char/mem.c can dereference >>> @file and we're hoping that platforms we care about won't both define >>> __HAVE_PHYS_MEM_ACCESS_PROT and look at @file? >> >> No. >> >> That version in mem.c is static and not exported at all and I do not >> understand why it got this name. Every other instance of >> phys_mem_access_prot is accompanied by >> >> #define __HAVE_PHYS_MEM_ACCESS_PROT > > I did miss that mem.c was static there, but I think the point is still > valid, file being NULL doesn't seem to be a universally expected option. > > >> But 3 instances (ia64, x86, mips) are not exported (arm, arm64, ppc are) >> and v2 of this will come with 3 more single line patches, if we decide to >> proceed. >> >> The only version which actually looks at @file is in >> arch/mips/loongson64/common/mem.c and I do not know what to do about it >> (can it do VFIO at all?), I could pass a file there but no actual code >> would use it anyway. > > The question is not whether this particular platform could use vfio, > but instead is whether vfio is using the function correctly. That, I > really don't know. Who does? :) Let's cc: him. > > > But also... > > arch/arm64/mm/mmu.c: > pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn, > unsigned long size, pgprot_t vma_prot) > { > if (!pfn_valid(pfn)) > return pgprot_noncached(vma_prot); > else if (file->f_flags & O_SYNC) > return pgprot_writecombine(vma_prot); > return vma_prot; > } > > Why do we get to ignore dereferencing file on an arch that we > definitely care about? Thanks, Oopsie. This is because I overlooked it. Others do not use it. So I do need a file. But in the current scheme where all BARs share one fd - it won't work - I simply cannot allow WC on non-prefetchable BARs :-/
On Wed, 2017-10-11 at 13:56 +1100, Alexey Kardashevskiy wrote: > Oopsie. This is because I overlooked it. Others do not use it. So I do need > a file. But in the current scheme where all BARs share one fd - it won't > work - I simply cannot allow WC on non-prefetchable BARs :-/ This is an oversight in the design of VFIO-PCI, it should have a way to specify write combine, either implicitely via such an arch hook, or explicitely via an ioctl prior to mapping the BARs for example. Alex, what do you reckon is the best approach here ? Cheers, Ben.
On 12/10/17 02:35, Benjamin Herrenschmidt wrote: > On Wed, 2017-10-11 at 13:56 +1100, Alexey Kardashevskiy wrote: >> Oopsie. This is because I overlooked it. Others do not use it. So I do need >> a file. But in the current scheme where all BARs share one fd - it won't >> work - I simply cannot allow WC on non-prefetchable BARs :-/ > > This is an oversight in the design of VFIO-PCI, it should have a way to > specify write combine, either implicitely via such an arch hook, or > explicitely via an ioctl prior to mapping the BARs for example. > > Alex, what do you reckon is the best approach here ? /me wonders if it is yet another issue for the dead issues bucket, just like the msix mapping one :)
On Mon, Oct 16, 2017 at 04:54:08PM +1100, Alexey Kardashevskiy wrote: > On 12/10/17 02:35, Benjamin Herrenschmidt wrote: > > On Wed, 2017-10-11 at 13:56 +1100, Alexey Kardashevskiy wrote: > >> Oopsie. This is because I overlooked it. Others do not use it. So I do need > >> a file. But in the current scheme where all BARs share one fd - it won't > >> work - I simply cannot allow WC on non-prefetchable BARs :-/ > > > > This is an oversight in the design of VFIO-PCI, it should have a way to > > specify write combine, either implicitely via such an arch hook, or > > explicitely via an ioctl prior to mapping the BARs for example. > > > > Alex, what do you reckon is the best approach here ? > > /me wonders if it is yet another issue for the dead issues bucket, just > like the msix mapping one :) Maybe. Alexey, maybe you can make up a list of things that we (me, you, BenH) need to discuss with Alex W at KVM Forum?
On 16/10/17 17:00, David Gibson wrote: > On Mon, Oct 16, 2017 at 04:54:08PM +1100, Alexey Kardashevskiy wrote: >> On 12/10/17 02:35, Benjamin Herrenschmidt wrote: >>> On Wed, 2017-10-11 at 13:56 +1100, Alexey Kardashevskiy wrote: >>>> Oopsie. This is because I overlooked it. Others do not use it. So I do need >>>> a file. But in the current scheme where all BARs share one fd - it won't >>>> work - I simply cannot allow WC on non-prefetchable BARs :-/ >>> >>> This is an oversight in the design of VFIO-PCI, it should have a way to >>> specify write combine, either implicitely via such an arch hook, or >>> explicitely via an ioctl prior to mapping the BARs for example. >>> >>> Alex, what do you reckon is the best approach here ? >> >> /me wonders if it is yet another issue for the dead issues bucket, just >> like the msix mapping one :) > > Maybe. Alexey, maybe you can make up a list of things that we (me, > you, BenH) need to discuss with Alex W at KVM Forum? "you" - you meant me? I am not coming over there :( The list is: 1. Allow msix mapping to the userspace (to address non-64k-aligned msix bar) 2. Allow write combining in vfio for the userspace (kvm guest is kinda special and may simply ignore mapping flags in some configs but PPC radix guests still rely on this) 3. what callback and where needs to be added to inform HV/PR KVM about VFIO group, like IOMMUMR::add_vfio_group() proposal or something. Thanks!
On Mon, Oct 16, 2017 at 06:36:29PM +1100, Alexey Kardashevskiy wrote: > On 16/10/17 17:00, David Gibson wrote: > > On Mon, Oct 16, 2017 at 04:54:08PM +1100, Alexey Kardashevskiy wrote: > >> On 12/10/17 02:35, Benjamin Herrenschmidt wrote: > >>> On Wed, 2017-10-11 at 13:56 +1100, Alexey Kardashevskiy wrote: > >>>> Oopsie. This is because I overlooked it. Others do not use it. So I do need > >>>> a file. But in the current scheme where all BARs share one fd - it won't > >>>> work - I simply cannot allow WC on non-prefetchable BARs :-/ > >>> > >>> This is an oversight in the design of VFIO-PCI, it should have a way to > >>> specify write combine, either implicitely via such an arch hook, or > >>> explicitely via an ioctl prior to mapping the BARs for example. > >>> > >>> Alex, what do you reckon is the best approach here ? > >> > >> /me wonders if it is yet another issue for the dead issues bucket, just > >> like the msix mapping one :) > > > > Maybe. Alexey, maybe you can make up a list of things that we (me, > > you, BenH) need to discuss with Alex W at KVM Forum? > > "you" - you meant me? I am not coming over there :( Oh.. I thought you were. > The list is: > > 1. Allow msix mapping to the userspace (to address non-64k-aligned msix bar) > > 2. Allow write combining in vfio for the userspace (kvm guest is kinda > special and may simply ignore mapping flags in some configs but PPC radix > guests still rely on this) > > 3. what callback and where needs to be added to inform HV/PR KVM about VFIO > group, like IOMMUMR::add_vfio_group() proposal or something. Thanks.
On Mon, 2017-10-16 at 18:36 +1100, Alexey Kardashevskiy wrote: > > 1. Allow msix mapping to the userspace (to address non-64k-aligned msix bar) > > 2. Allow write combining in vfio for the userspace (kvm guest is kinda > special and may simply ignore mapping flags in some configs but PPC radix > guests still rely on this) Why ? The "G" bit is entirely under control of the guest afaik. This would only affect qemu itself. It's still useful for things like dpdk using vfio. > 3. what callback and where needs to be added to inform HV/PR KVM about VFIO > group, like IOMMUMR::add_vfio_group() proposal or something. Can you elaborate a bit ? I haven't followed this. Cheers, Ben.
On 16/10/17 19:38, Benjamin Herrenschmidt wrote: > On Mon, 2017-10-16 at 18:36 +1100, Alexey Kardashevskiy wrote: >> >> 1. Allow msix mapping to the userspace (to address non-64k-aligned msix bar) >> >> 2. Allow write combining in vfio for the userspace (kvm guest is kinda >> special and may simply ignore mapping flags in some configs but PPC radix >> guests still rely on this) > > Why ? The "G" bit is entirely under control of the guest afaik. Yes, for hash guests. I am not sure sure about radix, Paul pointed me to the code in KVM which uses the VFIO's mapping VMA. > This > would only affect qemu itself. It's still useful for things like dpdk > using vfio. Correct, this is useful regardless KVM. >> 3. what callback and where needs to be added to inform HV/PR KVM about VFIO >> group, like IOMMUMR::add_vfio_group() proposal or something. > > Can you elaborate a bit ? I haven't followed this. David knows :)
On Mon, 2017-10-16 at 22:11 +1100, Alexey Kardashevskiy wrote: > On 16/10/17 19:38, Benjamin Herrenschmidt wrote: > > On Mon, 2017-10-16 at 18:36 +1100, Alexey Kardashevskiy wrote: > > > > > > 1. Allow msix mapping to the userspace (to address non-64k-aligned msix bar) > > > > > > 2. Allow write combining in vfio for the userspace (kvm guest is kinda > > > special and may simply ignore mapping flags in some configs but PPC radix > > > guests still rely on this) > > > > Why ? The "G" bit is entirely under control of the guest afaik. > > Yes, for hash guests. I am not sure sure about radix, Paul pointed me to > the code in KVM which uses the VFIO's mapping VMA. With radix, the HW will honor the G bit set in the guest page tables. > > > This > > would only affect qemu itself. It's still useful for things like dpdk > > using vfio. > > Correct, this is useful regardless KVM. > > > > 3. what callback and where needs to be added to inform HV/PR KVM about VFIO > > > group, like IOMMUMR::add_vfio_group() proposal or something. > > > > Can you elaborate a bit ? I haven't followed this. > > David knows :) > > >
On 18/10/17 18:33, Benjamin Herrenschmidt wrote: > On Mon, 2017-10-16 at 22:11 +1100, Alexey Kardashevskiy wrote: >> On 16/10/17 19:38, Benjamin Herrenschmidt wrote: >>> On Mon, 2017-10-16 at 18:36 +1100, Alexey Kardashevskiy wrote: >>>> >>>> 1. Allow msix mapping to the userspace (to address non-64k-aligned msix bar) >>>> >>>> 2. Allow write combining in vfio for the userspace (kvm guest is kinda >>>> special and may simply ignore mapping flags in some configs but PPC radix >>>> guests still rely on this) >>> >>> Why ? The "G" bit is entirely under control of the guest afaik. >> >> Yes, for hash guests. I am not sure sure about radix, Paul pointed me to >> the code in KVM which uses the VFIO's mapping VMA. > > With radix, the HW will honor the G bit set in the guest page tables. I am sure that you know better, Paul just pointed me to this: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/kvm/book3s_64_mmu_radix.c?h=v4.13#n403 > >> >>> This >>> would only affect qemu itself. It's still useful for things like dpdk >>> using vfio. >> >> Correct, this is useful regardless KVM. >> >>>> 3. what callback and where needs to be added to inform HV/PR KVM about VFIO >>>> group, like IOMMUMR::add_vfio_group() proposal or something. >>> >>> Can you elaborate a bit ? I haven't followed this. >> >> David knows :) >> >> >>
On Wed, 2017-10-18 at 20:00 +1100, Alexey Kardashevskiy wrote: > > > > With radix, the HW will honor the G bit set in the guest page tables. > > > I am sure that you know better, Paul just pointed me to this: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/kvm/book3s_64_mmu_radix.c?h=v4.13#n403 It's in the architecture somewhere, if you look at 3.0B in the radix section about nested radix, there's a table for the nesting of attributes. Here you'll see that regardless of the G bit value set by the host, the HW will use the G bit value set by the guest. However, we should still provide a way to exploit WC if anything for things like DPDK. Cheers, Ben.
On 16/10/17 19:01, David Gibson wrote: > On Mon, Oct 16, 2017 at 06:36:29PM +1100, Alexey Kardashevskiy wrote: >> On 16/10/17 17:00, David Gibson wrote: >>> On Mon, Oct 16, 2017 at 04:54:08PM +1100, Alexey Kardashevskiy wrote: >>>> On 12/10/17 02:35, Benjamin Herrenschmidt wrote: >>>>> On Wed, 2017-10-11 at 13:56 +1100, Alexey Kardashevskiy wrote: >>>>>> Oopsie. This is because I overlooked it. Others do not use it. So I do need >>>>>> a file. But in the current scheme where all BARs share one fd - it won't >>>>>> work - I simply cannot allow WC on non-prefetchable BARs :-/ >>>>> >>>>> This is an oversight in the design of VFIO-PCI, it should have a way to >>>>> specify write combine, either implicitely via such an arch hook, or >>>>> explicitely via an ioctl prior to mapping the BARs for example. >>>>> >>>>> Alex, what do you reckon is the best approach here ? >>>> >>>> /me wonders if it is yet another issue for the dead issues bucket, just >>>> like the msix mapping one :) >>> >>> Maybe. Alexey, maybe you can make up a list of things that we (me, >>> you, BenH) need to discuss with Alex W at KVM Forum? >> >> "you" - you meant me? I am not coming over there :( > > Oh.. I thought you were. > >> The list is: >> >> 1. Allow msix mapping to the userspace (to address non-64k-aligned msix bar) >> >> 2. Allow write combining in vfio for the userspace (kvm guest is kinda >> special and may simply ignore mapping flags in some configs but PPC radix >> guests still rely on this) >> >> 3. what callback and where needs to be added to inform HV/PR KVM about VFIO >> group, like IOMMUMR::add_vfio_group() proposal or something. > > Thanks. Any luck with these? Ben, ping.
On Mon, Nov 06, 2017 at 04:44:03PM +1100, Alexey Kardashevskiy wrote: > On 16/10/17 19:01, David Gibson wrote: > > On Mon, Oct 16, 2017 at 06:36:29PM +1100, Alexey Kardashevskiy wrote: > >> On 16/10/17 17:00, David Gibson wrote: > >>> On Mon, Oct 16, 2017 at 04:54:08PM +1100, Alexey Kardashevskiy wrote: > >>>> On 12/10/17 02:35, Benjamin Herrenschmidt wrote: > >>>>> On Wed, 2017-10-11 at 13:56 +1100, Alexey Kardashevskiy wrote: > >>>>>> Oopsie. This is because I overlooked it. Others do not use it. So I do need > >>>>>> a file. But in the current scheme where all BARs share one fd - it won't > >>>>>> work - I simply cannot allow WC on non-prefetchable BARs :-/ > >>>>> > >>>>> This is an oversight in the design of VFIO-PCI, it should have a way to > >>>>> specify write combine, either implicitely via such an arch hook, or > >>>>> explicitely via an ioctl prior to mapping the BARs for example. > >>>>> > >>>>> Alex, what do you reckon is the best approach here ? > >>>> > >>>> /me wonders if it is yet another issue for the dead issues bucket, just > >>>> like the msix mapping one :) > >>> > >>> Maybe. Alexey, maybe you can make up a list of things that we (me, > >>> you, BenH) need to discuss with Alex W at KVM Forum? > >> > >> "you" - you meant me? I am not coming over there :( > > > > Oh.. I thought you were. > > > >> The list is: > >> > >> 1. Allow msix mapping to the userspace (to address non-64k-aligned msix bar) We have a new plan on this - I'll discuss it over IRC. > >> 2. Allow write combining in vfio for the userspace (kvm guest is kinda > >> special and may simply ignore mapping flags in some configs but PPC radix > >> guests still rely on this) AIUI this isn't for radix, but for DPDK things that we need this. Ben talked about it a bit, but I don't know what the outcome was. > >> 3. what callback and where needs to be added to inform HV/PR KVM about VFIO > >> group, like IOMMUMR::add_vfio_group() proposal or something. This was discussed, and I'm still thinking about it. It's kind of curly.
On Tue, 2017-11-14 at 13:23 +1100, David Gibson wrote: > > >> 1. Allow msix mapping to the userspace (to address non-64k-aligned msix bar) > > We have a new plan on this - I'll discuss it over IRC. > > > >> 2. Allow write combining in vfio for the userspace (kvm guest is kinda > > >> special and may simply ignore mapping flags in some configs but PPC radix > > >> guests still rely on this) > > AIUI this isn't for radix, but for DPDK things that we need this. Ben > talked about it a bit, but I don't know what the outcome was. So this is not a powerpc specific issue. Other archs similarily want to be able to do write combine mappings. The way sysfs does it is that for prefetchable BARs, it exposes both a resourceN and a resourceN_wc file. For VFIO it's a bit more tricky, maybe we need to game the offset using some of it as flags but that's very fishy, or maybe we do some kind of ioctl that selects the attributes used for that fd instance for subsequent mappings... I'll let Alex chose what he feels most appropriate here. > > >> 3. what callback and where needs to be added to inform HV/PR KVM about VFIO > > >> group, like IOMMUMR::add_vfio_group() proposal or something. > > This was discussed, and I'm still thinking about it. It's kind of > curly.
On Tue, 14 Nov 2017 13:29:02 +1100 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Tue, 2017-11-14 at 13:23 +1100, David Gibson wrote: > > > >> 1. Allow msix mapping to the userspace (to address non-64k-aligned msix bar) > > > > We have a new plan on this - I'll discuss it over IRC. > > > > > >> 2. Allow write combining in vfio for the userspace (kvm guest is kinda > > > >> special and may simply ignore mapping flags in some configs but PPC radix > > > >> guests still rely on this) > > > > AIUI this isn't for radix, but for DPDK things that we need this. Ben > > talked about it a bit, but I don't know what the outcome was. > > So this is not a powerpc specific issue. Other archs similarily want to > be able to do write combine mappings. > > The way sysfs does it is that for prefetchable BARs, it exposes both > a resourceN and a resourceN_wc file. > > For VFIO it's a bit more tricky, maybe we need to game the offset using > some of it as flags but that's very fishy, or maybe we do some kind of > ioctl that selects the attributes used for that fd instance for > subsequent mappings... > > I'll let Alex chose what he feels most appropriate here. My order of preference would be something like: - mmap flags provide some way for the user to specify a wc mapping within existing regions - some other mechanism of using the existing regions - additional regions provided for use exclusively with wc attributes (generalizing PCI BAR wc regions within device specific regions) - additional file descriptors provided for wc access This isn't at the top of my priority list to figure out the solution, so whoever implements it will need to provide justification as they move down the list from more to less preferred solutions. Thanks, Alex
On 15/11/17 03:28, Alex Williamson wrote: > On Tue, 14 Nov 2017 13:29:02 +1100 > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > >> On Tue, 2017-11-14 at 13:23 +1100, David Gibson wrote: >>>>>> 1. Allow msix mapping to the userspace (to address non-64k-aligned msix bar) >>> >>> We have a new plan on this - I'll discuss it over IRC. >>> >>>>>> 2. Allow write combining in vfio for the userspace (kvm guest is kinda >>>>>> special and may simply ignore mapping flags in some configs but PPC radix >>>>>> guests still rely on this) >>> >>> AIUI this isn't for radix, but for DPDK things that we need this. Ben >>> talked about it a bit, but I don't know what the outcome was. >> >> So this is not a powerpc specific issue. Other archs similarily want to >> be able to do write combine mappings. >> >> The way sysfs does it is that for prefetchable BARs, it exposes both >> a resourceN and a resourceN_wc file. >> >> For VFIO it's a bit more tricky, maybe we need to game the offset using >> some of it as flags but that's very fishy, or maybe we do some kind of >> ioctl that selects the attributes used for that fd instance for >> subsequent mappings... >> >> I'll let Alex chose what he feels most appropriate here. > > My order of preference would be something like: > > - mmap flags provide some way for the user to specify a wc mapping > within existing regions There are plenty of flags but none really matches, checked with Paul. > - some other mechanism of using the existing regions I can only think of madvise but it does not have appropriate flags either. > - additional regions provided for use exclusively with wc attributes > (generalizing PCI BAR wc regions within device specific regions) Adding VFIO_PCI_BAR0_WC_REGION_INDEX for VFIO_PCI_BAR0_REGION_INDEX (and so on for other BARs) seems a viable option. However the comment for VFIO_PCI_xxx_REGION_INDEX says: VFIO_PCI_NUM_REGIONS = 9 /* Fixed user ABI, region indexes >=9 use */ /* device specific cap to define content. */ which limits me in where I can add new indexes, I cannot just add new _WC indexes to that enum, can I? I cannot see any existing regions above 9 yet though. > - additional file descriptors provided for wc access It could be a capability + iocti(VFIO_DEVICE_GET_WC_RESOURCE) which would take a BAR index, check if the BAR is prefetchable and if so - return an fd which the userspace then could mmap(). This is won't break that ABI with 9 regions but it is the least favourable in the list... > This isn't at the top of my priority list to figure out the solution, > so whoever implements it will need to provide justification as they > move down the list from more to less preferred solutions. Thanks, I am trying... I was really counting on you guys having this discussed in Prague :(
On Fri, 24 Nov 2017 15:58:09 +1100 Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > On 15/11/17 03:28, Alex Williamson wrote: > > On Tue, 14 Nov 2017 13:29:02 +1100 > > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > > >> On Tue, 2017-11-14 at 13:23 +1100, David Gibson wrote: > >>>>>> 1. Allow msix mapping to the userspace (to address non-64k-aligned msix bar) > >>> > >>> We have a new plan on this - I'll discuss it over IRC. > >>> > >>>>>> 2. Allow write combining in vfio for the userspace (kvm guest is kinda > >>>>>> special and may simply ignore mapping flags in some configs but PPC radix > >>>>>> guests still rely on this) > >>> > >>> AIUI this isn't for radix, but for DPDK things that we need this. Ben > >>> talked about it a bit, but I don't know what the outcome was. > >> > >> So this is not a powerpc specific issue. Other archs similarily want to > >> be able to do write combine mappings. > >> > >> The way sysfs does it is that for prefetchable BARs, it exposes both > >> a resourceN and a resourceN_wc file. > >> > >> For VFIO it's a bit more tricky, maybe we need to game the offset using > >> some of it as flags but that's very fishy, or maybe we do some kind of > >> ioctl that selects the attributes used for that fd instance for > >> subsequent mappings... > >> > >> I'll let Alex chose what he feels most appropriate here. > > > > My order of preference would be something like: > > > > - mmap flags provide some way for the user to specify a wc mapping > > within existing regions > > There are plenty of flags but none really matches, checked with Paul. Is MAP_NONBLOCK off the table? Why? > > - some other mechanism of using the existing regions > > I can only think of madvise but it does not have appropriate flags either. Is it worth the process to define something that is appropriate? Would either of the above be the obvious architectural/implementation choice if we could define a flag for it? > > - additional regions provided for use exclusively with wc attributes > > (generalizing PCI BAR wc regions within device specific regions) > > > Adding VFIO_PCI_BAR0_WC_REGION_INDEX for VFIO_PCI_BAR0_REGION_INDEX (and so > on for other BARs) seems a viable option. > > However the comment for VFIO_PCI_xxx_REGION_INDEX says: > > VFIO_PCI_NUM_REGIONS = 9 /* Fixed user ABI, region indexes >=9 use */ > /* device specific cap to define content. */ > > > which limits me in where I can add new indexes, I cannot just add new _WC > indexes to that enum, can I? I cannot see any existing regions above 9 yet > though. The comment explains how to do this, you'd add a device specific region with the type identifying it as a PCI MMIO WC region and the sub-type probably defining the BAR index. > > - additional file descriptors provided for wc access > > It could be a capability + iocti(VFIO_DEVICE_GET_WC_RESOURCE) which would > take a BAR index, check if the BAR is prefetchable and if so - return an fd > which the userspace then could mmap(). This is won't break that ABI with 9 > regions but it is the least favourable in the list... Do the kernel mechanics require it to be a separate file descriptor? A separate fd is my last choice as well, but the interfaces your were attempting to use previously seemed to have fd granularity. > > This isn't at the top of my priority list to figure out the solution, > > so whoever implements it will need to provide justification as they > > move down the list from more to less preferred solutions. Thanks, > > I am trying... I was really counting on you guys having this discussed in > Prague :( Should have been there to push your agenda... Thanks, Alex
On Wed, Nov 29, 2017 at 11:47:46AM -0700, Alex Williamson wrote: > On Fri, 24 Nov 2017 15:58:09 +1100 > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > > > On 15/11/17 03:28, Alex Williamson wrote: > > > On Tue, 14 Nov 2017 13:29:02 +1100 > > > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > > > > >> On Tue, 2017-11-14 at 13:23 +1100, David Gibson wrote: > > >>>>>> 1. Allow msix mapping to the userspace (to address non-64k-aligned msix bar) > > >>> > > >>> We have a new plan on this - I'll discuss it over IRC. > > >>> > > >>>>>> 2. Allow write combining in vfio for the userspace (kvm guest is kinda > > >>>>>> special and may simply ignore mapping flags in some configs but PPC radix > > >>>>>> guests still rely on this) > > >>> > > >>> AIUI this isn't for radix, but for DPDK things that we need this. Ben > > >>> talked about it a bit, but I don't know what the outcome was. > > >> > > >> So this is not a powerpc specific issue. Other archs similarily want to > > >> be able to do write combine mappings. > > >> > > >> The way sysfs does it is that for prefetchable BARs, it exposes both > > >> a resourceN and a resourceN_wc file. > > >> > > >> For VFIO it's a bit more tricky, maybe we need to game the offset using > > >> some of it as flags but that's very fishy, or maybe we do some kind of > > >> ioctl that selects the attributes used for that fd instance for > > >> subsequent mappings... > > >> > > >> I'll let Alex chose what he feels most appropriate here. > > > > > > My order of preference would be something like: > > > > > > - mmap flags provide some way for the user to specify a wc mapping > > > within existing regions > > > > There are plenty of flags but none really matches, checked with Paul. > > Is MAP_NONBLOCK off the table? Why? > > > > - some other mechanism of using the existing regions > > > > I can only think of madvise but it does not have appropriate flags either. > > Is it worth the process to define something that is appropriate? Would > either of the above be the obvious architectural/implementation choice > if we could define a flag for it? > > > > - additional regions provided for use exclusively with wc attributes > > > (generalizing PCI BAR wc regions within device specific regions) > > > > > > Adding VFIO_PCI_BAR0_WC_REGION_INDEX for VFIO_PCI_BAR0_REGION_INDEX (and so > > on for other BARs) seems a viable option. > > > > However the comment for VFIO_PCI_xxx_REGION_INDEX says: > > > > VFIO_PCI_NUM_REGIONS = 9 /* Fixed user ABI, region indexes >=9 use */ > > /* device specific cap to define content. */ > > > > > > which limits me in where I can add new indexes, I cannot just add new _WC > > indexes to that enum, can I? I cannot see any existing regions above 9 yet > > though. > > The comment explains how to do this, you'd add a device specific region > with the type identifying it as a PCI MMIO WC region and the sub-type > probably defining the BAR index. > > > > - additional file descriptors provided for wc access > > > > It could be a capability + iocti(VFIO_DEVICE_GET_WC_RESOURCE) which would > > take a BAR index, check if the BAR is prefetchable and if so - return an fd > > which the userspace then could mmap(). This is won't break that ABI with 9 > > regions but it is the least favourable in the list... > > Do the kernel mechanics require it to be a separate file descriptor? A > separate fd is my last choice as well, but the interfaces your were > attempting to use previously seemed to have fd granularity. > > > > This isn't at the top of my priority list to figure out the solution, > > > so whoever implements it will need to provide justification as they > > > move down the list from more to less preferred solutions. Thanks, > > > > I am trying... I was really counting on you guys having this discussed in > > Prague :( > > Should have been there to push your agenda... Thanks, We discussed it briefly, BenH seemed to think there wasn't a big difficulty, IIRC, which is why we didn't spend much time on this (compared to the other issues). So, talk to him.
On Thu, 2017-11-30 at 15:20 +1100, David Gibson wrote: > > > > This isn't at the top of my priority list to figure out the solution, > > > > so whoever implements it will need to provide justification as they > > > > move down the list from more to less preferred solutions. Thanks, > > > > > > I am trying... I was really counting on you guys having this discussed in > > > Prague :( > > > > Should have been there to push your agenda... Thanks, > > We discussed it briefly, BenH seemed to think there wasn't a big > difficulty, IIRC, which is why we didn't spend much time on this > (compared to the other issues). So, talk to him. Well, first we established that this wasn't an issue for KVM, so the importance/urgency went down. It's still useful for DPDK. Then we discussed quickly the various options, none of them is particularily *difficult* as in the implementation is rather trivial, the question is to chose which interface to provide userspace. I don't have a strong opinion there. The mm guys might object to hijacking MAP_NONBLOCK, otherwise that seems like the best approach, so we should run that past them. Cheers, Ben.
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index f041b1a6cf66..014192b42724 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1156,8 +1156,13 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma) } vma->vm_private_data = vdev; - vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff; +#ifdef __HAVE_PHYS_MEM_ACCESS_PROT + vma->vm_page_prot = phys_mem_access_prot(NULL, vma->vm_pgoff, + req_len, vma->vm_page_prot); +#else + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); +#endif return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, req_len, vma->vm_page_prot);
At the moment the protection in VFIO MMIO mappings is forced to _PAGE_NON_IDEMPOTENT which means that write combining is not really available to the userspace even for prefetchable 64bit MMIO BARs. This replaces pgprot_noncached() with a platform specific phys_mem_access_prot() when available and depending on the platform the vm_page_prot may be set to _PAGE_TOLERANT allowing to exploit the write combining feature. The guest drivers still have to use _wc versions of the ioremap/pci_ioremap API to get write combininig working. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> --- This should allow DPDK and radix guests (x86, POWERPC, etc) to do write combining. POWERPC hash guests should not be affected by this change, it should work even without this. --- drivers/vfio/pci/vfio_pci.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)