Message ID | 20210429162906.32742-2-sdonthineni@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Honor PCI prefetchable attributes for a virtual machine on ARM64 | expand |
On Thu, 29 Apr 2021 11:29:05 -0500 Shanker Donthineni <sdonthineni@nvidia.com> wrote: > For pass-through device assignment, the ARM64 KVM hypervisor retrieves > the memory region properties physical address, size, and whether a > region backed with struct page or not from VMA. The prefetchable > attribute of a BAR region isn't visible to KVM to make an optimal > decision for stage2 attributes. > > This patch updates vma->vm_page_prot and maps with write-combine > attribute if the associated BAR is prefetchable. For ARM64 > pgprot_writecombine() is mapped to memory-type MT_NORMAL_NC which > has no side effects on reads and multiple writes can be combined. > > Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> > --- > drivers/vfio/pci/vfio_pci.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c > index 5023e23db3bc..1b734fe1dd51 100644 > --- a/drivers/vfio/pci/vfio_pci.c > +++ b/drivers/vfio/pci/vfio_pci.c > @@ -1703,7 +1703,11 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma) > } > > vma->vm_private_data = vdev; > - vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); > + if (IS_ENABLED(CONFIG_ARM64) && > + (pci_resource_flags(pdev, index) & IORESOURCE_PREFETCH)) > + vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot); > + else > + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); If this were a valid thing to do, it should be done for all architectures, not just ARM64. However, a prefetchable range only necessarily allows merged writes, which seems like a subset of the semantics implied by a WC attribute, therefore this doesn't seem universally valid. I'm also a bit confused by your problem statement that indicates that without WC you're seeing unaligned accesses, does this suggest that your driver is actually relying on WC semantics to perform merging to achieve alignment? That seems rather like a driver bug, I'd expect UC vs WC is largely a difference in performance, not a means to enforce proper driver access patterns. Per the PCI spec, the bridge itself can merge writes to prefetchable areas, presumably regardless of this processor attribute, perhaps that's the feature your driver is relying on that might be missing here. Thanks, Alex
Thanks Alex for quick reply. On 4/29/21 1:28 PM, Alex Williamson wrote: > If this were a valid thing to do, it should be done for all > architectures, not just ARM64. However, a prefetchable range only > necessarily allows merged writes, which seems like a subset of the > semantics implied by a WC attribute, therefore this doesn't seem > universally valid. > > I'm also a bit confused by your problem statement that indicates that > without WC you're seeing unaligned accesses, does this suggest that > your driver is actually relying on WC semantics to perform merging to > achieve alignment? That seems rather like a driver bug, I'd expect UC > vs WC is largely a difference in performance, not a means to enforce > proper driver access patterns. Per the PCI spec, the bridge itself can > merge writes to prefetchable areas, presumably regardless of this > processor attribute, perhaps that's the feature your driver is relying > on that might be missing here. Thanks, The driver uses WC semantics, It's mapping PCI prefetchable BARS using ioremap_wc(). We don't see any issue for x86 architecture, driver works fine in the host and guest kernel. The same driver works on ARM64 kernel but crashes inside VM. GPU driver uses the architecture agnostic function ioremap_wc() like other drivers. This limitation applies to all the drivers if they use WC memory and follow ARM64 NORMAL-NC access rules. On ARM64, ioremap_wc() is mapped to non-cacheable memory-type, no side effects on reads and unaligned accesses are allowed as per ARM-ARM architecture. The driver behavior is different in host vs guest on ARM64. ARM CPU generating alignment faults before transaction reaches the PCI-RC/switch/end-point-device. We've two concerns here: - Performance impacts for pass-through devices. - The definition of ioremap_wc() function doesn't match the host kernel on ARM64 > Alex >
On Thu, 29 Apr 2021 14:14:50 -0500 Shanker R Donthineni <sdonthineni@nvidia.com> wrote: > Thanks Alex for quick reply. > > On 4/29/21 1:28 PM, Alex Williamson wrote: > > If this were a valid thing to do, it should be done for all > > architectures, not just ARM64. However, a prefetchable range only > > necessarily allows merged writes, which seems like a subset of the > > semantics implied by a WC attribute, therefore this doesn't seem > > universally valid. > > > > I'm also a bit confused by your problem statement that indicates that > > without WC you're seeing unaligned accesses, does this suggest that > > your driver is actually relying on WC semantics to perform merging to > > achieve alignment? That seems rather like a driver bug, I'd expect UC > > vs WC is largely a difference in performance, not a means to enforce > > proper driver access patterns. Per the PCI spec, the bridge itself can > > merge writes to prefetchable areas, presumably regardless of this > > processor attribute, perhaps that's the feature your driver is relying > > on that might be missing here. Thanks, > The driver uses WC semantics, It's mapping PCI prefetchable BARS > using ioremap_wc(). We don't see any issue for x86 architecture, > driver works fine in the host and guest kernel. The same driver works > on ARM64 kernel but crashes inside VM. GPU driver uses the > architecture agnostic function ioremap_wc() like other drivers. This > limitation applies to all the drivers if they use WC memory and > follow ARM64 NORMAL-NC access rules. x86 KVM works for other reasons, KVM will trust the vCPU attributes for the memory range rather than relying only on the host mapping. > On ARM64, ioremap_wc() is mapped to non-cacheable memory-type, no > side effects on reads and unaligned accesses are allowed as per > ARM-ARM architecture. The driver behavior is different in host vs > guest on ARM64. Per the PCI spec, prefetchable memory only necessarily allows the bridge to merge writes. I believe this is only a subset of what WC mappings allow, therefore I expect this is incompatible with drivers that do not use WC mappings. > ARM CPU generating alignment faults before transaction reaches the > PCI-RC/switch/end-point-device. If an alignment fault is fixed by configuring a WC mapping, doesn't that suggest that the driver performed an unaligned access itself and is relying on write combining by the processor to correct that error? That's wrong. Fix the driver or please offer another explanation of how the WC mapping resolves this. I suspect you could enable tracing in QEMU, disable MMIO mmaps on the vfio-pci device and find the invalid access. > We've two concerns here: > - Performance impacts for pass-through devices. > - The definition of ioremap_wc() function doesn't match the host > kernel on ARM64 Performance I can understand, but I think you're also using it to mask a driver bug which should be resolved first. Thanks, Alex
Hi Alex, > From: Alex Williamson <alex.williamson@redhat.com> > Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region > in VMA > On Thu, 29 Apr 2021 14:14:50 -0500 > Shanker R Donthineni <sdonthineni@nvidia.com> wrote: > > > Thanks Alex for quick reply. > > > > On 4/29/21 1:28 PM, Alex Williamson wrote: > > > If this were a valid thing to do, it should be done for all > > > architectures, not just ARM64. However, a prefetchable range only > > > necessarily allows merged writes, which seems like a subset of the > > > semantics implied by a WC attribute, therefore this doesn't seem > > > universally valid. > > > I didn't get your exact concern. If we removed the check for ARM arch and simply stored that this is a prefetchable region in VMA, then each arch KVM port could decide which PTE mappings are OK for prefetchable BAR. KVM doesn't want to go through PCIe enumeration, and would rather have the properties stored in VMA. Beyond that, on arm64 specifically there is no WC Memtype, but we use Normal Non Cacheable mapping for ioremap_wc which can be prefetched and can be write combined. What semantics break for a device if its prefetchable BAR is marked as Normal Noncacheable on arm64? We need a way for write combining to work in a KVM-ARM guest, as it is an important usecase for GPUs and NICs and also NVMe CMB IIRC. So *some* way is needed of letting KVM know to map as write combine (Normal NC) at stage2. Do you have a better solution in mind? > > > I'm also a bit confused by your problem statement that indicates > > > that without WC you're seeing unaligned accesses, does this suggest > > > that your driver is actually relying on WC semantics to perform > > > merging to achieve alignment? That seems rather like a driver bug, > > > I'd expect UC vs WC is largely a difference in performance, not a > > > means to enforce proper driver access patterns. Per the PCI spec, > > > the bridge itself can merge writes to prefetchable areas, presumably > > > regardless of this processor attribute, perhaps that's the feature > > > your driver is relying on that might be missing here. Thanks, > > The driver uses WC semantics, It's mapping PCI prefetchable BARS using > > ioremap_wc(). We don't see any issue for x86 architecture, driver > > works fine in the host and guest kernel. The same driver works on > > ARM64 kernel but crashes inside VM. GPU driver uses the architecture > > agnostic function ioremap_wc() like other drivers. This limitation > > applies to all the drivers if they use WC memory and follow ARM64 > > NORMAL-NC access rules. > > x86 KVM works for other reasons, KVM will trust the vCPU attributes for the > memory range rather than relying only on the host mapping. > > > On ARM64, ioremap_wc() is mapped to non-cacheable memory-type, no > side > > effects on reads and unaligned accesses are allowed as per ARM-ARM > > architecture. The driver behavior is different in host vs guest on > > ARM64. > > Per the PCI spec, prefetchable memory only necessarily allows the bridge to > merge writes. I believe this is only a subset of what WC mappings allow, > therefore I expect this is incompatible with drivers that do not use WC > mappings. > > > ARM CPU generating alignment faults before transaction reaches the > > PCI-RC/switch/end-point-device. > > If an alignment fault is fixed by configuring a WC mapping, doesn't that > suggest that the driver performed an unaligned access itself and is relying on > write combining by the processor to correct that error? > That's wrong. Fix the driver or please offer another explanation of how the > WC mapping resolves this. I suspect you could enable tracing in QEMU, > disable MMIO mmaps on the vfio-pci device and find the invalid access. > > > We've two concerns here: > > - Performance impacts for pass-through devices. > > - The definition of ioremap_wc() function doesn't match the host > > kernel on ARM64 > > Performance I can understand, but I think you're also using it to mask a driver > bug which should be resolved first. Thanks, > > Alex
[+Jason, Ben] On Thu, Apr 29, 2021 at 11:29:05AM -0500, Shanker Donthineni wrote: > For pass-through device assignment, the ARM64 KVM hypervisor retrieves > the memory region properties physical address, size, and whether a > region backed with struct page or not from VMA. The prefetchable > attribute of a BAR region isn't visible to KVM to make an optimal > decision for stage2 attributes. > > This patch updates vma->vm_page_prot and maps with write-combine > attribute if the associated BAR is prefetchable. For ARM64 > pgprot_writecombine() is mapped to memory-type MT_NORMAL_NC which > has no side effects on reads and multiple writes can be combined. > > Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> > --- > drivers/vfio/pci/vfio_pci.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) A bit of background information that may be useful: https://lore.kernel.org/linux-pci/2b539df4c9ec703458e46da2fc879ee3b310b31c.camel@kernel.crashing.org Lorenzo > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c > index 5023e23db3bc..1b734fe1dd51 100644 > --- a/drivers/vfio/pci/vfio_pci.c > +++ b/drivers/vfio/pci/vfio_pci.c > @@ -1703,7 +1703,11 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma) > } > > vma->vm_private_data = vdev; > - vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); > + if (IS_ENABLED(CONFIG_ARM64) && > + (pci_resource_flags(pdev, index) & IORESOURCE_PREFETCH)) > + vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot); > + else > + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); > vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff; > > /* > -- > 2.17.1 >
Hi Alex On 4/29/21 2:46 PM, Alex Williamson wrote: > If an alignment fault is fixed by configuring a WC mapping, doesn't > that suggest that the driver performed an unaligned access itself and > is relying on write combining by the processor to correct that error? > That's wrong. Fix the driver or please offer another explanation of > how the WC mapping resolves this. I suspect you could enable tracing > in QEMU, disable MMIO mmaps on the vfio-pci device and find the invalid > access. > >> We've two concerns here: >> - Performance impacts for pass-through devices. >> - The definition of ioremap_wc() function doesn't match the host >> kernel on ARM64 > Performance I can understand, but I think you're also using it to mask > a driver bug which should be resolved first. Thank We’ve already instrumented the driver code and found the code path for the unaligned accesses. We’ll fix this issue if it’s not following WC semantics. Fixing the performance concern will be under KVM stage-2 page-table control. We're looking for a guidance/solution for updating stage-2 PTE based on PCI-BAR attribute.
Hi Shanker, On Fri, 30 Apr 2021 12:25:08 +0100, Shanker R Donthineni <sdonthineni@nvidia.com> wrote: > > Hi Alex > > On 4/29/21 2:46 PM, Alex Williamson wrote: > > If an alignment fault is fixed by configuring a WC mapping, doesn't > > that suggest that the driver performed an unaligned access itself and > > is relying on write combining by the processor to correct that error? > > That's wrong. Fix the driver or please offer another explanation of > > how the WC mapping resolves this. I suspect you could enable tracing > > in QEMU, disable MMIO mmaps on the vfio-pci device and find the invalid > > access. > > > >> We've two concerns here: > >> - Performance impacts for pass-through devices. > >> - The definition of ioremap_wc() function doesn't match the host > >> kernel on ARM64 > > Performance I can understand, but I think you're also using it to mask > > a driver bug which should be resolved first. Thank > > We’ve already instrumented the driver code and found the code path > for the unaligned accesses. We’ll fix this issue if it’s not > following WC semantics. > > Fixing the performance concern will be under KVM stage-2 page-table > control. We're looking for a guidance/solution for updating stage-2 > PTE based on PCI-BAR attribute. Before we start discussing the *how*, I'd like to clearly understand what *arm64* memory attributes you are relying on. We already have established that the unaligned access was a bug, which was the biggest argument in favour of NORMAL_NC. What are the other requirements? Thanks, M.
On Fri, Apr 30, 2021 at 10:54:17AM +0100, Lorenzo Pieralisi wrote: > [+Jason, Ben] > > On Thu, Apr 29, 2021 at 11:29:05AM -0500, Shanker Donthineni wrote: > > For pass-through device assignment, the ARM64 KVM hypervisor retrieves > > the memory region properties physical address, size, and whether a > > region backed with struct page or not from VMA. The prefetchable > > attribute of a BAR region isn't visible to KVM to make an optimal > > decision for stage2 attributes. > > > > This patch updates vma->vm_page_prot and maps with write-combine > > attribute if the associated BAR is prefetchable. For ARM64 > > pgprot_writecombine() is mapped to memory-type MT_NORMAL_NC which > > has no side effects on reads and multiple writes can be combined. > > > > Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> > > drivers/vfio/pci/vfio_pci.c | 6 +++++- > > 1 file changed, 5 insertions(+), 1 deletion(-) > > A bit of background information that may be useful: > > https://lore.kernel.org/linux-pci/2b539df4c9ec703458e46da2fc879ee3b310b31c.camel@kernel.crashing.org This can't happen automatically. writecombining or not is a uABI visible change. Userspace needs to explicitly request and know that the mmap it gets back is writecombining. Jason
Hi Marc, On 4/30/21 6:47 AM, Marc Zyngi >>>> We've two concerns here: >>>> - Performance impacts for pass-through devices. >>>> - The definition of ioremap_wc() function doesn't match the host >>>> kernel on ARM64 >>> Performance I can understand, but I think you're also using it to mask >>> a driver bug which should be resolved first. Thank >> We’ve already instrumented the driver code and found the code path >> for the unaligned accesses. We’ll fix this issue if it’s not >> following WC semantics. >> >> Fixing the performance concern will be under KVM stage-2 page-table >> control. We're looking for a guidance/solution for updating stage-2 >> PTE based on PCI-BAR attribute. > Before we start discussing the *how*, I'd like to clearly understand > what *arm64* memory attributes you are relying on. We already have > established that the unaligned access was a bug, which was the biggest > argument in favour of NORMAL_NC. What are the other requirements? > We are relying on NORMAL_NC mapping for PCI prefetchable BARs instead of DEVICE_nGnRE in baremetal and VM. No other requirement other than this. -Shanker
Hi Marc, On 4/30/21 6:47 AM, Marc Zyngier wrote: > >>>> We've two concerns here: >>>> - Performance impacts for pass-through devices. >>>> - The definition of ioremap_wc() function doesn't match the host >>>> kernel on ARM64 >>> Performance I can understand, but I think you're also using it to mask >>> a driver bug which should be resolved first. Thank >> We’ve already instrumented the driver code and found the code path >> for the unaligned accesses. We’ll fix this issue if it’s not >> following WC semantics. >> >> Fixing the performance concern will be under KVM stage-2 page-table >> control. We're looking for a guidance/solution for updating stage-2 >> PTE based on PCI-BAR attribute. > Before we start discussing the *how*, I'd like to clearly understand > what *arm64* memory attributes you are relying on. We already have > established that the unaligned access was a bug, which was the biggest > argument in favour of NORMAL_NC. What are the other requirements? Sorry, my earlier response was not complete... ARMv8 architecture has two features Gathering and Reorder transactions, very important from a performance point of view. Small inline packets for NIC cards and accesses to GPU's frame buffer are CPU-bound operations. We want to take advantages of GRE features to achieve higher performance. Both these features are disabled for prefetchable BARs in VM because memory-type MT_DEVICE_nGnRE enforced in stage-2. > Thanks, > > M. > > -- > Without deviation from the norm, progress is not possible.
On Fri, 30 Apr 2021 14:07:46 +0100, Shanker R Donthineni <sdonthineni@nvidia.com> wrote: > > > Hi Marc, > > On 4/30/21 6:47 AM, Marc Zyngi > >>>> We've two concerns here: > >>>> - Performance impacts for pass-through devices. > >>>> - The definition of ioremap_wc() function doesn't match the host > >>>> kernel on ARM64 > >>> Performance I can understand, but I think you're also using it to mask > >>> a driver bug which should be resolved first. Thank > >> We’ve already instrumented the driver code and found the code path > >> for the unaligned accesses. We’ll fix this issue if it’s not > >> following WC semantics. > >> > >> Fixing the performance concern will be under KVM stage-2 page-table > >> control. We're looking for a guidance/solution for updating stage-2 > >> PTE based on PCI-BAR attribute. > > Before we start discussing the *how*, I'd like to clearly understand > > what *arm64* memory attributes you are relying on. We already have > > established that the unaligned access was a bug, which was the biggest > > argument in favour of NORMAL_NC. What are the other requirements? > > > We are relying on NORMAL_NC mapping for PCI prefetchable BARs > instead of DEVICE_nGnRE in baremetal and VM. No other requirement > other than this. That wasn't really my question. What properties do you require for this assigned MMIO window outside of unaligned accesses? Or to put it another way, what goes wrong with any of the other device types? M.
On Fri, 30 Apr 2021 15:58:14 +0100, Shanker R Donthineni <sdonthineni@nvidia.com> wrote: > > Hi Marc, > > On 4/30/21 6:47 AM, Marc Zyngier wrote: > > > >>>> We've two concerns here: > >>>> - Performance impacts for pass-through devices. > >>>> - The definition of ioremap_wc() function doesn't match the host > >>>> kernel on ARM64 > >>> Performance I can understand, but I think you're also using it to mask > >>> a driver bug which should be resolved first. Thank > >> We’ve already instrumented the driver code and found the code path > >> for the unaligned accesses. We’ll fix this issue if it’s not > >> following WC semantics. > >> > >> Fixing the performance concern will be under KVM stage-2 page-table > >> control. We're looking for a guidance/solution for updating stage-2 > >> PTE based on PCI-BAR attribute. > > Before we start discussing the *how*, I'd like to clearly understand > > what *arm64* memory attributes you are relying on. We already have > > established that the unaligned access was a bug, which was the biggest > > argument in favour of NORMAL_NC. What are the other requirements? > Sorry, my earlier response was not complete... > > ARMv8 architecture has two features Gathering and Reorder > transactions, very important from a performance point of view. Small > inline packets for NIC cards and accesses to GPU's frame buffer are > CPU-bound operations. We want to take advantages of GRE features to > achieve higher performance. > > Both these features are disabled for prefetchable BARs in VM because > memory-type MT_DEVICE_nGnRE enforced in stage-2. Right, so Normal_NC is a red herring, and it is Device_GRE that you really are after, right? Now, I'm not convinced that we can do that directly from vfio in a device-agnostic manner. It is userspace that places the device in the guest's memory, and I have the ugly feeling that userspace needs to be in control of memory attributes. Otherwise, we change the behaviour for all existing devices that have prefetchable BARs, and I don't think that's an acceptable move (userspace ABI change). M.
Hi Marc, > -----Original Message----- > From: Marc Zyngier <maz@kernel.org> > Sent: Friday, April 30, 2021 10:31 AM > On Fri, 30 Apr 2021 15:58:14 +0100, > Shanker R Donthineni <sdonthineni@nvidia.com> wrote: > > > > Hi Marc, > > > > On 4/30/21 6:47 AM, Marc Zyngier wrote: > > > > > >>>> We've two concerns here: > > >>>> - Performance impacts for pass-through devices. > > >>>> - The definition of ioremap_wc() function doesn't match the > > >>>> host kernel on ARM64 > > >>> Performance I can understand, but I think you're also using it to > > >>> mask a driver bug which should be resolved first. Thank > > >> We’ve already instrumented the driver code and found the code path > > >> for the unaligned accesses. We’ll fix this issue if it’s not > > >> following WC semantics. > > >> > > >> Fixing the performance concern will be under KVM stage-2 page-table > > >> control. We're looking for a guidance/solution for updating stage-2 > > >> PTE based on PCI-BAR attribute. > > > Before we start discussing the *how*, I'd like to clearly understand > > > what *arm64* memory attributes you are relying on. We already have > > > established that the unaligned access was a bug, which was the > > > biggest argument in favour of NORMAL_NC. What are the other > requirements? > > Sorry, my earlier response was not complete... > > > > ARMv8 architecture has two features Gathering and Reorder > > transactions, very important from a performance point of view. Small > > inline packets for NIC cards and accesses to GPU's frame buffer are > > CPU-bound operations. We want to take advantages of GRE features to > > achieve higher performance. > > > > Both these features are disabled for prefetchable BARs in VM because > > memory-type MT_DEVICE_nGnRE enforced in stage-2. > > Right, so Normal_NC is a red herring, and it is Device_GRE that you really are > after, right? > I think Device GRE has some practical problems. 1. A lot of userspace code which is used to getting write combined mappings to GPU memory from kernel drivers does memcpy/memset on it which can insert ldp/stp which can crash on Device Memory Type. From a quick search I didn't find a memcpy_io or memset_io in glibc. Perhaps there are some other functions available, but a lot of userspace applications that work on x86 and ARM baremetal won't work on ARM VMs without such changes. Changes to all of userspace may not always be practical, specially if linking to binaries 2. Sometimes even if application is not using memset/memcpy directly, gcc may insert a builtin memcpy/memset. 3. Recompiling all applications with gcc -m strict-align has performance issues. In our experiments that resulted in an increase in code size, and also 3-5% performance decrease reliably. Also, it is not always practical to recompile all of userspace, depending on who owns the code/linked binaries etc. From KVM-ARM point of view, what is it about Normal NC at stage 2 for Prefetchable BAR (however KVM gets the hint, whether from userspace or VMA) that is undesirable vs Device GRE? I couldn't think of a difference to devices whether the combining or prefetching or reordering happened because of one or the other. > Now, I'm not convinced that we can do that directly from vfio in a device- > agnostic manner. It is userspace that places the device in the guest's > memory, and I have the ugly feeling that userspace needs to be in control of > memory attributes. > > Otherwise, we change the behaviour for all existing devices that have > prefetchable BARs, and I don't think that's an acceptable move (userspace > ABI change). > > M. > > -- > Without deviation from the norm, progress is not possible.
Hi Vikram, On Fri, 30 Apr 2021 17:57:14 +0100, Vikram Sethi <vsethi@nvidia.com> wrote: > > Hi Marc, > > > -----Original Message----- > > From: Marc Zyngier <maz@kernel.org> > > Sent: Friday, April 30, 2021 10:31 AM > > On Fri, 30 Apr 2021 15:58:14 +0100, > > Shanker R Donthineni <sdonthineni@nvidia.com> wrote: > > > > > > Hi Marc, > > > > > > On 4/30/21 6:47 AM, Marc Zyngier wrote: > > > > > > > >>>> We've two concerns here: > > > >>>> - Performance impacts for pass-through devices. > > > >>>> - The definition of ioremap_wc() function doesn't match the > > > >>>> host kernel on ARM64 > > > >>> Performance I can understand, but I think you're also using it to > > > >>> mask a driver bug which should be resolved first. Thank > > > >> We’ve already instrumented the driver code and found the code path > > > >> for the unaligned accesses. We’ll fix this issue if it’s not > > > >> following WC semantics. > > > >> > > > >> Fixing the performance concern will be under KVM stage-2 page-table > > > >> control. We're looking for a guidance/solution for updating stage-2 > > > >> PTE based on PCI-BAR attribute. > > > > Before we start discussing the *how*, I'd like to clearly understand > > > > what *arm64* memory attributes you are relying on. We already have > > > > established that the unaligned access was a bug, which was the > > > > biggest argument in favour of NORMAL_NC. What are the other > > requirements? > > > Sorry, my earlier response was not complete... > > > > > > ARMv8 architecture has two features Gathering and Reorder > > > transactions, very important from a performance point of view. Small > > > inline packets for NIC cards and accesses to GPU's frame buffer are > > > CPU-bound operations. We want to take advantages of GRE features to > > > achieve higher performance. > > > > > > Both these features are disabled for prefetchable BARs in VM because > > > memory-type MT_DEVICE_nGnRE enforced in stage-2. > > > > Right, so Normal_NC is a red herring, and it is Device_GRE that > > you really are after, right? > > > I think Device GRE has some practical problems. > 1. A lot of userspace code which is used to getting write combined > mappings to GPU memory from kernel drivers does memcpy/memset on it > which can insert ldp/stp which can crash on Device Memory Type. From > a quick search I didn't find a memcpy_io or memset_io in > glibc. Perhaps there are some other functions available, but a lot > of userspace applications that work on x86 and ARM baremetal won't > work on ARM VMs without such changes. Changes to all of userspace > may not always be practical, specially if linking to binaries This seems to go against what Alex was hinting at earlier, which is that unaligned accesses were not expected on prefetchable regions, and Shanker latter confirming that it was an actual bug. Where do we stand here? > > 2. Sometimes even if application is not using memset/memcpy directly, > gcc may insert a builtin memcpy/memset. > > 3. Recompiling all applications with gcc -m strict-align has > performance issues. In our experiments that resulted in an increase > in code size, and also 3-5% performance decrease reliably. Also, it > is not always practical to recompile all of userspace, depending on > who owns the code/linked binaries etc. > > From KVM-ARM point of view, what is it about Normal NC at stage 2 > for Prefetchable BAR (however KVM gets the hint, whether from > userspace or VMA) that is undesirable vs Device GRE? I couldn't > think of a difference to devices whether the combining or > prefetching or reordering happened because of one or the other. The problem I see is that we have VM and userspace being written in terms of Write-Combine, which is: - loosely defined even on x86 - subject to interpretations in the way it maps to PCI - has no direct equivalent in the ARMv8 collection of memory attributes (and Normal_NC comes with speculation capabilities which strikes me as extremely undesirable on arbitrary devices) How do we translate this into something consistent? I'd like to see an actual description of what we *really* expect from WC on prefetchable PCI regions, turn that into a documented definition agreed across architectures, and then we can look at implementing it with one memory type or another on arm64. Because once we expose that memory type at S2 for KVM guests, it becomes ABI and there is no turning back. So I want to get it right once and for all. Thanks, M.
Hi Marc, On 5/1/21 4:30 AM, Marc Zyngier wrote: >> I think Device GRE has some practical problems. >> 1. A lot of userspace code which is used to getting write combined >> mappings to GPU memory from kernel drivers does memcpy/memset on it >> which can insert ldp/stp which can crash on Device Memory Type. From >> a quick search I didn't find a memcpy_io or memset_io in >> glibc. Perhaps there are some other functions available, but a lot >> of userspace applications that work on x86 and ARM baremetal won't >> work on ARM VMs without such changes. Changes to all of userspace >> may not always be practical, specially if linking to binaries > This seems to go against what Alex was hinting at earlier, which is > that unaligned accesses were not expected on prefetchable regions, and > Shanker latter confirming that it was an actual bug. Where do we stand > here? > We agreed to call it a driver bug if it's not following Linux write-combining API ioremap_wc() semantics. So far I didn't find whether unaligned accesses allowed or not for WC regions explicitly in Linux documentation. Page faults due to driver unaligned accesses in kernel space will be under driver control, we'll fix it. Driver uses the architecture agnostic functions that are available in the Linux kernel and expecting the same behavior in VM vs Baremetal. We would like to keep the driver implementation is architecture-independent as much as possible and support VM unaware. For ARM64, VM's ioremap_wc() definition doesn't match baremetal. We don't have any control over the userspace applications/drivers/libraries as Vikram saying. Another example GCC memset() function uses 'DC ZVA' which triggers an alignment fault if the actual memory type is device_xxx.
Hi Marc, > From: Marc Zyngier <maz@kernel.org> > Hi Vikram, > > The problem I see is that we have VM and userspace being written in terms > of Write-Combine, which is: > > - loosely defined even on x86 > > - subject to interpretations in the way it maps to PCI > > - has no direct equivalent in the ARMv8 collection of memory > attributes (and Normal_NC comes with speculation capabilities which > strikes me as extremely undesirable on arbitrary devices) If speculation with Normal NC to prefetchable BARs in devices was a problem, those devices would already be broken in baremetal with ioremap_wc on arm64, and we would need quirks there to not do Normal NC for them but Device GRE, and if such a quirk was needed on baremetal, it could be picked up by vfio/KVM as well. But we haven't seen any broken devices doing wc on baremetal on ARM64, have we? I know we have tested NICs write combining on arm64 in baremetal, as well as GPU and NVMe CMB without issues. Further, I don't see why speculation to non cacheble would be an issue if prefetch without side effects is allowed by the device, which is what a prefetchable BAR is. If it is an issue for a device I would consider that a bug already needing a quirk in Baremetal/host kernel already. From PCI spec " A prefetchable address range may have write side effects, but it may not have read side effects." > > How do we translate this into something consistent? I'd like to see an actual > description of what we *really* expect from WC on prefetchable PCI regions, > turn that into a documented definition agreed across architectures, and then > we can look at implementing it with one memory type or another on arm64. > > Because once we expose that memory type at S2 for KVM guests, it > becomes ABI and there is no turning back. So I want to get it right once and > for all. > I agree that we need a precise definition for the Linux ioremap_wc API wrt what drivers (kernel and userspace) can expect and whether memset/memcpy is expected to work or not and whether aligned accesses are a requirement. To the extent ABI is set, I would think that the ABI is also already set in the host kernel for arm64 WC = Normal NC, so why should that not also be the ABI for same driver in VMs. > Thanks, > > M. > > -- > Without deviation from the norm, progress is not possible.
On Sat, 01 May 2021 12:36:11 +0100, Shanker R Donthineni <sdonthineni@nvidia.com> wrote: > > Hi Marc, > > On 5/1/21 4:30 AM, Marc Zyngier wrote: > >> I think Device GRE has some practical problems. > >> 1. A lot of userspace code which is used to getting write combined > >> mappings to GPU memory from kernel drivers does memcpy/memset on it > >> which can insert ldp/stp which can crash on Device Memory Type. From > >> a quick search I didn't find a memcpy_io or memset_io in > >> glibc. Perhaps there are some other functions available, but a lot > >> of userspace applications that work on x86 and ARM baremetal won't > >> work on ARM VMs without such changes. Changes to all of userspace > >> may not always be practical, specially if linking to binaries > > This seems to go against what Alex was hinting at earlier, which is > > that unaligned accesses were not expected on prefetchable regions, and > > Shanker latter confirming that it was an actual bug. Where do we stand > > here? > > > We agreed to call it a driver bug if it's not following Linux > write-combining API ioremap_wc() semantics. So far I didn't find > whether unaligned accesses allowed or not for WC regions explicitly > in Linux documentation. And that's exactly the kind of problem I want clarification on before we add *anything* to KVM. Proper, unambiguous definition of what WC is on the CPU side, and how it maps onto PCI. Without such a definition, we're just driving blind. > Page faults due to driver unaligned accesses > in kernel space will be under driver control, we'll fix it. > > Driver uses the architecture agnostic functions that are available > in the Linux kernel and expecting the same behavior in VM vs > Baremetal. We would like to keep the driver implementation is > architecture-independent as much as possible and support VM > unaware. For ARM64, VM's ioremap_wc() definition doesn't match > baremetal. You are mixing two things: what Linux/arm64 gives to kernel drivers, and what KVM, as an implementation of the ARMv8 architecture, gives to virtual machines. There is zero reason for the two to match if there is no definition of what we need to provide. > We don't have any control over the userspace > applications/drivers/libraries as Vikram saying. Another example GCC > memset() function uses 'DC ZVA' which triggers an alignment fault if > the actual memory type is device_xxx. Again, you're talking about an application, and I'm talking about how to map a nebulous concept that originated on a foreign architecture onto something that is entirely different. So please drop the "that's how my SW works", and instead give me a good definition of what WC means in architectural terms. Thanks M.
Hi Vikram, On Sun, 02 May 2021 18:56:31 +0100, Vikram Sethi <vsethi@nvidia.com> wrote: > > Hi Marc, > > > From: Marc Zyngier <maz@kernel.org> > > Hi Vikram, > > > > > The problem I see is that we have VM and userspace being written in terms > > of Write-Combine, which is: > > > > - loosely defined even on x86 > > > > - subject to interpretations in the way it maps to PCI > > > > - has no direct equivalent in the ARMv8 collection of memory > > attributes (and Normal_NC comes with speculation capabilities which > > strikes me as extremely undesirable on arbitrary devices) > > If speculation with Normal NC to prefetchable BARs in devices was a > problem, those devices would already be broken in baremetal with > ioremap_wc on arm64, and we would need quirks there to not do Normal > NC for them but Device GRE, and if such a quirk was needed on > baremetal, it could be picked up by vfio/KVM as well. But we haven't > seen any broken devices doing wc on baremetal on ARM64, have we? The lack of evidence does not equate to a proof, and your devices not misbehaving doesn't mean it is the right thing, specially when we have such a wide range of CPU and interconnect implementation. Which is why I really want an answer at the architecture level. Not a "it works for me" type of answer. Furthermore, as I replied to Shanker in a separate email, what Linux/arm64 does is pretty much irrelevant. KVM/arm64 implements the ARMv8 architecture, and it is at that level that we need to solve the problem. If, by enumerating the properties of Prefetchable, you can show that they are a strict superset of Normal_NC, I'm on board. I haven't seen such an enumeration so far. > I know we have tested NICs write combining on arm64 in baremetal, as > well as GPU and NVMe CMB without issues. > > Further, I don't see why speculation to non cacheble would be an > issue if prefetch without side effects is allowed by the device, > which is what a prefetchable BAR is. > If it is an issue for a device I would consider that a bug already needing a quirk in > Baremetal/host kernel already. > From PCI spec " A prefetchable address range may have write side effects, > but it may not have read side effects." Right, so we have made a small step in the direction of mapping "prefetchable" onto "Normal_NC", thanks for that. What about all the other properties (unaligned accesses, ordering, gathering)? > > How do we translate this into something consistent? I'd like to see an actual > > description of what we *really* expect from WC on prefetchable PCI regions, > > turn that into a documented definition agreed across architectures, and then > > we can look at implementing it with one memory type or another on arm64. > > > > Because once we expose that memory type at S2 for KVM guests, it > > becomes ABI and there is no turning back. So I want to get it right once and > > for all. > > > I agree that we need a precise definition for the Linux ioremap_wc > API wrt what drivers (kernel and userspace) can expect and whether > memset/memcpy is expected to work or not and whether aligned > accesses are a requirement. > To the extent ABI is set, I would think that the ABI is also already > set in the host kernel for arm64 WC = Normal NC, so why should that > not also be the ABI for same driver in VMs. KVM is an implementation of the ARM architecture, and doesn't really care about what WC is. If we come to the conclusion that Normal_NC is the natural match for Prefetchable attributes, than we're good and we can have Normal_NC being set by userspace, or even VFIO. But I don't want to set it only because "it works when bare-metal Linux uses it". Remember KVM doesn't only run Linux as guests. M.
Hi Marc, On 5/3/21 4:50 AM, Marc Zyngier wrote: > You are mixing two things: what Linux/arm64 gives to kernel drivers, > and what KVM, as an implementation of the ARMv8 architecture, gives to > virtual machines. There is zero reason for the two to match if there > is no definition of what we need to provide. Suggestion here is memory-types PROT_NORMAL_NC and PROT_DEVICE_nGnRE are applicable to braemetal only and don't use for device memory inside VM.
> Date: Mon, 03 May 2021 11:17:23 +0100 > From: Marc Zyngier <maz@kernel.org> > > Hi Vikram, > > On Sun, 02 May 2021 18:56:31 +0100, > Vikram Sethi <vsethi@nvidia.com> wrote: > > > > Hi Marc, > > > > > From: Marc Zyngier <maz@kernel.org> > > > Hi Vikram, > > > > > > > > The problem I see is that we have VM and userspace being written in terms > > > of Write-Combine, which is: > > > > > > - loosely defined even on x86 > > > > > > - subject to interpretations in the way it maps to PCI > > > > > > - has no direct equivalent in the ARMv8 collection of memory > > > attributes (and Normal_NC comes with speculation capabilities which > > > strikes me as extremely undesirable on arbitrary devices) > > > > If speculation with Normal NC to prefetchable BARs in devices was a > > problem, those devices would already be broken in baremetal with > > ioremap_wc on arm64, and we would need quirks there to not do Normal > > NC for them but Device GRE, and if such a quirk was needed on > > baremetal, it could be picked up by vfio/KVM as well. But we haven't > > seen any broken devices doing wc on baremetal on ARM64, have we? I think the SC2A11 SoC used in the Socionext developerbox counts as "broken": https://www.96boards.org/documentation/enterprise/developerbox/support/known-issues.html I'm not sure my understanding of the issue is 100% correct, but I believe the firmware workaround described there uses the stage 2 translation tables to map "Normal NC" onto "Device nGRE" or something even more restricted. Now this hardware may be classified as simply broken. However... On hardware based on the NXP LX2160A SoC we're seeing some weird behaviour when using "Normal NC" mappings with an AMD GPU that disappear by using "Device nGnRnE" mappings on OpenBSD. No such issue was observed with hardware based on an Ampere eMAG SoC. I don't fully understand this issue yet, and it may very well be a bug in OpenBSD code, but it does show there are potential pitfalls with using "Normal NC" for mapping prefetchable BARs of PCIe devices. > The lack of evidence does not equate to a proof, and your devices not > misbehaving doesn't mean it is the right thing, specially when we have > such a wide range of CPU and interconnect implementation. Which is why > I really want an answer at the architecture level. Not a "it works for > me" type of answer. > > Furthermore, as I replied to Shanker in a separate email, what > Linux/arm64 does is pretty much irrelevant. KVM/arm64 implements the > ARMv8 architecture, and it is at that level that we need to solve the > problem. > > If, by enumerating the properties of Prefetchable, you can show that > they are a strict superset of Normal_NC, I'm on board. I haven't seen > such an enumeration so far. > > > I know we have tested NICs write combining on arm64 in baremetal, as > > well as GPU and NVMe CMB without issues. > > > > Further, I don't see why speculation to non cacheble would be an > > issue if prefetch without side effects is allowed by the device, > > which is what a prefetchable BAR is. > > If it is an issue for a device I would consider that a bug already needing a quirk in > > Baremetal/host kernel already. > > From PCI spec " A prefetchable address range may have write side effects, > > but it may not have read side effects." > > Right, so we have made a small step in the direction of mapping > "prefetchable" onto "Normal_NC", thanks for that. What about all the > other properties (unaligned accesses, ordering, gathering)? On x86 WC: 1. Is not cached (but stores are buffered). 2. Allows unaligned access just like normal memory. 3. Allows speculative reads. 4. Has weaker ordering than normal memory; [lsm]fence instructions are needed to guarantee a particular ordering of writes with respect to other writes and reads. 5. Stores are buffered. This buffer isn't snooped so it has to be flushed before changes are globally visible. The [sm]fence instructions flush the store buffer. 6. The store buffer may combine multiple writes into a single write. Now whether the fact the unaligned access is allowed is really part of the semantics of WC mappings is debatable as x86 always allows unaligned access, even for areas mapped with ioremap(). However, this is where userland comes in. The userland graphics stack does assume that graphics memory mapped throug a prefetchable PCIe BAR allows unaligned access if the architecture allows unaligned access for cacheable memory. On arm64 this means that such memory needs to be "Normal NC". And since kernel drivers tend to map such memory using ioremap_wc() that pretty much implies ioremap_wc() shoul use "Normal NC" as well isn't it? > > > How do we translate this into something consistent? I'd like to > > > see an actual description of what we *really* expect from WC on > > > prefetchable PCI regions, turn that into a documented definition > > > agreed across architectures, and then we can look at > > > implementing it with one memory type or another on arm64. > > > > > > Because once we expose that memory type at S2 for KVM guests, it > > > becomes ABI and there is no turning back. So I want to get it > > > right once and for all. > > > > > I agree that we need a precise definition for the Linux ioremap_wc > > API wrt what drivers (kernel and userspace) can expect and whether > > memset/memcpy is expected to work or not and whether aligned > > accesses are a requirement. > > To the extent ABI is set, I would think that the ABI is also already > > set in the host kernel for arm64 WC = Normal NC, so why should that > > not also be the ABI for same driver in VMs. > > KVM is an implementation of the ARM architecture, and doesn't really > care about what WC is. If we come to the conclusion that Normal_NC is > the natural match for Prefetchable attributes, than we're good and we > can have Normal_NC being set by userspace, or even VFIO. But I don't > want to set it only because "it works when bare-metal Linux uses it". > Remember KVM doesn't only run Linux as guests. > > M. > > -- > Without deviation from the norm, progress is not possible. > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel >
> From: Mark Kettenis <mark.kettenis@xs4all.nl> > > From: Marc Zyngier <maz@kernel.org> snip > > If, by enumerating the properties of Prefetchable, you can show that > > they are a strict superset of Normal_NC, I'm on board. I haven't seen > > such an enumeration so far. > > snip > > Right, so we have made a small step in the direction of mapping > > "prefetchable" onto "Normal_NC", thanks for that. What about all the > > other properties (unaligned accesses, ordering, gathering)? > Regarding gathering/write combining, that is also allowed to prefetchable per PCI spec From 1.3.2.2 of 5/0 base spec: A PCI Express Endpoint requesting memory resources through a BAR must set the BAR's Prefetchable bit unless the range contains locations with read side-effects or locations in which the Function does not tolerate write merging. Further 7.5.1.2.1 says " A Function is permitted to mark a range as prefetchable if there are no side effects on reads, the Function returns all bytes on reads regardless of the byte enables, and host bridges can merge processor writes into this range139 without causing errors" The "regardless of byte enables" suggests to me that unaligned is OK, as only certain byte enables may be set, what do you think? So to me prefetchable in PCIe spec allows for write combining, read without sideeffect (prefetch/speculative as long as uncached), and unaligned. Regarding ordering I didn't find a statement one way or other in PCIe prefetchable definition, but I think that goes beyond what PCIe says or doesn't say anyway since reordering can also happen in the CPU, and since driver must be aware of correctness issues in its producer/consumer models it will need the right barriers where they are required for correctness anyway (required for the driver/userspace to work on host w/ ioremap_wc). But perhaps the bigger question is since WC doesn't exist as a Memory type on armv8, yet we are trying to fit something onto ioremap_wc which came from x86 world, shouldn't the arm64 MT we use for WC match the semantics of whatever drivers and userspace expected from ioremap_wc as defined on x86, which as Mark notes below includes unaligned? If we agree to that, we can codify it in the documentation of ioremap_wc and allow for Normal NC on arm64 for ioremap_wc in host or guest. Beyond that, if we don't want to do it automatically based on prefetchable but from explicit call from userspace is fine too. > On x86 WC: > > 1. Is not cached (but stores are buffered). > > 2. Allows unaligned access just like normal memory. > > 3. Allows speculative reads. > > 4. Has weaker ordering than normal memory; [lsm]fence instructions are > needed to guarantee a particular ordering of writes with respect to > other writes and reads. > > 5. Stores are buffered. This buffer isn't snooped so it has to be > flushed before changes are globally visible. The [sm]fence > instructions flush the store buffer. > > 6. The store buffer may combine multiple writes into a single write. > > Now whether the fact the unaligned access is allowed is really part of the > semantics of WC mappings is debatable as x86 always allows unaligned > access, even for areas mapped with ioremap(). > > However, this is where userland comes in. The userland graphics stack does > assume that graphics memory mapped throug a prefetchable PCIe BAR > allows unaligned access if the architecture allows unaligned access for > cacheable memory. On arm64 this means that such memory needs to be > "Normal NC". And since kernel drivers tend to map such memory using > ioremap_wc() that pretty much implies ioremap_wc() shoul use "Normal NC" > as well isn't it? > > > > > How do we translate this into something consistent? I'd like to > > > > see an actual description of what we *really* expect from WC on > > > > prefetchable PCI regions, turn that into a documented definition > > > > agreed across architectures, and then we can look at implementing > > > > it with one memory type or another on arm64. > > > > > > > > Because once we expose that memory type at S2 for KVM guests, it > > > > becomes ABI and there is no turning back. So I want to get it > > > > right once and for all. > > > > > > > I agree that we need a precise definition for the Linux ioremap_wc > > > API wrt what drivers (kernel and userspace) can expect and whether > > > memset/memcpy is expected to work or not and whether aligned > > > accesses are a requirement. > > > To the extent ABI is set, I would think that the ABI is also already > > > set in the host kernel for arm64 WC = Normal NC, so why should that > > > not also be the ABI for same driver in VMs. > > > > KVM is an implementation of the ARM architecture, and doesn't really > > care about what WC is. If we come to the conclusion that Normal_NC is > > the natural match for Prefetchable attributes, than we're good and we > > can have Normal_NC being set by userspace, or even VFIO. But I don't > > want to set it only because "it works when bare-metal Linux uses it". > > Remember KVM doesn't only run Linux as guests. > > > > M. > > > > -- > > Without deviation from the norm, progress is not possible. > > > > _______________________________________________ > > linux-arm-kernel mailing list > > linux-arm-kernel@lists.infradead.org > > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > >
On Mon, 3 May 2021 13:59:43 +0000 Vikram Sethi <vsethi@nvidia.com> wrote: > > From: Mark Kettenis <mark.kettenis@xs4all.nl> > > > From: Marc Zyngier <maz@kernel.org> > > snip > > > If, by enumerating the properties of Prefetchable, you can show that > > > they are a strict superset of Normal_NC, I'm on board. I haven't seen > > > such an enumeration so far. > > > > snip > > > Right, so we have made a small step in the direction of mapping > > > "prefetchable" onto "Normal_NC", thanks for that. What about all the > > > other properties (unaligned accesses, ordering, gathering)? > > > Regarding gathering/write combining, that is also allowed to prefetchable per PCI spec As others have stated, gather/write combining itself is not well defined. > From 1.3.2.2 of 5/0 base spec: > A PCI Express Endpoint requesting memory resources through a BAR must set the BAR's Prefetchable bit unless > the range contains locations with read side-effects or locations in which the Function does not tolerate write > merging. "write merging" This is a very specific thing, per PCI 3.0, 3.2.6: Byte Merging – occurs when a sequence of individual memory writes (bytes or words) are merged into a single DWORD. The semantics suggest quadword support in addition to dword, but don't require it. Writes to bytes within a dword can be merged, but duplicate writes cannot. It seems like an extremely liberal application to suggest that this one write semantic encompasses full write combining semantics, which itself is not clearly defined. > Further 7.5.1.2.1 says " A Function is permitted > to mark a range as prefetchable if there are no side effects on reads, the Function returns all bytes on reads regardless of > the byte enables, and host bridges can merge processor writes into this range139 without causing errors" > > The "regardless of byte enables" suggests to me that unaligned is OK, as only > certain byte enables may be set, what do you think? > > So to me prefetchable in PCIe spec allows for write combining, read without Ironically here, the above PCI spec section defining write merging has separate sections for "combining", "merging", and "collapsing". Only merging is indicated as a requirement for prefetchable resources. > sideeffect (prefetch/speculative as long as uncached), and unaligned. Regarding > ordering I didn't find a statement one way or other in PCIe prefetchable definition, but > I think that goes beyond what PCIe says or doesn't say anyway since reordering can > also happen in the CPU, and since driver must be aware of correctness issues in its > producer/consumer models it will need the right barriers where they are required > for correctness anyway (required for the driver/userspace to work on host w/ ioremap_wc). A lot of hand waving here, imo. Thanks, Alex
Hi Alex, > From: Alex Williamson <alex.williamson@redhat.com> > On Mon, 3 May 2021 13:59:43 +0000 > Vikram Sethi <vsethi@nvidia.com> wrote: > > > From: Mark Kettenis <mark.kettenis@xs4all.nl> > > > > From: Marc Zyngier <maz@kernel.org> > > > > snip > > > > If, by enumerating the properties of Prefetchable, you can show > > > > that they are a strict superset of Normal_NC, I'm on board. I > > > > haven't seen such an enumeration so far. > > > > > > snip > > > > Right, so we have made a small step in the direction of mapping > > > > "prefetchable" onto "Normal_NC", thanks for that. What about all > > > > the other properties (unaligned accesses, ordering, gathering)? > > > > > Regarding gathering/write combining, that is also allowed to > > prefetchable per PCI spec > > As others have stated, gather/write combining itself is not well defined. > > > From 1.3.2.2 of 5/0 base spec: > > A PCI Express Endpoint requesting memory resources through a BAR must > > set the BAR's Prefetchable bit unless the range contains locations > > with read side-effects or locations in which the Function does not tolerate > write merging. > > "write merging" This is a very specific thing, per PCI 3.0, 3.2.6: > > Byte Merging – occurs when a sequence of individual memory writes > (bytes or words) are merged into a single DWORD. > > The semantics suggest quadword support in addition to dword, but don't > require it. Writes to bytes within a dword can be merged, but duplicate > writes cannot. > > It seems like an extremely liberal application to suggest that this one write > semantic encompasses full write combining semantics, which itself is not > clearly defined. > Talking to our PCIe SIG representative, PCIe switches are not allowed do any of the byte Merging/combining etc as defined in the PCI spec, and per a rather poorly worded Implementation note in the spec says that no known PCIe Host Briddges/Root ports do it either. So for PCIe we don't think believe there is any byte merging that happens in the PCIe fabric so it's really a matter of what happens in the CPU core and interconnect before it gets to the PCIe hierarchy. Stepping back from this patchset, do you agree that it is desirable to support Write combining as understood by ioremap_wc to work in all ISA guests including ARMv8? You note that x86 virtualization doesn't have this issue, but KVM-ARM does because KVM maps all device BARs as Device Memory type nGnRE which doesn't allow ioremap_wc from within the guest to get the actual semantics desired. Marc and others have suggested that userspace should provide the hints. But the question is how would qemu vfio do this either? We would be stuck in the same arguments as here, as to what is the correct way to determine the desired attributes for a given BAR such that eventually when a driver in the guest asks for ioremap_wc it actually has a chance of working in the guest, in all ISAs. Do you have any suggestions on how to make progress here? A device specific list of which BARs are OK to allow ioremap_wc for seems terrible and I'm not sure if a commandline qemu option is any better. Is the user of device assignment/sysadmin supposed to know which BAR of which device is OK to allow ioremap_wc for? Will/Catalin, perhaps you could explain your thought process on why you chose Normal NC for ioremap_wc on the armv8 linux port instead of Device GRE or other Device Gxx. Thanks, Vikram
On Mon, May 03, 2021 at 10:03:59PM +0000, Vikram Sethi wrote: > Will/Catalin, perhaps you could explain your thought process on why you chose > Normal NC for ioremap_wc on the armv8 linux port instead of Device GRE or other > Device Gxx. I think a combination of: compatibility with 32-bit Arm, the need to support unaligned accesses and the potential for higher performance. Furthermore, ioremap() already gives you a Device memory type, and we're tight on MAIR space. Will
On Mon, 3 May 2021 22:03:59 +0000 Vikram Sethi <vsethi@nvidia.com> wrote: > Hi Alex, > > From: Alex Williamson <alex.williamson@redhat.com> > > On Mon, 3 May 2021 13:59:43 +0000 > > Vikram Sethi <vsethi@nvidia.com> wrote: > > > > From: Mark Kettenis <mark.kettenis@xs4all.nl> > > > > > From: Marc Zyngier <maz@kernel.org> > > > > > > snip > > > > > If, by enumerating the properties of Prefetchable, you can show > > > > > that they are a strict superset of Normal_NC, I'm on board. I > > > > > haven't seen such an enumeration so far. > > > > > > > > snip > > > > > Right, so we have made a small step in the direction of mapping > > > > > "prefetchable" onto "Normal_NC", thanks for that. What about all > > > > > the other properties (unaligned accesses, ordering, gathering)? > > > > > > > Regarding gathering/write combining, that is also allowed to > > > prefetchable per PCI spec > > > > As others have stated, gather/write combining itself is not well defined. > > > > > From 1.3.2.2 of 5/0 base spec: > > > A PCI Express Endpoint requesting memory resources through a BAR must > > > set the BAR's Prefetchable bit unless the range contains locations > > > with read side-effects or locations in which the Function does not tolerate > > write merging. > > > > "write merging" This is a very specific thing, per PCI 3.0, 3.2.6: > > > > Byte Merging – occurs when a sequence of individual memory writes > > (bytes or words) are merged into a single DWORD. > > > > The semantics suggest quadword support in addition to dword, but don't > > require it. Writes to bytes within a dword can be merged, but duplicate > > writes cannot. > > > > It seems like an extremely liberal application to suggest that this one write > > semantic encompasses full write combining semantics, which itself is not > > clearly defined. > > > Talking to our PCIe SIG representative, PCIe switches are not allowed do any of the byte > Merging/combining etc as defined in the PCI spec, and per a rather poorly > worded Implementation note in the spec says that no known PCIe Host Briddges/Root > ports do it either. > So for PCIe we don't think believe there is any byte merging that happens in the PCIe > fabric so it's really a matter of what happens in the CPU core and interconnect > before it gets to the PCIe hierarchy. Yes, but merged writes, no matter where they happen, are still the only type of write combining that a prefetchable BAR on an endpoint is required to support. > Stepping back from this patchset, do you agree that it is desirable to support > Write combining as understood by ioremap_wc to work in all ISA guests including > ARMv8? Yes, a userspace vfio driver should be able to take advantage of the hardware capabilities. I think where we disagree is whether it's universally safe to assume write combining based on the PCI prefetchable capability of a BAR. If that's something that can be assumed universally for ARMv8 based on the architecture specification compatibility with the PCI definition of a prefetchable BAR, then I would expect a helper somewhere in arch code that returns the right page protection flags, so that arch maintainers don't need to scour device drivers for architecture hacks. Otherwise, it needs to be exposed through the vfio uAPI to allow the userspace device driver itself to select these semantics. > You note that x86 virtualization doesn't have this issue, but KVM-ARM does > because KVM maps all device BARs as Device Memory type nGnRE which > doesn't allow ioremap_wc from within the guest to get the actual semantics desired. > > Marc and others have suggested that userspace should provide the hints. But the > question is how would qemu vfio do this either? We would be stuck in the same > arguments as here, as to what is the correct way to determine the desired attributes > for a given BAR such that eventually when a driver in the guest asks for > ioremap_wc it actually has a chance of working in the guest, in all ISAs. > Do you have any suggestions on how to make progress here? We do need some way for userspace drivers to also make use of WC semantics, there were some discussions in the past, I think others have referenced them as well, but nothing has been proposed for a vfio API. If we had that API, QEMU deciding to universally enable WC for all vfio prefetchable BARs seems only marginally better than this approach. Ultimately the mapping should be based on the guest driver semantics, and if you don't have any visibility to that on KVM/arm like we have on KVM/x86, then it seems like there's nothing to trigger a vfio API here anyway. If that's the case, I'd probably go back to letting the arch/arm64 folks declare that WC is compatible with the definition of PCI prefetchable and export some sort of pgprot_pci_prefetchable() helper where the default would be to #define it as pgproc_noncached() #ifndef by the arch. > A device specific list of which BARs are OK to allow ioremap_wc for seems terrible > and I'm not sure if a commandline qemu option is any better. Is the user of device > assignment/sysadmin supposed to know which BAR of which device is OK to allow > ioremap_wc for? No, a device specific userspace driver should know such device semantics, but QEMU is not such a driver. Burdening the hypervisor user/admin is not a good solution either. I'd lean on KVM/arm64 folks to know how the guest driver semantics can be exposed to the hypervisor. Thanks, Alex
On Tue, May 04, 2021 at 09:30:05AM +0100, Will Deacon wrote: > On Mon, May 03, 2021 at 10:03:59PM +0000, Vikram Sethi wrote: > > Will/Catalin, perhaps you could explain your thought process on why you chose > > Normal NC for ioremap_wc on the armv8 linux port instead of Device GRE or other > > Device Gxx. > > I think a combination of: compatibility with 32-bit Arm, the need to > support unaligned accesses and the potential for higher performance. IIRC the _wc suffix also matches the pgprot_writecombine() used by some drivers to map a video framebuffer into user space. Accesses to the framebuffer are not guaranteed to be aligned (memset/memcpy don't ensure alignment on arm64 and the user doesn't have a memset_io or memcpy_toio). > Furthermore, ioremap() already gives you a Device memory type, and we're > tight on MAIR space. We have MT_DEVICE_GRE currently reserved though no in-kernel user, we might as well remove it.
On Wed, May 05, 2021 at 07:02:31PM +0100, Catalin Marinas wrote: > > Furthermore, ioremap() already gives you a Device memory type, and we're > > tight on MAIR space. > > We have MT_DEVICE_GRE currently reserved though no in-kernel user, we > might as well remove it. Please do. The more we can cut down on different memory types, the better.
Hi Marc, On 5/5/21 1:02 PM, Catalin Marinas wrote: >>> Will/Catalin, perhaps you could explain your thought process on why you chose >>> Normal NC for ioremap_wc on the armv8 linux port instead of Device GRE or other >>> Device Gxx. >> I think a combination of: compatibility with 32-bit Arm, the need to >> support unaligned accesses and the potential for higher performance. > IIRC the _wc suffix also matches the pgprot_writecombine() used by some > drivers to map a video framebuffer into user space. Accesses to the > framebuffer are not guaranteed to be aligned (memset/memcpy don't ensure > alignment on arm64 and the user doesn't have a memset_io or memcpy_toio). > >> Furthermore, ioremap() already gives you a Device memory type, and we're >> tight on MAIR space. > We have MT_DEVICE_GRE currently reserved though no in-kernel user, we > might as well remove it. @Marc, Could you provide your thoughts/guidance for the next step? The proposal of getting hints for prefetchable regions from VFIO/QEMU is not recommended, The only option left is to implement ARM64 dependent logic in KVM. Option-1: I think we could take advantage of stage-1/2 combining rules to allow NORMAL_NC memory-type for device memory in VM. Always map device memory at stage-2 as NORMAL-NC and trust VM's stage-1 MT. --------------------------------------------------------------- Stage-2 MT Stage-1 MT Resultant MT (combining-rules/FWB) --------------------------------------------------------------- Normal-NC Normal-WT Normal-NC - Normal-WB - - Normal-NC - - Device-<attr> Device-<attr> --------------------------------------------------------------- We've been using this option internally for testing purpose and validated with NVME/Mellanox/GPU pass-through devices on Marvell-Thundex2 platform. Option-2: Get resource properties associated with MMIO using lookup_resource() and map at stage-2 as Normal-NC if IORESOURCE_PREFETCH is set in flags.
On Tue, 04 May 2021 19:03:48 +0100, Alex Williamson <alex.williamson@redhat.com> wrote: > > On Mon, 3 May 2021 22:03:59 +0000 > Vikram Sethi <vsethi@nvidia.com> wrote: > > > Hi Alex, > > > From: Alex Williamson <alex.williamson@redhat.com> > > > On Mon, 3 May 2021 13:59:43 +0000 > > > Vikram Sethi <vsethi@nvidia.com> wrote: > > > > > From: Mark Kettenis <mark.kettenis@xs4all.nl> > > > > > > From: Marc Zyngier <maz@kernel.org> > > > > > > > > snip > > > > > > If, by enumerating the properties of Prefetchable, you can show > > > > > > that they are a strict superset of Normal_NC, I'm on board. I > > > > > > haven't seen such an enumeration so far. > > > > > > > > > > snip > > > > > > Right, so we have made a small step in the direction of mapping > > > > > > "prefetchable" onto "Normal_NC", thanks for that. What about all > > > > > > the other properties (unaligned accesses, ordering, gathering)? > > > > > > > > > Regarding gathering/write combining, that is also allowed to > > > > prefetchable per PCI spec > > > > > > As others have stated, gather/write combining itself is not well defined. > > > > > > > From 1.3.2.2 of 5/0 base spec: > > > > A PCI Express Endpoint requesting memory resources through a BAR must > > > > set the BAR's Prefetchable bit unless the range contains locations > > > > with read side-effects or locations in which the Function does not tolerate > > > write merging. > > > > > > "write merging" This is a very specific thing, per PCI 3.0, 3.2.6: > > > > > > Byte Merging – occurs when a sequence of individual memory writes > > > (bytes or words) are merged into a single DWORD. > > > > > > The semantics suggest quadword support in addition to dword, but > > > don't require it. Writes to bytes within a dword can be merged, > > > but duplicate writes cannot. > > > > > > It seems like an extremely liberal application to suggest that > > > this one write semantic encompasses full write combining > > > semantics, which itself is not clearly defined. > > > > > Talking to our PCIe SIG representative, PCIe switches are not > > allowed do any of the byte Merging/combining etc as defined in the > > PCI spec, and per a rather poorly worded Implementation note in > > the spec says that no known PCIe Host Briddges/Root ports do it > > either. So for PCIe we don't think believe there is any byte > > merging that happens in the PCIe fabric so it's really a matter of > > what happens in the CPU core and interconnect before it gets to > > the PCIe hierarchy. > > Yes, but merged writes, no matter where they happen, are still the only > type of write combining that a prefetchable BAR on an endpoint is > required to support. > > > Stepping back from this patchset, do you agree that it is > > desirable to support Write combining as understood by ioremap_wc > > to work in all ISA guests including ARMv8? > > Yes, a userspace vfio driver should be able to take advantage of the > hardware capabilities. I think where we disagree is whether it's > universally safe to assume write combining based on the PCI > prefetchable capability of a BAR. If that's something that can be > assumed universally for ARMv8 based on the architecture specification > compatibility with the PCI definition of a prefetchable BAR, then I > would expect a helper somewhere in arch code that returns the right > page protection flags, so that arch maintainers don't need to scour > device drivers for architecture hacks. Otherwise, it needs to be > exposed through the vfio uAPI to allow the userspace device driver > itself to select these semantics. > > > You note that x86 virtualization doesn't have this issue, but > > KVM-ARM does because KVM maps all device BARs as Device Memory > > type nGnRE which doesn't allow ioremap_wc from within the guest to > > get the actual semantics desired. > > > > Marc and others have suggested that userspace should provide the > > hints. But the question is how would qemu vfio do this either? We > > would be stuck in the same arguments as here, as to what is the > > correct way to determine the desired attributes for a given BAR > > such that eventually when a driver in the guest asks for > > ioremap_wc it actually has a chance of working in the guest, in > > all ISAs. Do you have any suggestions on how to make progress > > here? > > We do need some way for userspace drivers to also make use of WC > semantics, there were some discussions in the past, I think others have > referenced them as well, but nothing has been proposed for a vfio API. > > If we had that API, QEMU deciding to universally enable WC for all > vfio prefetchable BARs seems only marginally better than this approach. > Ultimately the mapping should be based on the guest driver semantics, > and if you don't have any visibility to that on KVM/arm like we have on > KVM/x86, then it seems like there's nothing to trigger a vfio API here > anyway. There isn't much KVM/arm64 can do here unless it is being told what to do. We don't have visibility on the guest's page tables in a reliable way, and trusting them is not something I want to entertain anyway. > If that's the case, I'd probably go back to letting the arch/arm64 folks > declare that WC is compatible with the definition of PCI prefetchable > and export some sort of pgprot_pci_prefetchable() helper where the > default would be to #define it as pgproc_noncached() #ifndef by the > arch. > > > A device specific list of which BARs are OK to allow ioremap_wc > > for seems terrible and I'm not sure if a commandline qemu option > > is any better. Is the user of device assignment/sysadmin supposed > > to know which BAR of which device is OK to allow ioremap_wc for? > > No, a device specific userspace driver should know such device > semantics, but QEMU is not such a driver. Burdening the hypervisor > user/admin is not a good solution either. I'd lean on KVM/arm64 folks > to know how the guest driver semantics can be exposed to the > hypervisor. Thanks, I don't see a good way for that, unless we make it a per-guest buy-in where all PCI prefetchable mappings get the same treatment. I'm prepared to bet that this will break when two devices will have different requirements. It would also require userspace to buy into this scheme though, which is crap. Exposing the guest's preference on a per-device basis seems difficult (KVM knows nothing about the PCI devices) and would require some PV interface that will quickly become unmaintainable. M.
Hi Shanker, On Sat, 08 May 2021 17:33:11 +0100, Shanker R Donthineni <sdonthineni@nvidia.com> wrote: > > Hi Marc, > > On 5/5/21 1:02 PM, Catalin Marinas wrote: > >>> Will/Catalin, perhaps you could explain your thought process on why you chose > >>> Normal NC for ioremap_wc on the armv8 linux port instead of Device GRE or other > >>> Device Gxx. > >> I think a combination of: compatibility with 32-bit Arm, the need to > >> support unaligned accesses and the potential for higher performance. > > IIRC the _wc suffix also matches the pgprot_writecombine() used by some > > drivers to map a video framebuffer into user space. Accesses to the > > framebuffer are not guaranteed to be aligned (memset/memcpy don't ensure > > alignment on arm64 and the user doesn't have a memset_io or memcpy_toio). > > > >> Furthermore, ioremap() already gives you a Device memory type, and we're > >> tight on MAIR space. > > We have MT_DEVICE_GRE currently reserved though no in-kernel user, we > > might as well remove it. > @Marc, Could you provide your thoughts/guidance for the next step? The > proposal of getting hints for prefetchable regions from VFIO/QEMU is not > recommended, The only option left is to implement ARM64 dependent logic > in KVM. > > Option-1: I think we could take advantage of stage-1/2 combining rules to > allow NORMAL_NC memory-type for device memory in VM. Always map > device memory at stage-2 as NORMAL-NC and trust VM's stage-1 MT. > > --------------------------------------------------------------- > Stage-2 MT Stage-1 MT Resultant MT (combining-rules/FWB) > --------------------------------------------------------------- > Normal-NC Normal-WT Normal-NC > - Normal-WB - > - Normal-NC - > - Device-<attr> Device-<attr> > --------------------------------------------------------------- I think this is unwise. Will recently debugged a pretty horrible situation when doing exactly that: when S1 is off and S2 is on, the I-side is allowed to generate speculative accesses (see ARMv8 ARM G.a D5.2.9 for the details). And yes, implementations definitely do that. Add side-effect reads to the mix, and you're in for a treat. > We've been using this option internally for testing purpose and > validated with NVME/Mellanox/GPU pass-through devices on > Marvell-Thundex2 platform. See above. It *will* break eventually. > Option-2: Get resource properties associated with MMIO using lookup_resource() > and map at stage-2 as Normal-NC if IORESOURCE_PREFETCH is set in flags. That's a pretty roundabout way of doing exactly the same thing you initially proposed. And it suffers from the exact same problems, which is that you change the semantics of the mapping without knowing what the guest's intent is. M.
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 5023e23db3bc..1b734fe1dd51 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1703,7 +1703,11 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma) } vma->vm_private_data = vdev; - vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + if (IS_ENABLED(CONFIG_ARM64) && + (pci_resource_flags(pdev, index) & IORESOURCE_PREFETCH)) + vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot); + else + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff; /*
For pass-through device assignment, the ARM64 KVM hypervisor retrieves the memory region properties physical address, size, and whether a region backed with struct page or not from VMA. The prefetchable attribute of a BAR region isn't visible to KVM to make an optimal decision for stage2 attributes. This patch updates vma->vm_page_prot and maps with write-combine attribute if the associated BAR is prefetchable. For ARM64 pgprot_writecombine() is mapped to memory-type MT_NORMAL_NC which has no side effects on reads and multiple writes can be combined. Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> --- drivers/vfio/pci/vfio_pci.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)