diff mbox

[qemu,v7,2/4] vfio/pci: Relax DMA map errors for MMIO regions

Message ID 20180209075503.16996-3-aik@ozlabs.ru (mailing list archive)
State New, archived
Headers show

Commit Message

Alexey Kardashevskiy Feb. 9, 2018, 7:55 a.m. UTC
At the moment if vfio_memory_listener is registered in the system memory
address space, it maps/unmaps every RAM memory region for DMA.
It expects system page size aligned memory sections so vfio_dma_map
would not fail and so far this has been the case. A mapping failure
would be fatal. A side effect of such behavior is that some MMIO pages
would not be mapped silently.

However we are going to change MSIX BAR handling so we will end having
non-aligned sections in vfio_memory_listener (more details is in
the next patch) and vfio_dma_map will exit QEMU.

In order to avoid fatal failures on what previously was not a failure and
was just silently ignored, this checks the section alignment to
the smallest supported IOMMU page size and prints an error if not aligned;
it also prints an error if vfio_dma_map failed despite the page size check.
Both errors are not fatal; only MMIO RAM regions are checked
(aka "RAM device" regions).

If the amount of errors printed is overwhelming, the MSIX relocation
could be used to avoid excessive error output.

This is unlikely to cause any behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/vfio/common.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 49 insertions(+), 6 deletions(-)

Comments

David Gibson Feb. 12, 2018, 5:19 a.m. UTC | #1
On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy wrote:
> At the moment if vfio_memory_listener is registered in the system memory
> address space, it maps/unmaps every RAM memory region for DMA.
> It expects system page size aligned memory sections so vfio_dma_map
> would not fail and so far this has been the case. A mapping failure
> would be fatal. A side effect of such behavior is that some MMIO pages
> would not be mapped silently.
> 
> However we are going to change MSIX BAR handling so we will end having
> non-aligned sections in vfio_memory_listener (more details is in
> the next patch) and vfio_dma_map will exit QEMU.
> 
> In order to avoid fatal failures on what previously was not a failure and
> was just silently ignored, this checks the section alignment to
> the smallest supported IOMMU page size and prints an error if not aligned;
> it also prints an error if vfio_dma_map failed despite the page size check.
> Both errors are not fatal; only MMIO RAM regions are checked
> (aka "RAM device" regions).
> 
> If the amount of errors printed is overwhelming, the MSIX relocation
> could be used to avoid excessive error output.
> 
> This is unlikely to cause any behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

There are some relatively superficial problems noted below.

But more fundamentally, this feels like it's extending an existing
hack past the point of usefulness.

The explicit check for is_ram_device() here has always bothered me -
it's not like a real bus bridge magically knows whether a target
address maps to RAM or not.

What I think is really going on is that even for systems without an
IOMMU, it's not really true to say that the PCI address space maps
directly onto address_space_memory.  Instead, there's a large, but
much less than 2^64 sized, "upstream window" at address 0 on the PCI
bus, which is identity mapped to the system bus.  Details will vary
with the system, but in practice we expect nothing but RAM to be in
that window.  Addresses not within that window won't be mapped to the
system bus but will just be broadcast on the PCI bus and might be
picked up as a p2p transaction.  Actually on classic PC, I suspect
there may be two windows, below and above the "ISA IO hole".

With PCIe it gets more complicated, of course.  I still suspect
there's some sort of upstream window to the host, but whether things
outside the window get reflected back down the PCIe heirarchy will
depend on those p2p relevant configuration parameters.

Maybe it's time we had a detailed look at what really happens in
physical bridges, rather than faking it with the is_ram_device()
check?

Anyway, that said, the patch below might still be a reasonable interim
hack, once the smaller problems are fixed.

> ---
>  hw/vfio/common.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 49 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index f895e3c..736f271 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -544,18 +544,40 @@ static void vfio_listener_region_add(MemoryListener *listener,
>  
>      llsize = int128_sub(llend, int128_make64(iova));
>  
> +    if (memory_region_is_ram_device(section->mr)) {
> +        hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
> +
> +        if ((iova & pgmask) || (llsize & pgmask)) {
> +            error_report("Region 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx
> +                         " is not aligned to 0x%"HWADDR_PRIx
> +                         " and cannot be mapped for DMA",
> +                         section->offset_within_region,
> +                         int128_getlo(section->size),
> +                         pgmask + 1);
> +            return;
> +        }
> +    }
> +
>      ret = vfio_dma_map(container, iova, int128_get64(llsize),
>                         vaddr, section->readonly);
>      if (ret) {
>          error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>                       "0x%"HWADDR_PRIx", %p) = %d (%m)",
>                       container, iova, int128_get64(llsize), vaddr, ret);
> +        if (memory_region_is_ram_device(section->mr)) {
> +            /* Allow unexpected mappings not to be fatal for RAM devices */
> +            return;
> +        }
>          goto fail;
>      }
>  
>      return;
>  
>  fail:
> +    if (memory_region_is_ram_device(section->mr)) {
> +        error_report("failed to vfio_dma_map. pci p2p may not work");

Isn't this logic exactly backwards?  p2p will be affected when a *non
RAM* device (itself probably a PCI MMIO window) can't be mapped in.

> +        return;
> +    }
>      /*
>       * On the initfn path, store the first error in the container so we
>       * can gracefully fail.  Runtime, there's not much we can do other
> @@ -577,6 +599,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>      hwaddr iova, end;
>      Int128 llend, llsize;
>      int ret;
> +    bool try_unmap = true;
>  
>      if (vfio_listener_skipped_section(section)) {
>          trace_vfio_listener_region_del_skip(
> @@ -629,13 +652,33 @@ static void vfio_listener_region_del(MemoryListener *listener,
>  
>      trace_vfio_listener_region_del(iova, end);
>  
> -    ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
> +    if (memory_region_is_ram_device(section->mr)) {
> +        hwaddr pgmask;
> +        VFIOHostDMAWindow *hostwin;
> +        bool hostwin_found = false;
> +
> +        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> +            if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
> +                hostwin_found = true;
> +                break;
> +            }
> +        }
> +        assert(hostwin_found); /* or region_add() would have failed */
> +
> +        pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
> +        try_unmap = !((iova & pgmask) || (llsize & pgmask));

Explicit page mask checks seem bogus here.  We should unmap based on
whether we mapped, not on some other criterion which may or may not
evaluate to the same thing.

> +    }
> +
> +    if (try_unmap) {
> +        ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
> +        if (ret) {
> +            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> +                         "0x%"HWADDR_PRIx") = %d (%m)",
> +                         container, iova, int128_get64(llsize), ret);
> +        }
> +    }
> +
>      memory_region_unref(section->mr);
> -    if (ret) {
> -        error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> -                     "0x%"HWADDR_PRIx") = %d (%m)",
> -                     container, iova, int128_get64(llsize), ret);
> -    }
>  
>      if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>          vfio_spapr_remove_window(container,
Alexey Kardashevskiy Feb. 12, 2018, 7:05 a.m. UTC | #2
On 12/02/18 16:19, David Gibson wrote:
> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy wrote:
>> At the moment if vfio_memory_listener is registered in the system memory
>> address space, it maps/unmaps every RAM memory region for DMA.
>> It expects system page size aligned memory sections so vfio_dma_map
>> would not fail and so far this has been the case. A mapping failure
>> would be fatal. A side effect of such behavior is that some MMIO pages
>> would not be mapped silently.
>>
>> However we are going to change MSIX BAR handling so we will end having
>> non-aligned sections in vfio_memory_listener (more details is in
>> the next patch) and vfio_dma_map will exit QEMU.
>>
>> In order to avoid fatal failures on what previously was not a failure and
>> was just silently ignored, this checks the section alignment to
>> the smallest supported IOMMU page size and prints an error if not aligned;
>> it also prints an error if vfio_dma_map failed despite the page size check.
>> Both errors are not fatal; only MMIO RAM regions are checked
>> (aka "RAM device" regions).
>>
>> If the amount of errors printed is overwhelming, the MSIX relocation
>> could be used to avoid excessive error output.
>>
>> This is unlikely to cause any behavioral change.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> There are some relatively superficial problems noted below.
> 
> But more fundamentally, this feels like it's extending an existing
> hack past the point of usefulness.
> 
> The explicit check for is_ram_device() here has always bothered me -
> it's not like a real bus bridge magically knows whether a target
> address maps to RAM or not.
> 
> What I think is really going on is that even for systems without an
> IOMMU, it's not really true to say that the PCI address space maps
> directly onto address_space_memory.  Instead, there's a large, but
> much less than 2^64 sized, "upstream window" at address 0 on the PCI
> bus, which is identity mapped to the system bus.  Details will vary
> with the system, but in practice we expect nothing but RAM to be in
> that window.  Addresses not within that window won't be mapped to the
> system bus but will just be broadcast on the PCI bus and might be
> picked up as a p2p transaction.

Currently this p2p works only via the IOMMU, direct p2p is not possible as
the guest needs to know physical MMIO addresses to make p2p work and it
does not.


> Actually on classic PC, I suspect
> there may be two windows, below and above the "ISA IO hole".
> 
> With PCIe it gets more complicated, of course.  I still suspect
> there's some sort of upstream window to the host, but whether things
> outside the window get reflected back down the PCIe heirarchy will
> depend on those p2p relevant configuration parameters.
> 
> Maybe it's time we had a detailed look at what really happens in
> physical bridges, rather than faking it with the is_ram_device()
> check?
> 
> Anyway, that said, the patch below might still be a reasonable interim
> hack, once the smaller problems are fixed.
> 
>> ---
>>  hw/vfio/common.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++------
>>  1 file changed, 49 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index f895e3c..736f271 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -544,18 +544,40 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>  
>>      llsize = int128_sub(llend, int128_make64(iova));
>>  
>> +    if (memory_region_is_ram_device(section->mr)) {
>> +        hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
>> +
>> +        if ((iova & pgmask) || (llsize & pgmask)) {
>> +            error_report("Region 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx
>> +                         " is not aligned to 0x%"HWADDR_PRIx
>> +                         " and cannot be mapped for DMA",
>> +                         section->offset_within_region,
>> +                         int128_getlo(section->size),
>> +                         pgmask + 1);
>> +            return;
>> +        }
>> +    }
>> +
>>      ret = vfio_dma_map(container, iova, int128_get64(llsize),
>>                         vaddr, section->readonly);
>>      if (ret) {
>>          error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>>                       "0x%"HWADDR_PRIx", %p) = %d (%m)",
>>                       container, iova, int128_get64(llsize), vaddr, ret);
>> +        if (memory_region_is_ram_device(section->mr)) {
>> +            /* Allow unexpected mappings not to be fatal for RAM devices */
>> +            return;
>> +        }
>>          goto fail;
>>      }
>>  
>>      return;
>>  
>>  fail:
>> +    if (memory_region_is_ram_device(section->mr)) {
>> +        error_report("failed to vfio_dma_map. pci p2p may not work");
> 
> Isn't this logic exactly backwards?  p2p will be affected when a *non
> RAM* device (itself probably a PCI MMIO window) can't be mapped in.


"RAM device MR" == "mapped MMIO MR". "Non RAM device" is just the guest
RAM. Everything else (such as IO MRs) is filtered out by
vfio_listener_skipped_section().



> 
>> +        return;
>> +    }
>>      /*
>>       * On the initfn path, store the first error in the container so we
>>       * can gracefully fail.  Runtime, there's not much we can do other
>> @@ -577,6 +599,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>      hwaddr iova, end;
>>      Int128 llend, llsize;
>>      int ret;
>> +    bool try_unmap = true;
>>  
>>      if (vfio_listener_skipped_section(section)) {
>>          trace_vfio_listener_region_del_skip(
>> @@ -629,13 +652,33 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>  
>>      trace_vfio_listener_region_del(iova, end);
>>  
>> -    ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
>> +    if (memory_region_is_ram_device(section->mr)) {
>> +        hwaddr pgmask;
>> +        VFIOHostDMAWindow *hostwin;
>> +        bool hostwin_found = false;
>> +
>> +        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
>> +            if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
>> +                hostwin_found = true;
>> +                break;
>> +            }
>> +        }
>> +        assert(hostwin_found); /* or region_add() would have failed */
>> +
>> +        pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
>> +        try_unmap = !((iova & pgmask) || (llsize & pgmask));
> 
> Explicit page mask checks seem bogus here.  We should unmap based on
> whether we mapped, not on some other criterion which may or may not
> evaluate to the same thing.


There is no mapped sections tracking on the QEMU side, adding it for the
purpose of this patchset looks overkill.



>> +    }
>> +
>> +    if (try_unmap) {
>> +        ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
>> +        if (ret) {
>> +            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>> +                         "0x%"HWADDR_PRIx") = %d (%m)",
>> +                         container, iova, int128_get64(llsize), ret);
>> +        }
>> +    }
>> +
>>      memory_region_unref(section->mr);
>> -    if (ret) {
>> -        error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>> -                     "0x%"HWADDR_PRIx") = %d (%m)",
>> -                     container, iova, int128_get64(llsize), ret);
>> -    }
>>  
>>      if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>          vfio_spapr_remove_window(container,
>
Alex Williamson Feb. 12, 2018, 4:06 p.m. UTC | #3
On Mon, 12 Feb 2018 18:05:54 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 12/02/18 16:19, David Gibson wrote:
> > On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy wrote:  
> >> At the moment if vfio_memory_listener is registered in the system memory
> >> address space, it maps/unmaps every RAM memory region for DMA.
> >> It expects system page size aligned memory sections so vfio_dma_map
> >> would not fail and so far this has been the case. A mapping failure
> >> would be fatal. A side effect of such behavior is that some MMIO pages
> >> would not be mapped silently.
> >>
> >> However we are going to change MSIX BAR handling so we will end having
> >> non-aligned sections in vfio_memory_listener (more details is in
> >> the next patch) and vfio_dma_map will exit QEMU.
> >>
> >> In order to avoid fatal failures on what previously was not a failure and
> >> was just silently ignored, this checks the section alignment to
> >> the smallest supported IOMMU page size and prints an error if not aligned;
> >> it also prints an error if vfio_dma_map failed despite the page size check.
> >> Both errors are not fatal; only MMIO RAM regions are checked
> >> (aka "RAM device" regions).
> >>
> >> If the amount of errors printed is overwhelming, the MSIX relocation
> >> could be used to avoid excessive error output.
> >>
> >> This is unlikely to cause any behavioral change.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>  
> > 
> > There are some relatively superficial problems noted below.
> > 
> > But more fundamentally, this feels like it's extending an existing
> > hack past the point of usefulness.
> > 
> > The explicit check for is_ram_device() here has always bothered me -
> > it's not like a real bus bridge magically knows whether a target
> > address maps to RAM or not.
> > 
> > What I think is really going on is that even for systems without an
> > IOMMU, it's not really true to say that the PCI address space maps
> > directly onto address_space_memory.  Instead, there's a large, but
> > much less than 2^64 sized, "upstream window" at address 0 on the PCI
> > bus, which is identity mapped to the system bus.  Details will vary
> > with the system, but in practice we expect nothing but RAM to be in
> > that window.  Addresses not within that window won't be mapped to the
> > system bus but will just be broadcast on the PCI bus and might be
> > picked up as a p2p transaction.  
> 
> Currently this p2p works only via the IOMMU, direct p2p is not possible as
> the guest needs to know physical MMIO addresses to make p2p work and it
> does not.

/me points to the Direct Translated P2P section of the ACS spec, though
it's as prone to spoofing by the device as ATS.  In any case, p2p
reflected from the IOMMU is still p2p and offloads the CPU even if
bandwidth suffers vs bare metal depending on if the data doubles back
over any links.  Thanks,

Alex
Alexey Kardashevskiy Feb. 13, 2018, 1:15 a.m. UTC | #4
On 13/02/18 03:06, Alex Williamson wrote:
> On Mon, 12 Feb 2018 18:05:54 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 12/02/18 16:19, David Gibson wrote:
>>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy wrote:  
>>>> At the moment if vfio_memory_listener is registered in the system memory
>>>> address space, it maps/unmaps every RAM memory region for DMA.
>>>> It expects system page size aligned memory sections so vfio_dma_map
>>>> would not fail and so far this has been the case. A mapping failure
>>>> would be fatal. A side effect of such behavior is that some MMIO pages
>>>> would not be mapped silently.
>>>>
>>>> However we are going to change MSIX BAR handling so we will end having
>>>> non-aligned sections in vfio_memory_listener (more details is in
>>>> the next patch) and vfio_dma_map will exit QEMU.
>>>>
>>>> In order to avoid fatal failures on what previously was not a failure and
>>>> was just silently ignored, this checks the section alignment to
>>>> the smallest supported IOMMU page size and prints an error if not aligned;
>>>> it also prints an error if vfio_dma_map failed despite the page size check.
>>>> Both errors are not fatal; only MMIO RAM regions are checked
>>>> (aka "RAM device" regions).
>>>>
>>>> If the amount of errors printed is overwhelming, the MSIX relocation
>>>> could be used to avoid excessive error output.
>>>>
>>>> This is unlikely to cause any behavioral change.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>  
>>>
>>> There are some relatively superficial problems noted below.
>>>
>>> But more fundamentally, this feels like it's extending an existing
>>> hack past the point of usefulness.
>>>
>>> The explicit check for is_ram_device() here has always bothered me -
>>> it's not like a real bus bridge magically knows whether a target
>>> address maps to RAM or not.
>>>
>>> What I think is really going on is that even for systems without an
>>> IOMMU, it's not really true to say that the PCI address space maps
>>> directly onto address_space_memory.  Instead, there's a large, but
>>> much less than 2^64 sized, "upstream window" at address 0 on the PCI
>>> bus, which is identity mapped to the system bus.  Details will vary
>>> with the system, but in practice we expect nothing but RAM to be in
>>> that window.  Addresses not within that window won't be mapped to the
>>> system bus but will just be broadcast on the PCI bus and might be
>>> picked up as a p2p transaction.  
>>
>> Currently this p2p works only via the IOMMU, direct p2p is not possible as
>> the guest needs to know physical MMIO addresses to make p2p work and it
>> does not.
> 
> /me points to the Direct Translated P2P section of the ACS spec, though
> it's as prone to spoofing by the device as ATS.  In any case, p2p
> reflected from the IOMMU is still p2p and offloads the CPU even if
> bandwidth suffers vs bare metal depending on if the data doubles back
> over any links.  Thanks,

Sure, I was just saying that p2p via IOMMU won't be as simple as broadcast
on the PCI bus, IOMMU needs to be programmed in advance to make this work,
and current that broadcast won't work for the passed through devices.
David Gibson Feb. 13, 2018, 5:36 a.m. UTC | #5
On Tue, Feb 13, 2018 at 12:15:52PM +1100, Alexey Kardashevskiy wrote:
> On 13/02/18 03:06, Alex Williamson wrote:
> > On Mon, 12 Feb 2018 18:05:54 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > 
> >> On 12/02/18 16:19, David Gibson wrote:
> >>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy wrote:  
> >>>> At the moment if vfio_memory_listener is registered in the system memory
> >>>> address space, it maps/unmaps every RAM memory region for DMA.
> >>>> It expects system page size aligned memory sections so vfio_dma_map
> >>>> would not fail and so far this has been the case. A mapping failure
> >>>> would be fatal. A side effect of such behavior is that some MMIO pages
> >>>> would not be mapped silently.
> >>>>
> >>>> However we are going to change MSIX BAR handling so we will end having
> >>>> non-aligned sections in vfio_memory_listener (more details is in
> >>>> the next patch) and vfio_dma_map will exit QEMU.
> >>>>
> >>>> In order to avoid fatal failures on what previously was not a failure and
> >>>> was just silently ignored, this checks the section alignment to
> >>>> the smallest supported IOMMU page size and prints an error if not aligned;
> >>>> it also prints an error if vfio_dma_map failed despite the page size check.
> >>>> Both errors are not fatal; only MMIO RAM regions are checked
> >>>> (aka "RAM device" regions).
> >>>>
> >>>> If the amount of errors printed is overwhelming, the MSIX relocation
> >>>> could be used to avoid excessive error output.
> >>>>
> >>>> This is unlikely to cause any behavioral change.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>  
> >>>
> >>> There are some relatively superficial problems noted below.
> >>>
> >>> But more fundamentally, this feels like it's extending an existing
> >>> hack past the point of usefulness.
> >>>
> >>> The explicit check for is_ram_device() here has always bothered me -
> >>> it's not like a real bus bridge magically knows whether a target
> >>> address maps to RAM or not.
> >>>
> >>> What I think is really going on is that even for systems without an
> >>> IOMMU, it's not really true to say that the PCI address space maps
> >>> directly onto address_space_memory.  Instead, there's a large, but
> >>> much less than 2^64 sized, "upstream window" at address 0 on the PCI
> >>> bus, which is identity mapped to the system bus.  Details will vary
> >>> with the system, but in practice we expect nothing but RAM to be in
> >>> that window.  Addresses not within that window won't be mapped to the
> >>> system bus but will just be broadcast on the PCI bus and might be
> >>> picked up as a p2p transaction.  
> >>
> >> Currently this p2p works only via the IOMMU, direct p2p is not possible as
> >> the guest needs to know physical MMIO addresses to make p2p work and it
> >> does not.
> > 
> > /me points to the Direct Translated P2P section of the ACS spec, though
> > it's as prone to spoofing by the device as ATS.  In any case, p2p
> > reflected from the IOMMU is still p2p and offloads the CPU even if
> > bandwidth suffers vs bare metal depending on if the data doubles back
> > over any links.  Thanks,
> 
> Sure, I was just saying that p2p via IOMMU won't be as simple as broadcast
> on the PCI bus, IOMMU needs to be programmed in advance to make this work,
> and current that broadcast won't work for the passed through devices.

Well, sure, p2p in a guest with passthrough devices clearly needs to
be translated through the IOMMU (and p2p from a passthrough to an
emulated device is essentially impossible).

But.. what does that have to do with this code.  This is the memory
area watcher, looking for memory regions being mapped directly into
the PCI space.  NOT IOMMU regions, since those are handled separately
by wiring up the IOMMU notifier.  This will only trigger if RAM-like,
non-RAM regions are put into PCI space *not* behind an IOMMMU.
David Gibson Feb. 13, 2018, 5:41 a.m. UTC | #6
On Tue, Feb 13, 2018 at 04:36:30PM +1100, David Gibson wrote:
> On Tue, Feb 13, 2018 at 12:15:52PM +1100, Alexey Kardashevskiy wrote:
> > On 13/02/18 03:06, Alex Williamson wrote:
> > > On Mon, 12 Feb 2018 18:05:54 +1100
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > 
> > >> On 12/02/18 16:19, David Gibson wrote:
> > >>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy wrote:  
> > >>>> At the moment if vfio_memory_listener is registered in the system memory
> > >>>> address space, it maps/unmaps every RAM memory region for DMA.
> > >>>> It expects system page size aligned memory sections so vfio_dma_map
> > >>>> would not fail and so far this has been the case. A mapping failure
> > >>>> would be fatal. A side effect of such behavior is that some MMIO pages
> > >>>> would not be mapped silently.
> > >>>>
> > >>>> However we are going to change MSIX BAR handling so we will end having
> > >>>> non-aligned sections in vfio_memory_listener (more details is in
> > >>>> the next patch) and vfio_dma_map will exit QEMU.
> > >>>>
> > >>>> In order to avoid fatal failures on what previously was not a failure and
> > >>>> was just silently ignored, this checks the section alignment to
> > >>>> the smallest supported IOMMU page size and prints an error if not aligned;
> > >>>> it also prints an error if vfio_dma_map failed despite the page size check.
> > >>>> Both errors are not fatal; only MMIO RAM regions are checked
> > >>>> (aka "RAM device" regions).
> > >>>>
> > >>>> If the amount of errors printed is overwhelming, the MSIX relocation
> > >>>> could be used to avoid excessive error output.
> > >>>>
> > >>>> This is unlikely to cause any behavioral change.
> > >>>>
> > >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>  
> > >>>
> > >>> There are some relatively superficial problems noted below.
> > >>>
> > >>> But more fundamentally, this feels like it's extending an existing
> > >>> hack past the point of usefulness.
> > >>>
> > >>> The explicit check for is_ram_device() here has always bothered me -
> > >>> it's not like a real bus bridge magically knows whether a target
> > >>> address maps to RAM or not.
> > >>>
> > >>> What I think is really going on is that even for systems without an
> > >>> IOMMU, it's not really true to say that the PCI address space maps
> > >>> directly onto address_space_memory.  Instead, there's a large, but
> > >>> much less than 2^64 sized, "upstream window" at address 0 on the PCI
> > >>> bus, which is identity mapped to the system bus.  Details will vary
> > >>> with the system, but in practice we expect nothing but RAM to be in
> > >>> that window.  Addresses not within that window won't be mapped to the
> > >>> system bus but will just be broadcast on the PCI bus and might be
> > >>> picked up as a p2p transaction.  
> > >>
> > >> Currently this p2p works only via the IOMMU, direct p2p is not possible as
> > >> the guest needs to know physical MMIO addresses to make p2p work and it
> > >> does not.
> > > 
> > > /me points to the Direct Translated P2P section of the ACS spec, though
> > > it's as prone to spoofing by the device as ATS.  In any case, p2p
> > > reflected from the IOMMU is still p2p and offloads the CPU even if
> > > bandwidth suffers vs bare metal depending on if the data doubles back
> > > over any links.  Thanks,
> > 
> > Sure, I was just saying that p2p via IOMMU won't be as simple as broadcast
> > on the PCI bus, IOMMU needs to be programmed in advance to make this work,
> > and current that broadcast won't work for the passed through devices.
> 
> Well, sure, p2p in a guest with passthrough devices clearly needs to
> be translated through the IOMMU (and p2p from a passthrough to an
> emulated device is essentially impossible).
> 
> But.. what does that have to do with this code.  This is the memory
> area watcher, looking for memory regions being mapped directly into
> the PCI space.  NOT IOMMU regions, since those are handled separately
> by wiring up the IOMMU notifier.  This will only trigger if RAM-like,
> non-RAM regions are put into PCI space *not* behind an IOMMMU.

Duh, sorry, realised I was mixing up host and guest IOMMU.  I guess
the point here is that this will map RAM-like devices into the host
IOMMU when there is no guest IOMMU, allowing p2p transactions between
passthrough devices (though not from passthrough to emulated devices).

The conditions still seem kind of awkward to me, but I guess it makes
sense.
Alexey Kardashevskiy Feb. 13, 2018, 8:20 a.m. UTC | #7
On 13/02/18 16:41, David Gibson wrote:
> On Tue, Feb 13, 2018 at 04:36:30PM +1100, David Gibson wrote:
>> On Tue, Feb 13, 2018 at 12:15:52PM +1100, Alexey Kardashevskiy wrote:
>>> On 13/02/18 03:06, Alex Williamson wrote:
>>>> On Mon, 12 Feb 2018 18:05:54 +1100
>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>
>>>>> On 12/02/18 16:19, David Gibson wrote:
>>>>>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy wrote:  
>>>>>>> At the moment if vfio_memory_listener is registered in the system memory
>>>>>>> address space, it maps/unmaps every RAM memory region for DMA.
>>>>>>> It expects system page size aligned memory sections so vfio_dma_map
>>>>>>> would not fail and so far this has been the case. A mapping failure
>>>>>>> would be fatal. A side effect of such behavior is that some MMIO pages
>>>>>>> would not be mapped silently.
>>>>>>>
>>>>>>> However we are going to change MSIX BAR handling so we will end having
>>>>>>> non-aligned sections in vfio_memory_listener (more details is in
>>>>>>> the next patch) and vfio_dma_map will exit QEMU.
>>>>>>>
>>>>>>> In order to avoid fatal failures on what previously was not a failure and
>>>>>>> was just silently ignored, this checks the section alignment to
>>>>>>> the smallest supported IOMMU page size and prints an error if not aligned;
>>>>>>> it also prints an error if vfio_dma_map failed despite the page size check.
>>>>>>> Both errors are not fatal; only MMIO RAM regions are checked
>>>>>>> (aka "RAM device" regions).
>>>>>>>
>>>>>>> If the amount of errors printed is overwhelming, the MSIX relocation
>>>>>>> could be used to avoid excessive error output.
>>>>>>>
>>>>>>> This is unlikely to cause any behavioral change.
>>>>>>>
>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>  
>>>>>>
>>>>>> There are some relatively superficial problems noted below.
>>>>>>
>>>>>> But more fundamentally, this feels like it's extending an existing
>>>>>> hack past the point of usefulness.
>>>>>>
>>>>>> The explicit check for is_ram_device() here has always bothered me -
>>>>>> it's not like a real bus bridge magically knows whether a target
>>>>>> address maps to RAM or not.
>>>>>>
>>>>>> What I think is really going on is that even for systems without an
>>>>>> IOMMU, it's not really true to say that the PCI address space maps
>>>>>> directly onto address_space_memory.  Instead, there's a large, but
>>>>>> much less than 2^64 sized, "upstream window" at address 0 on the PCI
>>>>>> bus, which is identity mapped to the system bus.  Details will vary
>>>>>> with the system, but in practice we expect nothing but RAM to be in
>>>>>> that window.  Addresses not within that window won't be mapped to the
>>>>>> system bus but will just be broadcast on the PCI bus and might be
>>>>>> picked up as a p2p transaction.  
>>>>>
>>>>> Currently this p2p works only via the IOMMU, direct p2p is not possible as
>>>>> the guest needs to know physical MMIO addresses to make p2p work and it
>>>>> does not.
>>>>
>>>> /me points to the Direct Translated P2P section of the ACS spec, though
>>>> it's as prone to spoofing by the device as ATS.  In any case, p2p
>>>> reflected from the IOMMU is still p2p and offloads the CPU even if
>>>> bandwidth suffers vs bare metal depending on if the data doubles back
>>>> over any links.  Thanks,
>>>
>>> Sure, I was just saying that p2p via IOMMU won't be as simple as broadcast
>>> on the PCI bus, IOMMU needs to be programmed in advance to make this work,
>>> and current that broadcast won't work for the passed through devices.
>>
>> Well, sure, p2p in a guest with passthrough devices clearly needs to
>> be translated through the IOMMU (and p2p from a passthrough to an
>> emulated device is essentially impossible).
>>
>> But.. what does that have to do with this code.  This is the memory
>> area watcher, looking for memory regions being mapped directly into
>> the PCI space.  NOT IOMMU regions, since those are handled separately
>> by wiring up the IOMMU notifier.  This will only trigger if RAM-like,
>> non-RAM regions are put into PCI space *not* behind an IOMMMU.
> 
> Duh, sorry, realised I was mixing up host and guest IOMMU.  I guess
> the point here is that this will map RAM-like devices into the host
> IOMMU when there is no guest IOMMU, allowing p2p transactions between
> passthrough devices (though not from passthrough to emulated devices).

Correct.

> 
> The conditions still seem kind of awkward to me, but I guess it makes
> sense.

Is it the time to split this listener to RAM-listener and PCI bus listener?

On x86 it listens on the "memory" AS, on spapr - on the
"pci@800000020000000" AS, this will just create more confusion over time...
David Gibson Feb. 14, 2018, 1:33 a.m. UTC | #8
On Tue, Feb 13, 2018 at 07:20:56PM +1100, Alexey Kardashevskiy wrote:
> On 13/02/18 16:41, David Gibson wrote:
> > On Tue, Feb 13, 2018 at 04:36:30PM +1100, David Gibson wrote:
> >> On Tue, Feb 13, 2018 at 12:15:52PM +1100, Alexey Kardashevskiy wrote:
> >>> On 13/02/18 03:06, Alex Williamson wrote:
> >>>> On Mon, 12 Feb 2018 18:05:54 +1100
> >>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>
> >>>>> On 12/02/18 16:19, David Gibson wrote:
> >>>>>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy wrote:  
> >>>>>>> At the moment if vfio_memory_listener is registered in the system memory
> >>>>>>> address space, it maps/unmaps every RAM memory region for DMA.
> >>>>>>> It expects system page size aligned memory sections so vfio_dma_map
> >>>>>>> would not fail and so far this has been the case. A mapping failure
> >>>>>>> would be fatal. A side effect of such behavior is that some MMIO pages
> >>>>>>> would not be mapped silently.
> >>>>>>>
> >>>>>>> However we are going to change MSIX BAR handling so we will end having
> >>>>>>> non-aligned sections in vfio_memory_listener (more details is in
> >>>>>>> the next patch) and vfio_dma_map will exit QEMU.
> >>>>>>>
> >>>>>>> In order to avoid fatal failures on what previously was not a failure and
> >>>>>>> was just silently ignored, this checks the section alignment to
> >>>>>>> the smallest supported IOMMU page size and prints an error if not aligned;
> >>>>>>> it also prints an error if vfio_dma_map failed despite the page size check.
> >>>>>>> Both errors are not fatal; only MMIO RAM regions are checked
> >>>>>>> (aka "RAM device" regions).
> >>>>>>>
> >>>>>>> If the amount of errors printed is overwhelming, the MSIX relocation
> >>>>>>> could be used to avoid excessive error output.
> >>>>>>>
> >>>>>>> This is unlikely to cause any behavioral change.
> >>>>>>>
> >>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>  
> >>>>>>
> >>>>>> There are some relatively superficial problems noted below.
> >>>>>>
> >>>>>> But more fundamentally, this feels like it's extending an existing
> >>>>>> hack past the point of usefulness.
> >>>>>>
> >>>>>> The explicit check for is_ram_device() here has always bothered me -
> >>>>>> it's not like a real bus bridge magically knows whether a target
> >>>>>> address maps to RAM or not.
> >>>>>>
> >>>>>> What I think is really going on is that even for systems without an
> >>>>>> IOMMU, it's not really true to say that the PCI address space maps
> >>>>>> directly onto address_space_memory.  Instead, there's a large, but
> >>>>>> much less than 2^64 sized, "upstream window" at address 0 on the PCI
> >>>>>> bus, which is identity mapped to the system bus.  Details will vary
> >>>>>> with the system, but in practice we expect nothing but RAM to be in
> >>>>>> that window.  Addresses not within that window won't be mapped to the
> >>>>>> system bus but will just be broadcast on the PCI bus and might be
> >>>>>> picked up as a p2p transaction.  
> >>>>>
> >>>>> Currently this p2p works only via the IOMMU, direct p2p is not possible as
> >>>>> the guest needs to know physical MMIO addresses to make p2p work and it
> >>>>> does not.
> >>>>
> >>>> /me points to the Direct Translated P2P section of the ACS spec, though
> >>>> it's as prone to spoofing by the device as ATS.  In any case, p2p
> >>>> reflected from the IOMMU is still p2p and offloads the CPU even if
> >>>> bandwidth suffers vs bare metal depending on if the data doubles back
> >>>> over any links.  Thanks,
> >>>
> >>> Sure, I was just saying that p2p via IOMMU won't be as simple as broadcast
> >>> on the PCI bus, IOMMU needs to be programmed in advance to make this work,
> >>> and current that broadcast won't work for the passed through devices.
> >>
> >> Well, sure, p2p in a guest with passthrough devices clearly needs to
> >> be translated through the IOMMU (and p2p from a passthrough to an
> >> emulated device is essentially impossible).
> >>
> >> But.. what does that have to do with this code.  This is the memory
> >> area watcher, looking for memory regions being mapped directly into
> >> the PCI space.  NOT IOMMU regions, since those are handled separately
> >> by wiring up the IOMMU notifier.  This will only trigger if RAM-like,
> >> non-RAM regions are put into PCI space *not* behind an IOMMMU.
> > 
> > Duh, sorry, realised I was mixing up host and guest IOMMU.  I guess
> > the point here is that this will map RAM-like devices into the host
> > IOMMU when there is no guest IOMMU, allowing p2p transactions between
> > passthrough devices (though not from passthrough to emulated devices).
> 
> Correct.
> 
> > 
> > The conditions still seem kind of awkward to me, but I guess it makes
> > sense.
> 
> Is it the time to split this listener to RAM-listener and PCI bus listener?

I'm not really sure what you mean by that.

> On x86 it listens on the "memory" AS, on spapr - on the
> "pci@800000020000000" AS, this will just create more confusion over time...

Hm, it's still logically the same AS in each case - the PCI address
space.  It's just that in the x86 case that happens to be the same as
the memory AS (at least when there isn't a guest side IOMMU).

I do wonder if that's really right even for x86 without IOMMU.  The
PCI address space is identity mapped to the "main" memory address
space there, but I'm not sure it's mapped th *all* of the main memory
address space, probably just certain ranges of it.  That's what I was
suggesting earlier in the thread.
Alexey Kardashevskiy Feb. 14, 2018, 8:09 a.m. UTC | #9
On 14/02/18 12:33, David Gibson wrote:
> On Tue, Feb 13, 2018 at 07:20:56PM +1100, Alexey Kardashevskiy wrote:
>> On 13/02/18 16:41, David Gibson wrote:
>>> On Tue, Feb 13, 2018 at 04:36:30PM +1100, David Gibson wrote:
>>>> On Tue, Feb 13, 2018 at 12:15:52PM +1100, Alexey Kardashevskiy wrote:
>>>>> On 13/02/18 03:06, Alex Williamson wrote:
>>>>>> On Mon, 12 Feb 2018 18:05:54 +1100
>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>
>>>>>>> On 12/02/18 16:19, David Gibson wrote:
>>>>>>>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy wrote:  
>>>>>>>>> At the moment if vfio_memory_listener is registered in the system memory
>>>>>>>>> address space, it maps/unmaps every RAM memory region for DMA.
>>>>>>>>> It expects system page size aligned memory sections so vfio_dma_map
>>>>>>>>> would not fail and so far this has been the case. A mapping failure
>>>>>>>>> would be fatal. A side effect of such behavior is that some MMIO pages
>>>>>>>>> would not be mapped silently.
>>>>>>>>>
>>>>>>>>> However we are going to change MSIX BAR handling so we will end having
>>>>>>>>> non-aligned sections in vfio_memory_listener (more details is in
>>>>>>>>> the next patch) and vfio_dma_map will exit QEMU.
>>>>>>>>>
>>>>>>>>> In order to avoid fatal failures on what previously was not a failure and
>>>>>>>>> was just silently ignored, this checks the section alignment to
>>>>>>>>> the smallest supported IOMMU page size and prints an error if not aligned;
>>>>>>>>> it also prints an error if vfio_dma_map failed despite the page size check.
>>>>>>>>> Both errors are not fatal; only MMIO RAM regions are checked
>>>>>>>>> (aka "RAM device" regions).
>>>>>>>>>
>>>>>>>>> If the amount of errors printed is overwhelming, the MSIX relocation
>>>>>>>>> could be used to avoid excessive error output.
>>>>>>>>>
>>>>>>>>> This is unlikely to cause any behavioral change.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>  
>>>>>>>>
>>>>>>>> There are some relatively superficial problems noted below.
>>>>>>>>
>>>>>>>> But more fundamentally, this feels like it's extending an existing
>>>>>>>> hack past the point of usefulness.
>>>>>>>>
>>>>>>>> The explicit check for is_ram_device() here has always bothered me -
>>>>>>>> it's not like a real bus bridge magically knows whether a target
>>>>>>>> address maps to RAM or not.
>>>>>>>>
>>>>>>>> What I think is really going on is that even for systems without an
>>>>>>>> IOMMU, it's not really true to say that the PCI address space maps
>>>>>>>> directly onto address_space_memory.  Instead, there's a large, but
>>>>>>>> much less than 2^64 sized, "upstream window" at address 0 on the PCI
>>>>>>>> bus, which is identity mapped to the system bus.  Details will vary
>>>>>>>> with the system, but in practice we expect nothing but RAM to be in
>>>>>>>> that window.  Addresses not within that window won't be mapped to the
>>>>>>>> system bus but will just be broadcast on the PCI bus and might be
>>>>>>>> picked up as a p2p transaction.  
>>>>>>>
>>>>>>> Currently this p2p works only via the IOMMU, direct p2p is not possible as
>>>>>>> the guest needs to know physical MMIO addresses to make p2p work and it
>>>>>>> does not.
>>>>>>
>>>>>> /me points to the Direct Translated P2P section of the ACS spec, though
>>>>>> it's as prone to spoofing by the device as ATS.  In any case, p2p
>>>>>> reflected from the IOMMU is still p2p and offloads the CPU even if
>>>>>> bandwidth suffers vs bare metal depending on if the data doubles back
>>>>>> over any links.  Thanks,
>>>>>
>>>>> Sure, I was just saying that p2p via IOMMU won't be as simple as broadcast
>>>>> on the PCI bus, IOMMU needs to be programmed in advance to make this work,
>>>>> and current that broadcast won't work for the passed through devices.
>>>>
>>>> Well, sure, p2p in a guest with passthrough devices clearly needs to
>>>> be translated through the IOMMU (and p2p from a passthrough to an
>>>> emulated device is essentially impossible).
>>>>
>>>> But.. what does that have to do with this code.  This is the memory
>>>> area watcher, looking for memory regions being mapped directly into
>>>> the PCI space.  NOT IOMMU regions, since those are handled separately
>>>> by wiring up the IOMMU notifier.  This will only trigger if RAM-like,
>>>> non-RAM regions are put into PCI space *not* behind an IOMMMU.
>>>
>>> Duh, sorry, realised I was mixing up host and guest IOMMU.  I guess
>>> the point here is that this will map RAM-like devices into the host
>>> IOMMU when there is no guest IOMMU, allowing p2p transactions between
>>> passthrough devices (though not from passthrough to emulated devices).
>>
>> Correct.
>>
>>>
>>> The conditions still seem kind of awkward to me, but I guess it makes
>>> sense.
>>
>> Is it the time to split this listener to RAM-listener and PCI bus listener?
> 
> I'm not really sure what you mean by that.
> 
>> On x86 it listens on the "memory" AS, on spapr - on the
>> "pci@800000020000000" AS, this will just create more confusion over time...
> 
> Hm, it's still logically the same AS in each case - the PCI address
> space.  It's just that in the x86 case that happens to be the same as
> the memory AS (at least when there isn't a guest side IOMMU).

Hm. Ok.

> I do wonder if that's really right even for x86 without IOMMU.  The
> PCI address space is identity mapped to the "main" memory address
> space there, but I'm not sure it's mapped th *all* of the main memory

What should have been in the place of "th" above? :)

> address space, probably just certain ranges of it.  That's what I was
> suggesting earlier in the thread.

afaict vfio is trying to mmap every RAM MR == KVM memory slot, no ranges or
anything like that. I am trying to figure out what is not correct with the
patch. Thanks,
Alex Williamson Feb. 14, 2018, 3:55 p.m. UTC | #10
On Wed, 14 Feb 2018 19:09:16 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 14/02/18 12:33, David Gibson wrote:
> > On Tue, Feb 13, 2018 at 07:20:56PM +1100, Alexey Kardashevskiy wrote:  
> >> On 13/02/18 16:41, David Gibson wrote:  
> >>> On Tue, Feb 13, 2018 at 04:36:30PM +1100, David Gibson wrote:  
> >>>> On Tue, Feb 13, 2018 at 12:15:52PM +1100, Alexey Kardashevskiy wrote:  
> >>>>> On 13/02/18 03:06, Alex Williamson wrote:  
> >>>>>> On Mon, 12 Feb 2018 18:05:54 +1100
> >>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>>>  
> >>>>>>> On 12/02/18 16:19, David Gibson wrote:  
> >>>>>>>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy wrote:    
> >>>>>>>>> At the moment if vfio_memory_listener is registered in the system memory
> >>>>>>>>> address space, it maps/unmaps every RAM memory region for DMA.
> >>>>>>>>> It expects system page size aligned memory sections so vfio_dma_map
> >>>>>>>>> would not fail and so far this has been the case. A mapping failure
> >>>>>>>>> would be fatal. A side effect of such behavior is that some MMIO pages
> >>>>>>>>> would not be mapped silently.
> >>>>>>>>>
> >>>>>>>>> However we are going to change MSIX BAR handling so we will end having
> >>>>>>>>> non-aligned sections in vfio_memory_listener (more details is in
> >>>>>>>>> the next patch) and vfio_dma_map will exit QEMU.
> >>>>>>>>>
> >>>>>>>>> In order to avoid fatal failures on what previously was not a failure and
> >>>>>>>>> was just silently ignored, this checks the section alignment to
> >>>>>>>>> the smallest supported IOMMU page size and prints an error if not aligned;
> >>>>>>>>> it also prints an error if vfio_dma_map failed despite the page size check.
> >>>>>>>>> Both errors are not fatal; only MMIO RAM regions are checked
> >>>>>>>>> (aka "RAM device" regions).
> >>>>>>>>>
> >>>>>>>>> If the amount of errors printed is overwhelming, the MSIX relocation
> >>>>>>>>> could be used to avoid excessive error output.
> >>>>>>>>>
> >>>>>>>>> This is unlikely to cause any behavioral change.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>    
> >>>>>>>>
> >>>>>>>> There are some relatively superficial problems noted below.
> >>>>>>>>
> >>>>>>>> But more fundamentally, this feels like it's extending an existing
> >>>>>>>> hack past the point of usefulness.
> >>>>>>>>
> >>>>>>>> The explicit check for is_ram_device() here has always bothered me -
> >>>>>>>> it's not like a real bus bridge magically knows whether a target
> >>>>>>>> address maps to RAM or not.
> >>>>>>>>
> >>>>>>>> What I think is really going on is that even for systems without an
> >>>>>>>> IOMMU, it's not really true to say that the PCI address space maps
> >>>>>>>> directly onto address_space_memory.  Instead, there's a large, but
> >>>>>>>> much less than 2^64 sized, "upstream window" at address 0 on the PCI
> >>>>>>>> bus, which is identity mapped to the system bus.  Details will vary
> >>>>>>>> with the system, but in practice we expect nothing but RAM to be in
> >>>>>>>> that window.  Addresses not within that window won't be mapped to the
> >>>>>>>> system bus but will just be broadcast on the PCI bus and might be
> >>>>>>>> picked up as a p2p transaction.    
> >>>>>>>
> >>>>>>> Currently this p2p works only via the IOMMU, direct p2p is not possible as
> >>>>>>> the guest needs to know physical MMIO addresses to make p2p work and it
> >>>>>>> does not.  
> >>>>>>
> >>>>>> /me points to the Direct Translated P2P section of the ACS spec, though
> >>>>>> it's as prone to spoofing by the device as ATS.  In any case, p2p
> >>>>>> reflected from the IOMMU is still p2p and offloads the CPU even if
> >>>>>> bandwidth suffers vs bare metal depending on if the data doubles back
> >>>>>> over any links.  Thanks,  
> >>>>>
> >>>>> Sure, I was just saying that p2p via IOMMU won't be as simple as broadcast
> >>>>> on the PCI bus, IOMMU needs to be programmed in advance to make this work,
> >>>>> and current that broadcast won't work for the passed through devices.  
> >>>>
> >>>> Well, sure, p2p in a guest with passthrough devices clearly needs to
> >>>> be translated through the IOMMU (and p2p from a passthrough to an
> >>>> emulated device is essentially impossible).
> >>>>
> >>>> But.. what does that have to do with this code.  This is the memory
> >>>> area watcher, looking for memory regions being mapped directly into
> >>>> the PCI space.  NOT IOMMU regions, since those are handled separately
> >>>> by wiring up the IOMMU notifier.  This will only trigger if RAM-like,
> >>>> non-RAM regions are put into PCI space *not* behind an IOMMMU.  
> >>>
> >>> Duh, sorry, realised I was mixing up host and guest IOMMU.  I guess
> >>> the point here is that this will map RAM-like devices into the host
> >>> IOMMU when there is no guest IOMMU, allowing p2p transactions between
> >>> passthrough devices (though not from passthrough to emulated devices).  
> >>
> >> Correct.
> >>  
> >>>
> >>> The conditions still seem kind of awkward to me, but I guess it makes
> >>> sense.  
> >>
> >> Is it the time to split this listener to RAM-listener and PCI bus listener?  
> > 
> > I'm not really sure what you mean by that.
> >   
> >> On x86 it listens on the "memory" AS, on spapr - on the
> >> "pci@800000020000000" AS, this will just create more confusion over time...  
> > 
> > Hm, it's still logically the same AS in each case - the PCI address
> > space.  It's just that in the x86 case that happens to be the same as
> > the memory AS (at least when there isn't a guest side IOMMU).  
> 
> Hm. Ok.
> 
> > I do wonder if that's really right even for x86 without IOMMU.  The
> > PCI address space is identity mapped to the "main" memory address
> > space there, but I'm not sure it's mapped th *all* of the main memory  
> 
> What should have been in the place of "th" above? :)
> 
> > address space, probably just certain ranges of it.  That's what I was
> > suggesting earlier in the thread.  
> 
> afaict vfio is trying to mmap every RAM MR == KVM memory slot, no ranges or
> anything like that. I am trying to figure out what is not correct with the
> patch. Thanks,

It is possible for x86 systems to have translation applied to MMIO vs
RAM such that the processor view and device view of MMIO are different,
putting them into separate address spaces, but this is not typical and
not available on the class of chipsets that QEMU emulates for PC.
Therefore I think it's correct that MMIO and RAM all live in one big
happy, flat address space as they do now (assuming the guest IOMMU is
not present or disabled).  Thanks,

Alex
David Gibson Feb. 16, 2018, 5:28 a.m. UTC | #11
On Wed, Feb 14, 2018 at 08:55:41AM -0700, Alex Williamson wrote:
> On Wed, 14 Feb 2018 19:09:16 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > On 14/02/18 12:33, David Gibson wrote:
> > > On Tue, Feb 13, 2018 at 07:20:56PM +1100, Alexey Kardashevskiy wrote:  
> > >> On 13/02/18 16:41, David Gibson wrote:  
> > >>> On Tue, Feb 13, 2018 at 04:36:30PM +1100, David Gibson wrote:  
> > >>>> On Tue, Feb 13, 2018 at 12:15:52PM +1100, Alexey Kardashevskiy wrote:  
> > >>>>> On 13/02/18 03:06, Alex Williamson wrote:  
> > >>>>>> On Mon, 12 Feb 2018 18:05:54 +1100
> > >>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >>>>>>  
> > >>>>>>> On 12/02/18 16:19, David Gibson wrote:  
> > >>>>>>>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy wrote:    
> > >>>>>>>>> At the moment if vfio_memory_listener is registered in the system memory
> > >>>>>>>>> address space, it maps/unmaps every RAM memory region for DMA.
> > >>>>>>>>> It expects system page size aligned memory sections so vfio_dma_map
> > >>>>>>>>> would not fail and so far this has been the case. A mapping failure
> > >>>>>>>>> would be fatal. A side effect of such behavior is that some MMIO pages
> > >>>>>>>>> would not be mapped silently.
> > >>>>>>>>>
> > >>>>>>>>> However we are going to change MSIX BAR handling so we will end having
> > >>>>>>>>> non-aligned sections in vfio_memory_listener (more details is in
> > >>>>>>>>> the next patch) and vfio_dma_map will exit QEMU.
> > >>>>>>>>>
> > >>>>>>>>> In order to avoid fatal failures on what previously was not a failure and
> > >>>>>>>>> was just silently ignored, this checks the section alignment to
> > >>>>>>>>> the smallest supported IOMMU page size and prints an error if not aligned;
> > >>>>>>>>> it also prints an error if vfio_dma_map failed despite the page size check.
> > >>>>>>>>> Both errors are not fatal; only MMIO RAM regions are checked
> > >>>>>>>>> (aka "RAM device" regions).
> > >>>>>>>>>
> > >>>>>>>>> If the amount of errors printed is overwhelming, the MSIX relocation
> > >>>>>>>>> could be used to avoid excessive error output.
> > >>>>>>>>>
> > >>>>>>>>> This is unlikely to cause any behavioral change.
> > >>>>>>>>>
> > >>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>    
> > >>>>>>>>
> > >>>>>>>> There are some relatively superficial problems noted below.
> > >>>>>>>>
> > >>>>>>>> But more fundamentally, this feels like it's extending an existing
> > >>>>>>>> hack past the point of usefulness.
> > >>>>>>>>
> > >>>>>>>> The explicit check for is_ram_device() here has always bothered me -
> > >>>>>>>> it's not like a real bus bridge magically knows whether a target
> > >>>>>>>> address maps to RAM or not.
> > >>>>>>>>
> > >>>>>>>> What I think is really going on is that even for systems without an
> > >>>>>>>> IOMMU, it's not really true to say that the PCI address space maps
> > >>>>>>>> directly onto address_space_memory.  Instead, there's a large, but
> > >>>>>>>> much less than 2^64 sized, "upstream window" at address 0 on the PCI
> > >>>>>>>> bus, which is identity mapped to the system bus.  Details will vary
> > >>>>>>>> with the system, but in practice we expect nothing but RAM to be in
> > >>>>>>>> that window.  Addresses not within that window won't be mapped to the
> > >>>>>>>> system bus but will just be broadcast on the PCI bus and might be
> > >>>>>>>> picked up as a p2p transaction.    
> > >>>>>>>
> > >>>>>>> Currently this p2p works only via the IOMMU, direct p2p is not possible as
> > >>>>>>> the guest needs to know physical MMIO addresses to make p2p work and it
> > >>>>>>> does not.  
> > >>>>>>
> > >>>>>> /me points to the Direct Translated P2P section of the ACS spec, though
> > >>>>>> it's as prone to spoofing by the device as ATS.  In any case, p2p
> > >>>>>> reflected from the IOMMU is still p2p and offloads the CPU even if
> > >>>>>> bandwidth suffers vs bare metal depending on if the data doubles back
> > >>>>>> over any links.  Thanks,  
> > >>>>>
> > >>>>> Sure, I was just saying that p2p via IOMMU won't be as simple as broadcast
> > >>>>> on the PCI bus, IOMMU needs to be programmed in advance to make this work,
> > >>>>> and current that broadcast won't work for the passed through devices.  
> > >>>>
> > >>>> Well, sure, p2p in a guest with passthrough devices clearly needs to
> > >>>> be translated through the IOMMU (and p2p from a passthrough to an
> > >>>> emulated device is essentially impossible).
> > >>>>
> > >>>> But.. what does that have to do with this code.  This is the memory
> > >>>> area watcher, looking for memory regions being mapped directly into
> > >>>> the PCI space.  NOT IOMMU regions, since those are handled separately
> > >>>> by wiring up the IOMMU notifier.  This will only trigger if RAM-like,
> > >>>> non-RAM regions are put into PCI space *not* behind an IOMMMU.  
> > >>>
> > >>> Duh, sorry, realised I was mixing up host and guest IOMMU.  I guess
> > >>> the point here is that this will map RAM-like devices into the host
> > >>> IOMMU when there is no guest IOMMU, allowing p2p transactions between
> > >>> passthrough devices (though not from passthrough to emulated devices).  
> > >>
> > >> Correct.
> > >>  
> > >>>
> > >>> The conditions still seem kind of awkward to me, but I guess it makes
> > >>> sense.  
> > >>
> > >> Is it the time to split this listener to RAM-listener and PCI bus listener?  
> > > 
> > > I'm not really sure what you mean by that.
> > >   
> > >> On x86 it listens on the "memory" AS, on spapr - on the
> > >> "pci@800000020000000" AS, this will just create more confusion over time...  
> > > 
> > > Hm, it's still logically the same AS in each case - the PCI address
> > > space.  It's just that in the x86 case that happens to be the same as
> > > the memory AS (at least when there isn't a guest side IOMMU).  
> > 
> > Hm. Ok.
> > 
> > > I do wonder if that's really right even for x86 without IOMMU.  The
> > > PCI address space is identity mapped to the "main" memory address
> > > space there, but I'm not sure it's mapped th *all* of the main memory  
> > 
> > What should have been in the place of "th" above? :)
> > 
> > > address space, probably just certain ranges of it.  That's what I was
> > > suggesting earlier in the thread.  
> > 
> > afaict vfio is trying to mmap every RAM MR == KVM memory slot, no ranges or
> > anything like that. I am trying to figure out what is not correct with the
> > patch. Thanks,
> 
> It is possible for x86 systems to have translation applied to MMIO vs
> RAM such that the processor view and device view of MMIO are different,
> putting them into separate address spaces, but this is not typical and
> not available on the class of chipsets that QEMU emulates for PC.
> Therefore I think it's correct that MMIO and RAM all live in one big
> happy, flat address space as they do now (assuming the guest IOMMU is
> not present or disabled).  Thanks,

Ah.. I think the thing I was missing is that on PC (at least with
typical chipsets) *all* MMIO essentially comes from PCI space.  Even
the legacy devices are essentially ISA which is treated as being on a
bridge under the PCI space.  On non-x86 there are often at least a
handful of MMIO devices that aren't within the PCI space - say, the
PCI host bridge itself at least.  x86 avoids that by using the
separate IO space, which is also a bit weird in that it's
simultaneously direct attached to the cpu (and so doesn't need bridge
configuration), but also identified with the legay IO space treated as
being under two bridges (host->PCI, PCI->ISA) for other purposes.

Maybe it's just me, but I find it makes more sense to me if I think of
it as the two ASes being equal because on PC the system address map
(address_space_memory) is made equal to the PCI address map, rather
than the PCI address map being made equal to the system one.
Alexey Kardashevskiy Feb. 19, 2018, 2:46 a.m. UTC | #12
On 16/02/18 16:28, David Gibson wrote:
> On Wed, Feb 14, 2018 at 08:55:41AM -0700, Alex Williamson wrote:
>> On Wed, 14 Feb 2018 19:09:16 +1100
>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>
>>> On 14/02/18 12:33, David Gibson wrote:
>>>> On Tue, Feb 13, 2018 at 07:20:56PM +1100, Alexey Kardashevskiy wrote:  
>>>>> On 13/02/18 16:41, David Gibson wrote:  
>>>>>> On Tue, Feb 13, 2018 at 04:36:30PM +1100, David Gibson wrote:  
>>>>>>> On Tue, Feb 13, 2018 at 12:15:52PM +1100, Alexey Kardashevskiy wrote:  
>>>>>>>> On 13/02/18 03:06, Alex Williamson wrote:  
>>>>>>>>> On Mon, 12 Feb 2018 18:05:54 +1100
>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>>  
>>>>>>>>>> On 12/02/18 16:19, David Gibson wrote:  
>>>>>>>>>>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy wrote:    
>>>>>>>>>>>> At the moment if vfio_memory_listener is registered in the system memory
>>>>>>>>>>>> address space, it maps/unmaps every RAM memory region for DMA.
>>>>>>>>>>>> It expects system page size aligned memory sections so vfio_dma_map
>>>>>>>>>>>> would not fail and so far this has been the case. A mapping failure
>>>>>>>>>>>> would be fatal. A side effect of such behavior is that some MMIO pages
>>>>>>>>>>>> would not be mapped silently.
>>>>>>>>>>>>
>>>>>>>>>>>> However we are going to change MSIX BAR handling so we will end having
>>>>>>>>>>>> non-aligned sections in vfio_memory_listener (more details is in
>>>>>>>>>>>> the next patch) and vfio_dma_map will exit QEMU.
>>>>>>>>>>>>
>>>>>>>>>>>> In order to avoid fatal failures on what previously was not a failure and
>>>>>>>>>>>> was just silently ignored, this checks the section alignment to
>>>>>>>>>>>> the smallest supported IOMMU page size and prints an error if not aligned;
>>>>>>>>>>>> it also prints an error if vfio_dma_map failed despite the page size check.
>>>>>>>>>>>> Both errors are not fatal; only MMIO RAM regions are checked
>>>>>>>>>>>> (aka "RAM device" regions).
>>>>>>>>>>>>
>>>>>>>>>>>> If the amount of errors printed is overwhelming, the MSIX relocation
>>>>>>>>>>>> could be used to avoid excessive error output.
>>>>>>>>>>>>
>>>>>>>>>>>> This is unlikely to cause any behavioral change.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>    
>>>>>>>>>>>
>>>>>>>>>>> There are some relatively superficial problems noted below.
>>>>>>>>>>>
>>>>>>>>>>> But more fundamentally, this feels like it's extending an existing
>>>>>>>>>>> hack past the point of usefulness.
>>>>>>>>>>>
>>>>>>>>>>> The explicit check for is_ram_device() here has always bothered me -
>>>>>>>>>>> it's not like a real bus bridge magically knows whether a target
>>>>>>>>>>> address maps to RAM or not.
>>>>>>>>>>>
>>>>>>>>>>> What I think is really going on is that even for systems without an
>>>>>>>>>>> IOMMU, it's not really true to say that the PCI address space maps
>>>>>>>>>>> directly onto address_space_memory.  Instead, there's a large, but
>>>>>>>>>>> much less than 2^64 sized, "upstream window" at address 0 on the PCI
>>>>>>>>>>> bus, which is identity mapped to the system bus.  Details will vary
>>>>>>>>>>> with the system, but in practice we expect nothing but RAM to be in
>>>>>>>>>>> that window.  Addresses not within that window won't be mapped to the
>>>>>>>>>>> system bus but will just be broadcast on the PCI bus and might be
>>>>>>>>>>> picked up as a p2p transaction.    
>>>>>>>>>>
>>>>>>>>>> Currently this p2p works only via the IOMMU, direct p2p is not possible as
>>>>>>>>>> the guest needs to know physical MMIO addresses to make p2p work and it
>>>>>>>>>> does not.  
>>>>>>>>>
>>>>>>>>> /me points to the Direct Translated P2P section of the ACS spec, though
>>>>>>>>> it's as prone to spoofing by the device as ATS.  In any case, p2p
>>>>>>>>> reflected from the IOMMU is still p2p and offloads the CPU even if
>>>>>>>>> bandwidth suffers vs bare metal depending on if the data doubles back
>>>>>>>>> over any links.  Thanks,  
>>>>>>>>
>>>>>>>> Sure, I was just saying that p2p via IOMMU won't be as simple as broadcast
>>>>>>>> on the PCI bus, IOMMU needs to be programmed in advance to make this work,
>>>>>>>> and current that broadcast won't work for the passed through devices.  
>>>>>>>
>>>>>>> Well, sure, p2p in a guest with passthrough devices clearly needs to
>>>>>>> be translated through the IOMMU (and p2p from a passthrough to an
>>>>>>> emulated device is essentially impossible).
>>>>>>>
>>>>>>> But.. what does that have to do with this code.  This is the memory
>>>>>>> area watcher, looking for memory regions being mapped directly into
>>>>>>> the PCI space.  NOT IOMMU regions, since those are handled separately
>>>>>>> by wiring up the IOMMU notifier.  This will only trigger if RAM-like,
>>>>>>> non-RAM regions are put into PCI space *not* behind an IOMMMU.  
>>>>>>
>>>>>> Duh, sorry, realised I was mixing up host and guest IOMMU.  I guess
>>>>>> the point here is that this will map RAM-like devices into the host
>>>>>> IOMMU when there is no guest IOMMU, allowing p2p transactions between
>>>>>> passthrough devices (though not from passthrough to emulated devices).  
>>>>>
>>>>> Correct.
>>>>>  
>>>>>>
>>>>>> The conditions still seem kind of awkward to me, but I guess it makes
>>>>>> sense.  
>>>>>
>>>>> Is it the time to split this listener to RAM-listener and PCI bus listener?  
>>>>
>>>> I'm not really sure what you mean by that.
>>>>   
>>>>> On x86 it listens on the "memory" AS, on spapr - on the
>>>>> "pci@800000020000000" AS, this will just create more confusion over time...  
>>>>
>>>> Hm, it's still logically the same AS in each case - the PCI address
>>>> space.  It's just that in the x86 case that happens to be the same as
>>>> the memory AS (at least when there isn't a guest side IOMMU).  
>>>
>>> Hm. Ok.
>>>
>>>> I do wonder if that's really right even for x86 without IOMMU.  The
>>>> PCI address space is identity mapped to the "main" memory address
>>>> space there, but I'm not sure it's mapped th *all* of the main memory  
>>>
>>> What should have been in the place of "th" above? :)
>>>
>>>> address space, probably just certain ranges of it.  That's what I was
>>>> suggesting earlier in the thread.  
>>>
>>> afaict vfio is trying to mmap every RAM MR == KVM memory slot, no ranges or
>>> anything like that. I am trying to figure out what is not correct with the
>>> patch. Thanks,
>>
>> It is possible for x86 systems to have translation applied to MMIO vs
>> RAM such that the processor view and device view of MMIO are different,
>> putting them into separate address spaces, but this is not typical and
>> not available on the class of chipsets that QEMU emulates for PC.
>> Therefore I think it's correct that MMIO and RAM all live in one big
>> happy, flat address space as they do now (assuming the guest IOMMU is
>> not present or disabled).  Thanks,
> 
> Ah.. I think the thing I was missing is that on PC (at least with
> typical chipsets) *all* MMIO essentially comes from PCI space.  Even
> the legacy devices are essentially ISA which is treated as being on a
> bridge under the PCI space.  On non-x86 there are often at least a
> handful of MMIO devices that aren't within the PCI space - say, the
> PCI host bridge itself at least.  x86 avoids that by using the
> separate IO space, which is also a bit weird in that it's
> simultaneously direct attached to the cpu (and so doesn't need bridge
> configuration), but also identified with the legay IO space treated as
> being under two bridges (host->PCI, PCI->ISA) for other purposes.
> 
> Maybe it's just me, but I find it makes more sense to me if I think of
> it as the two ASes being equal because on PC the system address map
> (address_space_memory) is made equal to the PCI address map, rather
> than the PCI address map being made equal to the system one.

It makes more sense to me too, we just do not exploit/expose this on SPAPR
- the PCI address space only has 2xIOMMU and 1xMSI windows and that's it,
no MMIO which is mapped to the system AS only (adding those to the PCI AS
is little tricky as mentioned elsewhere).

Anyway, I still wonder - what needs to be addressed in the 2/4 patch in
order to proceed?
Alexey Kardashevskiy Feb. 26, 2018, 8:36 a.m. UTC | #13
On 19/02/18 13:46, Alexey Kardashevskiy wrote:
> On 16/02/18 16:28, David Gibson wrote:
>> On Wed, Feb 14, 2018 at 08:55:41AM -0700, Alex Williamson wrote:
>>> On Wed, 14 Feb 2018 19:09:16 +1100
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>
>>>> On 14/02/18 12:33, David Gibson wrote:
>>>>> On Tue, Feb 13, 2018 at 07:20:56PM +1100, Alexey Kardashevskiy wrote:  
>>>>>> On 13/02/18 16:41, David Gibson wrote:  
>>>>>>> On Tue, Feb 13, 2018 at 04:36:30PM +1100, David Gibson wrote:  
>>>>>>>> On Tue, Feb 13, 2018 at 12:15:52PM +1100, Alexey Kardashevskiy wrote:  
>>>>>>>>> On 13/02/18 03:06, Alex Williamson wrote:  
>>>>>>>>>> On Mon, 12 Feb 2018 18:05:54 +1100
>>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>>>  
>>>>>>>>>>> On 12/02/18 16:19, David Gibson wrote:  
>>>>>>>>>>>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy wrote:    
>>>>>>>>>>>>> At the moment if vfio_memory_listener is registered in the system memory
>>>>>>>>>>>>> address space, it maps/unmaps every RAM memory region for DMA.
>>>>>>>>>>>>> It expects system page size aligned memory sections so vfio_dma_map
>>>>>>>>>>>>> would not fail and so far this has been the case. A mapping failure
>>>>>>>>>>>>> would be fatal. A side effect of such behavior is that some MMIO pages
>>>>>>>>>>>>> would not be mapped silently.
>>>>>>>>>>>>>
>>>>>>>>>>>>> However we are going to change MSIX BAR handling so we will end having
>>>>>>>>>>>>> non-aligned sections in vfio_memory_listener (more details is in
>>>>>>>>>>>>> the next patch) and vfio_dma_map will exit QEMU.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In order to avoid fatal failures on what previously was not a failure and
>>>>>>>>>>>>> was just silently ignored, this checks the section alignment to
>>>>>>>>>>>>> the smallest supported IOMMU page size and prints an error if not aligned;
>>>>>>>>>>>>> it also prints an error if vfio_dma_map failed despite the page size check.
>>>>>>>>>>>>> Both errors are not fatal; only MMIO RAM regions are checked
>>>>>>>>>>>>> (aka "RAM device" regions).
>>>>>>>>>>>>>
>>>>>>>>>>>>> If the amount of errors printed is overwhelming, the MSIX relocation
>>>>>>>>>>>>> could be used to avoid excessive error output.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is unlikely to cause any behavioral change.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>    
>>>>>>>>>>>>
>>>>>>>>>>>> There are some relatively superficial problems noted below.
>>>>>>>>>>>>
>>>>>>>>>>>> But more fundamentally, this feels like it's extending an existing
>>>>>>>>>>>> hack past the point of usefulness.
>>>>>>>>>>>>
>>>>>>>>>>>> The explicit check for is_ram_device() here has always bothered me -
>>>>>>>>>>>> it's not like a real bus bridge magically knows whether a target
>>>>>>>>>>>> address maps to RAM or not.
>>>>>>>>>>>>
>>>>>>>>>>>> What I think is really going on is that even for systems without an
>>>>>>>>>>>> IOMMU, it's not really true to say that the PCI address space maps
>>>>>>>>>>>> directly onto address_space_memory.  Instead, there's a large, but
>>>>>>>>>>>> much less than 2^64 sized, "upstream window" at address 0 on the PCI
>>>>>>>>>>>> bus, which is identity mapped to the system bus.  Details will vary
>>>>>>>>>>>> with the system, but in practice we expect nothing but RAM to be in
>>>>>>>>>>>> that window.  Addresses not within that window won't be mapped to the
>>>>>>>>>>>> system bus but will just be broadcast on the PCI bus and might be
>>>>>>>>>>>> picked up as a p2p transaction.    
>>>>>>>>>>>
>>>>>>>>>>> Currently this p2p works only via the IOMMU, direct p2p is not possible as
>>>>>>>>>>> the guest needs to know physical MMIO addresses to make p2p work and it
>>>>>>>>>>> does not.  
>>>>>>>>>>
>>>>>>>>>> /me points to the Direct Translated P2P section of the ACS spec, though
>>>>>>>>>> it's as prone to spoofing by the device as ATS.  In any case, p2p
>>>>>>>>>> reflected from the IOMMU is still p2p and offloads the CPU even if
>>>>>>>>>> bandwidth suffers vs bare metal depending on if the data doubles back
>>>>>>>>>> over any links.  Thanks,  
>>>>>>>>>
>>>>>>>>> Sure, I was just saying that p2p via IOMMU won't be as simple as broadcast
>>>>>>>>> on the PCI bus, IOMMU needs to be programmed in advance to make this work,
>>>>>>>>> and current that broadcast won't work for the passed through devices.  
>>>>>>>>
>>>>>>>> Well, sure, p2p in a guest with passthrough devices clearly needs to
>>>>>>>> be translated through the IOMMU (and p2p from a passthrough to an
>>>>>>>> emulated device is essentially impossible).
>>>>>>>>
>>>>>>>> But.. what does that have to do with this code.  This is the memory
>>>>>>>> area watcher, looking for memory regions being mapped directly into
>>>>>>>> the PCI space.  NOT IOMMU regions, since those are handled separately
>>>>>>>> by wiring up the IOMMU notifier.  This will only trigger if RAM-like,
>>>>>>>> non-RAM regions are put into PCI space *not* behind an IOMMMU.  
>>>>>>>
>>>>>>> Duh, sorry, realised I was mixing up host and guest IOMMU.  I guess
>>>>>>> the point here is that this will map RAM-like devices into the host
>>>>>>> IOMMU when there is no guest IOMMU, allowing p2p transactions between
>>>>>>> passthrough devices (though not from passthrough to emulated devices).  
>>>>>>
>>>>>> Correct.
>>>>>>  
>>>>>>>
>>>>>>> The conditions still seem kind of awkward to me, but I guess it makes
>>>>>>> sense.  
>>>>>>
>>>>>> Is it the time to split this listener to RAM-listener and PCI bus listener?  
>>>>>
>>>>> I'm not really sure what you mean by that.
>>>>>   
>>>>>> On x86 it listens on the "memory" AS, on spapr - on the
>>>>>> "pci@800000020000000" AS, this will just create more confusion over time...  
>>>>>
>>>>> Hm, it's still logically the same AS in each case - the PCI address
>>>>> space.  It's just that in the x86 case that happens to be the same as
>>>>> the memory AS (at least when there isn't a guest side IOMMU).  
>>>>
>>>> Hm. Ok.
>>>>
>>>>> I do wonder if that's really right even for x86 without IOMMU.  The
>>>>> PCI address space is identity mapped to the "main" memory address
>>>>> space there, but I'm not sure it's mapped th *all* of the main memory  
>>>>
>>>> What should have been in the place of "th" above? :)
>>>>
>>>>> address space, probably just certain ranges of it.  That's what I was
>>>>> suggesting earlier in the thread.  
>>>>
>>>> afaict vfio is trying to mmap every RAM MR == KVM memory slot, no ranges or
>>>> anything like that. I am trying to figure out what is not correct with the
>>>> patch. Thanks,
>>>
>>> It is possible for x86 systems to have translation applied to MMIO vs
>>> RAM such that the processor view and device view of MMIO are different,
>>> putting them into separate address spaces, but this is not typical and
>>> not available on the class of chipsets that QEMU emulates for PC.
>>> Therefore I think it's correct that MMIO and RAM all live in one big
>>> happy, flat address space as they do now (assuming the guest IOMMU is
>>> not present or disabled).  Thanks,
>>
>> Ah.. I think the thing I was missing is that on PC (at least with
>> typical chipsets) *all* MMIO essentially comes from PCI space.  Even
>> the legacy devices are essentially ISA which is treated as being on a
>> bridge under the PCI space.  On non-x86 there are often at least a
>> handful of MMIO devices that aren't within the PCI space - say, the
>> PCI host bridge itself at least.  x86 avoids that by using the
>> separate IO space, which is also a bit weird in that it's
>> simultaneously direct attached to the cpu (and so doesn't need bridge
>> configuration), but also identified with the legay IO space treated as
>> being under two bridges (host->PCI, PCI->ISA) for other purposes.
>>
>> Maybe it's just me, but I find it makes more sense to me if I think of
>> it as the two ASes being equal because on PC the system address map
>> (address_space_memory) is made equal to the PCI address map, rather
>> than the PCI address map being made equal to the system one.
> 
> It makes more sense to me too, we just do not exploit/expose this on SPAPR
> - the PCI address space only has 2xIOMMU and 1xMSI windows and that's it,
> no MMIO which is mapped to the system AS only (adding those to the PCI AS
> is little tricky as mentioned elsewhere).
> 
> Anyway, I still wonder - what needs to be addressed in the 2/4 patch in
> order to proceed?

Ping?
Alexey Kardashevskiy March 7, 2018, 2:17 a.m. UTC | #14
On 26/02/18 19:36, Alexey Kardashevskiy wrote:
> On 19/02/18 13:46, Alexey Kardashevskiy wrote:
>> On 16/02/18 16:28, David Gibson wrote:
>>> On Wed, Feb 14, 2018 at 08:55:41AM -0700, Alex Williamson wrote:
>>>> On Wed, 14 Feb 2018 19:09:16 +1100
>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>
>>>>> On 14/02/18 12:33, David Gibson wrote:
>>>>>> On Tue, Feb 13, 2018 at 07:20:56PM +1100, Alexey Kardashevskiy wrote:  
>>>>>>> On 13/02/18 16:41, David Gibson wrote:  
>>>>>>>> On Tue, Feb 13, 2018 at 04:36:30PM +1100, David Gibson wrote:  
>>>>>>>>> On Tue, Feb 13, 2018 at 12:15:52PM +1100, Alexey Kardashevskiy wrote:  
>>>>>>>>>> On 13/02/18 03:06, Alex Williamson wrote:  
>>>>>>>>>>> On Mon, 12 Feb 2018 18:05:54 +1100
>>>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>>>>  
>>>>>>>>>>>> On 12/02/18 16:19, David Gibson wrote:  
>>>>>>>>>>>>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy wrote:    
>>>>>>>>>>>>>> At the moment if vfio_memory_listener is registered in the system memory
>>>>>>>>>>>>>> address space, it maps/unmaps every RAM memory region for DMA.
>>>>>>>>>>>>>> It expects system page size aligned memory sections so vfio_dma_map
>>>>>>>>>>>>>> would not fail and so far this has been the case. A mapping failure
>>>>>>>>>>>>>> would be fatal. A side effect of such behavior is that some MMIO pages
>>>>>>>>>>>>>> would not be mapped silently.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However we are going to change MSIX BAR handling so we will end having
>>>>>>>>>>>>>> non-aligned sections in vfio_memory_listener (more details is in
>>>>>>>>>>>>>> the next patch) and vfio_dma_map will exit QEMU.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In order to avoid fatal failures on what previously was not a failure and
>>>>>>>>>>>>>> was just silently ignored, this checks the section alignment to
>>>>>>>>>>>>>> the smallest supported IOMMU page size and prints an error if not aligned;
>>>>>>>>>>>>>> it also prints an error if vfio_dma_map failed despite the page size check.
>>>>>>>>>>>>>> Both errors are not fatal; only MMIO RAM regions are checked
>>>>>>>>>>>>>> (aka "RAM device" regions).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If the amount of errors printed is overwhelming, the MSIX relocation
>>>>>>>>>>>>>> could be used to avoid excessive error output.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is unlikely to cause any behavioral change.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>    
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are some relatively superficial problems noted below.
>>>>>>>>>>>>>
>>>>>>>>>>>>> But more fundamentally, this feels like it's extending an existing
>>>>>>>>>>>>> hack past the point of usefulness.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The explicit check for is_ram_device() here has always bothered me -
>>>>>>>>>>>>> it's not like a real bus bridge magically knows whether a target
>>>>>>>>>>>>> address maps to RAM or not.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What I think is really going on is that even for systems without an
>>>>>>>>>>>>> IOMMU, it's not really true to say that the PCI address space maps
>>>>>>>>>>>>> directly onto address_space_memory.  Instead, there's a large, but
>>>>>>>>>>>>> much less than 2^64 sized, "upstream window" at address 0 on the PCI
>>>>>>>>>>>>> bus, which is identity mapped to the system bus.  Details will vary
>>>>>>>>>>>>> with the system, but in practice we expect nothing but RAM to be in
>>>>>>>>>>>>> that window.  Addresses not within that window won't be mapped to the
>>>>>>>>>>>>> system bus but will just be broadcast on the PCI bus and might be
>>>>>>>>>>>>> picked up as a p2p transaction.    
>>>>>>>>>>>>
>>>>>>>>>>>> Currently this p2p works only via the IOMMU, direct p2p is not possible as
>>>>>>>>>>>> the guest needs to know physical MMIO addresses to make p2p work and it
>>>>>>>>>>>> does not.  
>>>>>>>>>>>
>>>>>>>>>>> /me points to the Direct Translated P2P section of the ACS spec, though
>>>>>>>>>>> it's as prone to spoofing by the device as ATS.  In any case, p2p
>>>>>>>>>>> reflected from the IOMMU is still p2p and offloads the CPU even if
>>>>>>>>>>> bandwidth suffers vs bare metal depending on if the data doubles back
>>>>>>>>>>> over any links.  Thanks,  
>>>>>>>>>>
>>>>>>>>>> Sure, I was just saying that p2p via IOMMU won't be as simple as broadcast
>>>>>>>>>> on the PCI bus, IOMMU needs to be programmed in advance to make this work,
>>>>>>>>>> and current that broadcast won't work for the passed through devices.  
>>>>>>>>>
>>>>>>>>> Well, sure, p2p in a guest with passthrough devices clearly needs to
>>>>>>>>> be translated through the IOMMU (and p2p from a passthrough to an
>>>>>>>>> emulated device is essentially impossible).
>>>>>>>>>
>>>>>>>>> But.. what does that have to do with this code.  This is the memory
>>>>>>>>> area watcher, looking for memory regions being mapped directly into
>>>>>>>>> the PCI space.  NOT IOMMU regions, since those are handled separately
>>>>>>>>> by wiring up the IOMMU notifier.  This will only trigger if RAM-like,
>>>>>>>>> non-RAM regions are put into PCI space *not* behind an IOMMMU.  
>>>>>>>>
>>>>>>>> Duh, sorry, realised I was mixing up host and guest IOMMU.  I guess
>>>>>>>> the point here is that this will map RAM-like devices into the host
>>>>>>>> IOMMU when there is no guest IOMMU, allowing p2p transactions between
>>>>>>>> passthrough devices (though not from passthrough to emulated devices).  
>>>>>>>
>>>>>>> Correct.
>>>>>>>  
>>>>>>>>
>>>>>>>> The conditions still seem kind of awkward to me, but I guess it makes
>>>>>>>> sense.  
>>>>>>>
>>>>>>> Is it the time to split this listener to RAM-listener and PCI bus listener?  
>>>>>>
>>>>>> I'm not really sure what you mean by that.
>>>>>>   
>>>>>>> On x86 it listens on the "memory" AS, on spapr - on the
>>>>>>> "pci@800000020000000" AS, this will just create more confusion over time...  
>>>>>>
>>>>>> Hm, it's still logically the same AS in each case - the PCI address
>>>>>> space.  It's just that in the x86 case that happens to be the same as
>>>>>> the memory AS (at least when there isn't a guest side IOMMU).  
>>>>>
>>>>> Hm. Ok.
>>>>>
>>>>>> I do wonder if that's really right even for x86 without IOMMU.  The
>>>>>> PCI address space is identity mapped to the "main" memory address
>>>>>> space there, but I'm not sure it's mapped th *all* of the main memory  
>>>>>
>>>>> What should have been in the place of "th" above? :)
>>>>>
>>>>>> address space, probably just certain ranges of it.  That's what I was
>>>>>> suggesting earlier in the thread.  
>>>>>
>>>>> afaict vfio is trying to mmap every RAM MR == KVM memory slot, no ranges or
>>>>> anything like that. I am trying to figure out what is not correct with the
>>>>> patch. Thanks,
>>>>
>>>> It is possible for x86 systems to have translation applied to MMIO vs
>>>> RAM such that the processor view and device view of MMIO are different,
>>>> putting them into separate address spaces, but this is not typical and
>>>> not available on the class of chipsets that QEMU emulates for PC.
>>>> Therefore I think it's correct that MMIO and RAM all live in one big
>>>> happy, flat address space as they do now (assuming the guest IOMMU is
>>>> not present or disabled).  Thanks,
>>>
>>> Ah.. I think the thing I was missing is that on PC (at least with
>>> typical chipsets) *all* MMIO essentially comes from PCI space.  Even
>>> the legacy devices are essentially ISA which is treated as being on a
>>> bridge under the PCI space.  On non-x86 there are often at least a
>>> handful of MMIO devices that aren't within the PCI space - say, the
>>> PCI host bridge itself at least.  x86 avoids that by using the
>>> separate IO space, which is also a bit weird in that it's
>>> simultaneously direct attached to the cpu (and so doesn't need bridge
>>> configuration), but also identified with the legay IO space treated as
>>> being under two bridges (host->PCI, PCI->ISA) for other purposes.
>>>
>>> Maybe it's just me, but I find it makes more sense to me if I think of
>>> it as the two ASes being equal because on PC the system address map
>>> (address_space_memory) is made equal to the PCI address map, rather
>>> than the PCI address map being made equal to the system one.
>>
>> It makes more sense to me too, we just do not exploit/expose this on SPAPR
>> - the PCI address space only has 2xIOMMU and 1xMSI windows and that's it,
>> no MMIO which is mapped to the system AS only (adding those to the PCI AS
>> is little tricky as mentioned elsewhere).
>>
>> Anyway, I still wonder - what needs to be addressed in the 2/4 patch in
>> order to proceed?
> 
> Ping?
> 


Ping?
Alexey Kardashevskiy March 13, 2018, 4:53 a.m. UTC | #15
On 7/3/18 1:17 pm, Alexey Kardashevskiy wrote:
> On 26/02/18 19:36, Alexey Kardashevskiy wrote:
>> On 19/02/18 13:46, Alexey Kardashevskiy wrote:
>>> On 16/02/18 16:28, David Gibson wrote:
>>>> On Wed, Feb 14, 2018 at 08:55:41AM -0700, Alex Williamson wrote:
>>>>> On Wed, 14 Feb 2018 19:09:16 +1100
>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>
>>>>>> On 14/02/18 12:33, David Gibson wrote:
>>>>>>> On Tue, Feb 13, 2018 at 07:20:56PM +1100, Alexey Kardashevskiy wrote:  
>>>>>>>> On 13/02/18 16:41, David Gibson wrote:  
>>>>>>>>> On Tue, Feb 13, 2018 at 04:36:30PM +1100, David Gibson wrote:  
>>>>>>>>>> On Tue, Feb 13, 2018 at 12:15:52PM +1100, Alexey Kardashevskiy wrote:  
>>>>>>>>>>> On 13/02/18 03:06, Alex Williamson wrote:  
>>>>>>>>>>>> On Mon, 12 Feb 2018 18:05:54 +1100
>>>>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>>>>>  
>>>>>>>>>>>>> On 12/02/18 16:19, David Gibson wrote:  
>>>>>>>>>>>>>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy wrote:    
>>>>>>>>>>>>>>> At the moment if vfio_memory_listener is registered in the system memory
>>>>>>>>>>>>>>> address space, it maps/unmaps every RAM memory region for DMA.
>>>>>>>>>>>>>>> It expects system page size aligned memory sections so vfio_dma_map
>>>>>>>>>>>>>>> would not fail and so far this has been the case. A mapping failure
>>>>>>>>>>>>>>> would be fatal. A side effect of such behavior is that some MMIO pages
>>>>>>>>>>>>>>> would not be mapped silently.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However we are going to change MSIX BAR handling so we will end having
>>>>>>>>>>>>>>> non-aligned sections in vfio_memory_listener (more details is in
>>>>>>>>>>>>>>> the next patch) and vfio_dma_map will exit QEMU.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In order to avoid fatal failures on what previously was not a failure and
>>>>>>>>>>>>>>> was just silently ignored, this checks the section alignment to
>>>>>>>>>>>>>>> the smallest supported IOMMU page size and prints an error if not aligned;
>>>>>>>>>>>>>>> it also prints an error if vfio_dma_map failed despite the page size check.
>>>>>>>>>>>>>>> Both errors are not fatal; only MMIO RAM regions are checked
>>>>>>>>>>>>>>> (aka "RAM device" regions).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If the amount of errors printed is overwhelming, the MSIX relocation
>>>>>>>>>>>>>>> could be used to avoid excessive error output.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is unlikely to cause any behavioral change.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>    
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There are some relatively superficial problems noted below.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But more fundamentally, this feels like it's extending an existing
>>>>>>>>>>>>>> hack past the point of usefulness.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The explicit check for is_ram_device() here has always bothered me -
>>>>>>>>>>>>>> it's not like a real bus bridge magically knows whether a target
>>>>>>>>>>>>>> address maps to RAM or not.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What I think is really going on is that even for systems without an
>>>>>>>>>>>>>> IOMMU, it's not really true to say that the PCI address space maps
>>>>>>>>>>>>>> directly onto address_space_memory.  Instead, there's a large, but
>>>>>>>>>>>>>> much less than 2^64 sized, "upstream window" at address 0 on the PCI
>>>>>>>>>>>>>> bus, which is identity mapped to the system bus.  Details will vary
>>>>>>>>>>>>>> with the system, but in practice we expect nothing but RAM to be in
>>>>>>>>>>>>>> that window.  Addresses not within that window won't be mapped to the
>>>>>>>>>>>>>> system bus but will just be broadcast on the PCI bus and might be
>>>>>>>>>>>>>> picked up as a p2p transaction.    
>>>>>>>>>>>>>
>>>>>>>>>>>>> Currently this p2p works only via the IOMMU, direct p2p is not possible as
>>>>>>>>>>>>> the guest needs to know physical MMIO addresses to make p2p work and it
>>>>>>>>>>>>> does not.  
>>>>>>>>>>>>
>>>>>>>>>>>> /me points to the Direct Translated P2P section of the ACS spec, though
>>>>>>>>>>>> it's as prone to spoofing by the device as ATS.  In any case, p2p
>>>>>>>>>>>> reflected from the IOMMU is still p2p and offloads the CPU even if
>>>>>>>>>>>> bandwidth suffers vs bare metal depending on if the data doubles back
>>>>>>>>>>>> over any links.  Thanks,  
>>>>>>>>>>>
>>>>>>>>>>> Sure, I was just saying that p2p via IOMMU won't be as simple as broadcast
>>>>>>>>>>> on the PCI bus, IOMMU needs to be programmed in advance to make this work,
>>>>>>>>>>> and current that broadcast won't work for the passed through devices.  
>>>>>>>>>>
>>>>>>>>>> Well, sure, p2p in a guest with passthrough devices clearly needs to
>>>>>>>>>> be translated through the IOMMU (and p2p from a passthrough to an
>>>>>>>>>> emulated device is essentially impossible).
>>>>>>>>>>
>>>>>>>>>> But.. what does that have to do with this code.  This is the memory
>>>>>>>>>> area watcher, looking for memory regions being mapped directly into
>>>>>>>>>> the PCI space.  NOT IOMMU regions, since those are handled separately
>>>>>>>>>> by wiring up the IOMMU notifier.  This will only trigger if RAM-like,
>>>>>>>>>> non-RAM regions are put into PCI space *not* behind an IOMMMU.  
>>>>>>>>>
>>>>>>>>> Duh, sorry, realised I was mixing up host and guest IOMMU.  I guess
>>>>>>>>> the point here is that this will map RAM-like devices into the host
>>>>>>>>> IOMMU when there is no guest IOMMU, allowing p2p transactions between
>>>>>>>>> passthrough devices (though not from passthrough to emulated devices).  
>>>>>>>>
>>>>>>>> Correct.
>>>>>>>>  
>>>>>>>>>
>>>>>>>>> The conditions still seem kind of awkward to me, but I guess it makes
>>>>>>>>> sense.  
>>>>>>>>
>>>>>>>> Is it the time to split this listener to RAM-listener and PCI bus listener?  
>>>>>>>
>>>>>>> I'm not really sure what you mean by that.
>>>>>>>   
>>>>>>>> On x86 it listens on the "memory" AS, on spapr - on the
>>>>>>>> "pci@800000020000000" AS, this will just create more confusion over time...  
>>>>>>>
>>>>>>> Hm, it's still logically the same AS in each case - the PCI address
>>>>>>> space.  It's just that in the x86 case that happens to be the same as
>>>>>>> the memory AS (at least when there isn't a guest side IOMMU).  
>>>>>>
>>>>>> Hm. Ok.
>>>>>>
>>>>>>> I do wonder if that's really right even for x86 without IOMMU.  The
>>>>>>> PCI address space is identity mapped to the "main" memory address
>>>>>>> space there, but I'm not sure it's mapped th *all* of the main memory  
>>>>>>
>>>>>> What should have been in the place of "th" above? :)
>>>>>>
>>>>>>> address space, probably just certain ranges of it.  That's what I was
>>>>>>> suggesting earlier in the thread.  
>>>>>>
>>>>>> afaict vfio is trying to mmap every RAM MR == KVM memory slot, no ranges or
>>>>>> anything like that. I am trying to figure out what is not correct with the
>>>>>> patch. Thanks,
>>>>>
>>>>> It is possible for x86 systems to have translation applied to MMIO vs
>>>>> RAM such that the processor view and device view of MMIO are different,
>>>>> putting them into separate address spaces, but this is not typical and
>>>>> not available on the class of chipsets that QEMU emulates for PC.
>>>>> Therefore I think it's correct that MMIO and RAM all live in one big
>>>>> happy, flat address space as they do now (assuming the guest IOMMU is
>>>>> not present or disabled).  Thanks,
>>>>
>>>> Ah.. I think the thing I was missing is that on PC (at least with
>>>> typical chipsets) *all* MMIO essentially comes from PCI space.  Even
>>>> the legacy devices are essentially ISA which is treated as being on a
>>>> bridge under the PCI space.  On non-x86 there are often at least a
>>>> handful of MMIO devices that aren't within the PCI space - say, the
>>>> PCI host bridge itself at least.  x86 avoids that by using the
>>>> separate IO space, which is also a bit weird in that it's
>>>> simultaneously direct attached to the cpu (and so doesn't need bridge
>>>> configuration), but also identified with the legay IO space treated as
>>>> being under two bridges (host->PCI, PCI->ISA) for other purposes.
>>>>
>>>> Maybe it's just me, but I find it makes more sense to me if I think of
>>>> it as the two ASes being equal because on PC the system address map
>>>> (address_space_memory) is made equal to the PCI address map, rather
>>>> than the PCI address map being made equal to the system one.
>>>
>>> It makes more sense to me too, we just do not exploit/expose this on SPAPR
>>> - the PCI address space only has 2xIOMMU and 1xMSI windows and that's it,
>>> no MMIO which is mapped to the system AS only (adding those to the PCI AS
>>> is little tricky as mentioned elsewhere).
>>>
>>> Anyway, I still wonder - what needs to be addressed in the 2/4 patch in
>>> order to proceed?
>>
>> Ping?
>>
> 
> 
> Ping?
> 
> 


Could anyone please enlighten me what needs to be done with this patchset
to proceed, or confirm that it is ok but not going anywhere as everybody is
busy with other things? :) Thanks,
Alex Williamson March 13, 2018, 4:56 p.m. UTC | #16
[Cc +Eric]

On Tue, 13 Mar 2018 15:53:19 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 7/3/18 1:17 pm, Alexey Kardashevskiy wrote:
> > On 26/02/18 19:36, Alexey Kardashevskiy wrote:  
> >> On 19/02/18 13:46, Alexey Kardashevskiy wrote:  
> >>> On 16/02/18 16:28, David Gibson wrote:  
> >>>> On Wed, Feb 14, 2018 at 08:55:41AM -0700, Alex Williamson wrote:  
> >>>>> On Wed, 14 Feb 2018 19:09:16 +1100
> >>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>>  
> >>>>>> On 14/02/18 12:33, David Gibson wrote:  
> >>>>>>> On Tue, Feb 13, 2018 at 07:20:56PM +1100, Alexey Kardashevskiy wrote:    
> >>>>>>>> On 13/02/18 16:41, David Gibson wrote:    
> >>>>>>>>> On Tue, Feb 13, 2018 at 04:36:30PM +1100, David Gibson wrote:    
> >>>>>>>>>> On Tue, Feb 13, 2018 at 12:15:52PM +1100, Alexey Kardashevskiy wrote:    
> >>>>>>>>>>> On 13/02/18 03:06, Alex Williamson wrote:    
> >>>>>>>>>>>> On Mon, 12 Feb 2018 18:05:54 +1100
> >>>>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>>>>>>>>>    
> >>>>>>>>>>>>> On 12/02/18 16:19, David Gibson wrote:    
> >>>>>>>>>>>>>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy wrote:      
> >>>>>>>>>>>>>>> At the moment if vfio_memory_listener is registered in the system memory
> >>>>>>>>>>>>>>> address space, it maps/unmaps every RAM memory region for DMA.
> >>>>>>>>>>>>>>> It expects system page size aligned memory sections so vfio_dma_map
> >>>>>>>>>>>>>>> would not fail and so far this has been the case. A mapping failure
> >>>>>>>>>>>>>>> would be fatal. A side effect of such behavior is that some MMIO pages
> >>>>>>>>>>>>>>> would not be mapped silently.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> However we are going to change MSIX BAR handling so we will end having
> >>>>>>>>>>>>>>> non-aligned sections in vfio_memory_listener (more details is in
> >>>>>>>>>>>>>>> the next patch) and vfio_dma_map will exit QEMU.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> In order to avoid fatal failures on what previously was not a failure and
> >>>>>>>>>>>>>>> was just silently ignored, this checks the section alignment to
> >>>>>>>>>>>>>>> the smallest supported IOMMU page size and prints an error if not aligned;
> >>>>>>>>>>>>>>> it also prints an error if vfio_dma_map failed despite the page size check.
> >>>>>>>>>>>>>>> Both errors are not fatal; only MMIO RAM regions are checked
> >>>>>>>>>>>>>>> (aka "RAM device" regions).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If the amount of errors printed is overwhelming, the MSIX relocation
> >>>>>>>>>>>>>>> could be used to avoid excessive error output.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> This is unlikely to cause any behavioral change.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>      
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> There are some relatively superficial problems noted below.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> But more fundamentally, this feels like it's extending an existing
> >>>>>>>>>>>>>> hack past the point of usefulness.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The explicit check for is_ram_device() here has always bothered me -
> >>>>>>>>>>>>>> it's not like a real bus bridge magically knows whether a target
> >>>>>>>>>>>>>> address maps to RAM or not.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> What I think is really going on is that even for systems without an
> >>>>>>>>>>>>>> IOMMU, it's not really true to say that the PCI address space maps
> >>>>>>>>>>>>>> directly onto address_space_memory.  Instead, there's a large, but
> >>>>>>>>>>>>>> much less than 2^64 sized, "upstream window" at address 0 on the PCI
> >>>>>>>>>>>>>> bus, which is identity mapped to the system bus.  Details will vary
> >>>>>>>>>>>>>> with the system, but in practice we expect nothing but RAM to be in
> >>>>>>>>>>>>>> that window.  Addresses not within that window won't be mapped to the
> >>>>>>>>>>>>>> system bus but will just be broadcast on the PCI bus and might be
> >>>>>>>>>>>>>> picked up as a p2p transaction.      
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Currently this p2p works only via the IOMMU, direct p2p is not possible as
> >>>>>>>>>>>>> the guest needs to know physical MMIO addresses to make p2p work and it
> >>>>>>>>>>>>> does not.    
> >>>>>>>>>>>>
> >>>>>>>>>>>> /me points to the Direct Translated P2P section of the ACS spec, though
> >>>>>>>>>>>> it's as prone to spoofing by the device as ATS.  In any case, p2p
> >>>>>>>>>>>> reflected from the IOMMU is still p2p and offloads the CPU even if
> >>>>>>>>>>>> bandwidth suffers vs bare metal depending on if the data doubles back
> >>>>>>>>>>>> over any links.  Thanks,    
> >>>>>>>>>>>
> >>>>>>>>>>> Sure, I was just saying that p2p via IOMMU won't be as simple as broadcast
> >>>>>>>>>>> on the PCI bus, IOMMU needs to be programmed in advance to make this work,
> >>>>>>>>>>> and current that broadcast won't work for the passed through devices.    
> >>>>>>>>>>
> >>>>>>>>>> Well, sure, p2p in a guest with passthrough devices clearly needs to
> >>>>>>>>>> be translated through the IOMMU (and p2p from a passthrough to an
> >>>>>>>>>> emulated device is essentially impossible).
> >>>>>>>>>>
> >>>>>>>>>> But.. what does that have to do with this code.  This is the memory
> >>>>>>>>>> area watcher, looking for memory regions being mapped directly into
> >>>>>>>>>> the PCI space.  NOT IOMMU regions, since those are handled separately
> >>>>>>>>>> by wiring up the IOMMU notifier.  This will only trigger if RAM-like,
> >>>>>>>>>> non-RAM regions are put into PCI space *not* behind an IOMMMU.    
> >>>>>>>>>
> >>>>>>>>> Duh, sorry, realised I was mixing up host and guest IOMMU.  I guess
> >>>>>>>>> the point here is that this will map RAM-like devices into the host
> >>>>>>>>> IOMMU when there is no guest IOMMU, allowing p2p transactions between
> >>>>>>>>> passthrough devices (though not from passthrough to emulated devices).    
> >>>>>>>>
> >>>>>>>> Correct.
> >>>>>>>>    
> >>>>>>>>>
> >>>>>>>>> The conditions still seem kind of awkward to me, but I guess it makes
> >>>>>>>>> sense.    
> >>>>>>>>
> >>>>>>>> Is it the time to split this listener to RAM-listener and PCI bus listener?    
> >>>>>>>
> >>>>>>> I'm not really sure what you mean by that.
> >>>>>>>     
> >>>>>>>> On x86 it listens on the "memory" AS, on spapr - on the
> >>>>>>>> "pci@800000020000000" AS, this will just create more confusion over time...    
> >>>>>>>
> >>>>>>> Hm, it's still logically the same AS in each case - the PCI address
> >>>>>>> space.  It's just that in the x86 case that happens to be the same as
> >>>>>>> the memory AS (at least when there isn't a guest side IOMMU).    
> >>>>>>
> >>>>>> Hm. Ok.
> >>>>>>  
> >>>>>>> I do wonder if that's really right even for x86 without IOMMU.  The
> >>>>>>> PCI address space is identity mapped to the "main" memory address
> >>>>>>> space there, but I'm not sure it's mapped th *all* of the main memory    
> >>>>>>
> >>>>>> What should have been in the place of "th" above? :)
> >>>>>>  
> >>>>>>> address space, probably just certain ranges of it.  That's what I was
> >>>>>>> suggesting earlier in the thread.    
> >>>>>>
> >>>>>> afaict vfio is trying to mmap every RAM MR == KVM memory slot, no ranges or
> >>>>>> anything like that. I am trying to figure out what is not correct with the
> >>>>>> patch. Thanks,  
> >>>>>
> >>>>> It is possible for x86 systems to have translation applied to MMIO vs
> >>>>> RAM such that the processor view and device view of MMIO are different,
> >>>>> putting them into separate address spaces, but this is not typical and
> >>>>> not available on the class of chipsets that QEMU emulates for PC.
> >>>>> Therefore I think it's correct that MMIO and RAM all live in one big
> >>>>> happy, flat address space as they do now (assuming the guest IOMMU is
> >>>>> not present or disabled).  Thanks,  
> >>>>
> >>>> Ah.. I think the thing I was missing is that on PC (at least with
> >>>> typical chipsets) *all* MMIO essentially comes from PCI space.  Even
> >>>> the legacy devices are essentially ISA which is treated as being on a
> >>>> bridge under the PCI space.  On non-x86 there are often at least a
> >>>> handful of MMIO devices that aren't within the PCI space - say, the
> >>>> PCI host bridge itself at least.  x86 avoids that by using the
> >>>> separate IO space, which is also a bit weird in that it's
> >>>> simultaneously direct attached to the cpu (and so doesn't need bridge
> >>>> configuration), but also identified with the legay IO space treated as
> >>>> being under two bridges (host->PCI, PCI->ISA) for other purposes.
> >>>>
> >>>> Maybe it's just me, but I find it makes more sense to me if I think of
> >>>> it as the two ASes being equal because on PC the system address map
> >>>> (address_space_memory) is made equal to the PCI address map, rather
> >>>> than the PCI address map being made equal to the system one.  
> >>>
> >>> It makes more sense to me too, we just do not exploit/expose this on SPAPR
> >>> - the PCI address space only has 2xIOMMU and 1xMSI windows and that's it,
> >>> no MMIO which is mapped to the system AS only (adding those to the PCI AS
> >>> is little tricky as mentioned elsewhere).
> >>>
> >>> Anyway, I still wonder - what needs to be addressed in the 2/4 patch in
> >>> order to proceed?  
> >>
> >> Ping?
> >>  
> > 
> > 
> > Ping?
> > 
> >   
> 
> 
> Could anyone please enlighten me what needs to be done with this patchset
> to proceed, or confirm that it is ok but not going anywhere as everybody is
> busy with other things? :) Thanks,

Actually making sure it compiles would have been nice:

qemu.git/hw/vfio/common.c: In function ‘vfio_listener_region_add’:
qemu.git/hw/vfio/common.c:550:40: error: invalid operands to binary & (have ‘Int128 {aka struct Int128}’ and ‘hwaddr {aka long long unsigned int}’)
         if ((iova & pgmask) || (llsize & pgmask)) {
                                        ^
qemu.git/hw/vfio/common.c: In function ‘vfio_listener_region_del’:
qemu.git/hw/vfio/common.c:669:50: error: invalid operands to binary & (have ‘Int128 {aka struct Int128}’ and ‘hwaddr {aka long long unsigned int}’)
         try_unmap = !((iova & pgmask) || (llsize & pgmask));
                                                  ^
Clearly llsize needs to be wrapped in int128_get64() here.  Should I
presume that testing of this patch is equally lacking?  Eric, have you
done any testing of this?  I think this was fixing a mapping issue you
reported.  Thanks,

Alex
Eric Auger March 13, 2018, 5:13 p.m. UTC | #17
Hi Alex,

On 13/03/18 17:56, Alex Williamson wrote:
> [Cc +Eric]
> 
> On Tue, 13 Mar 2018 15:53:19 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 7/3/18 1:17 pm, Alexey Kardashevskiy wrote:
>>> On 26/02/18 19:36, Alexey Kardashevskiy wrote:  
>>>> On 19/02/18 13:46, Alexey Kardashevskiy wrote:  
>>>>> On 16/02/18 16:28, David Gibson wrote:  
>>>>>> On Wed, Feb 14, 2018 at 08:55:41AM -0700, Alex Williamson wrote:  
>>>>>>> On Wed, 14 Feb 2018 19:09:16 +1100
>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>  
>>>>>>>> On 14/02/18 12:33, David Gibson wrote:  
>>>>>>>>> On Tue, Feb 13, 2018 at 07:20:56PM +1100, Alexey Kardashevskiy wrote:    
>>>>>>>>>> On 13/02/18 16:41, David Gibson wrote:    
>>>>>>>>>>> On Tue, Feb 13, 2018 at 04:36:30PM +1100, David Gibson wrote:    
>>>>>>>>>>>> On Tue, Feb 13, 2018 at 12:15:52PM +1100, Alexey Kardashevskiy wrote:    
>>>>>>>>>>>>> On 13/02/18 03:06, Alex Williamson wrote:    
>>>>>>>>>>>>>> On Mon, 12 Feb 2018 18:05:54 +1100
>>>>>>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>>> On 12/02/18 16:19, David Gibson wrote:    
>>>>>>>>>>>>>>>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy wrote:      
>>>>>>>>>>>>>>>>> At the moment if vfio_memory_listener is registered in the system memory
>>>>>>>>>>>>>>>>> address space, it maps/unmaps every RAM memory region for DMA.
>>>>>>>>>>>>>>>>> It expects system page size aligned memory sections so vfio_dma_map
>>>>>>>>>>>>>>>>> would not fail and so far this has been the case. A mapping failure
>>>>>>>>>>>>>>>>> would be fatal. A side effect of such behavior is that some MMIO pages
>>>>>>>>>>>>>>>>> would not be mapped silently.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> However we are going to change MSIX BAR handling so we will end having
>>>>>>>>>>>>>>>>> non-aligned sections in vfio_memory_listener (more details is in
>>>>>>>>>>>>>>>>> the next patch) and vfio_dma_map will exit QEMU.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In order to avoid fatal failures on what previously was not a failure and
>>>>>>>>>>>>>>>>> was just silently ignored, this checks the section alignment to
>>>>>>>>>>>>>>>>> the smallest supported IOMMU page size and prints an error if not aligned;
>>>>>>>>>>>>>>>>> it also prints an error if vfio_dma_map failed despite the page size check.
>>>>>>>>>>>>>>>>> Both errors are not fatal; only MMIO RAM regions are checked
>>>>>>>>>>>>>>>>> (aka "RAM device" regions).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If the amount of errors printed is overwhelming, the MSIX relocation
>>>>>>>>>>>>>>>>> could be used to avoid excessive error output.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This is unlikely to cause any behavioral change.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>      
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There are some relatively superficial problems noted below.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But more fundamentally, this feels like it's extending an existing
>>>>>>>>>>>>>>>> hack past the point of usefulness.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The explicit check for is_ram_device() here has always bothered me -
>>>>>>>>>>>>>>>> it's not like a real bus bridge magically knows whether a target
>>>>>>>>>>>>>>>> address maps to RAM or not.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What I think is really going on is that even for systems without an
>>>>>>>>>>>>>>>> IOMMU, it's not really true to say that the PCI address space maps
>>>>>>>>>>>>>>>> directly onto address_space_memory.  Instead, there's a large, but
>>>>>>>>>>>>>>>> much less than 2^64 sized, "upstream window" at address 0 on the PCI
>>>>>>>>>>>>>>>> bus, which is identity mapped to the system bus.  Details will vary
>>>>>>>>>>>>>>>> with the system, but in practice we expect nothing but RAM to be in
>>>>>>>>>>>>>>>> that window.  Addresses not within that window won't be mapped to the
>>>>>>>>>>>>>>>> system bus but will just be broadcast on the PCI bus and might be
>>>>>>>>>>>>>>>> picked up as a p2p transaction.      
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Currently this p2p works only via the IOMMU, direct p2p is not possible as
>>>>>>>>>>>>>>> the guest needs to know physical MMIO addresses to make p2p work and it
>>>>>>>>>>>>>>> does not.    
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /me points to the Direct Translated P2P section of the ACS spec, though
>>>>>>>>>>>>>> it's as prone to spoofing by the device as ATS.  In any case, p2p
>>>>>>>>>>>>>> reflected from the IOMMU is still p2p and offloads the CPU even if
>>>>>>>>>>>>>> bandwidth suffers vs bare metal depending on if the data doubles back
>>>>>>>>>>>>>> over any links.  Thanks,    
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sure, I was just saying that p2p via IOMMU won't be as simple as broadcast
>>>>>>>>>>>>> on the PCI bus, IOMMU needs to be programmed in advance to make this work,
>>>>>>>>>>>>> and current that broadcast won't work for the passed through devices.    
>>>>>>>>>>>>
>>>>>>>>>>>> Well, sure, p2p in a guest with passthrough devices clearly needs to
>>>>>>>>>>>> be translated through the IOMMU (and p2p from a passthrough to an
>>>>>>>>>>>> emulated device is essentially impossible).
>>>>>>>>>>>>
>>>>>>>>>>>> But.. what does that have to do with this code.  This is the memory
>>>>>>>>>>>> area watcher, looking for memory regions being mapped directly into
>>>>>>>>>>>> the PCI space.  NOT IOMMU regions, since those are handled separately
>>>>>>>>>>>> by wiring up the IOMMU notifier.  This will only trigger if RAM-like,
>>>>>>>>>>>> non-RAM regions are put into PCI space *not* behind an IOMMMU.    
>>>>>>>>>>>
>>>>>>>>>>> Duh, sorry, realised I was mixing up host and guest IOMMU.  I guess
>>>>>>>>>>> the point here is that this will map RAM-like devices into the host
>>>>>>>>>>> IOMMU when there is no guest IOMMU, allowing p2p transactions between
>>>>>>>>>>> passthrough devices (though not from passthrough to emulated devices).    
>>>>>>>>>>
>>>>>>>>>> Correct.
>>>>>>>>>>    
>>>>>>>>>>>
>>>>>>>>>>> The conditions still seem kind of awkward to me, but I guess it makes
>>>>>>>>>>> sense.    
>>>>>>>>>>
>>>>>>>>>> Is it the time to split this listener to RAM-listener and PCI bus listener?    
>>>>>>>>>
>>>>>>>>> I'm not really sure what you mean by that.
>>>>>>>>>     
>>>>>>>>>> On x86 it listens on the "memory" AS, on spapr - on the
>>>>>>>>>> "pci@800000020000000" AS, this will just create more confusion over time...    
>>>>>>>>>
>>>>>>>>> Hm, it's still logically the same AS in each case - the PCI address
>>>>>>>>> space.  It's just that in the x86 case that happens to be the same as
>>>>>>>>> the memory AS (at least when there isn't a guest side IOMMU).    
>>>>>>>>
>>>>>>>> Hm. Ok.
>>>>>>>>  
>>>>>>>>> I do wonder if that's really right even for x86 without IOMMU.  The
>>>>>>>>> PCI address space is identity mapped to the "main" memory address
>>>>>>>>> space there, but I'm not sure it's mapped th *all* of the main memory    
>>>>>>>>
>>>>>>>> What should have been in the place of "th" above? :)
>>>>>>>>  
>>>>>>>>> address space, probably just certain ranges of it.  That's what I was
>>>>>>>>> suggesting earlier in the thread.    
>>>>>>>>
>>>>>>>> afaict vfio is trying to mmap every RAM MR == KVM memory slot, no ranges or
>>>>>>>> anything like that. I am trying to figure out what is not correct with the
>>>>>>>> patch. Thanks,  
>>>>>>>
>>>>>>> It is possible for x86 systems to have translation applied to MMIO vs
>>>>>>> RAM such that the processor view and device view of MMIO are different,
>>>>>>> putting them into separate address spaces, but this is not typical and
>>>>>>> not available on the class of chipsets that QEMU emulates for PC.
>>>>>>> Therefore I think it's correct that MMIO and RAM all live in one big
>>>>>>> happy, flat address space as they do now (assuming the guest IOMMU is
>>>>>>> not present or disabled).  Thanks,  
>>>>>>
>>>>>> Ah.. I think the thing I was missing is that on PC (at least with
>>>>>> typical chipsets) *all* MMIO essentially comes from PCI space.  Even
>>>>>> the legacy devices are essentially ISA which is treated as being on a
>>>>>> bridge under the PCI space.  On non-x86 there are often at least a
>>>>>> handful of MMIO devices that aren't within the PCI space - say, the
>>>>>> PCI host bridge itself at least.  x86 avoids that by using the
>>>>>> separate IO space, which is also a bit weird in that it's
>>>>>> simultaneously direct attached to the cpu (and so doesn't need bridge
>>>>>> configuration), but also identified with the legay IO space treated as
>>>>>> being under two bridges (host->PCI, PCI->ISA) for other purposes.
>>>>>>
>>>>>> Maybe it's just me, but I find it makes more sense to me if I think of
>>>>>> it as the two ASes being equal because on PC the system address map
>>>>>> (address_space_memory) is made equal to the PCI address map, rather
>>>>>> than the PCI address map being made equal to the system one.  
>>>>>
>>>>> It makes more sense to me too, we just do not exploit/expose this on SPAPR
>>>>> - the PCI address space only has 2xIOMMU and 1xMSI windows and that's it,
>>>>> no MMIO which is mapped to the system AS only (adding those to the PCI AS
>>>>> is little tricky as mentioned elsewhere).
>>>>>
>>>>> Anyway, I still wonder - what needs to be addressed in the 2/4 patch in
>>>>> order to proceed?  
>>>>
>>>> Ping?
>>>>  
>>>
>>>
>>> Ping?
>>>
>>>   
>>
>>
>> Could anyone please enlighten me what needs to be done with this patchset
>> to proceed, or confirm that it is ok but not going anywhere as everybody is
>> busy with other things? :) Thanks,
> 
> Actually making sure it compiles would have been nice:
> 
> qemu.git/hw/vfio/common.c: In function ‘vfio_listener_region_add’:
> qemu.git/hw/vfio/common.c:550:40: error: invalid operands to binary & (have ‘Int128 {aka struct Int128}’ and ‘hwaddr {aka long long unsigned int}’)
>          if ((iova & pgmask) || (llsize & pgmask)) {
>                                         ^
> qemu.git/hw/vfio/common.c: In function ‘vfio_listener_region_del’:
> qemu.git/hw/vfio/common.c:669:50: error: invalid operands to binary & (have ‘Int128 {aka struct Int128}’ and ‘hwaddr {aka long long unsigned int}’)
>          try_unmap = !((iova & pgmask) || (llsize & pgmask));
>                                                   ^
> Clearly llsize needs to be wrapped in int128_get64() here.  Should I
> presume that testing of this patch is equally lacking?  Eric, have you
> done any testing of this?  I think this was fixing a mapping issue you
> reported.  Thanks,
not that version unfortunately. I don't have access to the HW on which I
had the issue anymore. Maybe I could by the beg of next week?

Thanks

Eric
> 
> Alex
>
Eric Blake March 13, 2018, 7:17 p.m. UTC | #18
[Making a meta-comment that I've pointed out to others before]

On 03/13/2018 12:13 PM, Auger Eric wrote:
> Hi Alex,
...

>>>>>>>>>>>>>>>>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>>>>>>> At the moment if vfio_memory_listener is registered in the system memory

17 levels of '>' before I even began my reply.  And several screenfuls 
of content that I'm snipping...

>> Clearly llsize needs to be wrapped in int128_get64() here.  Should I
>> presume that testing of this patch is equally lacking?  Eric, have you
>> done any testing of this?  I think this was fixing a mapping issue you
>> reported.  Thanks,
> not that version unfortunately. I don't have access to the HW on which I
> had the issue anymore. Maybe I could by the beg of next week?

before getting to the 2-line meat of the message.  It's okay to snip 
portions of the quoted content that are no longer relevant to the 
immediate message, as it helps improve the signal-to-noise ratio (we 
have list archives, for when referring to prior context is needed)
Alexey Kardashevskiy March 14, 2018, 2:40 a.m. UTC | #19
On 14/3/18 3:56 am, Alex Williamson wrote:
> [Cc +Eric]
> 
> On Tue, 13 Mar 2018 15:53:19 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 7/3/18 1:17 pm, Alexey Kardashevskiy wrote:
>>> On 26/02/18 19:36, Alexey Kardashevskiy wrote:  
>>>> On 19/02/18 13:46, Alexey Kardashevskiy wrote:  
>>>>> On 16/02/18 16:28, David Gibson wrote:  
>>>>>> On Wed, Feb 14, 2018 at 08:55:41AM -0700, Alex Williamson wrote:  
>>>>>>> On Wed, 14 Feb 2018 19:09:16 +1100
>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>  
>>>>>>>> On 14/02/18 12:33, David Gibson wrote:  
>>>>>>>>> On Tue, Feb 13, 2018 at 07:20:56PM +1100, Alexey Kardashevskiy wrote:    
>>>>>>>>>> On 13/02/18 16:41, David Gibson wrote:    
>>>>>>>>>>> On Tue, Feb 13, 2018 at 04:36:30PM +1100, David Gibson wrote:    
>>>>>>>>>>>> On Tue, Feb 13, 2018 at 12:15:52PM +1100, Alexey Kardashevskiy wrote:    
>>>>>>>>>>>>> On 13/02/18 03:06, Alex Williamson wrote:    
>>>>>>>>>>>>>> On Mon, 12 Feb 2018 18:05:54 +1100
>>>>>>>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>>> On 12/02/18 16:19, David Gibson wrote:    
>>>>>>>>>>>>>>>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy wrote:      
>>>>>>>>>>>>>>>>> At the moment if vfio_memory_listener is registered in the system memory
>>>>>>>>>>>>>>>>> address space, it maps/unmaps every RAM memory region for DMA.
>>>>>>>>>>>>>>>>> It expects system page size aligned memory sections so vfio_dma_map
>>>>>>>>>>>>>>>>> would not fail and so far this has been the case. A mapping failure
>>>>>>>>>>>>>>>>> would be fatal. A side effect of such behavior is that some MMIO pages
>>>>>>>>>>>>>>>>> would not be mapped silently.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> However we are going to change MSIX BAR handling so we will end having
>>>>>>>>>>>>>>>>> non-aligned sections in vfio_memory_listener (more details is in
>>>>>>>>>>>>>>>>> the next patch) and vfio_dma_map will exit QEMU.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In order to avoid fatal failures on what previously was not a failure and
>>>>>>>>>>>>>>>>> was just silently ignored, this checks the section alignment to
>>>>>>>>>>>>>>>>> the smallest supported IOMMU page size and prints an error if not aligned;
>>>>>>>>>>>>>>>>> it also prints an error if vfio_dma_map failed despite the page size check.
>>>>>>>>>>>>>>>>> Both errors are not fatal; only MMIO RAM regions are checked
>>>>>>>>>>>>>>>>> (aka "RAM device" regions).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If the amount of errors printed is overwhelming, the MSIX relocation
>>>>>>>>>>>>>>>>> could be used to avoid excessive error output.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This is unlikely to cause any behavioral change.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>      
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There are some relatively superficial problems noted below.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But more fundamentally, this feels like it's extending an existing
>>>>>>>>>>>>>>>> hack past the point of usefulness.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The explicit check for is_ram_device() here has always bothered me -
>>>>>>>>>>>>>>>> it's not like a real bus bridge magically knows whether a target
>>>>>>>>>>>>>>>> address maps to RAM or not.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What I think is really going on is that even for systems without an
>>>>>>>>>>>>>>>> IOMMU, it's not really true to say that the PCI address space maps
>>>>>>>>>>>>>>>> directly onto address_space_memory.  Instead, there's a large, but
>>>>>>>>>>>>>>>> much less than 2^64 sized, "upstream window" at address 0 on the PCI
>>>>>>>>>>>>>>>> bus, which is identity mapped to the system bus.  Details will vary
>>>>>>>>>>>>>>>> with the system, but in practice we expect nothing but RAM to be in
>>>>>>>>>>>>>>>> that window.  Addresses not within that window won't be mapped to the
>>>>>>>>>>>>>>>> system bus but will just be broadcast on the PCI bus and might be
>>>>>>>>>>>>>>>> picked up as a p2p transaction.      
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Currently this p2p works only via the IOMMU, direct p2p is not possible as
>>>>>>>>>>>>>>> the guest needs to know physical MMIO addresses to make p2p work and it
>>>>>>>>>>>>>>> does not.    
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /me points to the Direct Translated P2P section of the ACS spec, though
>>>>>>>>>>>>>> it's as prone to spoofing by the device as ATS.  In any case, p2p
>>>>>>>>>>>>>> reflected from the IOMMU is still p2p and offloads the CPU even if
>>>>>>>>>>>>>> bandwidth suffers vs bare metal depending on if the data doubles back
>>>>>>>>>>>>>> over any links.  Thanks,    
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sure, I was just saying that p2p via IOMMU won't be as simple as broadcast
>>>>>>>>>>>>> on the PCI bus, IOMMU needs to be programmed in advance to make this work,
>>>>>>>>>>>>> and current that broadcast won't work for the passed through devices.    
>>>>>>>>>>>>
>>>>>>>>>>>> Well, sure, p2p in a guest with passthrough devices clearly needs to
>>>>>>>>>>>> be translated through the IOMMU (and p2p from a passthrough to an
>>>>>>>>>>>> emulated device is essentially impossible).
>>>>>>>>>>>>
>>>>>>>>>>>> But.. what does that have to do with this code.  This is the memory
>>>>>>>>>>>> area watcher, looking for memory regions being mapped directly into
>>>>>>>>>>>> the PCI space.  NOT IOMMU regions, since those are handled separately
>>>>>>>>>>>> by wiring up the IOMMU notifier.  This will only trigger if RAM-like,
>>>>>>>>>>>> non-RAM regions are put into PCI space *not* behind an IOMMMU.    
>>>>>>>>>>>
>>>>>>>>>>> Duh, sorry, realised I was mixing up host and guest IOMMU.  I guess
>>>>>>>>>>> the point here is that this will map RAM-like devices into the host
>>>>>>>>>>> IOMMU when there is no guest IOMMU, allowing p2p transactions between
>>>>>>>>>>> passthrough devices (though not from passthrough to emulated devices).    
>>>>>>>>>>
>>>>>>>>>> Correct.
>>>>>>>>>>    
>>>>>>>>>>>
>>>>>>>>>>> The conditions still seem kind of awkward to me, but I guess it makes
>>>>>>>>>>> sense.    
>>>>>>>>>>
>>>>>>>>>> Is it the time to split this listener to RAM-listener and PCI bus listener?    
>>>>>>>>>
>>>>>>>>> I'm not really sure what you mean by that.
>>>>>>>>>     
>>>>>>>>>> On x86 it listens on the "memory" AS, on spapr - on the
>>>>>>>>>> "pci@800000020000000" AS, this will just create more confusion over time...    
>>>>>>>>>
>>>>>>>>> Hm, it's still logically the same AS in each case - the PCI address
>>>>>>>>> space.  It's just that in the x86 case that happens to be the same as
>>>>>>>>> the memory AS (at least when there isn't a guest side IOMMU).    
>>>>>>>>
>>>>>>>> Hm. Ok.
>>>>>>>>  
>>>>>>>>> I do wonder if that's really right even for x86 without IOMMU.  The
>>>>>>>>> PCI address space is identity mapped to the "main" memory address
>>>>>>>>> space there, but I'm not sure it's mapped th *all* of the main memory    
>>>>>>>>
>>>>>>>> What should have been in the place of "th" above? :)
>>>>>>>>  
>>>>>>>>> address space, probably just certain ranges of it.  That's what I was
>>>>>>>>> suggesting earlier in the thread.    
>>>>>>>>
>>>>>>>> afaict vfio is trying to mmap every RAM MR == KVM memory slot, no ranges or
>>>>>>>> anything like that. I am trying to figure out what is not correct with the
>>>>>>>> patch. Thanks,  
>>>>>>>
>>>>>>> It is possible for x86 systems to have translation applied to MMIO vs
>>>>>>> RAM such that the processor view and device view of MMIO are different,
>>>>>>> putting them into separate address spaces, but this is not typical and
>>>>>>> not available on the class of chipsets that QEMU emulates for PC.
>>>>>>> Therefore I think it's correct that MMIO and RAM all live in one big
>>>>>>> happy, flat address space as they do now (assuming the guest IOMMU is
>>>>>>> not present or disabled).  Thanks,  
>>>>>>
>>>>>> Ah.. I think the thing I was missing is that on PC (at least with
>>>>>> typical chipsets) *all* MMIO essentially comes from PCI space.  Even
>>>>>> the legacy devices are essentially ISA which is treated as being on a
>>>>>> bridge under the PCI space.  On non-x86 there are often at least a
>>>>>> handful of MMIO devices that aren't within the PCI space - say, the
>>>>>> PCI host bridge itself at least.  x86 avoids that by using the
>>>>>> separate IO space, which is also a bit weird in that it's
>>>>>> simultaneously direct attached to the cpu (and so doesn't need bridge
>>>>>> configuration), but also identified with the legay IO space treated as
>>>>>> being under two bridges (host->PCI, PCI->ISA) for other purposes.
>>>>>>
>>>>>> Maybe it's just me, but I find it makes more sense to me if I think of
>>>>>> it as the two ASes being equal because on PC the system address map
>>>>>> (address_space_memory) is made equal to the PCI address map, rather
>>>>>> than the PCI address map being made equal to the system one.  
>>>>>
>>>>> It makes more sense to me too, we just do not exploit/expose this on SPAPR
>>>>> - the PCI address space only has 2xIOMMU and 1xMSI windows and that's it,
>>>>> no MMIO which is mapped to the system AS only (adding those to the PCI AS
>>>>> is little tricky as mentioned elsewhere).
>>>>>
>>>>> Anyway, I still wonder - what needs to be addressed in the 2/4 patch in
>>>>> order to proceed?  
>>>>
>>>> Ping?
>>>>  
>>>
>>>
>>> Ping?
>>>
>>>   
>>
>>
>> Could anyone please enlighten me what needs to be done with this patchset
>> to proceed, or confirm that it is ok but not going anywhere as everybody is
>> busy with other things? :) Thanks,
> 
> Actually making sure it compiles would have been nice:
> 
> qemu.git/hw/vfio/common.c: In function ‘vfio_listener_region_add’:
> qemu.git/hw/vfio/common.c:550:40: error: invalid operands to binary & (have ‘Int128 {aka struct Int128}’ and ‘hwaddr {aka long long unsigned int}’)
>          if ((iova & pgmask) || (llsize & pgmask)) {
>                                         ^
> qemu.git/hw/vfio/common.c: In function ‘vfio_listener_region_del’:
> qemu.git/hw/vfio/common.c:669:50: error: invalid operands to binary & (have ‘Int128 {aka struct Int128}’ and ‘hwaddr {aka long long unsigned int}’)
>          try_unmap = !((iova & pgmask) || (llsize & pgmask));
>                                                   ^
> Clearly llsize needs to be wrapped in int128_get64() here.  Should I
> presume that testing of this patch is equally lacking? 

No.

I do not know how but this definitely compiles for me with these:

gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
gcc (GCC) 6.3.1 20161221 (Red Hat 6.3.1-1)

I always thought Int128 is a struct but it turns out there are __uint128_t
and __int128_t and everywhere I compile (including x86_64 laptop) there is
CONFIG_INT128=y in config-host.mak, hence no error.

I can see that you fixed it (thanks for that!) but how do you compile it to
get this error? There is no way to disable gcc's types and even 4.8.5 can
do these. Thanks,



> Eric, have you
> done any testing of this?  I think this was fixing a mapping issue you
> reported.  Thanks,
> 
> Alex
>
Alex Williamson March 15, 2018, 2:36 a.m. UTC | #20
On Wed, 14 Mar 2018 13:40:21 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 14/3/18 3:56 am, Alex Williamson wrote:
> > Actually making sure it compiles would have been nice:
> > 
> > qemu.git/hw/vfio/common.c: In function ‘vfio_listener_region_add’:
> > qemu.git/hw/vfio/common.c:550:40: error: invalid operands to binary & (have ‘Int128 {aka struct Int128}’ and ‘hwaddr {aka long long unsigned int}’)
> >          if ((iova & pgmask) || (llsize & pgmask)) {
> >                                         ^
> > qemu.git/hw/vfio/common.c: In function ‘vfio_listener_region_del’:
> > qemu.git/hw/vfio/common.c:669:50: error: invalid operands to binary & (have ‘Int128 {aka struct Int128}’ and ‘hwaddr {aka long long unsigned int}’)
> >          try_unmap = !((iova & pgmask) || (llsize & pgmask));
> >                                                   ^
> > Clearly llsize needs to be wrapped in int128_get64() here.  Should I
> > presume that testing of this patch is equally lacking?   
> 
> No.
> 
> I do not know how but this definitely compiles for me with these:
> 
> gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
> gcc (GCC) 6.3.1 20161221 (Red Hat 6.3.1-1)
> 
> I always thought Int128 is a struct but it turns out there are __uint128_t
> and __int128_t and everywhere I compile (including x86_64 laptop) there is
> CONFIG_INT128=y in config-host.mak, hence no error.
> 
> I can see that you fixed it (thanks for that!) but how do you compile it to
> get this error? There is no way to disable gcc's types and even 4.8.5 can
> do these. Thanks,

Hmm, I always try to do a 32bit build before I send a pull request,
that's where I started this time.  I thought an Int128 was always a
structure, so I figured it always failed.  Good to know there was more
than just my testing on this commit.  Since it's in, we can fix it
during the freeze if it turns out to be broken.  Thanks,

Alex
Eric Auger March 19, 2018, 8:49 p.m. UTC | #21
Hi Alexey,

On 09/02/18 08:55, Alexey Kardashevskiy wrote:
> At the moment if vfio_memory_listener is registered in the system memory
> address space, it maps/unmaps every RAM memory region for DMA.
> It expects system page size aligned memory sections so vfio_dma_map
> would not fail and so far this has been the case. A mapping failure
> would be fatal. A side effect of such behavior is that some MMIO pages
> would not be mapped silently.
> 
> However we are going to change MSIX BAR handling so we will end having
> non-aligned sections in vfio_memory_listener (more details is in
> the next patch) and vfio_dma_map will exit QEMU.
> 
> In order to avoid fatal failures on what previously was not a failure and
> was just silently ignored, this checks the section alignment to
> the smallest supported IOMMU page size and prints an error if not aligned;
> it also prints an error if vfio_dma_map failed despite the page size check.
> Both errors are not fatal; only MMIO RAM regions are checked
> (aka "RAM device" regions).
> 
> If the amount of errors printed is overwhelming, the MSIX relocation
> could be used to avoid excessive error output.
> 
> This is unlikely to cause any behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/vfio/common.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 49 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index f895e3c..736f271 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -544,18 +544,40 @@ static void vfio_listener_region_add(MemoryListener *listener,
>  
>      llsize = int128_sub(llend, int128_make64(iova));
>  
> +    if (memory_region_is_ram_device(section->mr)) {
> +        hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
> +
> +        if ((iova & pgmask) || (llsize & pgmask)) {
> +            error_report("Region 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx
> +                         " is not aligned to 0x%"HWADDR_PRIx
> +                         " and cannot be mapped for DMA",
> +                         section->offset_within_region,
> +                         int128_getlo(section->size),
Didn't you want to print the section->offset_within_address_space
instead of section->offset_within_region?

For an 1000e card*, with a 64kB page, it outputs:
"Region 0x50..0xffb0 is not aligned to 0x10000 and cannot be mapped for DMA"

an absolute gpa range would be simpler I think.


* where 	Region 3: Memory at 100e0000 (32-bit, non-prefetchable)
[size=16K] contains the MSIX table

	Capabilities: [a0] MSI-X: Enable+ Count=5 Masked-
		Vector table: BAR=3 offset=00000000
		PBA: BAR=3 offset=00002000

Thanks

Eric
> +                         pgmask + 1);
> +            return;
> +        }
> +    }
> +
>      ret = vfio_dma_map(container, iova, int128_get64(llsize),
>                         vaddr, section->readonly);
>      if (ret) {
>          error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>                       "0x%"HWADDR_PRIx", %p) = %d (%m)",
>                       container, iova, int128_get64(llsize), vaddr, ret);
> +        if (memory_region_is_ram_device(section->mr)) {
> +            /* Allow unexpected mappings not to be fatal for RAM devices */
> +            return;
> +        }
>          goto fail;
>      }
>  
>      return;
>  
>  fail:
> +    if (memory_region_is_ram_device(section->mr)) {
> +        error_report("failed to vfio_dma_map. pci p2p may not work");
> +        return;
> +    }
>      /*
>       * On the initfn path, store the first error in the container so we
>       * can gracefully fail.  Runtime, there's not much we can do other
> @@ -577,6 +599,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>      hwaddr iova, end;
>      Int128 llend, llsize;
>      int ret;
> +    bool try_unmap = true;
>  
>      if (vfio_listener_skipped_section(section)) {
>          trace_vfio_listener_region_del_skip(
> @@ -629,13 +652,33 @@ static void vfio_listener_region_del(MemoryListener *listener,
>  
>      trace_vfio_listener_region_del(iova, end);
>  
> -    ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
> +    if (memory_region_is_ram_device(section->mr)) {
> +        hwaddr pgmask;
> +        VFIOHostDMAWindow *hostwin;
> +        bool hostwin_found = false;
> +
> +        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> +            if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
> +                hostwin_found = true;
> +                break;
> +            }
> +        }
> +        assert(hostwin_found); /* or region_add() would have failed */
> +
> +        pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
> +        try_unmap = !((iova & pgmask) || (llsize & pgmask));
> +    }
> +
> +    if (try_unmap) {
> +        ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
> +        if (ret) {
> +            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> +                         "0x%"HWADDR_PRIx") = %d (%m)",
> +                         container, iova, int128_get64(llsize), ret);
> +        }
> +    }
> +
>      memory_region_unref(section->mr);
> -    if (ret) {
> -        error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> -                     "0x%"HWADDR_PRIx") = %d (%m)",
> -                     container, iova, int128_get64(llsize), ret);
> -    }
>  
>      if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>          vfio_spapr_remove_window(container,
>
Alex Williamson March 21, 2018, 3:29 p.m. UTC | #22
On Mon, 19 Mar 2018 21:49:58 +0100
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Alexey,
> 
> On 09/02/18 08:55, Alexey Kardashevskiy wrote:
> > At the moment if vfio_memory_listener is registered in the system memory
> > address space, it maps/unmaps every RAM memory region for DMA.
> > It expects system page size aligned memory sections so vfio_dma_map
> > would not fail and so far this has been the case. A mapping failure
> > would be fatal. A side effect of such behavior is that some MMIO pages
> > would not be mapped silently.
> > 
> > However we are going to change MSIX BAR handling so we will end having
> > non-aligned sections in vfio_memory_listener (more details is in
> > the next patch) and vfio_dma_map will exit QEMU.
> > 
> > In order to avoid fatal failures on what previously was not a failure and
> > was just silently ignored, this checks the section alignment to
> > the smallest supported IOMMU page size and prints an error if not aligned;
> > it also prints an error if vfio_dma_map failed despite the page size check.
> > Both errors are not fatal; only MMIO RAM regions are checked
> > (aka "RAM device" regions).
> > 
> > If the amount of errors printed is overwhelming, the MSIX relocation
> > could be used to avoid excessive error output.
> > 
> > This is unlikely to cause any behavioral change.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > ---
> >  hw/vfio/common.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 49 insertions(+), 6 deletions(-)
> > 
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index f895e3c..736f271 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -544,18 +544,40 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >  
> >      llsize = int128_sub(llend, int128_make64(iova));
> >  
> > +    if (memory_region_is_ram_device(section->mr)) {
> > +        hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
> > +
> > +        if ((iova & pgmask) || (llsize & pgmask)) {
> > +            error_report("Region 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx
> > +                         " is not aligned to 0x%"HWADDR_PRIx
> > +                         " and cannot be mapped for DMA",
> > +                         section->offset_within_region,
> > +                         int128_getlo(section->size),  
> Didn't you want to print the section->offset_within_address_space
> instead of section->offset_within_region?
> 
> For an 1000e card*, with a 64kB page, it outputs:
> "Region 0x50..0xffb0 is not aligned to 0x10000 and cannot be mapped for DMA"
> 
> an absolute gpa range would be simpler I think.

I agree, the offset within the address space seems like it'd be much
easier to map back to the device.  Since we're handling it as
non-fatal, perhaps we can even add "MMIO Region" to give the user a bit
more context where the issue occurs.  Alexey, do you want to submit a
follow-up patch, or Eric, you may have the best resources to tune the
message.  Thanks,

Alex
 
> * where 	Region 3: Memory at 100e0000 (32-bit, non-prefetchable)
> [size=16K] contains the MSIX table
> 
> 	Capabilities: [a0] MSI-X: Enable+ Count=5 Masked-
> 		Vector table: BAR=3 offset=00000000
> 		PBA: BAR=3 offset=00002000
> 
> Thanks
> 
> Eric
> > +                         pgmask + 1);
> > +            return;
> > +        }
> > +    }
> > +
> >      ret = vfio_dma_map(container, iova, int128_get64(llsize),
> >                         vaddr, section->readonly);
> >      if (ret) {
> >          error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> >                       "0x%"HWADDR_PRIx", %p) = %d (%m)",
> >                       container, iova, int128_get64(llsize), vaddr, ret);
> > +        if (memory_region_is_ram_device(section->mr)) {
> > +            /* Allow unexpected mappings not to be fatal for RAM devices */
> > +            return;
> > +        }
> >          goto fail;
> >      }
> >  
> >      return;
> >  
> >  fail:
> > +    if (memory_region_is_ram_device(section->mr)) {
> > +        error_report("failed to vfio_dma_map. pci p2p may not work");
> > +        return;
> > +    }
> >      /*
> >       * On the initfn path, store the first error in the container so we
> >       * can gracefully fail.  Runtime, there's not much we can do other
> > @@ -577,6 +599,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >      hwaddr iova, end;
> >      Int128 llend, llsize;
> >      int ret;
> > +    bool try_unmap = true;
> >  
> >      if (vfio_listener_skipped_section(section)) {
> >          trace_vfio_listener_region_del_skip(
> > @@ -629,13 +652,33 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >  
> >      trace_vfio_listener_region_del(iova, end);
> >  
> > -    ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
> > +    if (memory_region_is_ram_device(section->mr)) {
> > +        hwaddr pgmask;
> > +        VFIOHostDMAWindow *hostwin;
> > +        bool hostwin_found = false;
> > +
> > +        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> > +            if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
> > +                hostwin_found = true;
> > +                break;
> > +            }
> > +        }
> > +        assert(hostwin_found); /* or region_add() would have failed */
> > +
> > +        pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
> > +        try_unmap = !((iova & pgmask) || (llsize & pgmask));
> > +    }
> > +
> > +    if (try_unmap) {
> > +        ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
> > +        if (ret) {
> > +            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> > +                         "0x%"HWADDR_PRIx") = %d (%m)",
> > +                         container, iova, int128_get64(llsize), ret);
> > +        }
> > +    }
> > +
> >      memory_region_unref(section->mr);
> > -    if (ret) {
> > -        error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> > -                     "0x%"HWADDR_PRIx") = %d (%m)",
> > -                     container, iova, int128_get64(llsize), ret);
> > -    }
> >  
> >      if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >          vfio_spapr_remove_window(container,
> >
Alexey Kardashevskiy March 22, 2018, 7:45 a.m. UTC | #23
On 22/3/18 2:29 am, Alex Williamson wrote:
> On Mon, 19 Mar 2018 21:49:58 +0100
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Alexey,
>>
>> On 09/02/18 08:55, Alexey Kardashevskiy wrote:
>>> At the moment if vfio_memory_listener is registered in the system memory
>>> address space, it maps/unmaps every RAM memory region for DMA.
>>> It expects system page size aligned memory sections so vfio_dma_map
>>> would not fail and so far this has been the case. A mapping failure
>>> would be fatal. A side effect of such behavior is that some MMIO pages
>>> would not be mapped silently.
>>>
>>> However we are going to change MSIX BAR handling so we will end having
>>> non-aligned sections in vfio_memory_listener (more details is in
>>> the next patch) and vfio_dma_map will exit QEMU.
>>>
>>> In order to avoid fatal failures on what previously was not a failure and
>>> was just silently ignored, this checks the section alignment to
>>> the smallest supported IOMMU page size and prints an error if not aligned;
>>> it also prints an error if vfio_dma_map failed despite the page size check.
>>> Both errors are not fatal; only MMIO RAM regions are checked
>>> (aka "RAM device" regions).
>>>
>>> If the amount of errors printed is overwhelming, the MSIX relocation
>>> could be used to avoid excessive error output.
>>>
>>> This is unlikely to cause any behavioral change.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>>  hw/vfio/common.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++------
>>>  1 file changed, 49 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>> index f895e3c..736f271 100644
>>> --- a/hw/vfio/common.c
>>> +++ b/hw/vfio/common.c
>>> @@ -544,18 +544,40 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>>  
>>>      llsize = int128_sub(llend, int128_make64(iova));
>>>  
>>> +    if (memory_region_is_ram_device(section->mr)) {
>>> +        hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
>>> +
>>> +        if ((iova & pgmask) || (llsize & pgmask)) {
>>> +            error_report("Region 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx
>>> +                         " is not aligned to 0x%"HWADDR_PRIx
>>> +                         " and cannot be mapped for DMA",
>>> +                         section->offset_within_region,
>>> +                         int128_getlo(section->size),  
>> Didn't you want to print the section->offset_within_address_space
>> instead of section->offset_within_region?


I do, my bad - this offset_within_region is a leftover from an earlier
version, iova is offset_within_address_space essentially.


>>
>> For an 1000e card*, with a 64kB page, it outputs:
>> "Region 0x50..0xffb0 is not aligned to 0x10000 and cannot be mapped for DMA"
>>
>> an absolute gpa range would be simpler I think.
> 
> I agree, the offset within the address space seems like it'd be much
> easier to map back to the device.  Since we're handling it as
> non-fatal, perhaps we can even add "MMIO Region" to give the user a bit
> more context where the issue occurs.  Alexey, do you want to submit a
> follow-up patch, or Eric, you may have the best resources to tune the
> message.  Thanks,


I can if somebody gives a hint about what needs to be tuned in the message.
I am also fine if Eric does this too :)


> 
> Alex
>  
>> * where 	Region 3: Memory at 100e0000 (32-bit, non-prefetchable)
>> [size=16K] contains the MSIX table
>>
>> 	Capabilities: [a0] MSI-X: Enable+ Count=5 Masked-
>> 		Vector table: BAR=3 offset=00000000
>> 		PBA: BAR=3 offset=00002000
>>
>> Thanks
>>
>> Eric
>>> +                         pgmask + 1);
>>> +            return;
>>> +        }
>>> +    }
>>> +
>>>      ret = vfio_dma_map(container, iova, int128_get64(llsize),
>>>                         vaddr, section->readonly);
>>>      if (ret) {
>>>          error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>>>                       "0x%"HWADDR_PRIx", %p) = %d (%m)",
>>>                       container, iova, int128_get64(llsize), vaddr, ret);
>>> +        if (memory_region_is_ram_device(section->mr)) {
>>> +            /* Allow unexpected mappings not to be fatal for RAM devices */
>>> +            return;
>>> +        }
>>>          goto fail;
>>>      }
>>>  
>>>      return;
>>>  
>>>  fail:
>>> +    if (memory_region_is_ram_device(section->mr)) {
>>> +        error_report("failed to vfio_dma_map. pci p2p may not work");
>>> +        return;
>>> +    }
>>>      /*
>>>       * On the initfn path, store the first error in the container so we
>>>       * can gracefully fail.  Runtime, there's not much we can do other
>>> @@ -577,6 +599,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>>      hwaddr iova, end;
>>>      Int128 llend, llsize;
>>>      int ret;
>>> +    bool try_unmap = true;
>>>  
>>>      if (vfio_listener_skipped_section(section)) {
>>>          trace_vfio_listener_region_del_skip(
>>> @@ -629,13 +652,33 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>>  
>>>      trace_vfio_listener_region_del(iova, end);
>>>  
>>> -    ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
>>> +    if (memory_region_is_ram_device(section->mr)) {
>>> +        hwaddr pgmask;
>>> +        VFIOHostDMAWindow *hostwin;
>>> +        bool hostwin_found = false;
>>> +
>>> +        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
>>> +            if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
>>> +                hostwin_found = true;
>>> +                break;
>>> +            }
>>> +        }
>>> +        assert(hostwin_found); /* or region_add() would have failed */
>>> +
>>> +        pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
>>> +        try_unmap = !((iova & pgmask) || (llsize & pgmask));
>>> +    }
>>> +
>>> +    if (try_unmap) {
>>> +        ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
>>> +        if (ret) {
>>> +            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>>> +                         "0x%"HWADDR_PRIx") = %d (%m)",
>>> +                         container, iova, int128_get64(llsize), ret);
>>> +        }
>>> +    }
>>> +
>>>      memory_region_unref(section->mr);
>>> -    if (ret) {
>>> -        error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>>> -                     "0x%"HWADDR_PRIx") = %d (%m)",
>>> -                     container, iova, int128_get64(llsize), ret);
>>> -    }
>>>  
>>>      if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>          vfio_spapr_remove_window(container,
>>>   
>
Eric Auger March 22, 2018, 7:48 a.m. UTC | #24
Hi Alexey,

On 22/03/18 08:45, Alexey Kardashevskiy wrote:
> On 22/3/18 2:29 am, Alex Williamson wrote:
>> On Mon, 19 Mar 2018 21:49:58 +0100
>> Auger Eric <eric.auger@redhat.com> wrote:
>>
>>> Hi Alexey,
>>>
>>> On 09/02/18 08:55, Alexey Kardashevskiy wrote:
>>>> At the moment if vfio_memory_listener is registered in the system memory
>>>> address space, it maps/unmaps every RAM memory region for DMA.
>>>> It expects system page size aligned memory sections so vfio_dma_map
>>>> would not fail and so far this has been the case. A mapping failure
>>>> would be fatal. A side effect of such behavior is that some MMIO pages
>>>> would not be mapped silently.
>>>>
>>>> However we are going to change MSIX BAR handling so we will end having
>>>> non-aligned sections in vfio_memory_listener (more details is in
>>>> the next patch) and vfio_dma_map will exit QEMU.
>>>>
>>>> In order to avoid fatal failures on what previously was not a failure and
>>>> was just silently ignored, this checks the section alignment to
>>>> the smallest supported IOMMU page size and prints an error if not aligned;
>>>> it also prints an error if vfio_dma_map failed despite the page size check.
>>>> Both errors are not fatal; only MMIO RAM regions are checked
>>>> (aka "RAM device" regions).
>>>>
>>>> If the amount of errors printed is overwhelming, the MSIX relocation
>>>> could be used to avoid excessive error output.
>>>>
>>>> This is unlikely to cause any behavioral change.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>  hw/vfio/common.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++------
>>>>  1 file changed, 49 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index f895e3c..736f271 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -544,18 +544,40 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>>>  
>>>>      llsize = int128_sub(llend, int128_make64(iova));
>>>>  
>>>> +    if (memory_region_is_ram_device(section->mr)) {
>>>> +        hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
>>>> +
>>>> +        if ((iova & pgmask) || (llsize & pgmask)) {
>>>> +            error_report("Region 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx
>>>> +                         " is not aligned to 0x%"HWADDR_PRIx
>>>> +                         " and cannot be mapped for DMA",
>>>> +                         section->offset_within_region,
>>>> +                         int128_getlo(section->size),  
>>> Didn't you want to print the section->offset_within_address_space
>>> instead of section->offset_within_region?
> 
> 
> I do, my bad - this offset_within_region is a leftover from an earlier
> version, iova is offset_within_address_space essentially.
> 
> 
>>>
>>> For an 1000e card*, with a 64kB page, it outputs:
>>> "Region 0x50..0xffb0 is not aligned to 0x10000 and cannot be mapped for DMA"
>>>
>>> an absolute gpa range would be simpler I think.
>>
>> I agree, the offset within the address space seems like it'd be much
>> easier to map back to the device.  Since we're handling it as
>> non-fatal, perhaps we can even add "MMIO Region" to give the user a bit
>> more context where the issue occurs.  Alexey, do you want to submit a
>> follow-up patch, or Eric, you may have the best resources to tune the
>> message.  Thanks,
> 
> 
> I can if somebody gives a hint about what needs to be tuned in the message.
> I am also fine if Eric does this too :)

I propose to send a follow-up patch as I have the proper HW to test it now.

Thanks

Eric
> 
> 
>>
>> Alex
>>  
>>> * where 	Region 3: Memory at 100e0000 (32-bit, non-prefetchable)
>>> [size=16K] contains the MSIX table
>>>
>>> 	Capabilities: [a0] MSI-X: Enable+ Count=5 Masked-
>>> 		Vector table: BAR=3 offset=00000000
>>> 		PBA: BAR=3 offset=00002000
>>>
>>> Thanks
>>>
>>> Eric
>>>> +                         pgmask + 1);
>>>> +            return;
>>>> +        }
>>>> +    }
>>>> +
>>>>      ret = vfio_dma_map(container, iova, int128_get64(llsize),
>>>>                         vaddr, section->readonly);
>>>>      if (ret) {
>>>>          error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>>>>                       "0x%"HWADDR_PRIx", %p) = %d (%m)",
>>>>                       container, iova, int128_get64(llsize), vaddr, ret);
>>>> +        if (memory_region_is_ram_device(section->mr)) {
>>>> +            /* Allow unexpected mappings not to be fatal for RAM devices */
>>>> +            return;
>>>> +        }
>>>>          goto fail;
>>>>      }
>>>>  
>>>>      return;
>>>>  
>>>>  fail:
>>>> +    if (memory_region_is_ram_device(section->mr)) {
>>>> +        error_report("failed to vfio_dma_map. pci p2p may not work");
>>>> +        return;
>>>> +    }
>>>>      /*
>>>>       * On the initfn path, store the first error in the container so we
>>>>       * can gracefully fail.  Runtime, there's not much we can do other
>>>> @@ -577,6 +599,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>>>      hwaddr iova, end;
>>>>      Int128 llend, llsize;
>>>>      int ret;
>>>> +    bool try_unmap = true;
>>>>  
>>>>      if (vfio_listener_skipped_section(section)) {
>>>>          trace_vfio_listener_region_del_skip(
>>>> @@ -629,13 +652,33 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>>>  
>>>>      trace_vfio_listener_region_del(iova, end);
>>>>  
>>>> -    ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
>>>> +    if (memory_region_is_ram_device(section->mr)) {
>>>> +        hwaddr pgmask;
>>>> +        VFIOHostDMAWindow *hostwin;
>>>> +        bool hostwin_found = false;
>>>> +
>>>> +        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
>>>> +            if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
>>>> +                hostwin_found = true;
>>>> +                break;
>>>> +            }
>>>> +        }
>>>> +        assert(hostwin_found); /* or region_add() would have failed */
>>>> +
>>>> +        pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
>>>> +        try_unmap = !((iova & pgmask) || (llsize & pgmask));
>>>> +    }
>>>> +
>>>> +    if (try_unmap) {
>>>> +        ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
>>>> +        if (ret) {
>>>> +            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>>>> +                         "0x%"HWADDR_PRIx") = %d (%m)",
>>>> +                         container, iova, int128_get64(llsize), ret);
>>>> +        }
>>>> +    }
>>>> +
>>>>      memory_region_unref(section->mr);
>>>> -    if (ret) {
>>>> -        error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>>>> -                     "0x%"HWADDR_PRIx") = %d (%m)",
>>>> -                     container, iova, int128_get64(llsize), ret);
>>>> -    }
>>>>  
>>>>      if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>>          vfio_spapr_remove_window(container,
>>>>   
>>
> 
>
diff mbox

Patch

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index f895e3c..736f271 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -544,18 +544,40 @@  static void vfio_listener_region_add(MemoryListener *listener,
 
     llsize = int128_sub(llend, int128_make64(iova));
 
+    if (memory_region_is_ram_device(section->mr)) {
+        hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
+
+        if ((iova & pgmask) || (llsize & pgmask)) {
+            error_report("Region 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx
+                         " is not aligned to 0x%"HWADDR_PRIx
+                         " and cannot be mapped for DMA",
+                         section->offset_within_region,
+                         int128_getlo(section->size),
+                         pgmask + 1);
+            return;
+        }
+    }
+
     ret = vfio_dma_map(container, iova, int128_get64(llsize),
                        vaddr, section->readonly);
     if (ret) {
         error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
                      "0x%"HWADDR_PRIx", %p) = %d (%m)",
                      container, iova, int128_get64(llsize), vaddr, ret);
+        if (memory_region_is_ram_device(section->mr)) {
+            /* Allow unexpected mappings not to be fatal for RAM devices */
+            return;
+        }
         goto fail;
     }
 
     return;
 
 fail:
+    if (memory_region_is_ram_device(section->mr)) {
+        error_report("failed to vfio_dma_map. pci p2p may not work");
+        return;
+    }
     /*
      * On the initfn path, store the first error in the container so we
      * can gracefully fail.  Runtime, there's not much we can do other
@@ -577,6 +599,7 @@  static void vfio_listener_region_del(MemoryListener *listener,
     hwaddr iova, end;
     Int128 llend, llsize;
     int ret;
+    bool try_unmap = true;
 
     if (vfio_listener_skipped_section(section)) {
         trace_vfio_listener_region_del_skip(
@@ -629,13 +652,33 @@  static void vfio_listener_region_del(MemoryListener *listener,
 
     trace_vfio_listener_region_del(iova, end);
 
-    ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
+    if (memory_region_is_ram_device(section->mr)) {
+        hwaddr pgmask;
+        VFIOHostDMAWindow *hostwin;
+        bool hostwin_found = false;
+
+        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+            if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
+                hostwin_found = true;
+                break;
+            }
+        }
+        assert(hostwin_found); /* or region_add() would have failed */
+
+        pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
+        try_unmap = !((iova & pgmask) || (llsize & pgmask));
+    }
+
+    if (try_unmap) {
+        ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
+        if (ret) {
+            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
+                         "0x%"HWADDR_PRIx") = %d (%m)",
+                         container, iova, int128_get64(llsize), ret);
+        }
+    }
+
     memory_region_unref(section->mr);
-    if (ret) {
-        error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
-                     "0x%"HWADDR_PRIx") = %d (%m)",
-                     container, iova, int128_get64(llsize), ret);
-    }
 
     if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
         vfio_spapr_remove_window(container,