diff mbox series

[qemu,RFC,1/7] vfio/spapr: Fix indirect levels calculation

Message ID 20181113083104.2692-2-aik@ozlabs.ru (mailing list archive)
State New, archived
Headers show
Series spapr_pci, vfio: NVIDIA V100 + P9 passthrough | expand

Commit Message

Alexey Kardashevskiy Nov. 13, 2018, 8:30 a.m. UTC
The current code assumes that we can address more bits on a PCI bus
for DMA than we really can. Limit to the known tested maximum of 55 bits
and assume 64K IOMMU pages.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/vfio/spapr.c      | 3 ++-
 hw/vfio/trace-events | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

Comments

David Gibson Nov. 19, 2018, 1:45 a.m. UTC | #1
On Tue, Nov 13, 2018 at 07:30:58PM +1100, Alexey Kardashevskiy wrote:
> The current code assumes that we can address more bits on a PCI bus
> for DMA than we really can. Limit to the known tested maximum of 55 bits
> and assume 64K IOMMU pages.

This doesn't make much sense to me.  It looks nothing like the
calculation it's replacing, and not just for extreme values.  For
example the new calculation doesn't depend on the size of the window
at all, whereas the previous one did.  You say you're assuming 64k
IOMMU pages, but create.page_shift (the actual IOMMU page size) is in
there.  Nor is it clear to me why the maximum PCI address is relevant
to the number of levels in the first place.

In addition, AFAICT the new calculation gives the answer '2' for all
likely IOMMU page sizes (4k, 64k, 2M)

> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/vfio/spapr.c      | 3 ++-
>  hw/vfio/trace-events | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
> index becf71a..f5fdc53 100644
> --- a/hw/vfio/spapr.c
> +++ b/hw/vfio/spapr.c
> @@ -183,7 +183,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>      entries = create.window_size >> create.page_shift;
>      pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
>      pages = MAX(pow2ceil(pages), 1); /* Round up */
> -    create.levels = ctz64(pages) / 6 + 1;
> +    create.levels = MAX(1, (55 - create.page_shift) / 16);
>  
>      ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>      if (ret) {
> @@ -200,6 +200,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>          return -EINVAL;
>      }
>      trace_vfio_spapr_create_window(create.page_shift,
> +                                   create.levels,
>                                     create.window_size,
>                                     create.start_addr);
>      *pgsize = pagesize;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index a85e866..db730f3 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -128,6 +128,6 @@ vfio_prereg_listener_region_add_skip(uint64_t start, uint64_t end) "0x%"PRIx64"
>  vfio_prereg_listener_region_del_skip(uint64_t start, uint64_t end) "0x%"PRIx64" - 0x%"PRIx64
>  vfio_prereg_register(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d"
>  vfio_prereg_unregister(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d"
> -vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> +vfio_spapr_create_window(int ps, unsigned int levels, uint64_t ws, uint64_t off) "pageshift=0x%x levels=%u winsize=0x%"PRIx64" offset=0x%"PRIx64
>  vfio_spapr_remove_window(uint64_t off) "offset=0x%"PRIx64
>  vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to liobn fd %d"
Alexey Kardashevskiy Nov. 19, 2018, 5:51 a.m. UTC | #2
On 19/11/2018 12:45, David Gibson wrote:
> On Tue, Nov 13, 2018 at 07:30:58PM +1100, Alexey Kardashevskiy wrote:
>> The current code assumes that we can address more bits on a PCI bus
>> for DMA than we really can. Limit to the known tested maximum of 55 bits
>> and assume 64K IOMMU pages.
> 
> This doesn't make much sense to me.  It looks nothing like the
> calculation it's replacing, and not just for extreme values.  For
> example the new calculation doesn't depend on the size of the window
> at all, whereas the previous one did.

Uff. Yes, this is confusing. There are a typical 64k IOMMU page size and
a typical 64k system page size...

For the window as big as 0x200.0000.0000 and 64k pages (actual working
example), 33554432 entries are needed, or 256MB of space. The existing
code calculates it as 25/6+1=5 levels. Which the hardware can do.

But:
1. all TCE levels are the same size and
2. for the minimum size of the level I take a system page size which is 64k.

So with 5 levels 16bit each (and this is without counting IOMMU page
size, which is plus extra 16 bits) the table is huuuge.

And I know from tests that 55 bits is the maximum the hardware can
handle, and I have this limit in the powernv code so VFIO ioctl to
create a window will fail. If I do not have a check in powernv, than
things silently do not work (no EEH though).


> You say you're assuming 64k
> IOMMU pages, but create.page_shift (the actual IOMMU page size) is in
> there. 

It is a radix tree basically. Out of 64bits available, I cannot use bits
56..63 (hardware does not work) and 0..12/15/21 (IOMMU page size), so I
have bits 13/16/22..55 which I need to split equally between levels,
each of which is 64k (system page size).


> Nor is it clear to me why the maximum PCI address is relevant
> to the number of levels in the first place.


The window can only as big as the PHB hardware supports, this is why
maximum PCI address.


> 
> In addition, AFAICT the new calculation gives the answer '2' for all
> likely IOMMU page sizes (4k, 64k, 2M)

Hm, I did not realize that :) Instead of using 64k as a default level
size, I could use some idea of what CONFIG_FORCE_MAX_ZONEORDER could be.

Currently it is:

arch/powerpc/Kconfig
config FORCE_MAX_ZONEORDER
        int "Maximum zone order"
        range 8 9 if PPC64 && PPC_64K_PAGES
        default "9" if PPC64 && PPC_64K_PAGES
        range 13 13 if PPC64 && !PPC_64K_PAGES
        default "13" if PPC64 && !PPC_64K_PAGES
        range 9 64 if PPC32 && PPC_16K_PAGES
        default "9" if PPC32 && PPC_16K_PAGES
        range 7 64 if PPC32 && PPC_64K_PAGES
        default "7" if PPC32 && PPC_64K_PAGES
        range 5 64 if PPC32 && PPC_256K_PAGES
        default "5" if PPC32 && PPC_256K_PAGES
        range 11 64
        default "11"

So assuming that "8" is the absolute minimum, I could try levels as big
as 1<<(16+8-1) and if this fails, then increase the number of levels
till a window is created or it is too many levels. Or there is some
obviously better solution to the problem?





> 
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  hw/vfio/spapr.c      | 3 ++-
>>  hw/vfio/trace-events | 2 +-
>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
>> index becf71a..f5fdc53 100644
>> --- a/hw/vfio/spapr.c
>> +++ b/hw/vfio/spapr.c
>> @@ -183,7 +183,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>>      entries = create.window_size >> create.page_shift;
>>      pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
>>      pages = MAX(pow2ceil(pages), 1); /* Round up */
>> -    create.levels = ctz64(pages) / 6 + 1;
>> +    create.levels = MAX(1, (55 - create.page_shift) / 16);
>>  
>>      ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>>      if (ret) {
>> @@ -200,6 +200,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>>          return -EINVAL;
>>      }
>>      trace_vfio_spapr_create_window(create.page_shift,
>> +                                   create.levels,
>>                                     create.window_size,
>>                                     create.start_addr);
>>      *pgsize = pagesize;
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index a85e866..db730f3 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -128,6 +128,6 @@ vfio_prereg_listener_region_add_skip(uint64_t start, uint64_t end) "0x%"PRIx64"
>>  vfio_prereg_listener_region_del_skip(uint64_t start, uint64_t end) "0x%"PRIx64" - 0x%"PRIx64
>>  vfio_prereg_register(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d"
>>  vfio_prereg_unregister(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d"
>> -vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
>> +vfio_spapr_create_window(int ps, unsigned int levels, uint64_t ws, uint64_t off) "pageshift=0x%x levels=%u winsize=0x%"PRIx64" offset=0x%"PRIx64
>>  vfio_spapr_remove_window(uint64_t off) "offset=0x%"PRIx64
>>  vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to liobn fd %d"
>
diff mbox series

Patch

diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
index becf71a..f5fdc53 100644
--- a/hw/vfio/spapr.c
+++ b/hw/vfio/spapr.c
@@ -183,7 +183,7 @@  int vfio_spapr_create_window(VFIOContainer *container,
     entries = create.window_size >> create.page_shift;
     pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
     pages = MAX(pow2ceil(pages), 1); /* Round up */
-    create.levels = ctz64(pages) / 6 + 1;
+    create.levels = MAX(1, (55 - create.page_shift) / 16);
 
     ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
     if (ret) {
@@ -200,6 +200,7 @@  int vfio_spapr_create_window(VFIOContainer *container,
         return -EINVAL;
     }
     trace_vfio_spapr_create_window(create.page_shift,
+                                   create.levels,
                                    create.window_size,
                                    create.start_addr);
     *pgsize = pagesize;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index a85e866..db730f3 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -128,6 +128,6 @@  vfio_prereg_listener_region_add_skip(uint64_t start, uint64_t end) "0x%"PRIx64"
 vfio_prereg_listener_region_del_skip(uint64_t start, uint64_t end) "0x%"PRIx64" - 0x%"PRIx64
 vfio_prereg_register(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d"
 vfio_prereg_unregister(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d"
-vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
+vfio_spapr_create_window(int ps, unsigned int levels, uint64_t ws, uint64_t off) "pageshift=0x%x levels=%u winsize=0x%"PRIx64" offset=0x%"PRIx64
 vfio_spapr_remove_window(uint64_t off) "offset=0x%"PRIx64
 vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to liobn fd %d"