diff mbox

[qemu,v15,16/17] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU

Message ID 1459762426-18440-17-git-send-email-aik@ozlabs.ru (mailing list archive)
State New, archived
Headers show

Commit Message

Alexey Kardashevskiy April 4, 2016, 9:33 a.m. UTC
New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
This adds ability to VFIO common code to dynamically allocate/remove
DMA windows in the host kernel when new VFIO container is added/removed.

This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
and adds just created IOMMU into the host IOMMU list; the opposite
action is taken in vfio_listener_region_del.

When creating a new window, this uses euristic to decide on the TCE table
levels number.

This should cause no guest visible change in behavior.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v14:
* new to the series

---
TODO:
* export levels to PHB
---
 hw/vfio/common.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
 trace-events     |   2 +
 2 files changed, 107 insertions(+), 10 deletions(-)

Comments

David Gibson April 7, 2016, 1:10 a.m. UTC | #1
Subject doesn't seem quite right, since you added at least minimal
support for the SPAPRv2 IOMMU in the prereg patch.

On Mon, Apr 04, 2016 at 07:33:45PM +1000, Alexey Kardashevskiy wrote:
> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> This adds ability to VFIO common code to dynamically allocate/remove
> DMA windows in the host kernel when new VFIO container is added/removed.
> 
> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
> and adds just created IOMMU into the host IOMMU list; the opposite
> action is taken in vfio_listener_region_del.
> 
> When creating a new window, this uses euristic to decide on the TCE table
> levels number.

"heuristic" has an 'h' (yes, English spelling is stupid[0]).

[0] The historical reasons for that are kind of fascinating, though.

> This should cause no guest visible change in behavior.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v14:
> * new to the series
> 
> ---
> TODO:
> * export levels to PHB
> ---
>  hw/vfio/common.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  trace-events     |   2 +
>  2 files changed, 107 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 5e5b77c..57a51df 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -279,6 +279,14 @@ static int vfio_host_win_add(VFIOContainer *container,
>      return 0;
>  }
>  
> +static void vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
> +{
> +    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);
> +
> +    g_assert(hostwin);
> +    QLIST_REMOVE(hostwin, hiommu_next);
> +}
> +
>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  {
>      return (!memory_region_is_ram(section->mr) &&
> @@ -392,6 +400,63 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      }
>      end = int128_get64(llend);
>  
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {

I think the "add region" path could do with being split out into a
different function - vfio_listener_region_add() is getting pretty
huge.

> +        unsigned entries, pages, pagesize = qemu_real_host_page_size;
> +        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
> +
> +        trace_vfio_listener_region_add_iommu(iova, end - 1);
> +        if (section->mr->iommu_ops) {
> +            pagesize = section->mr->iommu_ops->get_page_sizes(section->mr);

Since you're querying the guest IOMMU here, I assume pagesize is
supposed to represent *guest* IOMMU pagesizes, in which case it should
default to TARGET_PAGE_SIZE, instead of qemu_real_host_page_size.
(didn't you already have a function which implemented that fallback?)

> +        }
> +        /*
> +         * FIXME: For VFIO iommu types which have KVM acceleration to
> +         * avoid bouncing all map/unmaps through qemu this way, this
> +         * would be the right place to wire that up (tell the KVM
> +         * device emulation the VFIO iommu handles to use).
> +         */
> +        create.window_size = int128_get64(section->size);
> +        create.page_shift = ctz64(pagesize);
> +        /*
> +         * SPAPR host supports multilevel TCE tables, there is some
> +         * euristic to decide how many levels we want for our table:

s/some euristic/a heuristic/

> +         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> +         */
> +        entries = create.window_size >> create.page_shift;
> +        pages = (entries * sizeof(uint64_t)) / getpagesize();
> +        create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
> +
> +        if (vfio_host_win_lookup(container, create.start_addr,
> +                                 create.start_addr + create.window_size - 1)) {
> +            goto fail;

Hmm.. if you successfully look up a host window, it seems to me you
shouldn't fail, but in fact don't even need to create a new window
(the removal path gets harder though, because you need to check if any
guest window requires that host window).

Requiring that the host windows exactly match the guest windows is
probably ok for a first version - except that in that case any overlap
should cause a failure, not just a complete inclusion.

> +        }
> +
> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +        if (ret) {
> +            error_report("Failed to create a window, ret = %d (%m)", ret);
> +            goto fail;
> +        }
> +
> +        if (create.start_addr != section->offset_within_address_space) {
> +            struct vfio_iommu_spapr_tce_remove remove = {
> +                .argsz = sizeof(remove),
> +                .start_addr = create.start_addr
> +            };
> +            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
> +                         section->offset_within_address_space,
> +                         create.start_addr);
> +            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +            ret = -EINVAL;
> +            goto fail;
> +        }
> +        trace_vfio_spapr_create_window(create.page_shift,
> +                                       create.window_size,
> +                                       create.start_addr);
> +
> +        vfio_host_win_add(container, create.start_addr,
> +                          create.start_addr + create.window_size - 1,
> +                          1ULL << create.page_shift);
> +    }
> +
>      if (!vfio_host_win_lookup(container, iova, end - 1)) {
>          error_report("vfio: IOMMU container %p can't map guest IOVA region"
>                       " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
> @@ -525,6 +590,22 @@ static void vfio_listener_region_del(MemoryListener *listener,
>                       container, iova, end - iova, ret);
>      }
>  
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        struct vfio_iommu_spapr_tce_remove remove = {
> +            .argsz = sizeof(remove),
> +            .start_addr = section->offset_within_address_space,
> +        };
> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +        if (ret) {
> +            error_report("Failed to remove window at %"PRIx64,
> +                         remove.start_addr);
> +        }
> +
> +        vfio_host_win_del(container, section->offset_within_address_space);
> +
> +        trace_vfio_spapr_remove_window(remove.start_addr);
> +    }
> +
>      if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
>          iommu->iommu_ops->vfio_stop(section->mr);
>      }
> @@ -915,11 +996,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              }
>          }
>  
> -        /*
> -         * This only considers the host IOMMU's 32-bit window.  At
> -         * some point we need to add support for the optional 64-bit
> -         * window and dynamic windows
> -         */
>          info.argsz = sizeof(info);
>          ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
>          if (ret) {
> @@ -928,11 +1004,30 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto listener_release_exit;
>          }
>  
> -        /* The default table uses 4K pages */
> -        vfio_host_win_add(container, info.dma32_window_start,
> -                          info.dma32_window_start +
> -                          info.dma32_window_size - 1,
> -                          0x1000);
> +        if (v2) {
> +            /*
> +             * There is a default window in just created container.
> +             * To make region_add/del simpler, we better remove this
> +             * window now and let those iommu_listener callbacks
> +             * create/remove them when needed.
> +             */
> +            struct vfio_iommu_spapr_tce_remove remove = {
> +                .argsz = sizeof(remove),
> +                .start_addr = info.dma32_window_start,
> +            };
> +            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +            if (ret) {
> +                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
> +                ret = -errno;
> +                goto free_container_exit;
> +            }
> +        } else {
> +            /* The default table uses 4K pages */
> +            vfio_host_win_add(container, info.dma32_window_start,
> +                              info.dma32_window_start +
> +                              info.dma32_window_size - 1,
> +                              0x1000);
> +        }
>      } else {
>          error_report("vfio: No available IOMMU models");
>          ret = -EINVAL;
> diff --git a/trace-events b/trace-events
> index 23ca0b9..5c651fa 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1738,6 +1738,8 @@ vfio_region_finalize(const char *name, int index) "Device %s, region %d"
>  vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
>  vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>  vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> +vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> +vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
>  
>  # hw/vfio/platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
Alexey Kardashevskiy April 20, 2016, 9:43 a.m. UTC | #2
On 04/07/2016 11:10 AM, David Gibson wrote:
> Subject doesn't seem quite right, since you added at least minimal
> support for the SPAPRv2 IOMMU in the prereg patch.
>
> On Mon, Apr 04, 2016 at 07:33:45PM +1000, Alexey Kardashevskiy wrote:
>> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
>> This adds ability to VFIO common code to dynamically allocate/remove
>> DMA windows in the host kernel when new VFIO container is added/removed.
>>
>> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
>> and adds just created IOMMU into the host IOMMU list; the opposite
>> action is taken in vfio_listener_region_del.
>>
>> When creating a new window, this uses euristic to decide on the TCE table
>> levels number.
>
> "heuristic" has an 'h' (yes, English spelling is stupid[0]).
>
> [0] The historical reasons for that are kind of fascinating, though.

Tried googling, could not spot quickly the reasoning, any hints what to 
google for? Or just a link with an explanation? :)


>
>> This should cause no guest visible change in behavior.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v14:
>> * new to the series
>>
>> ---
>> TODO:
>> * export levels to PHB
>> ---
>>   hw/vfio/common.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
>>   trace-events     |   2 +
>>   2 files changed, 107 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 5e5b77c..57a51df 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -279,6 +279,14 @@ static int vfio_host_win_add(VFIOContainer *container,
>>       return 0;
>>   }
>>
>> +static void vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
>> +{
>> +    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);
>> +
>> +    g_assert(hostwin);
>> +    QLIST_REMOVE(hostwin, hiommu_next);
>> +}
>> +
>>   static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>   {
>>       return (!memory_region_is_ram(section->mr) &&
>> @@ -392,6 +400,63 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>       }
>>       end = int128_get64(llend);
>>
>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>
> I think the "add region" path could do with being split out into a
> different function - vfio_listener_region_add() is getting pretty
> huge.

It is big but not huge and I am trying to avoid having functions with 
"spapr" in their names in common.c as once they appear, we will start 
having a discussion if they should move to a separate file and if they do, 
then may be some other code should too, etc...


>
>> +        unsigned entries, pages, pagesize = qemu_real_host_page_size;
>> +        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
>> +
>> +        trace_vfio_listener_region_add_iommu(iova, end - 1);
>> +        if (section->mr->iommu_ops) {
>> +            pagesize = section->mr->iommu_ops->get_page_sizes(section->mr);
>
> Since you're querying the guest IOMMU here, I assume pagesize is
> supposed to represent *guest* IOMMU pagesizes, in which case it should
> default to TARGET_PAGE_SIZE, instead of qemu_real_host_page_size.
> (didn't you already have a function which implemented that fallback?)

Yes, will use the helper.


>> +        }
>> +        /*
>> +         * FIXME: For VFIO iommu types which have KVM acceleration to
>> +         * avoid bouncing all map/unmaps through qemu this way, this
>> +         * would be the right place to wire that up (tell the KVM
>> +         * device emulation the VFIO iommu handles to use).
>> +         */
>> +        create.window_size = int128_get64(section->size);
>> +        create.page_shift = ctz64(pagesize);
>> +        /*
>> +         * SPAPR host supports multilevel TCE tables, there is some
>> +         * euristic to decide how many levels we want for our table:
>
> s/some euristic/a heuristic/
>
>> +         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
>> +         */
>> +        entries = create.window_size >> create.page_shift;
>> +        pages = (entries * sizeof(uint64_t)) / getpagesize();
>> +        create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
>> +
>> +        if (vfio_host_win_lookup(container, create.start_addr,
>> +                                 create.start_addr + create.window_size - 1)) {
>> +            goto fail;
>
> Hmm.. if you successfully look up a host window, it seems to me you
> shouldn't fail, but in fact don't even need to create a new window
> (the removal path gets harder though, because you need to check if any
> guest window requires that host window).


At the moment if the window is there, it is failure in the environment I am 
testing it in. And, having a host kernel which cannot allocate and map 
windows randomly, it is unlikely that I'll have a setup where this spot 
won't mean that something went wrong. x86 case needs lot more than this anyway.


> Requiring that the host windows exactly match the guest windows is
> probably ok for a first version - except that in that case any overlap
> should cause a failure, not just a complete inclusion.

Yes, for now this should fail too, I'll fix it.


>
>> +        }
>> +
>> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>> +        if (ret) {
>> +            error_report("Failed to create a window, ret = %d (%m)", ret);
>> +            goto fail;
>> +        }
>> +
>> +        if (create.start_addr != section->offset_within_address_space) {
>> +            struct vfio_iommu_spapr_tce_remove remove = {
>> +                .argsz = sizeof(remove),
>> +                .start_addr = create.start_addr
>> +            };
>> +            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
>> +                         section->offset_within_address_space,
>> +                         create.start_addr);
>> +            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
>> +            ret = -EINVAL;
>> +            goto fail;
>> +        }
>> +        trace_vfio_spapr_create_window(create.page_shift,
>> +                                       create.window_size,
>> +                                       create.start_addr);
>> +
>> +        vfio_host_win_add(container, create.start_addr,
>> +                          create.start_addr + create.window_size - 1,
>> +                          1ULL << create.page_shift);
>> +    }
>> +
>>       if (!vfio_host_win_lookup(container, iova, end - 1)) {
>>           error_report("vfio: IOMMU container %p can't map guest IOVA region"
>>                        " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
>> @@ -525,6 +590,22 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>                        container, iova, end - iova, ret);
>>       }
>>
>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>> +        struct vfio_iommu_spapr_tce_remove remove = {
>> +            .argsz = sizeof(remove),
>> +            .start_addr = section->offset_within_address_space,
>> +        };
>> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
>> +        if (ret) {
>> +            error_report("Failed to remove window at %"PRIx64,
>> +                         remove.start_addr);
>> +        }
>> +
>> +        vfio_host_win_del(container, section->offset_within_address_space);
>> +
>> +        trace_vfio_spapr_remove_window(remove.start_addr);
>> +    }
>> +
>>       if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
>>           iommu->iommu_ops->vfio_stop(section->mr);
>>       }
>> @@ -915,11 +996,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>               }
>>           }
>>
>> -        /*
>> -         * This only considers the host IOMMU's 32-bit window.  At
>> -         * some point we need to add support for the optional 64-bit
>> -         * window and dynamic windows
>> -         */
>>           info.argsz = sizeof(info);
>>           ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
>>           if (ret) {
>> @@ -928,11 +1004,30 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>               goto listener_release_exit;
>>           }
>>
>> -        /* The default table uses 4K pages */
>> -        vfio_host_win_add(container, info.dma32_window_start,
>> -                          info.dma32_window_start +
>> -                          info.dma32_window_size - 1,
>> -                          0x1000);
>> +        if (v2) {
>> +            /*
>> +             * There is a default window in just created container.
>> +             * To make region_add/del simpler, we better remove this
>> +             * window now and let those iommu_listener callbacks
>> +             * create/remove them when needed.
>> +             */
>> +            struct vfio_iommu_spapr_tce_remove remove = {
>> +                .argsz = sizeof(remove),
>> +                .start_addr = info.dma32_window_start,
>> +            };
>> +            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
>> +            if (ret) {
>> +                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
>> +                ret = -errno;
>> +                goto free_container_exit;
>> +            }
>> +        } else {
>> +            /* The default table uses 4K pages */
>> +            vfio_host_win_add(container, info.dma32_window_start,
>> +                              info.dma32_window_start +
>> +                              info.dma32_window_size - 1,
>> +                              0x1000);
>> +        }
>>       } else {
>>           error_report("vfio: No available IOMMU models");
>>           ret = -EINVAL;
>> diff --git a/trace-events b/trace-events
>> index 23ca0b9..5c651fa 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -1738,6 +1738,8 @@ vfio_region_finalize(const char *name, int index) "Device %s, region %d"
>>   vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
>>   vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>>   vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>> +vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
>> +vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
>>
>>   # hw/vfio/platform.c
>>   vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
>
David Gibson April 21, 2016, 4:03 a.m. UTC | #3
On Wed, Apr 20, 2016 at 07:43:41PM +1000, Alexey Kardashevskiy wrote:
> On 04/07/2016 11:10 AM, David Gibson wrote:
> >Subject doesn't seem quite right, since you added at least minimal
> >support for the SPAPRv2 IOMMU in the prereg patch.
> >
> >On Mon, Apr 04, 2016 at 07:33:45PM +1000, Alexey Kardashevskiy wrote:
> >>New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> >>This adds ability to VFIO common code to dynamically allocate/remove
> >>DMA windows in the host kernel when new VFIO container is added/removed.
> >>
> >>This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
> >>and adds just created IOMMU into the host IOMMU list; the opposite
> >>action is taken in vfio_listener_region_del.
> >>
> >>When creating a new window, this uses euristic to decide on the TCE table
> >>levels number.
> >
> >"heuristic" has an 'h' (yes, English spelling is stupid[0]).
> >
> >[0] The historical reasons for that are kind of fascinating, though.
> 
> Tried googling, could not spot quickly the reasoning, any hints what to
> google for? Or just a link with an explanation? :)

That wasn't a comment about heuristic specifically, but english
spelling in general.  I got most of the fascinating stuff I've seen
from:

http://www.amazon.com/Spell-Out-Enthralling-Extraordinary-Spelling/dp/1250056128/ref=la_B000AP940C_1_5/189-2461131-2789630?s=books&ie=UTF8&qid=1461211129&sr=1-5

> >>This should cause no guest visible change in behavior.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>---
> >>Changes:
> >>v14:
> >>* new to the series
> >>
> >>---
> >>TODO:
> >>* export levels to PHB
> >>---
> >>  hw/vfio/common.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
> >>  trace-events     |   2 +
> >>  2 files changed, 107 insertions(+), 10 deletions(-)
> >>
> >>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>index 5e5b77c..57a51df 100644
> >>--- a/hw/vfio/common.c
> >>+++ b/hw/vfio/common.c
> >>@@ -279,6 +279,14 @@ static int vfio_host_win_add(VFIOContainer *container,
> >>      return 0;
> >>  }
> >>
> >>+static void vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
> >>+{
> >>+    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);
> >>+
> >>+    g_assert(hostwin);
> >>+    QLIST_REMOVE(hostwin, hiommu_next);
> >>+}
> >>+
> >>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >>  {
> >>      return (!memory_region_is_ram(section->mr) &&
> >>@@ -392,6 +400,63 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>      }
> >>      end = int128_get64(llend);
> >>
> >>+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >
> >I think the "add region" path could do with being split out into a
> >different function - vfio_listener_region_add() is getting pretty
> >huge.
> 
> It is big but not huge and I am trying to avoid having functions with
> "spapr" in their names in common.c as once they appear, we will start having
> a discussion if they should move to a separate file and if they do, then may
> be some other code should too, etc...

I don't think it needs to be a "spapr" named function.  The spapr
backend is the only one to support it for now, but other iommus could
support dynamic windows in future (for example, if I ever get time to
implement the "type2" converged interface I was thinking about, I'd
look to include that).

> >>+        unsigned entries, pages, pagesize = qemu_real_host_page_size;
> >>+        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
> >>+
> >>+        trace_vfio_listener_region_add_iommu(iova, end - 1);
> >>+        if (section->mr->iommu_ops) {
> >>+            pagesize = section->mr->iommu_ops->get_page_sizes(section->mr);
> >
> >Since you're querying the guest IOMMU here, I assume pagesize is
> >supposed to represent *guest* IOMMU pagesizes, in which case it should
> >default to TARGET_PAGE_SIZE, instead of qemu_real_host_page_size.
> >(didn't you already have a function which implemented that fallback?)
> 
> Yes, will use the helper.
> 
> 
> >>+        }
> >>+        /*
> >>+         * FIXME: For VFIO iommu types which have KVM acceleration to
> >>+         * avoid bouncing all map/unmaps through qemu this way, this
> >>+         * would be the right place to wire that up (tell the KVM
> >>+         * device emulation the VFIO iommu handles to use).
> >>+         */
> >>+        create.window_size = int128_get64(section->size);
> >>+        create.page_shift = ctz64(pagesize);
> >>+        /*
> >>+         * SPAPR host supports multilevel TCE tables, there is some
> >>+         * euristic to decide how many levels we want for our table:
> >
> >s/some euristic/a heuristic/
> >
> >>+         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> >>+         */
> >>+        entries = create.window_size >> create.page_shift;
> >>+        pages = (entries * sizeof(uint64_t)) / getpagesize();
> >>+        create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
> >>+
> >>+        if (vfio_host_win_lookup(container, create.start_addr,
> >>+                                 create.start_addr + create.window_size - 1)) {
> >>+            goto fail;
> >
> >Hmm.. if you successfully look up a host window, it seems to me you
> >shouldn't fail, but in fact don't even need to create a new window
> >(the removal path gets harder though, because you need to check if any
> >guest window requires that host window).
> 
> 
> At the moment if the window is there, it is failure in the environment I am
> testing it in. And, having a host kernel which cannot allocate and map
> windows randomly, it is unlikely that I'll have a setup where this spot
> won't mean that something went wrong. x86 case needs lot more than this
> anyway.

Hm, I suppose so.

> 
> 
> >Requiring that the host windows exactly match the guest windows is
> >probably ok for a first version - except that in that case any overlap
> >should cause a failure, not just a complete inclusion.
> 
> Yes, for now this should fail too, I'll fix it.
> 
> 
> >
> >>+        }
> >>+
> >>+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> >>+        if (ret) {
> >>+            error_report("Failed to create a window, ret = %d (%m)", ret);
> >>+            goto fail;
> >>+        }
> >>+
> >>+        if (create.start_addr != section->offset_within_address_space) {
> >>+            struct vfio_iommu_spapr_tce_remove remove = {
> >>+                .argsz = sizeof(remove),
> >>+                .start_addr = create.start_addr
> >>+            };
> >>+            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
> >>+                         section->offset_within_address_space,
> >>+                         create.start_addr);
> >>+            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> >>+            ret = -EINVAL;
> >>+            goto fail;
> >>+        }
> >>+        trace_vfio_spapr_create_window(create.page_shift,
> >>+                                       create.window_size,
> >>+                                       create.start_addr);
> >>+
> >>+        vfio_host_win_add(container, create.start_addr,
> >>+                          create.start_addr + create.window_size - 1,
> >>+                          1ULL << create.page_shift);
> >>+    }
> >>+
> >>      if (!vfio_host_win_lookup(container, iova, end - 1)) {
> >>          error_report("vfio: IOMMU container %p can't map guest IOVA region"
> >>                       " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
> >>@@ -525,6 +590,22 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>                       container, iova, end - iova, ret);
> >>      }
> >>
> >>+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >>+        struct vfio_iommu_spapr_tce_remove remove = {
> >>+            .argsz = sizeof(remove),
> >>+            .start_addr = section->offset_within_address_space,
> >>+        };
> >>+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> >>+        if (ret) {
> >>+            error_report("Failed to remove window at %"PRIx64,
> >>+                         remove.start_addr);
> >>+        }
> >>+
> >>+        vfio_host_win_del(container, section->offset_within_address_space);
> >>+
> >>+        trace_vfio_spapr_remove_window(remove.start_addr);
> >>+    }
> >>+
> >>      if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
> >>          iommu->iommu_ops->vfio_stop(section->mr);
> >>      }
> >>@@ -915,11 +996,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>              }
> >>          }
> >>
> >>-        /*
> >>-         * This only considers the host IOMMU's 32-bit window.  At
> >>-         * some point we need to add support for the optional 64-bit
> >>-         * window and dynamic windows
> >>-         */
> >>          info.argsz = sizeof(info);
> >>          ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
> >>          if (ret) {
> >>@@ -928,11 +1004,30 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>              goto listener_release_exit;
> >>          }
> >>
> >>-        /* The default table uses 4K pages */
> >>-        vfio_host_win_add(container, info.dma32_window_start,
> >>-                          info.dma32_window_start +
> >>-                          info.dma32_window_size - 1,
> >>-                          0x1000);
> >>+        if (v2) {
> >>+            /*
> >>+             * There is a default window in just created container.
> >>+             * To make region_add/del simpler, we better remove this
> >>+             * window now and let those iommu_listener callbacks
> >>+             * create/remove them when needed.
> >>+             */
> >>+            struct vfio_iommu_spapr_tce_remove remove = {
> >>+                .argsz = sizeof(remove),
> >>+                .start_addr = info.dma32_window_start,
> >>+            };
> >>+            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> >>+            if (ret) {
> >>+                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
> >>+                ret = -errno;
> >>+                goto free_container_exit;
> >>+            }
> >>+        } else {
> >>+            /* The default table uses 4K pages */
> >>+            vfio_host_win_add(container, info.dma32_window_start,
> >>+                              info.dma32_window_start +
> >>+                              info.dma32_window_size - 1,
> >>+                              0x1000);
> >>+        }
> >>      } else {
> >>          error_report("vfio: No available IOMMU models");
> >>          ret = -EINVAL;
> >>diff --git a/trace-events b/trace-events
> >>index 23ca0b9..5c651fa 100644
> >>--- a/trace-events
> >>+++ b/trace-events
> >>@@ -1738,6 +1738,8 @@ vfio_region_finalize(const char *name, int index) "Device %s, region %d"
> >>  vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
> >>  vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> >>  vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> >>+vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> >>+vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
> >>
> >>  # hw/vfio/platform.c
> >>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
> >
> 
>
diff mbox

Patch

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 5e5b77c..57a51df 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -279,6 +279,14 @@  static int vfio_host_win_add(VFIOContainer *container,
     return 0;
 }
 
+static void vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
+{
+    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);
+
+    g_assert(hostwin);
+    QLIST_REMOVE(hostwin, hiommu_next);
+}
+
 static bool vfio_listener_skipped_section(MemoryRegionSection *section)
 {
     return (!memory_region_is_ram(section->mr) &&
@@ -392,6 +400,63 @@  static void vfio_listener_region_add(MemoryListener *listener,
     }
     end = int128_get64(llend);
 
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        unsigned entries, pages, pagesize = qemu_real_host_page_size;
+        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
+
+        trace_vfio_listener_region_add_iommu(iova, end - 1);
+        if (section->mr->iommu_ops) {
+            pagesize = section->mr->iommu_ops->get_page_sizes(section->mr);
+        }
+        /*
+         * FIXME: For VFIO iommu types which have KVM acceleration to
+         * avoid bouncing all map/unmaps through qemu this way, this
+         * would be the right place to wire that up (tell the KVM
+         * device emulation the VFIO iommu handles to use).
+         */
+        create.window_size = int128_get64(section->size);
+        create.page_shift = ctz64(pagesize);
+        /*
+         * SPAPR host supports multilevel TCE tables, there is some
+         * euristic to decide how many levels we want for our table:
+         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
+         */
+        entries = create.window_size >> create.page_shift;
+        pages = (entries * sizeof(uint64_t)) / getpagesize();
+        create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
+
+        if (vfio_host_win_lookup(container, create.start_addr,
+                                 create.start_addr + create.window_size - 1)) {
+            goto fail;
+        }
+
+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
+        if (ret) {
+            error_report("Failed to create a window, ret = %d (%m)", ret);
+            goto fail;
+        }
+
+        if (create.start_addr != section->offset_within_address_space) {
+            struct vfio_iommu_spapr_tce_remove remove = {
+                .argsz = sizeof(remove),
+                .start_addr = create.start_addr
+            };
+            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
+                         section->offset_within_address_space,
+                         create.start_addr);
+            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+            ret = -EINVAL;
+            goto fail;
+        }
+        trace_vfio_spapr_create_window(create.page_shift,
+                                       create.window_size,
+                                       create.start_addr);
+
+        vfio_host_win_add(container, create.start_addr,
+                          create.start_addr + create.window_size - 1,
+                          1ULL << create.page_shift);
+    }
+
     if (!vfio_host_win_lookup(container, iova, end - 1)) {
         error_report("vfio: IOMMU container %p can't map guest IOVA region"
                      " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
@@ -525,6 +590,22 @@  static void vfio_listener_region_del(MemoryListener *listener,
                      container, iova, end - iova, ret);
     }
 
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        struct vfio_iommu_spapr_tce_remove remove = {
+            .argsz = sizeof(remove),
+            .start_addr = section->offset_within_address_space,
+        };
+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+        if (ret) {
+            error_report("Failed to remove window at %"PRIx64,
+                         remove.start_addr);
+        }
+
+        vfio_host_win_del(container, section->offset_within_address_space);
+
+        trace_vfio_spapr_remove_window(remove.start_addr);
+    }
+
     if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
         iommu->iommu_ops->vfio_stop(section->mr);
     }
@@ -915,11 +996,6 @@  static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             }
         }
 
-        /*
-         * This only considers the host IOMMU's 32-bit window.  At
-         * some point we need to add support for the optional 64-bit
-         * window and dynamic windows
-         */
         info.argsz = sizeof(info);
         ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
         if (ret) {
@@ -928,11 +1004,30 @@  static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto listener_release_exit;
         }
 
-        /* The default table uses 4K pages */
-        vfio_host_win_add(container, info.dma32_window_start,
-                          info.dma32_window_start +
-                          info.dma32_window_size - 1,
-                          0x1000);
+        if (v2) {
+            /*
+             * There is a default window in just created container.
+             * To make region_add/del simpler, we better remove this
+             * window now and let those iommu_listener callbacks
+             * create/remove them when needed.
+             */
+            struct vfio_iommu_spapr_tce_remove remove = {
+                .argsz = sizeof(remove),
+                .start_addr = info.dma32_window_start,
+            };
+            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+            if (ret) {
+                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
+                ret = -errno;
+                goto free_container_exit;
+            }
+        } else {
+            /* The default table uses 4K pages */
+            vfio_host_win_add(container, info.dma32_window_start,
+                              info.dma32_window_start +
+                              info.dma32_window_size - 1,
+                              0x1000);
+        }
     } else {
         error_report("vfio: No available IOMMU models");
         ret = -EINVAL;
diff --git a/trace-events b/trace-events
index 23ca0b9..5c651fa 100644
--- a/trace-events
+++ b/trace-events
@@ -1738,6 +1738,8 @@  vfio_region_finalize(const char *name, int index) "Device %s, region %d"
 vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
 vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
 vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
+vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
+vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
 
 # hw/vfio/platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"