diff mbox series

[RFC,1/2] vhost-vdpa: Decouple the IOVA allocator

Message ID 20240821125548.749143-2-jonah.palmer@oracle.com (mailing list archive)
State New, archived
Headers show
Series Handling aliased guest memory maps in vhost-vDPA SVQs | expand

Commit Message

Jonah Palmer Aug. 21, 2024, 12:55 p.m. UTC
Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
will hold all IOVA ranges that have been allocated (e.g. in the
IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.

A new API function vhost_iova_tree_insert() is also created to add a
IOVA->HVA mapping into the IOVA->HVA tree.

Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
---
 hw/virtio/vhost-iova-tree.c | 38 ++++++++++++++++++++++++++++++++-----
 hw/virtio/vhost-iova-tree.h |  1 +
 hw/virtio/vhost-vdpa.c      | 31 ++++++++++++++++++++++++------
 net/vhost-vdpa.c            | 13 +++++++++++--
 4 files changed, 70 insertions(+), 13 deletions(-)

Comments

Eugenio Perez Martin Aug. 29, 2024, 4:53 p.m. UTC | #1
On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
> Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
> the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
> will hold all IOVA ranges that have been allocated (e.g. in the
> IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.
>
> A new API function vhost_iova_tree_insert() is also created to add a
> IOVA->HVA mapping into the IOVA->HVA tree.
>

I think this is a good first iteration but we can take steps to
simplify it. Also, it is great to be able to make points on real code
instead of designs on the air :).

I expected a split of vhost_iova_tree_map_alloc between the current
vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
similar. Similarly, a vhost_iova_tree_remove and
vhost_iova_tree_remove_gpa would be needed.

The first one is used for regions that don't exist in the guest, like
SVQ vrings or CVQ buffers. The second one is the one used by the
memory listener to map the guest regions into the vdpa device.

Implementation wise, only two trees are actually needed:
* Current iova_taddr_map that contains all IOVA->vaddr translations as
seen by the device, so both allocation functions can work on a single
tree. The function iova_tree_find_iova keeps using this one, so the
user does not need to know if the address is from the guest or only
exists in QEMU by using RAMBlock etc. All insert and remove functions
use this tree.
* A new tree that relates IOVA to GPA, that only
vhost_iova_tree_map_alloc_gpa and vhost_iova_tree_remove_gpa uses.

The ideal case is that the key in this new tree is the GPA and the
value is the IOVA. But IOVATree's DMA is named the reverse: iova is
the key and translated_addr is the vaddr. We can create a new tree
struct for that, use GTree directly, or translate the reverse
linearly. As memory add / remove should not be frequent, I think the
simpler is the last one, but I'd be ok with creating a new tree.

vhost_iova_tree_map_alloc_gpa needs to add the map to this new tree
also. Similarly, vhost_iova_tree_remove_gpa must look for the GPA in
this tree, and only remove the associated DMAMap in iova_taddr_map
that matches the IOVA.

Does it make sense to you?

> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> ---
>  hw/virtio/vhost-iova-tree.c | 38 ++++++++++++++++++++++++++++++++-----
>  hw/virtio/vhost-iova-tree.h |  1 +
>  hw/virtio/vhost-vdpa.c      | 31 ++++++++++++++++++++++++------
>  net/vhost-vdpa.c            | 13 +++++++++++--
>  4 files changed, 70 insertions(+), 13 deletions(-)
>
> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> index 3d03395a77..32c03db2f5 100644
> --- a/hw/virtio/vhost-iova-tree.c
> +++ b/hw/virtio/vhost-iova-tree.c
> @@ -28,12 +28,17 @@ struct VhostIOVATree {
>
>      /* IOVA address to qemu memory maps. */
>      IOVATree *iova_taddr_map;
> +
> +    /* IOVA tree (IOVA allocator) */
> +    IOVATree *iova_map;
>  };
>
>  /**
> - * Create a new IOVA tree
> + * Create a new VhostIOVATree with a new set of IOVATree's:

s/IOVA tree/VhostIOVATree/ is good, but I think the rest is more an
implementation detail.

> + * - IOVA allocator (iova_map)
> + * - IOVA->HVA tree (iova_taddr_map)
>   *
> - * Returns the new IOVA tree
> + * Returns the new VhostIOVATree
>   */
>  VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>  {
> @@ -44,6 +49,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>      tree->iova_last = iova_last;
>
>      tree->iova_taddr_map = iova_tree_new();
> +    tree->iova_map = iova_tree_new();
>      return tree;
>  }
>
> @@ -53,6 +59,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>  void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
>  {
>      iova_tree_destroy(iova_tree->iova_taddr_map);
> +    iova_tree_destroy(iova_tree->iova_map);
>      g_free(iova_tree);
>  }
>
> @@ -88,13 +95,12 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
>      /* Some vhost devices do not like addr 0. Skip first page */
>      hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size();
>
> -    if (map->translated_addr + map->size < map->translated_addr ||

Why remove this condition? If the request is invalid we still need to
return an error here.

Maybe we should move it to iova_tree_alloc_map though.

> -        map->perm == IOMMU_NONE) {
> +    if (map->perm == IOMMU_NONE) {
>          return IOVA_ERR_INVALID;
>      }
>
>      /* Allocate a node in IOVA address */
> -    return iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
> +    return iova_tree_alloc_map(tree->iova_map, map, iova_first,
>                                 tree->iova_last);
>  }
>
> @@ -107,4 +113,26 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
>  void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
>  {
>      iova_tree_remove(iova_tree->iova_taddr_map, map);
> +    iova_tree_remove(iova_tree->iova_map, map);
> +}
> +
> +/**
> + * Insert a new mapping to the IOVA->HVA tree
> + *
> + * @tree: The VhostIOVATree
> + * @map: The iova map
> + *
> + * Returns:
> + * - IOVA_OK if the map fits in the container
> + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
> + * - IOVA_ERR_OVERLAP if the IOVA range overlaps with an existing range
> + */
> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map)
> +{
> +    if (map->translated_addr + map->size < map->translated_addr ||
> +        map->perm == IOMMU_NONE) {
> +        return IOVA_ERR_INVALID;
> +    }
> +
> +    return iova_tree_insert(iova_tree->iova_taddr_map, map);
>  }
> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> index 4adfd79ff0..8bf7b64786 100644
> --- a/hw/virtio/vhost-iova-tree.h
> +++ b/hw/virtio/vhost-iova-tree.h
> @@ -23,5 +23,6 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
>                                          const DMAMap *map);
>  int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
>  void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map);
> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map);
>
>  #endif
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 3cdaa12ed5..6702459065 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -361,10 +361,10 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>      if (s->shadow_data) {
>          int r;
>
> -        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,
>          mem_region.size = int128_get64(llsize) - 1,
>          mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
>
> +        /* Allocate IOVA range and add the mapping to the IOVA tree */
>          r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
>          if (unlikely(r != IOVA_OK)) {
>              error_report("Can't allocate a mapping (%d)", r);
> @@ -372,6 +372,14 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>          }
>
>          iova = mem_region.iova;
> +
> +        /* Add mapping to the IOVA->HVA tree */
> +        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr;
> +        r = vhost_iova_tree_insert(s->iova_tree, &mem_region);
> +        if (unlikely(r != IOVA_OK)) {
> +            error_report("Can't add listener region mapping (%d)", r);
> +            goto fail_map;
> +        }

I'd say it is not intuitive for the caller code.

>      }
>
>      vhost_vdpa_iotlb_batch_begin_once(s);
> @@ -1142,19 +1150,30 @@ static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
>   *
>   * @v: Vhost-vdpa device
>   * @needle: The area to search iova
> + * @taddr: The translated address (SVQ HVA)
>   * @errorp: Error pointer
>   */
>  static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
> -                                    Error **errp)
> +                                    hwaddr taddr, Error **errp)
>  {
>      int r;
>
> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
>      r = vhost_iova_tree_map_alloc(v->shared->iova_tree, needle);
>      if (unlikely(r != IOVA_OK)) {
>          error_setg(errp, "Cannot allocate iova (%d)", r);
>          return false;
>      }
>
> +    /* Add mapping to the IOVA->HVA tree */
> +    needle->translated_addr = taddr;
> +    r = vhost_iova_tree_insert(v->shared->iova_tree, needle);
> +    if (unlikely(r != IOVA_OK)) {
> +        error_setg(errp, "Cannot add SVQ vring mapping (%d)", r);
> +        vhost_iova_tree_remove(v->shared->iova_tree, *needle);
> +        return false;
> +    }
> +
>      r = vhost_vdpa_dma_map(v->shared, v->address_space_id, needle->iova,
>                             needle->size + 1,
>                             (void *)(uintptr_t)needle->translated_addr,
> @@ -1192,11 +1211,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>      vhost_svq_get_vring_addr(svq, &svq_addr);
>
>      driver_region = (DMAMap) {
> -        .translated_addr = svq_addr.desc_user_addr,
>          .size = driver_size - 1,
>          .perm = IOMMU_RO,
>      };
> -    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
> +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, svq_addr.desc_user_addr,
> +                                 errp);
>      if (unlikely(!ok)) {
>          error_prepend(errp, "Cannot create vq driver region: ");
>          return false;
> @@ -1206,11 +1225,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>      addr->avail_user_addr = driver_region.iova + avail_offset;
>
>      device_region = (DMAMap) {
> -        .translated_addr = svq_addr.used_user_addr,
>          .size = device_size - 1,
>          .perm = IOMMU_RW,
>      };
> -    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
> +    ok = vhost_vdpa_svq_map_ring(v, &device_region, svq_addr.used_user_addr,
> +                                 errp);
>      if (unlikely(!ok)) {
>          error_prepend(errp, "Cannot create vq device region: ");
>          vhost_vdpa_svq_unmap_ring(v, driver_region.translated_addr);
> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> index 03457ead66..81da956b92 100644
> --- a/net/vhost-vdpa.c
> +++ b/net/vhost-vdpa.c
> @@ -512,15 +512,24 @@ static int vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v, void *buf, size_t size,
>      DMAMap map = {};
>      int r;
>
> -    map.translated_addr = (hwaddr)(uintptr_t)buf;
>      map.size = size - 1;
>      map.perm = write ? IOMMU_RW : IOMMU_RO,
> +
> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
>      r = vhost_iova_tree_map_alloc(v->shared->iova_tree, &map);
>      if (unlikely(r != IOVA_OK)) {
> -        error_report("Cannot map injected element");
> +        error_report("Cannot allocate IOVA range for injected element");
>          return r;
>      }
>
> +    /* Add mapping to the IOVA->HVA tree */
> +    map.translated_addr = (hwaddr)(uintptr_t)buf;
> +    r = vhost_iova_tree_insert(v->shared->iova_tree, &map);
> +    if (unlikely(r != IOVA_OK)) {
> +        error_report("Cannot map injected element into IOVA->HVA tree");
> +        goto dma_map_err;
> +    }
> +
>      r = vhost_vdpa_dma_map(v->shared, v->address_space_id, map.iova,
>                             vhost_vdpa_net_cvq_cmd_page_len(), buf, !write);
>      if (unlikely(r < 0)) {
> --
> 2.43.5
>
Si-Wei Liu Aug. 30, 2024, 4:20 a.m. UTC | #2
On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote:
> On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>> Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
>> the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
>> will hold all IOVA ranges that have been allocated (e.g. in the
>> IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.
>>
>> A new API function vhost_iova_tree_insert() is also created to add a
>> IOVA->HVA mapping into the IOVA->HVA tree.
>>
> I think this is a good first iteration but we can take steps to
> simplify it. Also, it is great to be able to make points on real code
> instead of designs on the air :).
>
> I expected a split of vhost_iova_tree_map_alloc between the current
> vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
> similar. Similarly, a vhost_iova_tree_remove and
> vhost_iova_tree_remove_gpa would be needed.
>
> The first one is used for regions that don't exist in the guest, like
> SVQ vrings or CVQ buffers. The second one is the one used by the
> memory listener to map the guest regions into the vdpa device.
>
> Implementation wise, only two trees are actually needed:
> * Current iova_taddr_map that contains all IOVA->vaddr translations as
> seen by the device, so both allocation functions can work on a single
> tree. The function iova_tree_find_iova keeps using this one, so the
I thought we had thorough discussion about this and agreed upon the 
decoupled IOVA allocator solution. But maybe I missed something earlier, 
I am not clear how come this iova_tree_find_iova function could still 
work with the full IOVA-> HVA tree when it comes to aliased memory or 
overlapped HVAs? Granted, for the memory map removal in the 
.region_del() path, we could rely on the GPA tree to locate the 
corresponding IOVA, but how come the translation path could figure out 
which IOVA range to return when the vaddr happens to fall in an 
overlapped HVA range? Do we still assume some overlapping order so we 
always return the first match from the tree? Or we expect every current 
user of iova_tree_find_iova should pass in GPA rather than HVA and use 
the vhost_iova_xxx_gpa API variant to look up IOVA?

Thanks,
-Siwei

> user does not need to know if the address is from the guest or only
> exists in QEMU by using RAMBlock etc. All insert and remove functions
> use this tree.
> * A new tree that relates IOVA to GPA, that only
> vhost_iova_tree_map_alloc_gpa and vhost_iova_tree_remove_gpa uses.
>
> The ideal case is that the key in this new tree is the GPA and the
> value is the IOVA. But IOVATree's DMA is named the reverse: iova is
> the key and translated_addr is the vaddr. We can create a new tree
> struct for that, use GTree directly, or translate the reverse
> linearly. As memory add / remove should not be frequent, I think the
> simpler is the last one, but I'd be ok with creating a new tree.
>
> vhost_iova_tree_map_alloc_gpa needs to add the map to this new tree
> also. Similarly, vhost_iova_tree_remove_gpa must look for the GPA in
> this tree, and only remove the associated DMAMap in iova_taddr_map
> that matches the IOVA.
>
> Does it make sense to you?
>
>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>> ---
>>   hw/virtio/vhost-iova-tree.c | 38 ++++++++++++++++++++++++++++++++-----
>>   hw/virtio/vhost-iova-tree.h |  1 +
>>   hw/virtio/vhost-vdpa.c      | 31 ++++++++++++++++++++++++------
>>   net/vhost-vdpa.c            | 13 +++++++++++--
>>   4 files changed, 70 insertions(+), 13 deletions(-)
>>
>> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
>> index 3d03395a77..32c03db2f5 100644
>> --- a/hw/virtio/vhost-iova-tree.c
>> +++ b/hw/virtio/vhost-iova-tree.c
>> @@ -28,12 +28,17 @@ struct VhostIOVATree {
>>
>>       /* IOVA address to qemu memory maps. */
>>       IOVATree *iova_taddr_map;
>> +
>> +    /* IOVA tree (IOVA allocator) */
>> +    IOVATree *iova_map;
>>   };
>>
>>   /**
>> - * Create a new IOVA tree
>> + * Create a new VhostIOVATree with a new set of IOVATree's:
> s/IOVA tree/VhostIOVATree/ is good, but I think the rest is more an
> implementation detail.
>
>> + * - IOVA allocator (iova_map)
>> + * - IOVA->HVA tree (iova_taddr_map)
>>    *
>> - * Returns the new IOVA tree
>> + * Returns the new VhostIOVATree
>>    */
>>   VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>   {
>> @@ -44,6 +49,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>       tree->iova_last = iova_last;
>>
>>       tree->iova_taddr_map = iova_tree_new();
>> +    tree->iova_map = iova_tree_new();
>>       return tree;
>>   }
>>
>> @@ -53,6 +59,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>   void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
>>   {
>>       iova_tree_destroy(iova_tree->iova_taddr_map);
>> +    iova_tree_destroy(iova_tree->iova_map);
>>       g_free(iova_tree);
>>   }
>>
>> @@ -88,13 +95,12 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
>>       /* Some vhost devices do not like addr 0. Skip first page */
>>       hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size();
>>
>> -    if (map->translated_addr + map->size < map->translated_addr ||
> Why remove this condition? If the request is invalid we still need to
> return an error here.
>
> Maybe we should move it to iova_tree_alloc_map though.
>
>> -        map->perm == IOMMU_NONE) {
>> +    if (map->perm == IOMMU_NONE) {
>>           return IOVA_ERR_INVALID;
>>       }
>>
>>       /* Allocate a node in IOVA address */
>> -    return iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
>> +    return iova_tree_alloc_map(tree->iova_map, map, iova_first,
>>                                  tree->iova_last);
>>   }
>>
>> @@ -107,4 +113,26 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
>>   void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
>>   {
>>       iova_tree_remove(iova_tree->iova_taddr_map, map);
>> +    iova_tree_remove(iova_tree->iova_map, map);
>> +}
>> +
>> +/**
>> + * Insert a new mapping to the IOVA->HVA tree
>> + *
>> + * @tree: The VhostIOVATree
>> + * @map: The iova map
>> + *
>> + * Returns:
>> + * - IOVA_OK if the map fits in the container
>> + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
>> + * - IOVA_ERR_OVERLAP if the IOVA range overlaps with an existing range
>> + */
>> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map)
>> +{
>> +    if (map->translated_addr + map->size < map->translated_addr ||
>> +        map->perm == IOMMU_NONE) {
>> +        return IOVA_ERR_INVALID;
>> +    }
>> +
>> +    return iova_tree_insert(iova_tree->iova_taddr_map, map);
>>   }
>> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
>> index 4adfd79ff0..8bf7b64786 100644
>> --- a/hw/virtio/vhost-iova-tree.h
>> +++ b/hw/virtio/vhost-iova-tree.h
>> @@ -23,5 +23,6 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
>>                                           const DMAMap *map);
>>   int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
>>   void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map);
>> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map);
>>
>>   #endif
>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>> index 3cdaa12ed5..6702459065 100644
>> --- a/hw/virtio/vhost-vdpa.c
>> +++ b/hw/virtio/vhost-vdpa.c
>> @@ -361,10 +361,10 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>       if (s->shadow_data) {
>>           int r;
>>
>> -        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,
>>           mem_region.size = int128_get64(llsize) - 1,
>>           mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
>>
>> +        /* Allocate IOVA range and add the mapping to the IOVA tree */
>>           r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
>>           if (unlikely(r != IOVA_OK)) {
>>               error_report("Can't allocate a mapping (%d)", r);
>> @@ -372,6 +372,14 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>           }
>>
>>           iova = mem_region.iova;
>> +
>> +        /* Add mapping to the IOVA->HVA tree */
>> +        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr;
>> +        r = vhost_iova_tree_insert(s->iova_tree, &mem_region);
>> +        if (unlikely(r != IOVA_OK)) {
>> +            error_report("Can't add listener region mapping (%d)", r);
>> +            goto fail_map;
>> +        }
> I'd say it is not intuitive for the caller code.
>
>>       }
>>
>>       vhost_vdpa_iotlb_batch_begin_once(s);
>> @@ -1142,19 +1150,30 @@ static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
>>    *
>>    * @v: Vhost-vdpa device
>>    * @needle: The area to search iova
>> + * @taddr: The translated address (SVQ HVA)
>>    * @errorp: Error pointer
>>    */
>>   static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
>> -                                    Error **errp)
>> +                                    hwaddr taddr, Error **errp)
>>   {
>>       int r;
>>
>> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
>>       r = vhost_iova_tree_map_alloc(v->shared->iova_tree, needle);
>>       if (unlikely(r != IOVA_OK)) {
>>           error_setg(errp, "Cannot allocate iova (%d)", r);
>>           return false;
>>       }
>>
>> +    /* Add mapping to the IOVA->HVA tree */
>> +    needle->translated_addr = taddr;
>> +    r = vhost_iova_tree_insert(v->shared->iova_tree, needle);
>> +    if (unlikely(r != IOVA_OK)) {
>> +        error_setg(errp, "Cannot add SVQ vring mapping (%d)", r);
>> +        vhost_iova_tree_remove(v->shared->iova_tree, *needle);
>> +        return false;
>> +    }
>> +
>>       r = vhost_vdpa_dma_map(v->shared, v->address_space_id, needle->iova,
>>                              needle->size + 1,
>>                              (void *)(uintptr_t)needle->translated_addr,
>> @@ -1192,11 +1211,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>>       vhost_svq_get_vring_addr(svq, &svq_addr);
>>
>>       driver_region = (DMAMap) {
>> -        .translated_addr = svq_addr.desc_user_addr,
>>           .size = driver_size - 1,
>>           .perm = IOMMU_RO,
>>       };
>> -    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
>> +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, svq_addr.desc_user_addr,
>> +                                 errp);
>>       if (unlikely(!ok)) {
>>           error_prepend(errp, "Cannot create vq driver region: ");
>>           return false;
>> @@ -1206,11 +1225,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>>       addr->avail_user_addr = driver_region.iova + avail_offset;
>>
>>       device_region = (DMAMap) {
>> -        .translated_addr = svq_addr.used_user_addr,
>>           .size = device_size - 1,
>>           .perm = IOMMU_RW,
>>       };
>> -    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
>> +    ok = vhost_vdpa_svq_map_ring(v, &device_region, svq_addr.used_user_addr,
>> +                                 errp);
>>       if (unlikely(!ok)) {
>>           error_prepend(errp, "Cannot create vq device region: ");
>>           vhost_vdpa_svq_unmap_ring(v, driver_region.translated_addr);
>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
>> index 03457ead66..81da956b92 100644
>> --- a/net/vhost-vdpa.c
>> +++ b/net/vhost-vdpa.c
>> @@ -512,15 +512,24 @@ static int vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v, void *buf, size_t size,
>>       DMAMap map = {};
>>       int r;
>>
>> -    map.translated_addr = (hwaddr)(uintptr_t)buf;
>>       map.size = size - 1;
>>       map.perm = write ? IOMMU_RW : IOMMU_RO,
>> +
>> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
>>       r = vhost_iova_tree_map_alloc(v->shared->iova_tree, &map);
>>       if (unlikely(r != IOVA_OK)) {
>> -        error_report("Cannot map injected element");
>> +        error_report("Cannot allocate IOVA range for injected element");
>>           return r;
>>       }
>>
>> +    /* Add mapping to the IOVA->HVA tree */
>> +    map.translated_addr = (hwaddr)(uintptr_t)buf;
>> +    r = vhost_iova_tree_insert(v->shared->iova_tree, &map);
>> +    if (unlikely(r != IOVA_OK)) {
>> +        error_report("Cannot map injected element into IOVA->HVA tree");
>> +        goto dma_map_err;
>> +    }
>> +
>>       r = vhost_vdpa_dma_map(v->shared, v->address_space_id, map.iova,
>>                              vhost_vdpa_net_cvq_cmd_page_len(), buf, !write);
>>       if (unlikely(r < 0)) {
>> --
>> 2.43.5
>>
Eugenio Perez Martin Aug. 30, 2024, 8:05 a.m. UTC | #3
On Fri, Aug 30, 2024 at 6:20 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote:
> > On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >> Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
> >> the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
> >> will hold all IOVA ranges that have been allocated (e.g. in the
> >> IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.
> >>
> >> A new API function vhost_iova_tree_insert() is also created to add a
> >> IOVA->HVA mapping into the IOVA->HVA tree.
> >>
> > I think this is a good first iteration but we can take steps to
> > simplify it. Also, it is great to be able to make points on real code
> > instead of designs on the air :).
> >
> > I expected a split of vhost_iova_tree_map_alloc between the current
> > vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
> > similar. Similarly, a vhost_iova_tree_remove and
> > vhost_iova_tree_remove_gpa would be needed.
> >
> > The first one is used for regions that don't exist in the guest, like
> > SVQ vrings or CVQ buffers. The second one is the one used by the
> > memory listener to map the guest regions into the vdpa device.
> >
> > Implementation wise, only two trees are actually needed:
> > * Current iova_taddr_map that contains all IOVA->vaddr translations as
> > seen by the device, so both allocation functions can work on a single
> > tree. The function iova_tree_find_iova keeps using this one, so the
> I thought we had thorough discussion about this and agreed upon the
> decoupled IOVA allocator solution.

My interpretation of it is to leave the allocator as the current one,
and create a new tree with GPA which is guaranteed to be unique. But
we can talk over it of course.

> But maybe I missed something earlier,
> I am not clear how come this iova_tree_find_iova function could still
> work with the full IOVA-> HVA tree when it comes to aliased memory or
> overlapped HVAs? Granted, for the memory map removal in the
> .region_del() path, we could rely on the GPA tree to locate the
> corresponding IOVA, but how come the translation path could figure out
> which IOVA range to return when the vaddr happens to fall in an
> overlapped HVA range?

That is not a problem, as they both translate to the same address at the device.

The most complicated situation is where we have a region contained in
another region, and the requested buffer crosses them. If the IOVA
tree returns the inner region, it will return the buffer chained with
the rest of the content in the outer region. Not optimal, but solved
either way.

The only problem that comes to my mind is the case where the inner
region is RO and it is a write command, but I don't think we have this
case in a sane guest. A malicious guest cannot do any harm this way
anyway.

> Do we still assume some overlapping order so we
> always return the first match from the tree? Or we expect every current
> user of iova_tree_find_iova should pass in GPA rather than HVA and use
> the vhost_iova_xxx_gpa API variant to look up IOVA?
>

No, iova_tree_find_iova should keep asking for vaddr, as the result is
guaranteed to be there. Users of VhostIOVATree only need to modify how
they add or remove regions, knowing if they come from the guest or
not. As shown by this series, it is easier to do in that place than in
translation.

> Thanks,
> -Siwei
>
> > user does not need to know if the address is from the guest or only
> > exists in QEMU by using RAMBlock etc. All insert and remove functions
> > use this tree.
> > * A new tree that relates IOVA to GPA, that only
> > vhost_iova_tree_map_alloc_gpa and vhost_iova_tree_remove_gpa uses.
> >
> > The ideal case is that the key in this new tree is the GPA and the
> > value is the IOVA. But IOVATree's DMA is named the reverse: iova is
> > the key and translated_addr is the vaddr. We can create a new tree
> > struct for that, use GTree directly, or translate the reverse
> > linearly. As memory add / remove should not be frequent, I think the
> > simpler is the last one, but I'd be ok with creating a new tree.
> >
> > vhost_iova_tree_map_alloc_gpa needs to add the map to this new tree
> > also. Similarly, vhost_iova_tree_remove_gpa must look for the GPA in
> > this tree, and only remove the associated DMAMap in iova_taddr_map
> > that matches the IOVA.
> >
> > Does it make sense to you?
> >
> >> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> >> ---
> >>   hw/virtio/vhost-iova-tree.c | 38 ++++++++++++++++++++++++++++++++-----
> >>   hw/virtio/vhost-iova-tree.h |  1 +
> >>   hw/virtio/vhost-vdpa.c      | 31 ++++++++++++++++++++++++------
> >>   net/vhost-vdpa.c            | 13 +++++++++++--
> >>   4 files changed, 70 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> >> index 3d03395a77..32c03db2f5 100644
> >> --- a/hw/virtio/vhost-iova-tree.c
> >> +++ b/hw/virtio/vhost-iova-tree.c
> >> @@ -28,12 +28,17 @@ struct VhostIOVATree {
> >>
> >>       /* IOVA address to qemu memory maps. */
> >>       IOVATree *iova_taddr_map;
> >> +
> >> +    /* IOVA tree (IOVA allocator) */
> >> +    IOVATree *iova_map;
> >>   };
> >>
> >>   /**
> >> - * Create a new IOVA tree
> >> + * Create a new VhostIOVATree with a new set of IOVATree's:
> > s/IOVA tree/VhostIOVATree/ is good, but I think the rest is more an
> > implementation detail.
> >
> >> + * - IOVA allocator (iova_map)
> >> + * - IOVA->HVA tree (iova_taddr_map)
> >>    *
> >> - * Returns the new IOVA tree
> >> + * Returns the new VhostIOVATree
> >>    */
> >>   VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> >>   {
> >> @@ -44,6 +49,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> >>       tree->iova_last = iova_last;
> >>
> >>       tree->iova_taddr_map = iova_tree_new();
> >> +    tree->iova_map = iova_tree_new();
> >>       return tree;
> >>   }
> >>
> >> @@ -53,6 +59,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> >>   void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
> >>   {
> >>       iova_tree_destroy(iova_tree->iova_taddr_map);
> >> +    iova_tree_destroy(iova_tree->iova_map);
> >>       g_free(iova_tree);
> >>   }
> >>
> >> @@ -88,13 +95,12 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
> >>       /* Some vhost devices do not like addr 0. Skip first page */
> >>       hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size();
> >>
> >> -    if (map->translated_addr + map->size < map->translated_addr ||
> > Why remove this condition? If the request is invalid we still need to
> > return an error here.
> >
> > Maybe we should move it to iova_tree_alloc_map though.
> >
> >> -        map->perm == IOMMU_NONE) {
> >> +    if (map->perm == IOMMU_NONE) {
> >>           return IOVA_ERR_INVALID;
> >>       }
> >>
> >>       /* Allocate a node in IOVA address */
> >> -    return iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
> >> +    return iova_tree_alloc_map(tree->iova_map, map, iova_first,
> >>                                  tree->iova_last);
> >>   }
> >>
> >> @@ -107,4 +113,26 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
> >>   void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
> >>   {
> >>       iova_tree_remove(iova_tree->iova_taddr_map, map);
> >> +    iova_tree_remove(iova_tree->iova_map, map);
> >> +}
> >> +
> >> +/**
> >> + * Insert a new mapping to the IOVA->HVA tree
> >> + *
> >> + * @tree: The VhostIOVATree
> >> + * @map: The iova map
> >> + *
> >> + * Returns:
> >> + * - IOVA_OK if the map fits in the container
> >> + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
> >> + * - IOVA_ERR_OVERLAP if the IOVA range overlaps with an existing range
> >> + */
> >> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map)
> >> +{
> >> +    if (map->translated_addr + map->size < map->translated_addr ||
> >> +        map->perm == IOMMU_NONE) {
> >> +        return IOVA_ERR_INVALID;
> >> +    }
> >> +
> >> +    return iova_tree_insert(iova_tree->iova_taddr_map, map);
> >>   }
> >> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> >> index 4adfd79ff0..8bf7b64786 100644
> >> --- a/hw/virtio/vhost-iova-tree.h
> >> +++ b/hw/virtio/vhost-iova-tree.h
> >> @@ -23,5 +23,6 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
> >>                                           const DMAMap *map);
> >>   int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
> >>   void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map);
> >> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map);
> >>
> >>   #endif
> >> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >> index 3cdaa12ed5..6702459065 100644
> >> --- a/hw/virtio/vhost-vdpa.c
> >> +++ b/hw/virtio/vhost-vdpa.c
> >> @@ -361,10 +361,10 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> >>       if (s->shadow_data) {
> >>           int r;
> >>
> >> -        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,
> >>           mem_region.size = int128_get64(llsize) - 1,
> >>           mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> >>
> >> +        /* Allocate IOVA range and add the mapping to the IOVA tree */
> >>           r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
> >>           if (unlikely(r != IOVA_OK)) {
> >>               error_report("Can't allocate a mapping (%d)", r);
> >> @@ -372,6 +372,14 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> >>           }
> >>
> >>           iova = mem_region.iova;
> >> +
> >> +        /* Add mapping to the IOVA->HVA tree */
> >> +        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr;
> >> +        r = vhost_iova_tree_insert(s->iova_tree, &mem_region);
> >> +        if (unlikely(r != IOVA_OK)) {
> >> +            error_report("Can't add listener region mapping (%d)", r);
> >> +            goto fail_map;
> >> +        }
> > I'd say it is not intuitive for the caller code.
> >
> >>       }
> >>
> >>       vhost_vdpa_iotlb_batch_begin_once(s);
> >> @@ -1142,19 +1150,30 @@ static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
> >>    *
> >>    * @v: Vhost-vdpa device
> >>    * @needle: The area to search iova
> >> + * @taddr: The translated address (SVQ HVA)
> >>    * @errorp: Error pointer
> >>    */
> >>   static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
> >> -                                    Error **errp)
> >> +                                    hwaddr taddr, Error **errp)
> >>   {
> >>       int r;
> >>
> >> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
> >>       r = vhost_iova_tree_map_alloc(v->shared->iova_tree, needle);
> >>       if (unlikely(r != IOVA_OK)) {
> >>           error_setg(errp, "Cannot allocate iova (%d)", r);
> >>           return false;
> >>       }
> >>
> >> +    /* Add mapping to the IOVA->HVA tree */
> >> +    needle->translated_addr = taddr;
> >> +    r = vhost_iova_tree_insert(v->shared->iova_tree, needle);
> >> +    if (unlikely(r != IOVA_OK)) {
> >> +        error_setg(errp, "Cannot add SVQ vring mapping (%d)", r);
> >> +        vhost_iova_tree_remove(v->shared->iova_tree, *needle);
> >> +        return false;
> >> +    }
> >> +
> >>       r = vhost_vdpa_dma_map(v->shared, v->address_space_id, needle->iova,
> >>                              needle->size + 1,
> >>                              (void *)(uintptr_t)needle->translated_addr,
> >> @@ -1192,11 +1211,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
> >>       vhost_svq_get_vring_addr(svq, &svq_addr);
> >>
> >>       driver_region = (DMAMap) {
> >> -        .translated_addr = svq_addr.desc_user_addr,
> >>           .size = driver_size - 1,
> >>           .perm = IOMMU_RO,
> >>       };
> >> -    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
> >> +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, svq_addr.desc_user_addr,
> >> +                                 errp);
> >>       if (unlikely(!ok)) {
> >>           error_prepend(errp, "Cannot create vq driver region: ");
> >>           return false;
> >> @@ -1206,11 +1225,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
> >>       addr->avail_user_addr = driver_region.iova + avail_offset;
> >>
> >>       device_region = (DMAMap) {
> >> -        .translated_addr = svq_addr.used_user_addr,
> >>           .size = device_size - 1,
> >>           .perm = IOMMU_RW,
> >>       };
> >> -    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
> >> +    ok = vhost_vdpa_svq_map_ring(v, &device_region, svq_addr.used_user_addr,
> >> +                                 errp);
> >>       if (unlikely(!ok)) {
> >>           error_prepend(errp, "Cannot create vq device region: ");
> >>           vhost_vdpa_svq_unmap_ring(v, driver_region.translated_addr);
> >> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> >> index 03457ead66..81da956b92 100644
> >> --- a/net/vhost-vdpa.c
> >> +++ b/net/vhost-vdpa.c
> >> @@ -512,15 +512,24 @@ static int vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v, void *buf, size_t size,
> >>       DMAMap map = {};
> >>       int r;
> >>
> >> -    map.translated_addr = (hwaddr)(uintptr_t)buf;
> >>       map.size = size - 1;
> >>       map.perm = write ? IOMMU_RW : IOMMU_RO,
> >> +
> >> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
> >>       r = vhost_iova_tree_map_alloc(v->shared->iova_tree, &map);
> >>       if (unlikely(r != IOVA_OK)) {
> >> -        error_report("Cannot map injected element");
> >> +        error_report("Cannot allocate IOVA range for injected element");
> >>           return r;
> >>       }
> >>
> >> +    /* Add mapping to the IOVA->HVA tree */
> >> +    map.translated_addr = (hwaddr)(uintptr_t)buf;
> >> +    r = vhost_iova_tree_insert(v->shared->iova_tree, &map);
> >> +    if (unlikely(r != IOVA_OK)) {
> >> +        error_report("Cannot map injected element into IOVA->HVA tree");
> >> +        goto dma_map_err;
> >> +    }
> >> +
> >>       r = vhost_vdpa_dma_map(v->shared, v->address_space_id, map.iova,
> >>                              vhost_vdpa_net_cvq_cmd_page_len(), buf, !write);
> >>       if (unlikely(r < 0)) {
> >> --
> >> 2.43.5
> >>
>
Jonah Palmer Aug. 30, 2024, 1:51 p.m. UTC | #4
On 8/30/24 4:05 AM, Eugenio Perez Martin wrote:
> On Fri, Aug 30, 2024 at 6:20 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>>
>> On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote:
>>> On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>> Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
>>>> the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
>>>> will hold all IOVA ranges that have been allocated (e.g. in the
>>>> IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.
>>>>
>>>> A new API function vhost_iova_tree_insert() is also created to add a
>>>> IOVA->HVA mapping into the IOVA->HVA tree.
>>>>
>>> I think this is a good first iteration but we can take steps to
>>> simplify it. Also, it is great to be able to make points on real code
>>> instead of designs on the air :).
>>>

I can add more comments in the code if this is what you mean, no problem!

>>> I expected a split of vhost_iova_tree_map_alloc between the current
>>> vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
>>> similar. Similarly, a vhost_iova_tree_remove and
>>> vhost_iova_tree_remove_gpa would be needed.
>>> >>> The first one is used for regions that don't exist in the guest, like
>>> SVQ vrings or CVQ buffers. The second one is the one used by the
>>> memory listener to map the guest regions into the vdpa device.
>>>
>>> Implementation wise, only two trees are actually needed:
>>> * Current iova_taddr_map that contains all IOVA->vaddr translations as
>>> seen by the device, so both allocation functions can work on a single
>>> tree. The function iova_tree_find_iova keeps using this one, so the
>> I thought we had thorough discussion about this and agreed upon the
>> decoupled IOVA allocator solution.
> 
> My interpretation of it is to leave the allocator as the current one,
> and create a new tree with GPA which is guaranteed to be unique. But
> we can talk over it of course.
> 

So you mean keep the full IOVA->HVA tree but also have a GPA->IOVA tree 
as well for guest memory regions, correct?

>> But maybe I missed something earlier,
>> I am not clear how come this iova_tree_find_iova function could still
>> work with the full IOVA-> HVA tree when it comes to aliased memory or
>> overlapped HVAs? Granted, for the memory map removal in the
>> .region_del() path, we could rely on the GPA tree to locate the
>> corresponding IOVA, but how come the translation path could figure out
>> which IOVA range to return when the vaddr happens to fall in an
>> overlapped HVA range?
> 
> That is not a problem, as they both translate to the same address at the device.
> 
> The most complicated situation is where we have a region contained in
> another region, and the requested buffer crosses them. If the IOVA
> tree returns the inner region, it will return the buffer chained with
> the rest of the content in the outer region. Not optimal, but solved
> either way.
> 
> The only problem that comes to my mind is the case where the inner
> region is RO and it is a write command, but I don't think we have this
> case in a sane guest. A malicious guest cannot do any harm this way
> anyway.
> 
>> Do we still assume some overlapping order so we
>> always return the first match from the tree? Or we expect every current
>> user of iova_tree_find_iova should pass in GPA rather than HVA and use
>> the vhost_iova_xxx_gpa API variant to look up IOVA?
>>
> 
> No, iova_tree_find_iova should keep asking for vaddr, as the result is
> guaranteed to be there. Users of VhostIOVATree only need to modify how
> they add or remove regions, knowing if they come from the guest or
> not. As shown by this series, it is easier to do in that place than in
> translation.
> 
>> Thanks,
>> -Siwei
>>
>>> user does not need to know if the address is from the guest or only
>>> exists in QEMU by using RAMBlock etc. All insert and remove functions
>>> use this tree.
>>> * A new tree that relates IOVA to GPA, that only
>>> vhost_iova_tree_map_alloc_gpa and vhost_iova_tree_remove_gpa uses.
>>>
>>> The ideal case is that the key in this new tree is the GPA and the
>>> value is the IOVA. But IOVATree's DMA is named the reverse: iova is
>>> the key and translated_addr is the vaddr. We can create a new tree
>>> struct for that, use GTree directly, or translate the reverse
>>> linearly. As memory add / remove should not be frequent, I think the
>>> simpler is the last one, but I'd be ok with creating a new tree.
>>>

Is the concern here that making the gpa_map (GPA->IOVA tree) of type 
IOVATree can be confusing for users from an API perspective?

In other words, IOVATree users should always use its iova member as the 
key and the translated_addr member as the value (and thus IOVATree 
gpa_map feels out of place since it uses the translated_addr as the key 
and iova as the value)?

Also, could you elaborate a bit more on "translate the reverse 
linearly"? Do you mean to create an IOVA->GPA tree but always search the 
tree using the GPA (e.g. via iova_tree_find_iova)?

>>> vhost_iova_tree_map_alloc_gpa needs to add the map to this new tree
>>> also. Similarly, vhost_iova_tree_remove_gpa must look for the GPA in
>>> this tree, and only remove the associated DMAMap in iova_taddr_map
>>> that matches the IOVA.
>>>
>>> Does it make sense to you?

Would using a name like vhost_iova_tree_map_alloc_gpa seem a bit 
misleading given that we're already allocating the IOVA range in 
vhost_iova_tree_map_alloc? It seems this would be more of an insertion 
rather than an allocation when adding a map to the GPA->IOVA tree.

Also, are you saying that vhost_iova_tree_remove_gpa only removes the 
DMAMap in the IOVA->HVA tree or should it also remove the corresponding 
mapping in the GPA->IOVA tree?

>>>
>>>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>>>> ---
>>>>    hw/virtio/vhost-iova-tree.c | 38 ++++++++++++++++++++++++++++++++-----
>>>>    hw/virtio/vhost-iova-tree.h |  1 +
>>>>    hw/virtio/vhost-vdpa.c      | 31 ++++++++++++++++++++++++------
>>>>    net/vhost-vdpa.c            | 13 +++++++++++--
>>>>    4 files changed, 70 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
>>>> index 3d03395a77..32c03db2f5 100644
>>>> --- a/hw/virtio/vhost-iova-tree.c
>>>> +++ b/hw/virtio/vhost-iova-tree.c
>>>> @@ -28,12 +28,17 @@ struct VhostIOVATree {
>>>>
>>>>        /* IOVA address to qemu memory maps. */
>>>>        IOVATree *iova_taddr_map;
>>>> +
>>>> +    /* IOVA tree (IOVA allocator) */
>>>> +    IOVATree *iova_map;
>>>>    };
>>>>
>>>>    /**
>>>> - * Create a new IOVA tree
>>>> + * Create a new VhostIOVATree with a new set of IOVATree's:
>>> s/IOVA tree/VhostIOVATree/ is good, but I think the rest is more an
>>> implementation detail.
>>>

Gotcha. Would you like me to remove the other comments then?

>>>> + * - IOVA allocator (iova_map)
>>>> + * - IOVA->HVA tree (iova_taddr_map)
>>>>     *
>>>> - * Returns the new IOVA tree
>>>> + * Returns the new VhostIOVATree
>>>>     */
>>>>    VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>>>    {
>>>> @@ -44,6 +49,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>>>        tree->iova_last = iova_last;
>>>>
>>>>        tree->iova_taddr_map = iova_tree_new();
>>>> +    tree->iova_map = iova_tree_new();
>>>>        return tree;
>>>>    }
>>>>
>>>> @@ -53,6 +59,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>>>    void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
>>>>    {
>>>>        iova_tree_destroy(iova_tree->iova_taddr_map);
>>>> +    iova_tree_destroy(iova_tree->iova_map);
>>>>        g_free(iova_tree);
>>>>    }
>>>>
>>>> @@ -88,13 +95,12 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
>>>>        /* Some vhost devices do not like addr 0. Skip first page */
>>>>        hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size();
>>>>
>>>> -    if (map->translated_addr + map->size < map->translated_addr ||
>>> Why remove this condition? If the request is invalid we still need to
>>> return an error here.
>>>
>>> Maybe we should move it to iova_tree_alloc_map though.
>>>

This series decoupled the IOVA allocator from also adding a mapping to 
the IOVA->HVA tree and instead added IOVA ranges only to an IOVA-only 
tree. So no value existed under translated_addr for these mappings 
specifically.

This check was moved to vhost_iova_tree_insert since that function 
covered adding the host-only memory mappings to the IOVA->SVQ HVA tree.

>>>> -        map->perm == IOMMU_NONE) {
>>>> +    if (map->perm == IOMMU_NONE) {
>>>>            return IOVA_ERR_INVALID;
>>>>        }
>>>>
>>>>        /* Allocate a node in IOVA address */
>>>> -    return iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
>>>> +    return iova_tree_alloc_map(tree->iova_map, map, iova_first,
>>>>                                   tree->iova_last);
>>>>    }
>>>>
>>>> @@ -107,4 +113,26 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
>>>>    void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
>>>>    {
>>>>        iova_tree_remove(iova_tree->iova_taddr_map, map);
>>>> +    iova_tree_remove(iova_tree->iova_map, map);
>>>> +}
>>>> +
>>>> +/**
>>>> + * Insert a new mapping to the IOVA->HVA tree
>>>> + *
>>>> + * @tree: The VhostIOVATree
>>>> + * @map: The iova map
>>>> + *
>>>> + * Returns:
>>>> + * - IOVA_OK if the map fits in the container
>>>> + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
>>>> + * - IOVA_ERR_OVERLAP if the IOVA range overlaps with an existing range
>>>> + */
>>>> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map)
>>>> +{
>>>> +    if (map->translated_addr + map->size < map->translated_addr ||
>>>> +        map->perm == IOMMU_NONE) {
>>>> +        return IOVA_ERR_INVALID;
>>>> +    }
>>>> +
>>>> +    return iova_tree_insert(iova_tree->iova_taddr_map, map);
>>>>    }
>>>> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
>>>> index 4adfd79ff0..8bf7b64786 100644
>>>> --- a/hw/virtio/vhost-iova-tree.h
>>>> +++ b/hw/virtio/vhost-iova-tree.h
>>>> @@ -23,5 +23,6 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
>>>>                                            const DMAMap *map);
>>>>    int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
>>>>    void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map);
>>>> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map);
>>>>
>>>>    #endif
>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>>> index 3cdaa12ed5..6702459065 100644
>>>> --- a/hw/virtio/vhost-vdpa.c
>>>> +++ b/hw/virtio/vhost-vdpa.c
>>>> @@ -361,10 +361,10 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>>>        if (s->shadow_data) {
>>>>            int r;
>>>>
>>>> -        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,
>>>>            mem_region.size = int128_get64(llsize) - 1,
>>>>            mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
>>>>
>>>> +        /* Allocate IOVA range and add the mapping to the IOVA tree */
>>>>            r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
>>>>            if (unlikely(r != IOVA_OK)) {
>>>>                error_report("Can't allocate a mapping (%d)", r);
>>>> @@ -372,6 +372,14 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>>>            }
>>>>
>>>>            iova = mem_region.iova;
>>>> +
>>>> +        /* Add mapping to the IOVA->HVA tree */
>>>> +        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr;
>>>> +        r = vhost_iova_tree_insert(s->iova_tree, &mem_region);
>>>> +        if (unlikely(r != IOVA_OK)) {
>>>> +            error_report("Can't add listener region mapping (%d)", r);
>>>> +            goto fail_map;
>>>> +        }
>>> I'd say it is not intuitive for the caller code.
>>>

Sorry, I'm not sure what you mean by this here. Would you mind 
elaborating a bit more?

>>>>        }
>>>>
>>>>        vhost_vdpa_iotlb_batch_begin_once(s);
>>>> @@ -1142,19 +1150,30 @@ static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
>>>>     *
>>>>     * @v: Vhost-vdpa device
>>>>     * @needle: The area to search iova
>>>> + * @taddr: The translated address (SVQ HVA)
>>>>     * @errorp: Error pointer
>>>>     */
>>>>    static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
>>>> -                                    Error **errp)
>>>> +                                    hwaddr taddr, Error **errp)
>>>>    {
>>>>        int r;
>>>>
>>>> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
>>>>        r = vhost_iova_tree_map_alloc(v->shared->iova_tree, needle);
>>>>        if (unlikely(r != IOVA_OK)) {
>>>>            error_setg(errp, "Cannot allocate iova (%d)", r);
>>>>            return false;
>>>>        }
>>>>
>>>> +    /* Add mapping to the IOVA->HVA tree */
>>>> +    needle->translated_addr = taddr;
>>>> +    r = vhost_iova_tree_insert(v->shared->iova_tree, needle);
>>>> +    if (unlikely(r != IOVA_OK)) {
>>>> +        error_setg(errp, "Cannot add SVQ vring mapping (%d)", r);
>>>> +        vhost_iova_tree_remove(v->shared->iova_tree, *needle);
>>>> +        return false;
>>>> +    }
>>>> +
>>>>        r = vhost_vdpa_dma_map(v->shared, v->address_space_id, needle->iova,
>>>>                               needle->size + 1,
>>>>                               (void *)(uintptr_t)needle->translated_addr,
>>>> @@ -1192,11 +1211,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>>>>        vhost_svq_get_vring_addr(svq, &svq_addr);
>>>>
>>>>        driver_region = (DMAMap) {
>>>> -        .translated_addr = svq_addr.desc_user_addr,
>>>>            .size = driver_size - 1,
>>>>            .perm = IOMMU_RO,
>>>>        };
>>>> -    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
>>>> +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, svq_addr.desc_user_addr,
>>>> +                                 errp);
>>>>        if (unlikely(!ok)) {
>>>>            error_prepend(errp, "Cannot create vq driver region: ");
>>>>            return false;
>>>> @@ -1206,11 +1225,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>>>>        addr->avail_user_addr = driver_region.iova + avail_offset;
>>>>
>>>>        device_region = (DMAMap) {
>>>> -        .translated_addr = svq_addr.used_user_addr,
>>>>            .size = device_size - 1,
>>>>            .perm = IOMMU_RW,
>>>>        };
>>>> -    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
>>>> +    ok = vhost_vdpa_svq_map_ring(v, &device_region, svq_addr.used_user_addr,
>>>> +                                 errp);
>>>>        if (unlikely(!ok)) {
>>>>            error_prepend(errp, "Cannot create vq device region: ");
>>>>            vhost_vdpa_svq_unmap_ring(v, driver_region.translated_addr);
>>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
>>>> index 03457ead66..81da956b92 100644
>>>> --- a/net/vhost-vdpa.c
>>>> +++ b/net/vhost-vdpa.c
>>>> @@ -512,15 +512,24 @@ static int vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v, void *buf, size_t size,
>>>>        DMAMap map = {};
>>>>        int r;
>>>>
>>>> -    map.translated_addr = (hwaddr)(uintptr_t)buf;
>>>>        map.size = size - 1;
>>>>        map.perm = write ? IOMMU_RW : IOMMU_RO,
>>>> +
>>>> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
>>>>        r = vhost_iova_tree_map_alloc(v->shared->iova_tree, &map);
>>>>        if (unlikely(r != IOVA_OK)) {
>>>> -        error_report("Cannot map injected element");
>>>> +        error_report("Cannot allocate IOVA range for injected element");
>>>>            return r;
>>>>        }
>>>>
>>>> +    /* Add mapping to the IOVA->HVA tree */
>>>> +    map.translated_addr = (hwaddr)(uintptr_t)buf;
>>>> +    r = vhost_iova_tree_insert(v->shared->iova_tree, &map);
>>>> +    if (unlikely(r != IOVA_OK)) {
>>>> +        error_report("Cannot map injected element into IOVA->HVA tree");
>>>> +        goto dma_map_err;
>>>> +    }
>>>> +
>>>>        r = vhost_vdpa_dma_map(v->shared, v->address_space_id, map.iova,
>>>>                               vhost_vdpa_net_cvq_cmd_page_len(), buf, !write);
>>>>        if (unlikely(r < 0)) {
>>>> --
>>>> 2.43.5
>>>>
>>
>
Eugenio Perez Martin Aug. 30, 2024, 2:50 p.m. UTC | #5
On Fri, Aug 30, 2024 at 3:52 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
>
>
> On 8/30/24 4:05 AM, Eugenio Perez Martin wrote:
> > On Fri, Aug 30, 2024 at 6:20 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >>
> >> On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote:
> >>> On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>>> Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
> >>>> the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
> >>>> will hold all IOVA ranges that have been allocated (e.g. in the
> >>>> IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.
> >>>>
> >>>> A new API function vhost_iova_tree_insert() is also created to add a
> >>>> IOVA->HVA mapping into the IOVA->HVA tree.
> >>>>
> >>> I think this is a good first iteration but we can take steps to
> >>> simplify it. Also, it is great to be able to make points on real code
> >>> instead of designs on the air :).
> >>>
>
> I can add more comments in the code if this is what you mean, no problem!
>

No action needed about this feedback :). I just meant that it will be
easier to iterate on code than designing just by talking at this
stage.

> >>> I expected a split of vhost_iova_tree_map_alloc between the current
> >>> vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
> >>> similar. Similarly, a vhost_iova_tree_remove and
> >>> vhost_iova_tree_remove_gpa would be needed.
> >>> >>> The first one is used for regions that don't exist in the guest, like
> >>> SVQ vrings or CVQ buffers. The second one is the one used by the
> >>> memory listener to map the guest regions into the vdpa device.
> >>>
> >>> Implementation wise, only two trees are actually needed:
> >>> * Current iova_taddr_map that contains all IOVA->vaddr translations as
> >>> seen by the device, so both allocation functions can work on a single
> >>> tree. The function iova_tree_find_iova keeps using this one, so the
> >> I thought we had thorough discussion about this and agreed upon the
> >> decoupled IOVA allocator solution.
> >
> > My interpretation of it is to leave the allocator as the current one,
> > and create a new tree with GPA which is guaranteed to be unique. But
> > we can talk over it of course.
> >
>
> So you mean keep the full IOVA->HVA tree but also have a GPA->IOVA tree
> as well for guest memory regions, correct?
>

Right.

> >> But maybe I missed something earlier,
> >> I am not clear how come this iova_tree_find_iova function could still
> >> work with the full IOVA-> HVA tree when it comes to aliased memory or
> >> overlapped HVAs? Granted, for the memory map removal in the
> >> .region_del() path, we could rely on the GPA tree to locate the
> >> corresponding IOVA, but how come the translation path could figure out
> >> which IOVA range to return when the vaddr happens to fall in an
> >> overlapped HVA range?
> >
> > That is not a problem, as they both translate to the same address at the device.
> >
> > The most complicated situation is where we have a region contained in
> > another region, and the requested buffer crosses them. If the IOVA
> > tree returns the inner region, it will return the buffer chained with
> > the rest of the content in the outer region. Not optimal, but solved
> > either way.
> >
> > The only problem that comes to my mind is the case where the inner
> > region is RO and it is a write command, but I don't think we have this
> > case in a sane guest. A malicious guest cannot do any harm this way
> > anyway.
> >
> >> Do we still assume some overlapping order so we
> >> always return the first match from the tree? Or we expect every current
> >> user of iova_tree_find_iova should pass in GPA rather than HVA and use
> >> the vhost_iova_xxx_gpa API variant to look up IOVA?
> >>
> >
> > No, iova_tree_find_iova should keep asking for vaddr, as the result is
> > guaranteed to be there. Users of VhostIOVATree only need to modify how
> > they add or remove regions, knowing if they come from the guest or
> > not. As shown by this series, it is easier to do in that place than in
> > translation.
> >
> >> Thanks,
> >> -Siwei
> >>
> >>> user does not need to know if the address is from the guest or only
> >>> exists in QEMU by using RAMBlock etc. All insert and remove functions
> >>> use this tree.
> >>> * A new tree that relates IOVA to GPA, that only
> >>> vhost_iova_tree_map_alloc_gpa and vhost_iova_tree_remove_gpa uses.
> >>>
> >>> The ideal case is that the key in this new tree is the GPA and the
> >>> value is the IOVA. But IOVATree's DMA is named the reverse: iova is
> >>> the key and translated_addr is the vaddr. We can create a new tree
> >>> struct for that, use GTree directly, or translate the reverse
> >>> linearly. As memory add / remove should not be frequent, I think the
> >>> simpler is the last one, but I'd be ok with creating a new tree.
> >>>
>
> Is the concern here that making the gpa_map (GPA->IOVA tree) of type
> IOVATree can be confusing for users from an API perspective?
>
> In other words, IOVATree users should always use its iova member as the
> key and the translated_addr member as the value (and thus IOVATree
> gpa_map feels out of place since it uses the translated_addr as the key
> and iova as the value)?
>

Totally right.

> Also, could you elaborate a bit more on "translate the reverse
> linearly"? Do you mean to create an IOVA->GPA tree but always search the
> tree using the GPA (e.g. via iova_tree_find_iova)?
>

Yes, that's the other option. I'm comparing the two options here:
* IOVA->GPA has the advantage of reusing IOVATree, but lookups for GPA is O(N).
* GPA->IOVA is more natural but we cannot reuse IOVATree in a simple way.

> >>> vhost_iova_tree_map_alloc_gpa needs to add the map to this new tree
> >>> also. Similarly, vhost_iova_tree_remove_gpa must look for the GPA in
> >>> this tree, and only remove the associated DMAMap in iova_taddr_map
> >>> that matches the IOVA.
> >>>
> >>> Does it make sense to you?
>
> Would using a name like vhost_iova_tree_map_alloc_gpa seem a bit
> misleading given that we're already allocating the IOVA range in
> vhost_iova_tree_map_alloc?

With the vhost_iova_tree_map_alloc_gpa there is no need to call
vhost_iova_tree_map_alloc. It's like merging the two operations, so
the caller does not need to know the internals.

> It seems this would be more of an insertion
> rather than an allocation when adding a map to the GPA->IOVA tree.
>

Let's put it another way, why complicate IOVATree or VhostIOVATree by
integrating the GPA->IOVA or the IOVA->GPA if we could simply store
that information in the caller and make each struct simpler?

If we abstract away the two trees under a coherent API, the struct
makes sense. If not, it would be better to let the caller handle the
information.

> Also, are you saying that vhost_iova_tree_remove_gpa only removes the
> DMAMap in the IOVA->HVA tree or should it also remove the corresponding
> mapping in the GPA->IOVA tree?
>

It should remove it in both.

> >>>
> >>>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> >>>> ---
> >>>>    hw/virtio/vhost-iova-tree.c | 38 ++++++++++++++++++++++++++++++++-----
> >>>>    hw/virtio/vhost-iova-tree.h |  1 +
> >>>>    hw/virtio/vhost-vdpa.c      | 31 ++++++++++++++++++++++++------
> >>>>    net/vhost-vdpa.c            | 13 +++++++++++--
> >>>>    4 files changed, 70 insertions(+), 13 deletions(-)
> >>>>
> >>>> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> >>>> index 3d03395a77..32c03db2f5 100644
> >>>> --- a/hw/virtio/vhost-iova-tree.c
> >>>> +++ b/hw/virtio/vhost-iova-tree.c
> >>>> @@ -28,12 +28,17 @@ struct VhostIOVATree {
> >>>>
> >>>>        /* IOVA address to qemu memory maps. */
> >>>>        IOVATree *iova_taddr_map;
> >>>> +
> >>>> +    /* IOVA tree (IOVA allocator) */
> >>>> +    IOVATree *iova_map;
> >>>>    };
> >>>>
> >>>>    /**
> >>>> - * Create a new IOVA tree
> >>>> + * Create a new VhostIOVATree with a new set of IOVATree's:
> >>> s/IOVA tree/VhostIOVATree/ is good, but I think the rest is more an
> >>> implementation detail.
> >>>
>
> Gotcha. Would you like me to remove the other comments then?
>

Yes, it is better to remove the ones that expose internals. We should
be able to change these kind of details without the users of the
VhostIOVATree notice it.

> >>>> + * - IOVA allocator (iova_map)
> >>>> + * - IOVA->HVA tree (iova_taddr_map)
> >>>>     *
> >>>> - * Returns the new IOVA tree
> >>>> + * Returns the new VhostIOVATree
> >>>>     */
> >>>>    VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> >>>>    {
> >>>> @@ -44,6 +49,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> >>>>        tree->iova_last = iova_last;
> >>>>
> >>>>        tree->iova_taddr_map = iova_tree_new();
> >>>> +    tree->iova_map = iova_tree_new();
> >>>>        return tree;
> >>>>    }
> >>>>
> >>>> @@ -53,6 +59,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> >>>>    void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
> >>>>    {
> >>>>        iova_tree_destroy(iova_tree->iova_taddr_map);
> >>>> +    iova_tree_destroy(iova_tree->iova_map);
> >>>>        g_free(iova_tree);
> >>>>    }
> >>>>
> >>>> @@ -88,13 +95,12 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
> >>>>        /* Some vhost devices do not like addr 0. Skip first page */
> >>>>        hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size();
> >>>>
> >>>> -    if (map->translated_addr + map->size < map->translated_addr ||
> >>> Why remove this condition? If the request is invalid we still need to
> >>> return an error here.
> >>>
> >>> Maybe we should move it to iova_tree_alloc_map though.
> >>>
>
> This series decoupled the IOVA allocator from also adding a mapping to
> the IOVA->HVA tree and instead added IOVA ranges only to an IOVA-only
> tree. So no value existed under translated_addr for these mappings
> specifically.
>
> This check was moved to vhost_iova_tree_insert since that function
> covered adding the host-only memory mappings to the IOVA->SVQ HVA tree.
>

Ok, I missed that then. Can you extract that in a separated patch? I
think it makes sense by itself.

> >>>> -        map->perm == IOMMU_NONE) {
> >>>> +    if (map->perm == IOMMU_NONE) {
> >>>>            return IOVA_ERR_INVALID;
> >>>>        }
> >>>>
> >>>>        /* Allocate a node in IOVA address */
> >>>> -    return iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
> >>>> +    return iova_tree_alloc_map(tree->iova_map, map, iova_first,
> >>>>                                   tree->iova_last);
> >>>>    }
> >>>>
> >>>> @@ -107,4 +113,26 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
> >>>>    void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
> >>>>    {
> >>>>        iova_tree_remove(iova_tree->iova_taddr_map, map);
> >>>> +    iova_tree_remove(iova_tree->iova_map, map);
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * Insert a new mapping to the IOVA->HVA tree
> >>>> + *
> >>>> + * @tree: The VhostIOVATree
> >>>> + * @map: The iova map
> >>>> + *
> >>>> + * Returns:
> >>>> + * - IOVA_OK if the map fits in the container
> >>>> + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
> >>>> + * - IOVA_ERR_OVERLAP if the IOVA range overlaps with an existing range
> >>>> + */
> >>>> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map)
> >>>> +{
> >>>> +    if (map->translated_addr + map->size < map->translated_addr ||
> >>>> +        map->perm == IOMMU_NONE) {
> >>>> +        return IOVA_ERR_INVALID;
> >>>> +    }
> >>>> +
> >>>> +    return iova_tree_insert(iova_tree->iova_taddr_map, map);
> >>>>    }
> >>>> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> >>>> index 4adfd79ff0..8bf7b64786 100644
> >>>> --- a/hw/virtio/vhost-iova-tree.h
> >>>> +++ b/hw/virtio/vhost-iova-tree.h
> >>>> @@ -23,5 +23,6 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
> >>>>                                            const DMAMap *map);
> >>>>    int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
> >>>>    void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map);
> >>>> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map);
> >>>>
> >>>>    #endif
> >>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >>>> index 3cdaa12ed5..6702459065 100644
> >>>> --- a/hw/virtio/vhost-vdpa.c
> >>>> +++ b/hw/virtio/vhost-vdpa.c
> >>>> @@ -361,10 +361,10 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> >>>>        if (s->shadow_data) {
> >>>>            int r;
> >>>>
> >>>> -        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,
> >>>>            mem_region.size = int128_get64(llsize) - 1,
> >>>>            mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> >>>>
> >>>> +        /* Allocate IOVA range and add the mapping to the IOVA tree */
> >>>>            r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
> >>>>            if (unlikely(r != IOVA_OK)) {
> >>>>                error_report("Can't allocate a mapping (%d)", r);
> >>>> @@ -372,6 +372,14 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> >>>>            }
> >>>>
> >>>>            iova = mem_region.iova;
> >>>> +
> >>>> +        /* Add mapping to the IOVA->HVA tree */
> >>>> +        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr;
> >>>> +        r = vhost_iova_tree_insert(s->iova_tree, &mem_region);
> >>>> +        if (unlikely(r != IOVA_OK)) {
> >>>> +            error_report("Can't add listener region mapping (%d)", r);
> >>>> +            goto fail_map;
> >>>> +        }
> >>> I'd say it is not intuitive for the caller code.
> >>>
>
> Sorry, I'm not sure what you mean by this here. Would you mind
> elaborating a bit more?
>

If VhostIOVATree gets reused it is easy to miss the
vhost_iova_tree_insert after the allocation.

If we're going to expose this second GPA tree, I think it would be
better to place it directly in VhostShadowVirtqueue. All the
conditionals fit better that way and we don't make VhostIOVATree /
IOVATree harder to reuse. However, I keep thinking it is easy enough
to hide in VhostIOVATree.

> >>>>        }
> >>>>
> >>>>        vhost_vdpa_iotlb_batch_begin_once(s);
> >>>> @@ -1142,19 +1150,30 @@ static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
> >>>>     *
> >>>>     * @v: Vhost-vdpa device
> >>>>     * @needle: The area to search iova
> >>>> + * @taddr: The translated address (SVQ HVA)
> >>>>     * @errorp: Error pointer
> >>>>     */
> >>>>    static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
> >>>> -                                    Error **errp)
> >>>> +                                    hwaddr taddr, Error **errp)
> >>>>    {
> >>>>        int r;
> >>>>
> >>>> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
> >>>>        r = vhost_iova_tree_map_alloc(v->shared->iova_tree, needle);
> >>>>        if (unlikely(r != IOVA_OK)) {
> >>>>            error_setg(errp, "Cannot allocate iova (%d)", r);
> >>>>            return false;
> >>>>        }
> >>>>
> >>>> +    /* Add mapping to the IOVA->HVA tree */
> >>>> +    needle->translated_addr = taddr;
> >>>> +    r = vhost_iova_tree_insert(v->shared->iova_tree, needle);
> >>>> +    if (unlikely(r != IOVA_OK)) {
> >>>> +        error_setg(errp, "Cannot add SVQ vring mapping (%d)", r);
> >>>> +        vhost_iova_tree_remove(v->shared->iova_tree, *needle);
> >>>> +        return false;
> >>>> +    }
> >>>> +
> >>>>        r = vhost_vdpa_dma_map(v->shared, v->address_space_id, needle->iova,
> >>>>                               needle->size + 1,
> >>>>                               (void *)(uintptr_t)needle->translated_addr,
> >>>> @@ -1192,11 +1211,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
> >>>>        vhost_svq_get_vring_addr(svq, &svq_addr);
> >>>>
> >>>>        driver_region = (DMAMap) {
> >>>> -        .translated_addr = svq_addr.desc_user_addr,
> >>>>            .size = driver_size - 1,
> >>>>            .perm = IOMMU_RO,
> >>>>        };
> >>>> -    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
> >>>> +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, svq_addr.desc_user_addr,
> >>>> +                                 errp);
> >>>>        if (unlikely(!ok)) {
> >>>>            error_prepend(errp, "Cannot create vq driver region: ");
> >>>>            return false;
> >>>> @@ -1206,11 +1225,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
> >>>>        addr->avail_user_addr = driver_region.iova + avail_offset;
> >>>>
> >>>>        device_region = (DMAMap) {
> >>>> -        .translated_addr = svq_addr.used_user_addr,
> >>>>            .size = device_size - 1,
> >>>>            .perm = IOMMU_RW,
> >>>>        };
> >>>> -    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
> >>>> +    ok = vhost_vdpa_svq_map_ring(v, &device_region, svq_addr.used_user_addr,
> >>>> +                                 errp);
> >>>>        if (unlikely(!ok)) {
> >>>>            error_prepend(errp, "Cannot create vq device region: ");
> >>>>            vhost_vdpa_svq_unmap_ring(v, driver_region.translated_addr);
> >>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> >>>> index 03457ead66..81da956b92 100644
> >>>> --- a/net/vhost-vdpa.c
> >>>> +++ b/net/vhost-vdpa.c
> >>>> @@ -512,15 +512,24 @@ static int vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v, void *buf, size_t size,
> >>>>        DMAMap map = {};
> >>>>        int r;
> >>>>
> >>>> -    map.translated_addr = (hwaddr)(uintptr_t)buf;
> >>>>        map.size = size - 1;
> >>>>        map.perm = write ? IOMMU_RW : IOMMU_RO,
> >>>> +
> >>>> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
> >>>>        r = vhost_iova_tree_map_alloc(v->shared->iova_tree, &map);
> >>>>        if (unlikely(r != IOVA_OK)) {
> >>>> -        error_report("Cannot map injected element");
> >>>> +        error_report("Cannot allocate IOVA range for injected element");
> >>>>            return r;
> >>>>        }
> >>>>
> >>>> +    /* Add mapping to the IOVA->HVA tree */
> >>>> +    map.translated_addr = (hwaddr)(uintptr_t)buf;
> >>>> +    r = vhost_iova_tree_insert(v->shared->iova_tree, &map);
> >>>> +    if (unlikely(r != IOVA_OK)) {
> >>>> +        error_report("Cannot map injected element into IOVA->HVA tree");
> >>>> +        goto dma_map_err;
> >>>> +    }
> >>>> +
> >>>>        r = vhost_vdpa_dma_map(v->shared, v->address_space_id, map.iova,
> >>>>                               vhost_vdpa_net_cvq_cmd_page_len(), buf, !write);
> >>>>        if (unlikely(r < 0)) {
> >>>> --
> >>>> 2.43.5
> >>>>
> >>
> >
>
Si-Wei Liu Aug. 30, 2024, 9:05 p.m. UTC | #6
On 8/30/2024 1:05 AM, Eugenio Perez Martin wrote:
> On Fri, Aug 30, 2024 at 6:20 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote:
>>> On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>> Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
>>>> the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
>>>> will hold all IOVA ranges that have been allocated (e.g. in the
>>>> IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.
>>>>
>>>> A new API function vhost_iova_tree_insert() is also created to add a
>>>> IOVA->HVA mapping into the IOVA->HVA tree.
>>>>
>>> I think this is a good first iteration but we can take steps to
>>> simplify it. Also, it is great to be able to make points on real code
>>> instead of designs on the air :).
>>>
>>> I expected a split of vhost_iova_tree_map_alloc between the current
>>> vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
>>> similar. Similarly, a vhost_iova_tree_remove and
>>> vhost_iova_tree_remove_gpa would be needed.
>>>
>>> The first one is used for regions that don't exist in the guest, like
>>> SVQ vrings or CVQ buffers. The second one is the one used by the
>>> memory listener to map the guest regions into the vdpa device.
>>>
>>> Implementation wise, only two trees are actually needed:
>>> * Current iova_taddr_map that contains all IOVA->vaddr translations as
>>> seen by the device, so both allocation functions can work on a single
>>> tree. The function iova_tree_find_iova keeps using this one, so the
>> I thought we had thorough discussion about this and agreed upon the
>> decoupled IOVA allocator solution.
> My interpretation of it is to leave the allocator as the current one,
> and create a new tree with GPA which is guaranteed to be unique. But
> we can talk over it of course.
>
>> But maybe I missed something earlier,
>> I am not clear how come this iova_tree_find_iova function could still
>> work with the full IOVA-> HVA tree when it comes to aliased memory or
>> overlapped HVAs? Granted, for the memory map removal in the
>> .region_del() path, we could rely on the GPA tree to locate the
>> corresponding IOVA, but how come the translation path could figure out
>> which IOVA range to return when the vaddr happens to fall in an
>> overlapped HVA range?
> That is not a problem, as they both translate to the same address at the device.
Not sure I followed, it might return a wrong IOVA (range) which the host 
kernel may have conflict or unmatched attribute i.e. permission, size et 
al in the map.

>
> The most complicated situation is where we have a region contained in
> another region, and the requested buffer crosses them. If the IOVA
> tree returns the inner region, it will return the buffer chained with
> the rest of the content in the outer region. Not optimal, but solved
> either way.
Don't quite understand what it means... So in this overlapping case, 
speaking of the expectation of the translation API, you would like to 
have all IOVA ranges that match the overlapped HVA to be returned? And 
then to rely on the user (caller) to figure out which one is correct? 
Wouldn't it be easier for the user (SVQ) to use the memory system API 
directly to figure out?

As we are talking about API we may want to build it in a way generic 
enough to address all possible needs (which goes with what memory 
subsystem is capable of), rather than just look on the current usage 
which has kind of narrow scope. Although virtio-net device doesn't work 
with aliased region now, some other virtio device may do, or maybe some 
day virtio-net would need to use aliased region than the API and the 
users (SVQ) would have to go with another round of significant 
refactoring due to the iova-tree internal working. I feel it's just too 
early or too tight to abstract the iova-tree layer and get the API 
customized for the current use case with a lot of limitations on how 
user should expect to use it. We need some more flexibility and ease on 
extensibility if we want to take the chance to get it rewritten, given 
it is not a lot of code that Jonah had showed here ..

> The only problem that comes to my mind is the case where the inner
> region is RO
Yes, this is one of examples around the permission or size I mentioned 
above, which may have a conflict view with the memory system or the kernel.

Thanks,
-Siwei

> and it is a write command, but I don't think we have this
> case in a sane guest. A malicious guest cannot do any harm this way
> anyway.
>
>> Do we still assume some overlapping order so we
>> always return the first match from the tree? Or we expect every current
>> user of iova_tree_find_iova should pass in GPA rather than HVA and use
>> the vhost_iova_xxx_gpa API variant to look up IOVA?
>>
> No, iova_tree_find_iova should keep asking for vaddr, as the result is
> guaranteed to be there. Users of VhostIOVATree only need to modify how
> they add or remove regions, knowing if they come from the guest or
> not. As shown by this series, it is easier to do in that place than in
> translation.
>
>> Thanks,
>> -Siwei
>>
>>> user does not need to know if the address is from the guest or only
>>> exists in QEMU by using RAMBlock etc. All insert and remove functions
>>> use this tree.
>>> * A new tree that relates IOVA to GPA, that only
>>> vhost_iova_tree_map_alloc_gpa and vhost_iova_tree_remove_gpa uses.
>>>
>>> The ideal case is that the key in this new tree is the GPA and the
>>> value is the IOVA. But IOVATree's DMA is named the reverse: iova is
>>> the key and translated_addr is the vaddr. We can create a new tree
>>> struct for that, use GTree directly, or translate the reverse
>>> linearly. As memory add / remove should not be frequent, I think the
>>> simpler is the last one, but I'd be ok with creating a new tree.
>>>
>>> vhost_iova_tree_map_alloc_gpa needs to add the map to this new tree
>>> also. Similarly, vhost_iova_tree_remove_gpa must look for the GPA in
>>> this tree, and only remove the associated DMAMap in iova_taddr_map
>>> that matches the IOVA.
>>>
>>> Does it make sense to you?
>>>
>>>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>>>> ---
>>>>    hw/virtio/vhost-iova-tree.c | 38 ++++++++++++++++++++++++++++++++-----
>>>>    hw/virtio/vhost-iova-tree.h |  1 +
>>>>    hw/virtio/vhost-vdpa.c      | 31 ++++++++++++++++++++++++------
>>>>    net/vhost-vdpa.c            | 13 +++++++++++--
>>>>    4 files changed, 70 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
>>>> index 3d03395a77..32c03db2f5 100644
>>>> --- a/hw/virtio/vhost-iova-tree.c
>>>> +++ b/hw/virtio/vhost-iova-tree.c
>>>> @@ -28,12 +28,17 @@ struct VhostIOVATree {
>>>>
>>>>        /* IOVA address to qemu memory maps. */
>>>>        IOVATree *iova_taddr_map;
>>>> +
>>>> +    /* IOVA tree (IOVA allocator) */
>>>> +    IOVATree *iova_map;
>>>>    };
>>>>
>>>>    /**
>>>> - * Create a new IOVA tree
>>>> + * Create a new VhostIOVATree with a new set of IOVATree's:
>>> s/IOVA tree/VhostIOVATree/ is good, but I think the rest is more an
>>> implementation detail.
>>>
>>>> + * - IOVA allocator (iova_map)
>>>> + * - IOVA->HVA tree (iova_taddr_map)
>>>>     *
>>>> - * Returns the new IOVA tree
>>>> + * Returns the new VhostIOVATree
>>>>     */
>>>>    VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>>>    {
>>>> @@ -44,6 +49,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>>>        tree->iova_last = iova_last;
>>>>
>>>>        tree->iova_taddr_map = iova_tree_new();
>>>> +    tree->iova_map = iova_tree_new();
>>>>        return tree;
>>>>    }
>>>>
>>>> @@ -53,6 +59,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>>>    void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
>>>>    {
>>>>        iova_tree_destroy(iova_tree->iova_taddr_map);
>>>> +    iova_tree_destroy(iova_tree->iova_map);
>>>>        g_free(iova_tree);
>>>>    }
>>>>
>>>> @@ -88,13 +95,12 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
>>>>        /* Some vhost devices do not like addr 0. Skip first page */
>>>>        hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size();
>>>>
>>>> -    if (map->translated_addr + map->size < map->translated_addr ||
>>> Why remove this condition? If the request is invalid we still need to
>>> return an error here.
>>>
>>> Maybe we should move it to iova_tree_alloc_map though.
>>>
>>>> -        map->perm == IOMMU_NONE) {
>>>> +    if (map->perm == IOMMU_NONE) {
>>>>            return IOVA_ERR_INVALID;
>>>>        }
>>>>
>>>>        /* Allocate a node in IOVA address */
>>>> -    return iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
>>>> +    return iova_tree_alloc_map(tree->iova_map, map, iova_first,
>>>>                                   tree->iova_last);
>>>>    }
>>>>
>>>> @@ -107,4 +113,26 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
>>>>    void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
>>>>    {
>>>>        iova_tree_remove(iova_tree->iova_taddr_map, map);
>>>> +    iova_tree_remove(iova_tree->iova_map, map);
>>>> +}
>>>> +
>>>> +/**
>>>> + * Insert a new mapping to the IOVA->HVA tree
>>>> + *
>>>> + * @tree: The VhostIOVATree
>>>> + * @map: The iova map
>>>> + *
>>>> + * Returns:
>>>> + * - IOVA_OK if the map fits in the container
>>>> + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
>>>> + * - IOVA_ERR_OVERLAP if the IOVA range overlaps with an existing range
>>>> + */
>>>> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map)
>>>> +{
>>>> +    if (map->translated_addr + map->size < map->translated_addr ||
>>>> +        map->perm == IOMMU_NONE) {
>>>> +        return IOVA_ERR_INVALID;
>>>> +    }
>>>> +
>>>> +    return iova_tree_insert(iova_tree->iova_taddr_map, map);
>>>>    }
>>>> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
>>>> index 4adfd79ff0..8bf7b64786 100644
>>>> --- a/hw/virtio/vhost-iova-tree.h
>>>> +++ b/hw/virtio/vhost-iova-tree.h
>>>> @@ -23,5 +23,6 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
>>>>                                            const DMAMap *map);
>>>>    int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
>>>>    void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map);
>>>> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map);
>>>>
>>>>    #endif
>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>>> index 3cdaa12ed5..6702459065 100644
>>>> --- a/hw/virtio/vhost-vdpa.c
>>>> +++ b/hw/virtio/vhost-vdpa.c
>>>> @@ -361,10 +361,10 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>>>        if (s->shadow_data) {
>>>>            int r;
>>>>
>>>> -        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,
>>>>            mem_region.size = int128_get64(llsize) - 1,
>>>>            mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
>>>>
>>>> +        /* Allocate IOVA range and add the mapping to the IOVA tree */
>>>>            r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
>>>>            if (unlikely(r != IOVA_OK)) {
>>>>                error_report("Can't allocate a mapping (%d)", r);
>>>> @@ -372,6 +372,14 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>>>            }
>>>>
>>>>            iova = mem_region.iova;
>>>> +
>>>> +        /* Add mapping to the IOVA->HVA tree */
>>>> +        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr;
>>>> +        r = vhost_iova_tree_insert(s->iova_tree, &mem_region);
>>>> +        if (unlikely(r != IOVA_OK)) {
>>>> +            error_report("Can't add listener region mapping (%d)", r);
>>>> +            goto fail_map;
>>>> +        }
>>> I'd say it is not intuitive for the caller code.
>>>
>>>>        }
>>>>
>>>>        vhost_vdpa_iotlb_batch_begin_once(s);
>>>> @@ -1142,19 +1150,30 @@ static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
>>>>     *
>>>>     * @v: Vhost-vdpa device
>>>>     * @needle: The area to search iova
>>>> + * @taddr: The translated address (SVQ HVA)
>>>>     * @errorp: Error pointer
>>>>     */
>>>>    static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
>>>> -                                    Error **errp)
>>>> +                                    hwaddr taddr, Error **errp)
>>>>    {
>>>>        int r;
>>>>
>>>> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
>>>>        r = vhost_iova_tree_map_alloc(v->shared->iova_tree, needle);
>>>>        if (unlikely(r != IOVA_OK)) {
>>>>            error_setg(errp, "Cannot allocate iova (%d)", r);
>>>>            return false;
>>>>        }
>>>>
>>>> +    /* Add mapping to the IOVA->HVA tree */
>>>> +    needle->translated_addr = taddr;
>>>> +    r = vhost_iova_tree_insert(v->shared->iova_tree, needle);
>>>> +    if (unlikely(r != IOVA_OK)) {
>>>> +        error_setg(errp, "Cannot add SVQ vring mapping (%d)", r);
>>>> +        vhost_iova_tree_remove(v->shared->iova_tree, *needle);
>>>> +        return false;
>>>> +    }
>>>> +
>>>>        r = vhost_vdpa_dma_map(v->shared, v->address_space_id, needle->iova,
>>>>                               needle->size + 1,
>>>>                               (void *)(uintptr_t)needle->translated_addr,
>>>> @@ -1192,11 +1211,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>>>>        vhost_svq_get_vring_addr(svq, &svq_addr);
>>>>
>>>>        driver_region = (DMAMap) {
>>>> -        .translated_addr = svq_addr.desc_user_addr,
>>>>            .size = driver_size - 1,
>>>>            .perm = IOMMU_RO,
>>>>        };
>>>> -    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
>>>> +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, svq_addr.desc_user_addr,
>>>> +                                 errp);
>>>>        if (unlikely(!ok)) {
>>>>            error_prepend(errp, "Cannot create vq driver region: ");
>>>>            return false;
>>>> @@ -1206,11 +1225,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>>>>        addr->avail_user_addr = driver_region.iova + avail_offset;
>>>>
>>>>        device_region = (DMAMap) {
>>>> -        .translated_addr = svq_addr.used_user_addr,
>>>>            .size = device_size - 1,
>>>>            .perm = IOMMU_RW,
>>>>        };
>>>> -    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
>>>> +    ok = vhost_vdpa_svq_map_ring(v, &device_region, svq_addr.used_user_addr,
>>>> +                                 errp);
>>>>        if (unlikely(!ok)) {
>>>>            error_prepend(errp, "Cannot create vq device region: ");
>>>>            vhost_vdpa_svq_unmap_ring(v, driver_region.translated_addr);
>>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
>>>> index 03457ead66..81da956b92 100644
>>>> --- a/net/vhost-vdpa.c
>>>> +++ b/net/vhost-vdpa.c
>>>> @@ -512,15 +512,24 @@ static int vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v, void *buf, size_t size,
>>>>        DMAMap map = {};
>>>>        int r;
>>>>
>>>> -    map.translated_addr = (hwaddr)(uintptr_t)buf;
>>>>        map.size = size - 1;
>>>>        map.perm = write ? IOMMU_RW : IOMMU_RO,
>>>> +
>>>> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
>>>>        r = vhost_iova_tree_map_alloc(v->shared->iova_tree, &map);
>>>>        if (unlikely(r != IOVA_OK)) {
>>>> -        error_report("Cannot map injected element");
>>>> +        error_report("Cannot allocate IOVA range for injected element");
>>>>            return r;
>>>>        }
>>>>
>>>> +    /* Add mapping to the IOVA->HVA tree */
>>>> +    map.translated_addr = (hwaddr)(uintptr_t)buf;
>>>> +    r = vhost_iova_tree_insert(v->shared->iova_tree, &map);
>>>> +    if (unlikely(r != IOVA_OK)) {
>>>> +        error_report("Cannot map injected element into IOVA->HVA tree");
>>>> +        goto dma_map_err;
>>>> +    }
>>>> +
>>>>        r = vhost_vdpa_dma_map(v->shared, v->address_space_id, map.iova,
>>>>                               vhost_vdpa_net_cvq_cmd_page_len(), buf, !write);
>>>>        if (unlikely(r < 0)) {
>>>> --
>>>> 2.43.5
>>>>
Eugenio Perez Martin Sept. 2, 2024, 11:09 a.m. UTC | #7
On Fri, Aug 30, 2024 at 11:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 8/30/2024 1:05 AM, Eugenio Perez Martin wrote:
> > On Fri, Aug 30, 2024 at 6:20 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote:
> >>> On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>>> Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
> >>>> the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
> >>>> will hold all IOVA ranges that have been allocated (e.g. in the
> >>>> IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.
> >>>>
> >>>> A new API function vhost_iova_tree_insert() is also created to add a
> >>>> IOVA->HVA mapping into the IOVA->HVA tree.
> >>>>
> >>> I think this is a good first iteration but we can take steps to
> >>> simplify it. Also, it is great to be able to make points on real code
> >>> instead of designs on the air :).
> >>>
> >>> I expected a split of vhost_iova_tree_map_alloc between the current
> >>> vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
> >>> similar. Similarly, a vhost_iova_tree_remove and
> >>> vhost_iova_tree_remove_gpa would be needed.
> >>>
> >>> The first one is used for regions that don't exist in the guest, like
> >>> SVQ vrings or CVQ buffers. The second one is the one used by the
> >>> memory listener to map the guest regions into the vdpa device.
> >>>
> >>> Implementation wise, only two trees are actually needed:
> >>> * Current iova_taddr_map that contains all IOVA->vaddr translations as
> >>> seen by the device, so both allocation functions can work on a single
> >>> tree. The function iova_tree_find_iova keeps using this one, so the
> >> I thought we had thorough discussion about this and agreed upon the
> >> decoupled IOVA allocator solution.
> > My interpretation of it is to leave the allocator as the current one,
> > and create a new tree with GPA which is guaranteed to be unique. But
> > we can talk over it of course.
> >
> >> But maybe I missed something earlier,
> >> I am not clear how come this iova_tree_find_iova function could still
> >> work with the full IOVA-> HVA tree when it comes to aliased memory or
> >> overlapped HVAs? Granted, for the memory map removal in the
> >> .region_del() path, we could rely on the GPA tree to locate the
> >> corresponding IOVA, but how come the translation path could figure out
> >> which IOVA range to return when the vaddr happens to fall in an
> >> overlapped HVA range?
> > That is not a problem, as they both translate to the same address at the device.
> Not sure I followed, it might return a wrong IOVA (range) which the host
> kernel may have conflict or unmatched attribute i.e. permission, size et
> al in the map.
>

Let's leave out the permissions at the moment. I'm going to use the
example you found, but I'll reorder (1) and (3) insertions so it picks
the "wrong" IOVA range intentionally:

(1)
HVA: [0x7f7903ea0000, 0x7f7903ec0000)
GPA: [0xfeda0000, 0xfedc0000)
IOVA: [0x1000, 0x21000)

(2)
HVA: [0x7f7983e00000, 0x7f9903e00000)
GPA: [0x100000000, 0x2080000000)
IOVA: [0x80001000, 0x2000001000)

(3)
HVA: [0x7f7903e00000, 0x7f7983e00000)
GPA: [0x0, 0x80000000)
IOVA: [0x2000001000, 0x2080000000)

Let's say that SVQ wants to translate the HVA range
0xfeda0000-0xfedd0000. So it makes available for the device two
chained buffers: One with addr=0x1000 len=0x20000 and the other one
with addr=(0x20000c1000 len=0x10000).

The VirtIO device should be able to translate these two buffers in
isolation and chain them. Not optimal but it helps to keep QEMU source
clean, as the device already must support it. I don't foresee lots of
cases like this anyway :).

About the permissions, maybe we can make the permissions to be part of
the lookup? Instead of returning them at iova_tree_find_iova, make
them match at iova_tree_find_address_iterator.

> >
> > The most complicated situation is where we have a region contained in
> > another region, and the requested buffer crosses them. If the IOVA
> > tree returns the inner region, it will return the buffer chained with
> > the rest of the content in the outer region. Not optimal, but solved
> > either way.
> Don't quite understand what it means... So in this overlapping case,
> speaking of the expectation of the translation API, you would like to
> have all IOVA ranges that match the overlapped HVA to be returned? And
> then to rely on the user (caller) to figure out which one is correct?
> Wouldn't it be easier for the user (SVQ) to use the memory system API
> directly to figure out?
>

All of them are correct in the translation path. The device should be
able to work with a buffer that spans over different IOTLB too. You
can see how QEMU handles it at hw/virtio/virtio.c:virtqueue_map_desc.
If the region is not big enough to contain the whole buffer, the
device must keep looking for the rest of it.

> As we are talking about API we may want to build it in a way generic
> enough to address all possible needs (which goes with what memory
> subsystem is capable of), rather than just look on the current usage
> which has kind of narrow scope. Although virtio-net device doesn't work
> with aliased region now, some other virtio device may do, or maybe some
> day virtio-net would need to use aliased region than the API and the
> users (SVQ) would have to go with another round of significant
> refactoring due to the iova-tree internal working. I feel it's just too
> early or too tight to abstract the iova-tree layer and get the API
> customized for the current use case with a lot of limitations on how
> user should expect to use it. We need some more flexibility and ease on
> extensibility if we want to take the chance to get it rewritten, given
> it is not a lot of code that Jonah had showed here ..
>

Let me know if they are addressed here. Sorry if I didn't explain it
well, but I'm not trying to limit the alias or to handle just a subset
of them. I'm trying to delegate the handling of these to the device as
much as possible, as the device already needs to handle them and the
less we complicate the QEMU solution, the better. Of course, the IOVA
tree is a very self-contained area easy to rewrite in theory, but with
potential future users it might get complicated.

> > The only problem that comes to my mind is the case where the inner
> > region is RO
> Yes, this is one of examples around the permission or size I mentioned
> above, which may have a conflict view with the memory system or the kernel.
>
> Thanks,
> -Siwei
>
> > and it is a write command, but I don't think we have this
> > case in a sane guest. A malicious guest cannot do any harm this way
> > anyway.
> >
> >> Do we still assume some overlapping order so we
> >> always return the first match from the tree? Or we expect every current
> >> user of iova_tree_find_iova should pass in GPA rather than HVA and use
> >> the vhost_iova_xxx_gpa API variant to look up IOVA?
> >>
> > No, iova_tree_find_iova should keep asking for vaddr, as the result is
> > guaranteed to be there. Users of VhostIOVATree only need to modify how
> > they add or remove regions, knowing if they come from the guest or
> > not. As shown by this series, it is easier to do in that place than in
> > translation.
> >
> >> Thanks,
> >> -Siwei
> >>
> >>> user does not need to know if the address is from the guest or only
> >>> exists in QEMU by using RAMBlock etc. All insert and remove functions
> >>> use this tree.
> >>> * A new tree that relates IOVA to GPA, that only
> >>> vhost_iova_tree_map_alloc_gpa and vhost_iova_tree_remove_gpa uses.
> >>>
> >>> The ideal case is that the key in this new tree is the GPA and the
> >>> value is the IOVA. But IOVATree's DMA is named the reverse: iova is
> >>> the key and translated_addr is the vaddr. We can create a new tree
> >>> struct for that, use GTree directly, or translate the reverse
> >>> linearly. As memory add / remove should not be frequent, I think the
> >>> simpler is the last one, but I'd be ok with creating a new tree.
> >>>
> >>> vhost_iova_tree_map_alloc_gpa needs to add the map to this new tree
> >>> also. Similarly, vhost_iova_tree_remove_gpa must look for the GPA in
> >>> this tree, and only remove the associated DMAMap in iova_taddr_map
> >>> that matches the IOVA.
> >>>
> >>> Does it make sense to you?
> >>>
> >>>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> >>>> ---
> >>>>    hw/virtio/vhost-iova-tree.c | 38 ++++++++++++++++++++++++++++++++-----
> >>>>    hw/virtio/vhost-iova-tree.h |  1 +
> >>>>    hw/virtio/vhost-vdpa.c      | 31 ++++++++++++++++++++++++------
> >>>>    net/vhost-vdpa.c            | 13 +++++++++++--
> >>>>    4 files changed, 70 insertions(+), 13 deletions(-)
> >>>>
> >>>> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> >>>> index 3d03395a77..32c03db2f5 100644
> >>>> --- a/hw/virtio/vhost-iova-tree.c
> >>>> +++ b/hw/virtio/vhost-iova-tree.c
> >>>> @@ -28,12 +28,17 @@ struct VhostIOVATree {
> >>>>
> >>>>        /* IOVA address to qemu memory maps. */
> >>>>        IOVATree *iova_taddr_map;
> >>>> +
> >>>> +    /* IOVA tree (IOVA allocator) */
> >>>> +    IOVATree *iova_map;
> >>>>    };
> >>>>
> >>>>    /**
> >>>> - * Create a new IOVA tree
> >>>> + * Create a new VhostIOVATree with a new set of IOVATree's:
> >>> s/IOVA tree/VhostIOVATree/ is good, but I think the rest is more an
> >>> implementation detail.
> >>>
> >>>> + * - IOVA allocator (iova_map)
> >>>> + * - IOVA->HVA tree (iova_taddr_map)
> >>>>     *
> >>>> - * Returns the new IOVA tree
> >>>> + * Returns the new VhostIOVATree
> >>>>     */
> >>>>    VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> >>>>    {
> >>>> @@ -44,6 +49,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> >>>>        tree->iova_last = iova_last;
> >>>>
> >>>>        tree->iova_taddr_map = iova_tree_new();
> >>>> +    tree->iova_map = iova_tree_new();
> >>>>        return tree;
> >>>>    }
> >>>>
> >>>> @@ -53,6 +59,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> >>>>    void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
> >>>>    {
> >>>>        iova_tree_destroy(iova_tree->iova_taddr_map);
> >>>> +    iova_tree_destroy(iova_tree->iova_map);
> >>>>        g_free(iova_tree);
> >>>>    }
> >>>>
> >>>> @@ -88,13 +95,12 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
> >>>>        /* Some vhost devices do not like addr 0. Skip first page */
> >>>>        hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size();
> >>>>
> >>>> -    if (map->translated_addr + map->size < map->translated_addr ||
> >>> Why remove this condition? If the request is invalid we still need to
> >>> return an error here.
> >>>
> >>> Maybe we should move it to iova_tree_alloc_map though.
> >>>
> >>>> -        map->perm == IOMMU_NONE) {
> >>>> +    if (map->perm == IOMMU_NONE) {
> >>>>            return IOVA_ERR_INVALID;
> >>>>        }
> >>>>
> >>>>        /* Allocate a node in IOVA address */
> >>>> -    return iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
> >>>> +    return iova_tree_alloc_map(tree->iova_map, map, iova_first,
> >>>>                                   tree->iova_last);
> >>>>    }
> >>>>
> >>>> @@ -107,4 +113,26 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
> >>>>    void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
> >>>>    {
> >>>>        iova_tree_remove(iova_tree->iova_taddr_map, map);
> >>>> +    iova_tree_remove(iova_tree->iova_map, map);
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * Insert a new mapping to the IOVA->HVA tree
> >>>> + *
> >>>> + * @tree: The VhostIOVATree
> >>>> + * @map: The iova map
> >>>> + *
> >>>> + * Returns:
> >>>> + * - IOVA_OK if the map fits in the container
> >>>> + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
> >>>> + * - IOVA_ERR_OVERLAP if the IOVA range overlaps with an existing range
> >>>> + */
> >>>> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map)
> >>>> +{
> >>>> +    if (map->translated_addr + map->size < map->translated_addr ||
> >>>> +        map->perm == IOMMU_NONE) {
> >>>> +        return IOVA_ERR_INVALID;
> >>>> +    }
> >>>> +
> >>>> +    return iova_tree_insert(iova_tree->iova_taddr_map, map);
> >>>>    }
> >>>> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> >>>> index 4adfd79ff0..8bf7b64786 100644
> >>>> --- a/hw/virtio/vhost-iova-tree.h
> >>>> +++ b/hw/virtio/vhost-iova-tree.h
> >>>> @@ -23,5 +23,6 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
> >>>>                                            const DMAMap *map);
> >>>>    int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
> >>>>    void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map);
> >>>> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map);
> >>>>
> >>>>    #endif
> >>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >>>> index 3cdaa12ed5..6702459065 100644
> >>>> --- a/hw/virtio/vhost-vdpa.c
> >>>> +++ b/hw/virtio/vhost-vdpa.c
> >>>> @@ -361,10 +361,10 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> >>>>        if (s->shadow_data) {
> >>>>            int r;
> >>>>
> >>>> -        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,
> >>>>            mem_region.size = int128_get64(llsize) - 1,
> >>>>            mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> >>>>
> >>>> +        /* Allocate IOVA range and add the mapping to the IOVA tree */
> >>>>            r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
> >>>>            if (unlikely(r != IOVA_OK)) {
> >>>>                error_report("Can't allocate a mapping (%d)", r);
> >>>> @@ -372,6 +372,14 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> >>>>            }
> >>>>
> >>>>            iova = mem_region.iova;
> >>>> +
> >>>> +        /* Add mapping to the IOVA->HVA tree */
> >>>> +        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr;
> >>>> +        r = vhost_iova_tree_insert(s->iova_tree, &mem_region);
> >>>> +        if (unlikely(r != IOVA_OK)) {
> >>>> +            error_report("Can't add listener region mapping (%d)", r);
> >>>> +            goto fail_map;
> >>>> +        }
> >>> I'd say it is not intuitive for the caller code.
> >>>
> >>>>        }
> >>>>
> >>>>        vhost_vdpa_iotlb_batch_begin_once(s);
> >>>> @@ -1142,19 +1150,30 @@ static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
> >>>>     *
> >>>>     * @v: Vhost-vdpa device
> >>>>     * @needle: The area to search iova
> >>>> + * @taddr: The translated address (SVQ HVA)
> >>>>     * @errorp: Error pointer
> >>>>     */
> >>>>    static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
> >>>> -                                    Error **errp)
> >>>> +                                    hwaddr taddr, Error **errp)
> >>>>    {
> >>>>        int r;
> >>>>
> >>>> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
> >>>>        r = vhost_iova_tree_map_alloc(v->shared->iova_tree, needle);
> >>>>        if (unlikely(r != IOVA_OK)) {
> >>>>            error_setg(errp, "Cannot allocate iova (%d)", r);
> >>>>            return false;
> >>>>        }
> >>>>
> >>>> +    /* Add mapping to the IOVA->HVA tree */
> >>>> +    needle->translated_addr = taddr;
> >>>> +    r = vhost_iova_tree_insert(v->shared->iova_tree, needle);
> >>>> +    if (unlikely(r != IOVA_OK)) {
> >>>> +        error_setg(errp, "Cannot add SVQ vring mapping (%d)", r);
> >>>> +        vhost_iova_tree_remove(v->shared->iova_tree, *needle);
> >>>> +        return false;
> >>>> +    }
> >>>> +
> >>>>        r = vhost_vdpa_dma_map(v->shared, v->address_space_id, needle->iova,
> >>>>                               needle->size + 1,
> >>>>                               (void *)(uintptr_t)needle->translated_addr,
> >>>> @@ -1192,11 +1211,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
> >>>>        vhost_svq_get_vring_addr(svq, &svq_addr);
> >>>>
> >>>>        driver_region = (DMAMap) {
> >>>> -        .translated_addr = svq_addr.desc_user_addr,
> >>>>            .size = driver_size - 1,
> >>>>            .perm = IOMMU_RO,
> >>>>        };
> >>>> -    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
> >>>> +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, svq_addr.desc_user_addr,
> >>>> +                                 errp);
> >>>>        if (unlikely(!ok)) {
> >>>>            error_prepend(errp, "Cannot create vq driver region: ");
> >>>>            return false;
> >>>> @@ -1206,11 +1225,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
> >>>>        addr->avail_user_addr = driver_region.iova + avail_offset;
> >>>>
> >>>>        device_region = (DMAMap) {
> >>>> -        .translated_addr = svq_addr.used_user_addr,
> >>>>            .size = device_size - 1,
> >>>>            .perm = IOMMU_RW,
> >>>>        };
> >>>> -    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
> >>>> +    ok = vhost_vdpa_svq_map_ring(v, &device_region, svq_addr.used_user_addr,
> >>>> +                                 errp);
> >>>>        if (unlikely(!ok)) {
> >>>>            error_prepend(errp, "Cannot create vq device region: ");
> >>>>            vhost_vdpa_svq_unmap_ring(v, driver_region.translated_addr);
> >>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> >>>> index 03457ead66..81da956b92 100644
> >>>> --- a/net/vhost-vdpa.c
> >>>> +++ b/net/vhost-vdpa.c
> >>>> @@ -512,15 +512,24 @@ static int vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v, void *buf, size_t size,
> >>>>        DMAMap map = {};
> >>>>        int r;
> >>>>
> >>>> -    map.translated_addr = (hwaddr)(uintptr_t)buf;
> >>>>        map.size = size - 1;
> >>>>        map.perm = write ? IOMMU_RW : IOMMU_RO,
> >>>> +
> >>>> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
> >>>>        r = vhost_iova_tree_map_alloc(v->shared->iova_tree, &map);
> >>>>        if (unlikely(r != IOVA_OK)) {
> >>>> -        error_report("Cannot map injected element");
> >>>> +        error_report("Cannot allocate IOVA range for injected element");
> >>>>            return r;
> >>>>        }
> >>>>
> >>>> +    /* Add mapping to the IOVA->HVA tree */
> >>>> +    map.translated_addr = (hwaddr)(uintptr_t)buf;
> >>>> +    r = vhost_iova_tree_insert(v->shared->iova_tree, &map);
> >>>> +    if (unlikely(r != IOVA_OK)) {
> >>>> +        error_report("Cannot map injected element into IOVA->HVA tree");
> >>>> +        goto dma_map_err;
> >>>> +    }
> >>>> +
> >>>>        r = vhost_vdpa_dma_map(v->shared, v->address_space_id, map.iova,
> >>>>                               vhost_vdpa_net_cvq_cmd_page_len(), buf, !write);
> >>>>        if (unlikely(r < 0)) {
> >>>> --
> >>>> 2.43.5
> >>>>
>
Si-Wei Liu Sept. 10, 2024, 5:29 a.m. UTC | #8
Sorry for the delayed response, it seems I missed the email reply for 
some reason during the long weekend.

On 9/2/2024 4:09 AM, Eugenio Perez Martin wrote:
> On Fri, Aug 30, 2024 at 11:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 8/30/2024 1:05 AM, Eugenio Perez Martin wrote:
>>> On Fri, Aug 30, 2024 at 6:20 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>> On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote:
>>>>> On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>>> Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
>>>>>> the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
>>>>>> will hold all IOVA ranges that have been allocated (e.g. in the
>>>>>> IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.
>>>>>>
>>>>>> A new API function vhost_iova_tree_insert() is also created to add a
>>>>>> IOVA->HVA mapping into the IOVA->HVA tree.
>>>>>>
>>>>> I think this is a good first iteration but we can take steps to
>>>>> simplify it. Also, it is great to be able to make points on real code
>>>>> instead of designs on the air :).
>>>>>
>>>>> I expected a split of vhost_iova_tree_map_alloc between the current
>>>>> vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
>>>>> similar. Similarly, a vhost_iova_tree_remove and
>>>>> vhost_iova_tree_remove_gpa would be needed.
>>>>>
>>>>> The first one is used for regions that don't exist in the guest, like
>>>>> SVQ vrings or CVQ buffers. The second one is the one used by the
>>>>> memory listener to map the guest regions into the vdpa device.
>>>>>
>>>>> Implementation wise, only two trees are actually needed:
>>>>> * Current iova_taddr_map that contains all IOVA->vaddr translations as
>>>>> seen by the device, so both allocation functions can work on a single
>>>>> tree. The function iova_tree_find_iova keeps using this one, so the
>>>> I thought we had thorough discussion about this and agreed upon the
>>>> decoupled IOVA allocator solution.
>>> My interpretation of it is to leave the allocator as the current one,
>>> and create a new tree with GPA which is guaranteed to be unique. But
>>> we can talk over it of course.
>>>
>>>> But maybe I missed something earlier,
>>>> I am not clear how come this iova_tree_find_iova function could still
>>>> work with the full IOVA-> HVA tree when it comes to aliased memory or
>>>> overlapped HVAs? Granted, for the memory map removal in the
>>>> .region_del() path, we could rely on the GPA tree to locate the
>>>> corresponding IOVA, but how come the translation path could figure out
>>>> which IOVA range to return when the vaddr happens to fall in an
>>>> overlapped HVA range?
>>> That is not a problem, as they both translate to the same address at the device.
>> Not sure I followed, it might return a wrong IOVA (range) which the host
>> kernel may have conflict or unmatched attribute i.e. permission, size et
>> al in the map.
>>
> Let's leave out the permissions at the moment. I'm going to use the
> example you found, but I'll reorder (1) and (3) insertions so it picks
> the "wrong" IOVA range intentionally:
>
> (1)
> HVA: [0x7f7903ea0000, 0x7f7903ec0000)
> GPA: [0xfeda0000, 0xfedc0000)
> IOVA: [0x1000, 0x21000)
>
> (2)
> HVA: [0x7f7983e00000, 0x7f9903e00000)
> GPA: [0x100000000, 0x2080000000)
> IOVA: [0x80001000, 0x2000001000)
>
> (3)
> HVA: [0x7f7903e00000, 0x7f7983e00000)
> GPA: [0x0, 0x80000000)
> IOVA: [0x2000001000, 0x2080000000)
>
> Let's say that SVQ wants to translate the HVA range
> 0xfeda0000-0xfedd0000. So it makes available for the device two
> chained buffers: One with addr=0x1000 len=0x20000 and the other one
> with addr=(0x20000c1000 len=0x10000).
>
> The VirtIO device should be able to translate these two buffers in
> isolation and chain them. Not optimal but it helps to keep QEMU source
> clean, as the device already must support it. I don't foresee lots of
> cases like this anyway :).
Hmmm, this scheme will only work if the aliased map doesn't go away 
immediately. If the BQL is not held or an unmap is to be done out of RCU 
critical section, it's pretty dangerous to assume we can be always fine 
to work with the other overlapped regions. In my opinion if we have to 
work with the HVA tree, it'd be ideal for the caller(s) to get more aid 
in helping figure out the right IOVA range in match rather than for SVQ 
or vhost-iova-tree to blindly pick a random one or break up contiguous 
range into pieces (in an inconsistent and unnecessary way). This would 
require a bit extensive changes to all the callers to pass in more 
information though, like the GPA, or the RAMBlock/MemoryRegionSection 
backing the relevant guest memory, along with the offset.

> About the permissions, maybe we can make the permissions to be part of
> the lookup? Instead of returning them at iova_tree_find_iova, make
> them match at iova_tree_find_address_iterator.
Yes, if there's no easy way out we have to add this extra info to the 
HVA tree and make the lookup routine even complex (or duplicative).

>
>>> The most complicated situation is where we have a region contained in
>>> another region, and the requested buffer crosses them. If the IOVA
>>> tree returns the inner region, it will return the buffer chained with
>>> the rest of the content in the outer region. Not optimal, but solved
>>> either way.
>> Don't quite understand what it means... So in this overlapping case,
>> speaking of the expectation of the translation API, you would like to
>> have all IOVA ranges that match the overlapped HVA to be returned? And
>> then to rely on the user (caller) to figure out which one is correct?
>> Wouldn't it be easier for the user (SVQ) to use the memory system API
>> directly to figure out?
>>
> All of them are correct in the translation path. The device should be
> able to work with a buffer that spans over different IOTLB too. You
> can see how QEMU handles it at hw/virtio/virtio.c:virtqueue_map_desc.
> If the region is not big enough to contain the whole buffer, the
> device must keep looking for the rest of it.
Yeah I see why you prefer working with HVA tree even with overlapping 
ranges, as the current API virtqueue_map_desc() that returns the HVA 
already wraps up the translation internals well for e.g. when span over 
different IOMMU.  Are you worry with the vIOMMU case where the GPA is no 
longer cached in the virtqueue elem? Maybe we can add also that 
information to the elem even for vIOMMU (we can defer doing it until we 
add the vIOMMU support to SVQ), so that SVQ can just look up the GPA 
tree directly in the translation path?
>
>> As we are talking about API we may want to build it in a way generic
>> enough to address all possible needs (which goes with what memory
>> subsystem is capable of), rather than just look on the current usage
>> which has kind of narrow scope. Although virtio-net device doesn't work
>> with aliased region now, some other virtio device may do, or maybe some
>> day virtio-net would need to use aliased region than the API and the
>> users (SVQ) would have to go with another round of significant
>> refactoring due to the iova-tree internal working. I feel it's just too
>> early or too tight to abstract the iova-tree layer and get the API
>> customized for the current use case with a lot of limitations on how
>> user should expect to use it. We need some more flexibility and ease on
>> extensibility if we want to take the chance to get it rewritten, given
>> it is not a lot of code that Jonah had showed here ..
>>
> Let me know if they are addressed here. Sorry if I didn't explain it
> well, but I'm not trying to limit the alias or to handle just a subset
> of them. I'm trying to delegate the handling of these to the device as
> much as possible, as the device already needs to handle them and the
> less we complicate the QEMU solution, the better. Of course, the IOVA
> tree is a very self-contained area easy to rewrite in theory, but with
> potential future users it might get complicated.
Sure, understood. I just want to compare the Pros and Cons for each 
candidate, so that Jonah won't spend quite a lot of time to come up with 
complicated code, then soon find out all or most of them have to be 
thrown away, due to short sighted design which is unable to cope with 
foreseeable future use cases.

Thanks,
-Siwei

>
>>> The only problem that comes to my mind is the case where the inner
>>> region is RO
>> Yes, this is one of examples around the permission or size I mentioned
>> above, which may have a conflict view with the memory system or the kernel.
>>
>> Thanks,
>> -Siwei
>>
>>> and it is a write command, but I don't think we have this
>>> case in a sane guest. A malicious guest cannot do any harm this way
>>> anyway.
>>>
>>>> Do we still assume some overlapping order so we
>>>> always return the first match from the tree? Or we expect every current
>>>> user of iova_tree_find_iova should pass in GPA rather than HVA and use
>>>> the vhost_iova_xxx_gpa API variant to look up IOVA?
>>>>
>>> No, iova_tree_find_iova should keep asking for vaddr, as the result is
>>> guaranteed to be there. Users of VhostIOVATree only need to modify how
>>> they add or remove regions, knowing if they come from the guest or
>>> not. As shown by this series, it is easier to do in that place than in
>>> translation.
>>>
>>>> Thanks,
>>>> -Siwei
>>>>
>>>>> user does not need to know if the address is from the guest or only
>>>>> exists in QEMU by using RAMBlock etc. All insert and remove functions
>>>>> use this tree.
>>>>> * A new tree that relates IOVA to GPA, that only
>>>>> vhost_iova_tree_map_alloc_gpa and vhost_iova_tree_remove_gpa uses.
>>>>>
>>>>> The ideal case is that the key in this new tree is the GPA and the
>>>>> value is the IOVA. But IOVATree's DMA is named the reverse: iova is
>>>>> the key and translated_addr is the vaddr. We can create a new tree
>>>>> struct for that, use GTree directly, or translate the reverse
>>>>> linearly. As memory add / remove should not be frequent, I think the
>>>>> simpler is the last one, but I'd be ok with creating a new tree.
>>>>>
>>>>> vhost_iova_tree_map_alloc_gpa needs to add the map to this new tree
>>>>> also. Similarly, vhost_iova_tree_remove_gpa must look for the GPA in
>>>>> this tree, and only remove the associated DMAMap in iova_taddr_map
>>>>> that matches the IOVA.
>>>>>
>>>>> Does it make sense to you?
>>>>>
>>>>>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>>>>>> ---
>>>>>>     hw/virtio/vhost-iova-tree.c | 38 ++++++++++++++++++++++++++++++++-----
>>>>>>     hw/virtio/vhost-iova-tree.h |  1 +
>>>>>>     hw/virtio/vhost-vdpa.c      | 31 ++++++++++++++++++++++++------
>>>>>>     net/vhost-vdpa.c            | 13 +++++++++++--
>>>>>>     4 files changed, 70 insertions(+), 13 deletions(-)
>>>>>>
>>>>>> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
>>>>>> index 3d03395a77..32c03db2f5 100644
>>>>>> --- a/hw/virtio/vhost-iova-tree.c
>>>>>> +++ b/hw/virtio/vhost-iova-tree.c
>>>>>> @@ -28,12 +28,17 @@ struct VhostIOVATree {
>>>>>>
>>>>>>         /* IOVA address to qemu memory maps. */
>>>>>>         IOVATree *iova_taddr_map;
>>>>>> +
>>>>>> +    /* IOVA tree (IOVA allocator) */
>>>>>> +    IOVATree *iova_map;
>>>>>>     };
>>>>>>
>>>>>>     /**
>>>>>> - * Create a new IOVA tree
>>>>>> + * Create a new VhostIOVATree with a new set of IOVATree's:
>>>>> s/IOVA tree/VhostIOVATree/ is good, but I think the rest is more an
>>>>> implementation detail.
>>>>>
>>>>>> + * - IOVA allocator (iova_map)
>>>>>> + * - IOVA->HVA tree (iova_taddr_map)
>>>>>>      *
>>>>>> - * Returns the new IOVA tree
>>>>>> + * Returns the new VhostIOVATree
>>>>>>      */
>>>>>>     VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>>>>>     {
>>>>>> @@ -44,6 +49,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>>>>>         tree->iova_last = iova_last;
>>>>>>
>>>>>>         tree->iova_taddr_map = iova_tree_new();
>>>>>> +    tree->iova_map = iova_tree_new();
>>>>>>         return tree;
>>>>>>     }
>>>>>>
>>>>>> @@ -53,6 +59,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>>>>>     void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
>>>>>>     {
>>>>>>         iova_tree_destroy(iova_tree->iova_taddr_map);
>>>>>> +    iova_tree_destroy(iova_tree->iova_map);
>>>>>>         g_free(iova_tree);
>>>>>>     }
>>>>>>
>>>>>> @@ -88,13 +95,12 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
>>>>>>         /* Some vhost devices do not like addr 0. Skip first page */
>>>>>>         hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size();
>>>>>>
>>>>>> -    if (map->translated_addr + map->size < map->translated_addr ||
>>>>> Why remove this condition? If the request is invalid we still need to
>>>>> return an error here.
>>>>>
>>>>> Maybe we should move it to iova_tree_alloc_map though.
>>>>>
>>>>>> -        map->perm == IOMMU_NONE) {
>>>>>> +    if (map->perm == IOMMU_NONE) {
>>>>>>             return IOVA_ERR_INVALID;
>>>>>>         }
>>>>>>
>>>>>>         /* Allocate a node in IOVA address */
>>>>>> -    return iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
>>>>>> +    return iova_tree_alloc_map(tree->iova_map, map, iova_first,
>>>>>>                                    tree->iova_last);
>>>>>>     }
>>>>>>
>>>>>> @@ -107,4 +113,26 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
>>>>>>     void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
>>>>>>     {
>>>>>>         iova_tree_remove(iova_tree->iova_taddr_map, map);
>>>>>> +    iova_tree_remove(iova_tree->iova_map, map);
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * Insert a new mapping to the IOVA->HVA tree
>>>>>> + *
>>>>>> + * @tree: The VhostIOVATree
>>>>>> + * @map: The iova map
>>>>>> + *
>>>>>> + * Returns:
>>>>>> + * - IOVA_OK if the map fits in the container
>>>>>> + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
>>>>>> + * - IOVA_ERR_OVERLAP if the IOVA range overlaps with an existing range
>>>>>> + */
>>>>>> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map)
>>>>>> +{
>>>>>> +    if (map->translated_addr + map->size < map->translated_addr ||
>>>>>> +        map->perm == IOMMU_NONE) {
>>>>>> +        return IOVA_ERR_INVALID;
>>>>>> +    }
>>>>>> +
>>>>>> +    return iova_tree_insert(iova_tree->iova_taddr_map, map);
>>>>>>     }
>>>>>> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
>>>>>> index 4adfd79ff0..8bf7b64786 100644
>>>>>> --- a/hw/virtio/vhost-iova-tree.h
>>>>>> +++ b/hw/virtio/vhost-iova-tree.h
>>>>>> @@ -23,5 +23,6 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
>>>>>>                                             const DMAMap *map);
>>>>>>     int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
>>>>>>     void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map);
>>>>>> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map);
>>>>>>
>>>>>>     #endif
>>>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>>>>> index 3cdaa12ed5..6702459065 100644
>>>>>> --- a/hw/virtio/vhost-vdpa.c
>>>>>> +++ b/hw/virtio/vhost-vdpa.c
>>>>>> @@ -361,10 +361,10 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>>>>>         if (s->shadow_data) {
>>>>>>             int r;
>>>>>>
>>>>>> -        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,
>>>>>>             mem_region.size = int128_get64(llsize) - 1,
>>>>>>             mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
>>>>>>
>>>>>> +        /* Allocate IOVA range and add the mapping to the IOVA tree */
>>>>>>             r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
>>>>>>             if (unlikely(r != IOVA_OK)) {
>>>>>>                 error_report("Can't allocate a mapping (%d)", r);
>>>>>> @@ -372,6 +372,14 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>>>>>             }
>>>>>>
>>>>>>             iova = mem_region.iova;
>>>>>> +
>>>>>> +        /* Add mapping to the IOVA->HVA tree */
>>>>>> +        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr;
>>>>>> +        r = vhost_iova_tree_insert(s->iova_tree, &mem_region);
>>>>>> +        if (unlikely(r != IOVA_OK)) {
>>>>>> +            error_report("Can't add listener region mapping (%d)", r);
>>>>>> +            goto fail_map;
>>>>>> +        }
>>>>> I'd say it is not intuitive for the caller code.
>>>>>
>>>>>>         }
>>>>>>
>>>>>>         vhost_vdpa_iotlb_batch_begin_once(s);
>>>>>> @@ -1142,19 +1150,30 @@ static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
>>>>>>      *
>>>>>>      * @v: Vhost-vdpa device
>>>>>>      * @needle: The area to search iova
>>>>>> + * @taddr: The translated address (SVQ HVA)
>>>>>>      * @errorp: Error pointer
>>>>>>      */
>>>>>>     static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
>>>>>> -                                    Error **errp)
>>>>>> +                                    hwaddr taddr, Error **errp)
>>>>>>     {
>>>>>>         int r;
>>>>>>
>>>>>> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
>>>>>>         r = vhost_iova_tree_map_alloc(v->shared->iova_tree, needle);
>>>>>>         if (unlikely(r != IOVA_OK)) {
>>>>>>             error_setg(errp, "Cannot allocate iova (%d)", r);
>>>>>>             return false;
>>>>>>         }
>>>>>>
>>>>>> +    /* Add mapping to the IOVA->HVA tree */
>>>>>> +    needle->translated_addr = taddr;
>>>>>> +    r = vhost_iova_tree_insert(v->shared->iova_tree, needle);
>>>>>> +    if (unlikely(r != IOVA_OK)) {
>>>>>> +        error_setg(errp, "Cannot add SVQ vring mapping (%d)", r);
>>>>>> +        vhost_iova_tree_remove(v->shared->iova_tree, *needle);
>>>>>> +        return false;
>>>>>> +    }
>>>>>> +
>>>>>>         r = vhost_vdpa_dma_map(v->shared, v->address_space_id, needle->iova,
>>>>>>                                needle->size + 1,
>>>>>>                                (void *)(uintptr_t)needle->translated_addr,
>>>>>> @@ -1192,11 +1211,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>>>>>>         vhost_svq_get_vring_addr(svq, &svq_addr);
>>>>>>
>>>>>>         driver_region = (DMAMap) {
>>>>>> -        .translated_addr = svq_addr.desc_user_addr,
>>>>>>             .size = driver_size - 1,
>>>>>>             .perm = IOMMU_RO,
>>>>>>         };
>>>>>> -    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
>>>>>> +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, svq_addr.desc_user_addr,
>>>>>> +                                 errp);
>>>>>>         if (unlikely(!ok)) {
>>>>>>             error_prepend(errp, "Cannot create vq driver region: ");
>>>>>>             return false;
>>>>>> @@ -1206,11 +1225,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>>>>>>         addr->avail_user_addr = driver_region.iova + avail_offset;
>>>>>>
>>>>>>         device_region = (DMAMap) {
>>>>>> -        .translated_addr = svq_addr.used_user_addr,
>>>>>>             .size = device_size - 1,
>>>>>>             .perm = IOMMU_RW,
>>>>>>         };
>>>>>> -    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
>>>>>> +    ok = vhost_vdpa_svq_map_ring(v, &device_region, svq_addr.used_user_addr,
>>>>>> +                                 errp);
>>>>>>         if (unlikely(!ok)) {
>>>>>>             error_prepend(errp, "Cannot create vq device region: ");
>>>>>>             vhost_vdpa_svq_unmap_ring(v, driver_region.translated_addr);
>>>>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
>>>>>> index 03457ead66..81da956b92 100644
>>>>>> --- a/net/vhost-vdpa.c
>>>>>> +++ b/net/vhost-vdpa.c
>>>>>> @@ -512,15 +512,24 @@ static int vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v, void *buf, size_t size,
>>>>>>         DMAMap map = {};
>>>>>>         int r;
>>>>>>
>>>>>> -    map.translated_addr = (hwaddr)(uintptr_t)buf;
>>>>>>         map.size = size - 1;
>>>>>>         map.perm = write ? IOMMU_RW : IOMMU_RO,
>>>>>> +
>>>>>> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
>>>>>>         r = vhost_iova_tree_map_alloc(v->shared->iova_tree, &map);
>>>>>>         if (unlikely(r != IOVA_OK)) {
>>>>>> -        error_report("Cannot map injected element");
>>>>>> +        error_report("Cannot allocate IOVA range for injected element");
>>>>>>             return r;
>>>>>>         }
>>>>>>
>>>>>> +    /* Add mapping to the IOVA->HVA tree */
>>>>>> +    map.translated_addr = (hwaddr)(uintptr_t)buf;
>>>>>> +    r = vhost_iova_tree_insert(v->shared->iova_tree, &map);
>>>>>> +    if (unlikely(r != IOVA_OK)) {
>>>>>> +        error_report("Cannot map injected element into IOVA->HVA tree");
>>>>>> +        goto dma_map_err;
>>>>>> +    }
>>>>>> +
>>>>>>         r = vhost_vdpa_dma_map(v->shared, v->address_space_id, map.iova,
>>>>>>                                vhost_vdpa_net_cvq_cmd_page_len(), buf, !write);
>>>>>>         if (unlikely(r < 0)) {
>>>>>> --
>>>>>> 2.43.5
>>>>>>
Eugenio Perez Martin Sept. 10, 2024, 6:22 a.m. UTC | #9
On Tue, Sep 10, 2024 at 7:30 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
> Sorry for the delayed response, it seems I missed the email reply for
> some reason during the long weekend.
>
> On 9/2/2024 4:09 AM, Eugenio Perez Martin wrote:
> > On Fri, Aug 30, 2024 at 11:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 8/30/2024 1:05 AM, Eugenio Perez Martin wrote:
> >>> On Fri, Aug 30, 2024 at 6:20 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>> On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote:
> >>>>> On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>>>>> Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
> >>>>>> the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
> >>>>>> will hold all IOVA ranges that have been allocated (e.g. in the
> >>>>>> IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.
> >>>>>>
> >>>>>> A new API function vhost_iova_tree_insert() is also created to add a
> >>>>>> IOVA->HVA mapping into the IOVA->HVA tree.
> >>>>>>
> >>>>> I think this is a good first iteration but we can take steps to
> >>>>> simplify it. Also, it is great to be able to make points on real code
> >>>>> instead of designs on the air :).
> >>>>>
> >>>>> I expected a split of vhost_iova_tree_map_alloc between the current
> >>>>> vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
> >>>>> similar. Similarly, a vhost_iova_tree_remove and
> >>>>> vhost_iova_tree_remove_gpa would be needed.
> >>>>>
> >>>>> The first one is used for regions that don't exist in the guest, like
> >>>>> SVQ vrings or CVQ buffers. The second one is the one used by the
> >>>>> memory listener to map the guest regions into the vdpa device.
> >>>>>
> >>>>> Implementation wise, only two trees are actually needed:
> >>>>> * Current iova_taddr_map that contains all IOVA->vaddr translations as
> >>>>> seen by the device, so both allocation functions can work on a single
> >>>>> tree. The function iova_tree_find_iova keeps using this one, so the
> >>>> I thought we had thorough discussion about this and agreed upon the
> >>>> decoupled IOVA allocator solution.
> >>> My interpretation of it is to leave the allocator as the current one,
> >>> and create a new tree with GPA which is guaranteed to be unique. But
> >>> we can talk over it of course.
> >>>
> >>>> But maybe I missed something earlier,
> >>>> I am not clear how come this iova_tree_find_iova function could still
> >>>> work with the full IOVA-> HVA tree when it comes to aliased memory or
> >>>> overlapped HVAs? Granted, for the memory map removal in the
> >>>> .region_del() path, we could rely on the GPA tree to locate the
> >>>> corresponding IOVA, but how come the translation path could figure out
> >>>> which IOVA range to return when the vaddr happens to fall in an
> >>>> overlapped HVA range?
> >>> That is not a problem, as they both translate to the same address at the device.
> >> Not sure I followed, it might return a wrong IOVA (range) which the host
> >> kernel may have conflict or unmatched attribute i.e. permission, size et
> >> al in the map.
> >>
> > Let's leave out the permissions at the moment. I'm going to use the
> > example you found, but I'll reorder (1) and (3) insertions so it picks
> > the "wrong" IOVA range intentionally:
> >
> > (1)
> > HVA: [0x7f7903ea0000, 0x7f7903ec0000)
> > GPA: [0xfeda0000, 0xfedc0000)
> > IOVA: [0x1000, 0x21000)
> >
> > (2)
> > HVA: [0x7f7983e00000, 0x7f9903e00000)
> > GPA: [0x100000000, 0x2080000000)
> > IOVA: [0x80001000, 0x2000001000)
> >
> > (3)
> > HVA: [0x7f7903e00000, 0x7f7983e00000)
> > GPA: [0x0, 0x80000000)
> > IOVA: [0x2000001000, 0x2080000000)
> >
> > Let's say that SVQ wants to translate the HVA range
> > 0xfeda0000-0xfedd0000. So it makes available for the device two
> > chained buffers: One with addr=0x1000 len=0x20000 and the other one
> > with addr=(0x20000c1000 len=0x10000).
> >
> > The VirtIO device should be able to translate these two buffers in
> > isolation and chain them. Not optimal but it helps to keep QEMU source
> > clean, as the device already must support it. I don't foresee lots of
> > cases like this anyway :).
> Hmmm, this scheme will only work if the aliased map doesn't go away
> immediately. If the BQL is not held or an unmap is to be done out of RCU
> critical section, it's pretty dangerous to assume we can be always fine
> to work with the other overlapped regions.

But all the updates and reads are done in the critical sections, and
will be that way in the future too. That's the reason why the iova
tree does not have any mutex or equivalent.

If we take out SVQ from BQL, we will need to protect the update of it
with something, either a mutex or similar. But that's already part of
the plan, even without my proposal or if we implement this RFC the way
it is.

> In my opinion if we have to
> work with the HVA tree, it'd be ideal for the caller(s) to get more aid
> in helping figure out the right IOVA range in match rather than for SVQ
> or vhost-iova-tree to blindly pick a random one or break up contiguous
> range into pieces (in an inconsistent and unnecessary way).

The point is that SVQ already needs to work like that, as it needs to
support that a contiguous GPA range is splitted between different HVA.
Moreover, this is not a part of SVQ, but of VirtQueue. Both should be
extremely rare cases. Is it worth it to complicate / penalize the
general case solution to benefit this weird case, which is supported?

> This would
> require a bit extensive changes to all the callers to pass in more
> information though, like the GPA, or the RAMBlock/MemoryRegionSection
> backing the relevant guest memory, along with the offset.
>
> > About the permissions, maybe we can make the permissions to be part of
> > the lookup? Instead of returning them at iova_tree_find_iova, make
> > them match at iova_tree_find_address_iterator.
> Yes, if there's no easy way out we have to add this extra info to the
> HVA tree and make the lookup routine even complex (or duplicative).
>

Right.

> >
> >>> The most complicated situation is where we have a region contained in
> >>> another region, and the requested buffer crosses them. If the IOVA
> >>> tree returns the inner region, it will return the buffer chained with
> >>> the rest of the content in the outer region. Not optimal, but solved
> >>> either way.
> >> Don't quite understand what it means... So in this overlapping case,
> >> speaking of the expectation of the translation API, you would like to
> >> have all IOVA ranges that match the overlapped HVA to be returned? And
> >> then to rely on the user (caller) to figure out which one is correct?
> >> Wouldn't it be easier for the user (SVQ) to use the memory system API
> >> directly to figure out?
> >>
> > All of them are correct in the translation path. The device should be
> > able to work with a buffer that spans over different IOTLB too. You
> > can see how QEMU handles it at hw/virtio/virtio.c:virtqueue_map_desc.
> > If the region is not big enough to contain the whole buffer, the
> > device must keep looking for the rest of it.
> Yeah I see why you prefer working with HVA tree even with overlapping
> ranges, as the current API virtqueue_map_desc() that returns the HVA
> already wraps up the translation internals well for e.g. when span over
> different IOMMU.  Are you worry with the vIOMMU case where the GPA is no
> longer cached in the virtqueue elem? Maybe we can add also that
> information to the elem even for vIOMMU (we can defer doing it until we
> add the vIOMMU support to SVQ), so that SVQ can just look up the GPA
> tree directly in the translation path?

I think that IOVA should just replace GPA in the tree, isn't it? Or am
I missing something?

So the user of the IOVA tree (vhost-vdpa.c) should be slightly changed
but there is no change required for SVQ or IOVATree as far as I know.

> >
> >> As we are talking about API we may want to build it in a way generic
> >> enough to address all possible needs (which goes with what memory
> >> subsystem is capable of), rather than just look on the current usage
> >> which has kind of narrow scope. Although virtio-net device doesn't work
> >> with aliased region now, some other virtio device may do, or maybe some
> >> day virtio-net would need to use aliased region than the API and the
> >> users (SVQ) would have to go with another round of significant
> >> refactoring due to the iova-tree internal working. I feel it's just too
> >> early or too tight to abstract the iova-tree layer and get the API
> >> customized for the current use case with a lot of limitations on how
> >> user should expect to use it. We need some more flexibility and ease on
> >> extensibility if we want to take the chance to get it rewritten, given
> >> it is not a lot of code that Jonah had showed here ..
> >>
> > Let me know if they are addressed here. Sorry if I didn't explain it
> > well, but I'm not trying to limit the alias or to handle just a subset
> > of them. I'm trying to delegate the handling of these to the device as
> > much as possible, as the device already needs to handle them and the
> > less we complicate the QEMU solution, the better. Of course, the IOVA
> > tree is a very self-contained area easy to rewrite in theory, but with
> > potential future users it might get complicated.
> Sure, understood. I just want to compare the Pros and Cons for each
> candidate, so that Jonah won't spend quite a lot of time to come up with
> complicated code, then soon find out all or most of them have to be
> thrown away, due to short sighted design which is unable to cope with
> foreseeable future use cases.
>
> Thanks,
> -Siwei
>
> >
> >>> The only problem that comes to my mind is the case where the inner
> >>> region is RO
> >> Yes, this is one of examples around the permission or size I mentioned
> >> above, which may have a conflict view with the memory system or the kernel.
> >>
> >> Thanks,
> >> -Siwei
> >>
> >>> and it is a write command, but I don't think we have this
> >>> case in a sane guest. A malicious guest cannot do any harm this way
> >>> anyway.
> >>>
> >>>> Do we still assume some overlapping order so we
> >>>> always return the first match from the tree? Or we expect every current
> >>>> user of iova_tree_find_iova should pass in GPA rather than HVA and use
> >>>> the vhost_iova_xxx_gpa API variant to look up IOVA?
> >>>>
> >>> No, iova_tree_find_iova should keep asking for vaddr, as the result is
> >>> guaranteed to be there. Users of VhostIOVATree only need to modify how
> >>> they add or remove regions, knowing if they come from the guest or
> >>> not. As shown by this series, it is easier to do in that place than in
> >>> translation.
> >>>
> >>>> Thanks,
> >>>> -Siwei
> >>>>
> >>>>> user does not need to know if the address is from the guest or only
> >>>>> exists in QEMU by using RAMBlock etc. All insert and remove functions
> >>>>> use this tree.
> >>>>> * A new tree that relates IOVA to GPA, that only
> >>>>> vhost_iova_tree_map_alloc_gpa and vhost_iova_tree_remove_gpa uses.
> >>>>>
> >>>>> The ideal case is that the key in this new tree is the GPA and the
> >>>>> value is the IOVA. But IOVATree's DMA is named the reverse: iova is
> >>>>> the key and translated_addr is the vaddr. We can create a new tree
> >>>>> struct for that, use GTree directly, or translate the reverse
> >>>>> linearly. As memory add / remove should not be frequent, I think the
> >>>>> simpler is the last one, but I'd be ok with creating a new tree.
> >>>>>
> >>>>> vhost_iova_tree_map_alloc_gpa needs to add the map to this new tree
> >>>>> also. Similarly, vhost_iova_tree_remove_gpa must look for the GPA in
> >>>>> this tree, and only remove the associated DMAMap in iova_taddr_map
> >>>>> that matches the IOVA.
> >>>>>
> >>>>> Does it make sense to you?
> >>>>>
> >>>>>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> >>>>>> ---
> >>>>>>     hw/virtio/vhost-iova-tree.c | 38 ++++++++++++++++++++++++++++++++-----
> >>>>>>     hw/virtio/vhost-iova-tree.h |  1 +
> >>>>>>     hw/virtio/vhost-vdpa.c      | 31 ++++++++++++++++++++++++------
> >>>>>>     net/vhost-vdpa.c            | 13 +++++++++++--
> >>>>>>     4 files changed, 70 insertions(+), 13 deletions(-)
> >>>>>>
> >>>>>> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> >>>>>> index 3d03395a77..32c03db2f5 100644
> >>>>>> --- a/hw/virtio/vhost-iova-tree.c
> >>>>>> +++ b/hw/virtio/vhost-iova-tree.c
> >>>>>> @@ -28,12 +28,17 @@ struct VhostIOVATree {
> >>>>>>
> >>>>>>         /* IOVA address to qemu memory maps. */
> >>>>>>         IOVATree *iova_taddr_map;
> >>>>>> +
> >>>>>> +    /* IOVA tree (IOVA allocator) */
> >>>>>> +    IOVATree *iova_map;
> >>>>>>     };
> >>>>>>
> >>>>>>     /**
> >>>>>> - * Create a new IOVA tree
> >>>>>> + * Create a new VhostIOVATree with a new set of IOVATree's:
> >>>>> s/IOVA tree/VhostIOVATree/ is good, but I think the rest is more an
> >>>>> implementation detail.
> >>>>>
> >>>>>> + * - IOVA allocator (iova_map)
> >>>>>> + * - IOVA->HVA tree (iova_taddr_map)
> >>>>>>      *
> >>>>>> - * Returns the new IOVA tree
> >>>>>> + * Returns the new VhostIOVATree
> >>>>>>      */
> >>>>>>     VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> >>>>>>     {
> >>>>>> @@ -44,6 +49,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> >>>>>>         tree->iova_last = iova_last;
> >>>>>>
> >>>>>>         tree->iova_taddr_map = iova_tree_new();
> >>>>>> +    tree->iova_map = iova_tree_new();
> >>>>>>         return tree;
> >>>>>>     }
> >>>>>>
> >>>>>> @@ -53,6 +59,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> >>>>>>     void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
> >>>>>>     {
> >>>>>>         iova_tree_destroy(iova_tree->iova_taddr_map);
> >>>>>> +    iova_tree_destroy(iova_tree->iova_map);
> >>>>>>         g_free(iova_tree);
> >>>>>>     }
> >>>>>>
> >>>>>> @@ -88,13 +95,12 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
> >>>>>>         /* Some vhost devices do not like addr 0. Skip first page */
> >>>>>>         hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size();
> >>>>>>
> >>>>>> -    if (map->translated_addr + map->size < map->translated_addr ||
> >>>>> Why remove this condition? If the request is invalid we still need to
> >>>>> return an error here.
> >>>>>
> >>>>> Maybe we should move it to iova_tree_alloc_map though.
> >>>>>
> >>>>>> -        map->perm == IOMMU_NONE) {
> >>>>>> +    if (map->perm == IOMMU_NONE) {
> >>>>>>             return IOVA_ERR_INVALID;
> >>>>>>         }
> >>>>>>
> >>>>>>         /* Allocate a node in IOVA address */
> >>>>>> -    return iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
> >>>>>> +    return iova_tree_alloc_map(tree->iova_map, map, iova_first,
> >>>>>>                                    tree->iova_last);
> >>>>>>     }
> >>>>>>
> >>>>>> @@ -107,4 +113,26 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
> >>>>>>     void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
> >>>>>>     {
> >>>>>>         iova_tree_remove(iova_tree->iova_taddr_map, map);
> >>>>>> +    iova_tree_remove(iova_tree->iova_map, map);
> >>>>>> +}
> >>>>>> +
> >>>>>> +/**
> >>>>>> + * Insert a new mapping to the IOVA->HVA tree
> >>>>>> + *
> >>>>>> + * @tree: The VhostIOVATree
> >>>>>> + * @map: The iova map
> >>>>>> + *
> >>>>>> + * Returns:
> >>>>>> + * - IOVA_OK if the map fits in the container
> >>>>>> + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
> >>>>>> + * - IOVA_ERR_OVERLAP if the IOVA range overlaps with an existing range
> >>>>>> + */
> >>>>>> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map)
> >>>>>> +{
> >>>>>> +    if (map->translated_addr + map->size < map->translated_addr ||
> >>>>>> +        map->perm == IOMMU_NONE) {
> >>>>>> +        return IOVA_ERR_INVALID;
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    return iova_tree_insert(iova_tree->iova_taddr_map, map);
> >>>>>>     }
> >>>>>> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> >>>>>> index 4adfd79ff0..8bf7b64786 100644
> >>>>>> --- a/hw/virtio/vhost-iova-tree.h
> >>>>>> +++ b/hw/virtio/vhost-iova-tree.h
> >>>>>> @@ -23,5 +23,6 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
> >>>>>>                                             const DMAMap *map);
> >>>>>>     int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
> >>>>>>     void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map);
> >>>>>> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map);
> >>>>>>
> >>>>>>     #endif
> >>>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >>>>>> index 3cdaa12ed5..6702459065 100644
> >>>>>> --- a/hw/virtio/vhost-vdpa.c
> >>>>>> +++ b/hw/virtio/vhost-vdpa.c
> >>>>>> @@ -361,10 +361,10 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> >>>>>>         if (s->shadow_data) {
> >>>>>>             int r;
> >>>>>>
> >>>>>> -        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,
> >>>>>>             mem_region.size = int128_get64(llsize) - 1,
> >>>>>>             mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> >>>>>>
> >>>>>> +        /* Allocate IOVA range and add the mapping to the IOVA tree */
> >>>>>>             r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
> >>>>>>             if (unlikely(r != IOVA_OK)) {
> >>>>>>                 error_report("Can't allocate a mapping (%d)", r);
> >>>>>> @@ -372,6 +372,14 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> >>>>>>             }
> >>>>>>
> >>>>>>             iova = mem_region.iova;
> >>>>>> +
> >>>>>> +        /* Add mapping to the IOVA->HVA tree */
> >>>>>> +        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr;
> >>>>>> +        r = vhost_iova_tree_insert(s->iova_tree, &mem_region);
> >>>>>> +        if (unlikely(r != IOVA_OK)) {
> >>>>>> +            error_report("Can't add listener region mapping (%d)", r);
> >>>>>> +            goto fail_map;
> >>>>>> +        }
> >>>>> I'd say it is not intuitive for the caller code.
> >>>>>
> >>>>>>         }
> >>>>>>
> >>>>>>         vhost_vdpa_iotlb_batch_begin_once(s);
> >>>>>> @@ -1142,19 +1150,30 @@ static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
> >>>>>>      *
> >>>>>>      * @v: Vhost-vdpa device
> >>>>>>      * @needle: The area to search iova
> >>>>>> + * @taddr: The translated address (SVQ HVA)
> >>>>>>      * @errorp: Error pointer
> >>>>>>      */
> >>>>>>     static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
> >>>>>> -                                    Error **errp)
> >>>>>> +                                    hwaddr taddr, Error **errp)
> >>>>>>     {
> >>>>>>         int r;
> >>>>>>
> >>>>>> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
> >>>>>>         r = vhost_iova_tree_map_alloc(v->shared->iova_tree, needle);
> >>>>>>         if (unlikely(r != IOVA_OK)) {
> >>>>>>             error_setg(errp, "Cannot allocate iova (%d)", r);
> >>>>>>             return false;
> >>>>>>         }
> >>>>>>
> >>>>>> +    /* Add mapping to the IOVA->HVA tree */
> >>>>>> +    needle->translated_addr = taddr;
> >>>>>> +    r = vhost_iova_tree_insert(v->shared->iova_tree, needle);
> >>>>>> +    if (unlikely(r != IOVA_OK)) {
> >>>>>> +        error_setg(errp, "Cannot add SVQ vring mapping (%d)", r);
> >>>>>> +        vhost_iova_tree_remove(v->shared->iova_tree, *needle);
> >>>>>> +        return false;
> >>>>>> +    }
> >>>>>> +
> >>>>>>         r = vhost_vdpa_dma_map(v->shared, v->address_space_id, needle->iova,
> >>>>>>                                needle->size + 1,
> >>>>>>                                (void *)(uintptr_t)needle->translated_addr,
> >>>>>> @@ -1192,11 +1211,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
> >>>>>>         vhost_svq_get_vring_addr(svq, &svq_addr);
> >>>>>>
> >>>>>>         driver_region = (DMAMap) {
> >>>>>> -        .translated_addr = svq_addr.desc_user_addr,
> >>>>>>             .size = driver_size - 1,
> >>>>>>             .perm = IOMMU_RO,
> >>>>>>         };
> >>>>>> -    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
> >>>>>> +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, svq_addr.desc_user_addr,
> >>>>>> +                                 errp);
> >>>>>>         if (unlikely(!ok)) {
> >>>>>>             error_prepend(errp, "Cannot create vq driver region: ");
> >>>>>>             return false;
> >>>>>> @@ -1206,11 +1225,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
> >>>>>>         addr->avail_user_addr = driver_region.iova + avail_offset;
> >>>>>>
> >>>>>>         device_region = (DMAMap) {
> >>>>>> -        .translated_addr = svq_addr.used_user_addr,
> >>>>>>             .size = device_size - 1,
> >>>>>>             .perm = IOMMU_RW,
> >>>>>>         };
> >>>>>> -    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
> >>>>>> +    ok = vhost_vdpa_svq_map_ring(v, &device_region, svq_addr.used_user_addr,
> >>>>>> +                                 errp);
> >>>>>>         if (unlikely(!ok)) {
> >>>>>>             error_prepend(errp, "Cannot create vq device region: ");
> >>>>>>             vhost_vdpa_svq_unmap_ring(v, driver_region.translated_addr);
> >>>>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> >>>>>> index 03457ead66..81da956b92 100644
> >>>>>> --- a/net/vhost-vdpa.c
> >>>>>> +++ b/net/vhost-vdpa.c
> >>>>>> @@ -512,15 +512,24 @@ static int vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v, void *buf, size_t size,
> >>>>>>         DMAMap map = {};
> >>>>>>         int r;
> >>>>>>
> >>>>>> -    map.translated_addr = (hwaddr)(uintptr_t)buf;
> >>>>>>         map.size = size - 1;
> >>>>>>         map.perm = write ? IOMMU_RW : IOMMU_RO,
> >>>>>> +
> >>>>>> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
> >>>>>>         r = vhost_iova_tree_map_alloc(v->shared->iova_tree, &map);
> >>>>>>         if (unlikely(r != IOVA_OK)) {
> >>>>>> -        error_report("Cannot map injected element");
> >>>>>> +        error_report("Cannot allocate IOVA range for injected element");
> >>>>>>             return r;
> >>>>>>         }
> >>>>>>
> >>>>>> +    /* Add mapping to the IOVA->HVA tree */
> >>>>>> +    map.translated_addr = (hwaddr)(uintptr_t)buf;
> >>>>>> +    r = vhost_iova_tree_insert(v->shared->iova_tree, &map);
> >>>>>> +    if (unlikely(r != IOVA_OK)) {
> >>>>>> +        error_report("Cannot map injected element into IOVA->HVA tree");
> >>>>>> +        goto dma_map_err;
> >>>>>> +    }
> >>>>>> +
> >>>>>>         r = vhost_vdpa_dma_map(v->shared, v->address_space_id, map.iova,
> >>>>>>                                vhost_vdpa_net_cvq_cmd_page_len(), buf, !write);
> >>>>>>         if (unlikely(r < 0)) {
> >>>>>> --
> >>>>>> 2.43.5
> >>>>>>
>
Si-Wei Liu Sept. 11, 2024, 9:06 a.m. UTC | #10
On 9/9/2024 11:22 PM, Eugenio Perez Martin wrote:
> On Tue, Sep 10, 2024 at 7:30 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>> Sorry for the delayed response, it seems I missed the email reply for
>> some reason during the long weekend.
>>
>> On 9/2/2024 4:09 AM, Eugenio Perez Martin wrote:
>>> On Fri, Aug 30, 2024 at 11:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>> On 8/30/2024 1:05 AM, Eugenio Perez Martin wrote:
>>>>> On Fri, Aug 30, 2024 at 6:20 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>> On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote:
>>>>>>> On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>>>>> Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
>>>>>>>> the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
>>>>>>>> will hold all IOVA ranges that have been allocated (e.g. in the
>>>>>>>> IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.
>>>>>>>>
>>>>>>>> A new API function vhost_iova_tree_insert() is also created to add a
>>>>>>>> IOVA->HVA mapping into the IOVA->HVA tree.
>>>>>>>>
>>>>>>> I think this is a good first iteration but we can take steps to
>>>>>>> simplify it. Also, it is great to be able to make points on real code
>>>>>>> instead of designs on the air :).
>>>>>>>
>>>>>>> I expected a split of vhost_iova_tree_map_alloc between the current
>>>>>>> vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
>>>>>>> similar. Similarly, a vhost_iova_tree_remove and
>>>>>>> vhost_iova_tree_remove_gpa would be needed.
>>>>>>>
>>>>>>> The first one is used for regions that don't exist in the guest, like
>>>>>>> SVQ vrings or CVQ buffers. The second one is the one used by the
>>>>>>> memory listener to map the guest regions into the vdpa device.
>>>>>>>
>>>>>>> Implementation wise, only two trees are actually needed:
>>>>>>> * Current iova_taddr_map that contains all IOVA->vaddr translations as
>>>>>>> seen by the device, so both allocation functions can work on a single
>>>>>>> tree. The function iova_tree_find_iova keeps using this one, so the
>>>>>> I thought we had thorough discussion about this and agreed upon the
>>>>>> decoupled IOVA allocator solution.
>>>>> My interpretation of it is to leave the allocator as the current one,
>>>>> and create a new tree with GPA which is guaranteed to be unique. But
>>>>> we can talk over it of course.
>>>>>
>>>>>> But maybe I missed something earlier,
>>>>>> I am not clear how come this iova_tree_find_iova function could still
>>>>>> work with the full IOVA-> HVA tree when it comes to aliased memory or
>>>>>> overlapped HVAs? Granted, for the memory map removal in the
>>>>>> .region_del() path, we could rely on the GPA tree to locate the
>>>>>> corresponding IOVA, but how come the translation path could figure out
>>>>>> which IOVA range to return when the vaddr happens to fall in an
>>>>>> overlapped HVA range?
>>>>> That is not a problem, as they both translate to the same address at the device.
>>>> Not sure I followed, it might return a wrong IOVA (range) which the host
>>>> kernel may have conflict or unmatched attribute i.e. permission, size et
>>>> al in the map.
>>>>
>>> Let's leave out the permissions at the moment. I'm going to use the
>>> example you found, but I'll reorder (1) and (3) insertions so it picks
>>> the "wrong" IOVA range intentionally:
>>>
>>> (1)
>>> HVA: [0x7f7903ea0000, 0x7f7903ec0000)
>>> GPA: [0xfeda0000, 0xfedc0000)
>>> IOVA: [0x1000, 0x21000)
>>>
>>> (2)
>>> HVA: [0x7f7983e00000, 0x7f9903e00000)
>>> GPA: [0x100000000, 0x2080000000)
>>> IOVA: [0x80001000, 0x2000001000)
>>>
>>> (3)
>>> HVA: [0x7f7903e00000, 0x7f7983e00000)
>>> GPA: [0x0, 0x80000000)
>>> IOVA: [0x2000001000, 0x2080000000)
>>>
>>> Let's say that SVQ wants to translate the HVA range
>>> 0xfeda0000-0xfedd0000. So it makes available for the device two
>>> chained buffers: One with addr=0x1000 len=0x20000 and the other one
>>> with addr=(0x20000c1000 len=0x10000).
>>>
>>> The VirtIO device should be able to translate these two buffers in
>>> isolation and chain them. Not optimal but it helps to keep QEMU source
>>> clean, as the device already must support it. I don't foresee lots of
>>> cases like this anyway :).
>> Hmmm, this scheme will only work if the aliased map doesn't go away
>> immediately. If the BQL is not held or an unmap is to be done out of RCU
>> critical section, it's pretty dangerous to assume we can be always fine
>> to work with the other overlapped regions.
> But all the updates and reads are done in the critical sections, and
> will be that way in the future too. That's the reason why the iova
> tree does not have any mutex or equivalent.
Right. That's the way how the SVQ translation API is currently being 
used for sure. It's always protected in the critical section under BQL 
in the synchronized context. But if we want to support a future use 
caselike cache the translation result somewhere and get it used in an 
async context, this will be problematic. There's no way to correct the 
cached IOVA once the associated aliased map goes away, regardless of 
protection using mutex, BQL or something similar.
>
> If we take out SVQ from BQL, we will need to protect the update of it
> with something, either a mutex or similar. But that's already part of
> the plan, even without my proposal or if we implement this RFC the way
> it is.
Yep, I know currently we don't have such use case, or we don't even 
actually have to support or get any chance to trip over those edge use 
cases in the future. The point I wanted to make is that, this full HVA 
tree based translation path is tightly coupled with how SVQ is now 
supposed to work, while departing too much from the rest of memory 
subsystem. Not saying it is not okay to go this way, though you may be 
aware already that with this abstraction, there'll be loss of generality 
and consistency with memory system's view,  which would need duplicative 
work like the permission check to satisfy those well established 
functionalities already built in memory system itself; get it compared 
to the other IOTLB implementation or similar memory translation API in 
QEMU, limitation applies to where and how the API should be used.

>
>> In my opinion if we have to
>> work with the HVA tree, it'd be ideal for the caller(s) to get more aid
>> in helping figure out the right IOVA range in match rather than for SVQ
>> or vhost-iova-tree to blindly pick a random one or break up contiguous
>> range into pieces (in an inconsistent and unnecessary way).
> The point is that SVQ already needs to work like that, as it needs to
> support that a contiguous GPA range is splitted between different HVA.
> Moreover, this is not a part of SVQ, but of VirtQueue. Both should be
> extremely rare cases. Is it worth it to complicate / penalize the
> general case solution to benefit this weird case, which is supported?
Sure, not saying SVQ shouldn't support split GPA range between 
difference HVA. I guess what I meant was the returned IOVA will likely 
not match the the memory system's view, which is kinda weird. For 
example, the IOVA returned from the translation API can't be used to 
infer the GPA via internal tree lookup, we still have to resort to 
another external lookup via the memory system API. From the looks the 
abstraction layer appears to be self-contained, but actually there are 
quite some odd assumptions here or there that may in turn prohibit 
possible future use case.

Given this vhost-iova-tree abstraction can only work with the current 
assumption or the current limited usage in the SVQ code, I feel the 
abstraction might need a bit more time to evolve to a point, where with 
a feature-rich SVQ implementation we can gain more confidence to 
conclude it's the right time to abstract something up. For now I just 
mentally equalize SVQ with vhost-iova-tree both in concept and layering, 
so I have to admit I favored more on Jonah's sparse implementation with 
decoupled GPA tree plus partial HVA tree for the SVQ - it doesn't seem 
lose any generality for future extension. Do you feel if ever possible 
to start from this intuitive implementation and gradually get it evolved 
with future use cases?

>> This would
>> require a bit extensive changes to all the callers to pass in more
>> information though, like the GPA, or the RAMBlock/MemoryRegionSection
>> backing the relevant guest memory, along with the offset.
>>
>>> About the permissions, maybe we can make the permissions to be part of
>>> the lookup? Instead of returning them at iova_tree_find_iova, make
>>> them match at iova_tree_find_address_iterator.
>> Yes, if there's no easy way out we have to add this extra info to the
>> HVA tree and make the lookup routine even complex (or duplicative).
>>
> Right.
>
>>>>> The most complicated situation is where we have a region contained in
>>>>> another region, and the requested buffer crosses them. If the IOVA
>>>>> tree returns the inner region, it will return the buffer chained with
>>>>> the rest of the content in the outer region. Not optimal, but solved
>>>>> either way.
>>>> Don't quite understand what it means... So in this overlapping case,
>>>> speaking of the expectation of the translation API, you would like to
>>>> have all IOVA ranges that match the overlapped HVA to be returned? And
>>>> then to rely on the user (caller) to figure out which one is correct?
>>>> Wouldn't it be easier for the user (SVQ) to use the memory system API
>>>> directly to figure out?
>>>>
>>> All of them are correct in the translation path. The device should be
>>> able to work with a buffer that spans over different IOTLB too. You
>>> can see how QEMU handles it at hw/virtio/virtio.c:virtqueue_map_desc.
>>> If the region is not big enough to contain the whole buffer, the
>>> device must keep looking for the rest of it.
>> Yeah I see why you prefer working with HVA tree even with overlapping
>> ranges, as the current API virtqueue_map_desc() that returns the HVA
>> already wraps up the translation internals well for e.g. when span over
>> different IOMMU.  Are you worry with the vIOMMU case where the GPA is no
>> longer cached in the virtqueue elem? Maybe we can add also that
>> information to the elem even for vIOMMU (we can defer doing it until we
>> add the vIOMMU support to SVQ), so that SVQ can just look up the GPA
>> tree directly in the translation path?
> I think that IOVA should just replace GPA in the tree, isn't it? Or am
> I missing something?
Yeah, I mean that's the advantage for the full HVA tree solution, given 
the virtio device model that uses the virtqueue API virtqueue_map_desc() 
would return HVA rather than GPA, so when vIOMMU support is going to be 
added, SVQ translation code can still work with the returned HVA to 
translate back to IOVA as is. All the vIOMMU translation will be 
transparently handled in virtqueue_map_desc() itself.

Thanks,
-Siwei
>
> So the user of the IOVA tree (vhost-vdpa.c) should be slightly changed
> but there is no change required for SVQ or IOVATree as far as I know.
>
>>>> As we are talking about API we may want to build it in a way generic
>>>> enough to address all possible needs (which goes with what memory
>>>> subsystem is capable of), rather than just look on the current usage
>>>> which has kind of narrow scope. Although virtio-net device doesn't work
>>>> with aliased region now, some other virtio device may do, or maybe some
>>>> day virtio-net would need to use aliased region than the API and the
>>>> users (SVQ) would have to go with another round of significant
>>>> refactoring due to the iova-tree internal working. I feel it's just too
>>>> early or too tight to abstract the iova-tree layer and get the API
>>>> customized for the current use case with a lot of limitations on how
>>>> user should expect to use it. We need some more flexibility and ease on
>>>> extensibility if we want to take the chance to get it rewritten, given
>>>> it is not a lot of code that Jonah had showed here ..
>>>>
>>> Let me know if they are addressed here. Sorry if I didn't explain it
>>> well, but I'm not trying to limit the alias or to handle just a subset
>>> of them. I'm trying to delegate the handling of these to the device as
>>> much as possible, as the device already needs to handle them and the
>>> less we complicate the QEMU solution, the better. Of course, the IOVA
>>> tree is a very self-contained area easy to rewrite in theory, but with
>>> potential future users it might get complicated.
>> Sure, understood. I just want to compare the Pros and Cons for each
>> candidate, so that Jonah won't spend quite a lot of time to come up with
>> complicated code, then soon find out all or most of them have to be
>> thrown away, due to short sighted design which is unable to cope with
>> foreseeable future use cases.
>>
>> Thanks,
>> -Siwei
>>
>>>>> The only problem that comes to my mind is the case where the inner
>>>>> region is RO
>>>> Yes, this is one of examples around the permission or size I mentioned
>>>> above, which may have a conflict view with the memory system or the kernel.
>>>>
>>>> Thanks,
>>>> -Siwei
>>>>
>>>>> and it is a write command, but I don't think we have this
>>>>> case in a sane guest. A malicious guest cannot do any harm this way
>>>>> anyway.
>>>>>
>>>>>> Do we still assume some overlapping order so we
>>>>>> always return the first match from the tree? Or we expect every current
>>>>>> user of iova_tree_find_iova should pass in GPA rather than HVA and use
>>>>>> the vhost_iova_xxx_gpa API variant to look up IOVA?
>>>>>>
>>>>> No, iova_tree_find_iova should keep asking for vaddr, as the result is
>>>>> guaranteed to be there. Users of VhostIOVATree only need to modify how
>>>>> they add or remove regions, knowing if they come from the guest or
>>>>> not. As shown by this series, it is easier to do in that place than in
>>>>> translation.
>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>>>
>>>>>>> user does not need to know if the address is from the guest or only
>>>>>>> exists in QEMU by using RAMBlock etc. All insert and remove functions
>>>>>>> use this tree.
>>>>>>> * A new tree that relates IOVA to GPA, that only
>>>>>>> vhost_iova_tree_map_alloc_gpa and vhost_iova_tree_remove_gpa uses.
>>>>>>>
>>>>>>> The ideal case is that the key in this new tree is the GPA and the
>>>>>>> value is the IOVA. But IOVATree's DMA is named the reverse: iova is
>>>>>>> the key and translated_addr is the vaddr. We can create a new tree
>>>>>>> struct for that, use GTree directly, or translate the reverse
>>>>>>> linearly. As memory add / remove should not be frequent, I think the
>>>>>>> simpler is the last one, but I'd be ok with creating a new tree.
>>>>>>>
>>>>>>> vhost_iova_tree_map_alloc_gpa needs to add the map to this new tree
>>>>>>> also. Similarly, vhost_iova_tree_remove_gpa must look for the GPA in
>>>>>>> this tree, and only remove the associated DMAMap in iova_taddr_map
>>>>>>> that matches the IOVA.
>>>>>>>
>>>>>>> Does it make sense to you?
>>>>>>>
>>>>>>>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>>>>>>>> ---
>>>>>>>>      hw/virtio/vhost-iova-tree.c | 38 ++++++++++++++++++++++++++++++++-----
>>>>>>>>      hw/virtio/vhost-iova-tree.h |  1 +
>>>>>>>>      hw/virtio/vhost-vdpa.c      | 31 ++++++++++++++++++++++++------
>>>>>>>>      net/vhost-vdpa.c            | 13 +++++++++++--
>>>>>>>>      4 files changed, 70 insertions(+), 13 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
>>>>>>>> index 3d03395a77..32c03db2f5 100644
>>>>>>>> --- a/hw/virtio/vhost-iova-tree.c
>>>>>>>> +++ b/hw/virtio/vhost-iova-tree.c
>>>>>>>> @@ -28,12 +28,17 @@ struct VhostIOVATree {
>>>>>>>>
>>>>>>>>          /* IOVA address to qemu memory maps. */
>>>>>>>>          IOVATree *iova_taddr_map;
>>>>>>>> +
>>>>>>>> +    /* IOVA tree (IOVA allocator) */
>>>>>>>> +    IOVATree *iova_map;
>>>>>>>>      };
>>>>>>>>
>>>>>>>>      /**
>>>>>>>> - * Create a new IOVA tree
>>>>>>>> + * Create a new VhostIOVATree with a new set of IOVATree's:
>>>>>>> s/IOVA tree/VhostIOVATree/ is good, but I think the rest is more an
>>>>>>> implementation detail.
>>>>>>>
>>>>>>>> + * - IOVA allocator (iova_map)
>>>>>>>> + * - IOVA->HVA tree (iova_taddr_map)
>>>>>>>>       *
>>>>>>>> - * Returns the new IOVA tree
>>>>>>>> + * Returns the new VhostIOVATree
>>>>>>>>       */
>>>>>>>>      VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>>>>>>>      {
>>>>>>>> @@ -44,6 +49,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>>>>>>>          tree->iova_last = iova_last;
>>>>>>>>
>>>>>>>>          tree->iova_taddr_map = iova_tree_new();
>>>>>>>> +    tree->iova_map = iova_tree_new();
>>>>>>>>          return tree;
>>>>>>>>      }
>>>>>>>>
>>>>>>>> @@ -53,6 +59,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>>>>>>>      void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
>>>>>>>>      {
>>>>>>>>          iova_tree_destroy(iova_tree->iova_taddr_map);
>>>>>>>> +    iova_tree_destroy(iova_tree->iova_map);
>>>>>>>>          g_free(iova_tree);
>>>>>>>>      }
>>>>>>>>
>>>>>>>> @@ -88,13 +95,12 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
>>>>>>>>          /* Some vhost devices do not like addr 0. Skip first page */
>>>>>>>>          hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size();
>>>>>>>>
>>>>>>>> -    if (map->translated_addr + map->size < map->translated_addr ||
>>>>>>> Why remove this condition? If the request is invalid we still need to
>>>>>>> return an error here.
>>>>>>>
>>>>>>> Maybe we should move it to iova_tree_alloc_map though.
>>>>>>>
>>>>>>>> -        map->perm == IOMMU_NONE) {
>>>>>>>> +    if (map->perm == IOMMU_NONE) {
>>>>>>>>              return IOVA_ERR_INVALID;
>>>>>>>>          }
>>>>>>>>
>>>>>>>>          /* Allocate a node in IOVA address */
>>>>>>>> -    return iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
>>>>>>>> +    return iova_tree_alloc_map(tree->iova_map, map, iova_first,
>>>>>>>>                                     tree->iova_last);
>>>>>>>>      }
>>>>>>>>
>>>>>>>> @@ -107,4 +113,26 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
>>>>>>>>      void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
>>>>>>>>      {
>>>>>>>>          iova_tree_remove(iova_tree->iova_taddr_map, map);
>>>>>>>> +    iova_tree_remove(iova_tree->iova_map, map);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * Insert a new mapping to the IOVA->HVA tree
>>>>>>>> + *
>>>>>>>> + * @tree: The VhostIOVATree
>>>>>>>> + * @map: The iova map
>>>>>>>> + *
>>>>>>>> + * Returns:
>>>>>>>> + * - IOVA_OK if the map fits in the container
>>>>>>>> + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
>>>>>>>> + * - IOVA_ERR_OVERLAP if the IOVA range overlaps with an existing range
>>>>>>>> + */
>>>>>>>> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map)
>>>>>>>> +{
>>>>>>>> +    if (map->translated_addr + map->size < map->translated_addr ||
>>>>>>>> +        map->perm == IOMMU_NONE) {
>>>>>>>> +        return IOVA_ERR_INVALID;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    return iova_tree_insert(iova_tree->iova_taddr_map, map);
>>>>>>>>      }
>>>>>>>> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
>>>>>>>> index 4adfd79ff0..8bf7b64786 100644
>>>>>>>> --- a/hw/virtio/vhost-iova-tree.h
>>>>>>>> +++ b/hw/virtio/vhost-iova-tree.h
>>>>>>>> @@ -23,5 +23,6 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
>>>>>>>>                                              const DMAMap *map);
>>>>>>>>      int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
>>>>>>>>      void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map);
>>>>>>>> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map);
>>>>>>>>
>>>>>>>>      #endif
>>>>>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>>>>>>> index 3cdaa12ed5..6702459065 100644
>>>>>>>> --- a/hw/virtio/vhost-vdpa.c
>>>>>>>> +++ b/hw/virtio/vhost-vdpa.c
>>>>>>>> @@ -361,10 +361,10 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>>>>>>>          if (s->shadow_data) {
>>>>>>>>              int r;
>>>>>>>>
>>>>>>>> -        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,
>>>>>>>>              mem_region.size = int128_get64(llsize) - 1,
>>>>>>>>              mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
>>>>>>>>
>>>>>>>> +        /* Allocate IOVA range and add the mapping to the IOVA tree */
>>>>>>>>              r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
>>>>>>>>              if (unlikely(r != IOVA_OK)) {
>>>>>>>>                  error_report("Can't allocate a mapping (%d)", r);
>>>>>>>> @@ -372,6 +372,14 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>>>>>>>              }
>>>>>>>>
>>>>>>>>              iova = mem_region.iova;
>>>>>>>> +
>>>>>>>> +        /* Add mapping to the IOVA->HVA tree */
>>>>>>>> +        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr;
>>>>>>>> +        r = vhost_iova_tree_insert(s->iova_tree, &mem_region);
>>>>>>>> +        if (unlikely(r != IOVA_OK)) {
>>>>>>>> +            error_report("Can't add listener region mapping (%d)", r);
>>>>>>>> +            goto fail_map;
>>>>>>>> +        }
>>>>>>> I'd say it is not intuitive for the caller code.
>>>>>>>
>>>>>>>>          }
>>>>>>>>
>>>>>>>>          vhost_vdpa_iotlb_batch_begin_once(s);
>>>>>>>> @@ -1142,19 +1150,30 @@ static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
>>>>>>>>       *
>>>>>>>>       * @v: Vhost-vdpa device
>>>>>>>>       * @needle: The area to search iova
>>>>>>>> + * @taddr: The translated address (SVQ HVA)
>>>>>>>>       * @errorp: Error pointer
>>>>>>>>       */
>>>>>>>>      static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
>>>>>>>> -                                    Error **errp)
>>>>>>>> +                                    hwaddr taddr, Error **errp)
>>>>>>>>      {
>>>>>>>>          int r;
>>>>>>>>
>>>>>>>> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
>>>>>>>>          r = vhost_iova_tree_map_alloc(v->shared->iova_tree, needle);
>>>>>>>>          if (unlikely(r != IOVA_OK)) {
>>>>>>>>              error_setg(errp, "Cannot allocate iova (%d)", r);
>>>>>>>>              return false;
>>>>>>>>          }
>>>>>>>>
>>>>>>>> +    /* Add mapping to the IOVA->HVA tree */
>>>>>>>> +    needle->translated_addr = taddr;
>>>>>>>> +    r = vhost_iova_tree_insert(v->shared->iova_tree, needle);
>>>>>>>> +    if (unlikely(r != IOVA_OK)) {
>>>>>>>> +        error_setg(errp, "Cannot add SVQ vring mapping (%d)", r);
>>>>>>>> +        vhost_iova_tree_remove(v->shared->iova_tree, *needle);
>>>>>>>> +        return false;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>>          r = vhost_vdpa_dma_map(v->shared, v->address_space_id, needle->iova,
>>>>>>>>                                 needle->size + 1,
>>>>>>>>                                 (void *)(uintptr_t)needle->translated_addr,
>>>>>>>> @@ -1192,11 +1211,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>>>>>>>>          vhost_svq_get_vring_addr(svq, &svq_addr);
>>>>>>>>
>>>>>>>>          driver_region = (DMAMap) {
>>>>>>>> -        .translated_addr = svq_addr.desc_user_addr,
>>>>>>>>              .size = driver_size - 1,
>>>>>>>>              .perm = IOMMU_RO,
>>>>>>>>          };
>>>>>>>> -    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
>>>>>>>> +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, svq_addr.desc_user_addr,
>>>>>>>> +                                 errp);
>>>>>>>>          if (unlikely(!ok)) {
>>>>>>>>              error_prepend(errp, "Cannot create vq driver region: ");
>>>>>>>>              return false;
>>>>>>>> @@ -1206,11 +1225,11 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>>>>>>>>          addr->avail_user_addr = driver_region.iova + avail_offset;
>>>>>>>>
>>>>>>>>          device_region = (DMAMap) {
>>>>>>>> -        .translated_addr = svq_addr.used_user_addr,
>>>>>>>>              .size = device_size - 1,
>>>>>>>>              .perm = IOMMU_RW,
>>>>>>>>          };
>>>>>>>> -    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
>>>>>>>> +    ok = vhost_vdpa_svq_map_ring(v, &device_region, svq_addr.used_user_addr,
>>>>>>>> +                                 errp);
>>>>>>>>          if (unlikely(!ok)) {
>>>>>>>>              error_prepend(errp, "Cannot create vq device region: ");
>>>>>>>>              vhost_vdpa_svq_unmap_ring(v, driver_region.translated_addr);
>>>>>>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
>>>>>>>> index 03457ead66..81da956b92 100644
>>>>>>>> --- a/net/vhost-vdpa.c
>>>>>>>> +++ b/net/vhost-vdpa.c
>>>>>>>> @@ -512,15 +512,24 @@ static int vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v, void *buf, size_t size,
>>>>>>>>          DMAMap map = {};
>>>>>>>>          int r;
>>>>>>>>
>>>>>>>> -    map.translated_addr = (hwaddr)(uintptr_t)buf;
>>>>>>>>          map.size = size - 1;
>>>>>>>>          map.perm = write ? IOMMU_RW : IOMMU_RO,
>>>>>>>> +
>>>>>>>> +    /* Allocate IOVA range and add the mapping to the IOVA tree */
>>>>>>>>          r = vhost_iova_tree_map_alloc(v->shared->iova_tree, &map);
>>>>>>>>          if (unlikely(r != IOVA_OK)) {
>>>>>>>> -        error_report("Cannot map injected element");
>>>>>>>> +        error_report("Cannot allocate IOVA range for injected element");
>>>>>>>>              return r;
>>>>>>>>          }
>>>>>>>>
>>>>>>>> +    /* Add mapping to the IOVA->HVA tree */
>>>>>>>> +    map.translated_addr = (hwaddr)(uintptr_t)buf;
>>>>>>>> +    r = vhost_iova_tree_insert(v->shared->iova_tree, &map);
>>>>>>>> +    if (unlikely(r != IOVA_OK)) {
>>>>>>>> +        error_report("Cannot map injected element into IOVA->HVA tree");
>>>>>>>> +        goto dma_map_err;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>>          r = vhost_vdpa_dma_map(v->shared, v->address_space_id, map.iova,
>>>>>>>>                                 vhost_vdpa_net_cvq_cmd_page_len(), buf, !write);
>>>>>>>>          if (unlikely(r < 0)) {
>>>>>>>> --
>>>>>>>> 2.43.5
>>>>>>>>
Eugenio Perez Martin Sept. 11, 2024, 10:45 a.m. UTC | #11
On Wed, Sep 11, 2024 at 11:06 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 9/9/2024 11:22 PM, Eugenio Perez Martin wrote:
> > On Tue, Sep 10, 2024 at 7:30 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >> Sorry for the delayed response, it seems I missed the email reply for
> >> some reason during the long weekend.
> >>
> >> On 9/2/2024 4:09 AM, Eugenio Perez Martin wrote:
> >>> On Fri, Aug 30, 2024 at 11:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>> On 8/30/2024 1:05 AM, Eugenio Perez Martin wrote:
> >>>>> On Fri, Aug 30, 2024 at 6:20 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>> On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>>>>>>> Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
> >>>>>>>> the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
> >>>>>>>> will hold all IOVA ranges that have been allocated (e.g. in the
> >>>>>>>> IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.
> >>>>>>>>
> >>>>>>>> A new API function vhost_iova_tree_insert() is also created to add a
> >>>>>>>> IOVA->HVA mapping into the IOVA->HVA tree.
> >>>>>>>>
> >>>>>>> I think this is a good first iteration but we can take steps to
> >>>>>>> simplify it. Also, it is great to be able to make points on real code
> >>>>>>> instead of designs on the air :).
> >>>>>>>
> >>>>>>> I expected a split of vhost_iova_tree_map_alloc between the current
> >>>>>>> vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
> >>>>>>> similar. Similarly, a vhost_iova_tree_remove and
> >>>>>>> vhost_iova_tree_remove_gpa would be needed.
> >>>>>>>
> >>>>>>> The first one is used for regions that don't exist in the guest, like
> >>>>>>> SVQ vrings or CVQ buffers. The second one is the one used by the
> >>>>>>> memory listener to map the guest regions into the vdpa device.
> >>>>>>>
> >>>>>>> Implementation wise, only two trees are actually needed:
> >>>>>>> * Current iova_taddr_map that contains all IOVA->vaddr translations as
> >>>>>>> seen by the device, so both allocation functions can work on a single
> >>>>>>> tree. The function iova_tree_find_iova keeps using this one, so the
> >>>>>> I thought we had thorough discussion about this and agreed upon the
> >>>>>> decoupled IOVA allocator solution.
> >>>>> My interpretation of it is to leave the allocator as the current one,
> >>>>> and create a new tree with GPA which is guaranteed to be unique. But
> >>>>> we can talk over it of course.
> >>>>>
> >>>>>> But maybe I missed something earlier,
> >>>>>> I am not clear how come this iova_tree_find_iova function could still
> >>>>>> work with the full IOVA-> HVA tree when it comes to aliased memory or
> >>>>>> overlapped HVAs? Granted, for the memory map removal in the
> >>>>>> .region_del() path, we could rely on the GPA tree to locate the
> >>>>>> corresponding IOVA, but how come the translation path could figure out
> >>>>>> which IOVA range to return when the vaddr happens to fall in an
> >>>>>> overlapped HVA range?
> >>>>> That is not a problem, as they both translate to the same address at the device.
> >>>> Not sure I followed, it might return a wrong IOVA (range) which the host
> >>>> kernel may have conflict or unmatched attribute i.e. permission, size et
> >>>> al in the map.
> >>>>
> >>> Let's leave out the permissions at the moment. I'm going to use the
> >>> example you found, but I'll reorder (1) and (3) insertions so it picks
> >>> the "wrong" IOVA range intentionally:
> >>>
> >>> (1)
> >>> HVA: [0x7f7903ea0000, 0x7f7903ec0000)
> >>> GPA: [0xfeda0000, 0xfedc0000)
> >>> IOVA: [0x1000, 0x21000)
> >>>
> >>> (2)
> >>> HVA: [0x7f7983e00000, 0x7f9903e00000)
> >>> GPA: [0x100000000, 0x2080000000)
> >>> IOVA: [0x80001000, 0x2000001000)
> >>>
> >>> (3)
> >>> HVA: [0x7f7903e00000, 0x7f7983e00000)
> >>> GPA: [0x0, 0x80000000)
> >>> IOVA: [0x2000001000, 0x2080000000)
> >>>
> >>> Let's say that SVQ wants to translate the HVA range
> >>> 0xfeda0000-0xfedd0000. So it makes available for the device two
> >>> chained buffers: One with addr=0x1000 len=0x20000 and the other one
> >>> with addr=(0x20000c1000 len=0x10000).
> >>>
> >>> The VirtIO device should be able to translate these two buffers in
> >>> isolation and chain them. Not optimal but it helps to keep QEMU source
> >>> clean, as the device already must support it. I don't foresee lots of
> >>> cases like this anyway :).
> >> Hmmm, this scheme will only work if the aliased map doesn't go away
> >> immediately. If the BQL is not held or an unmap is to be done out of RCU
> >> critical section, it's pretty dangerous to assume we can be always fine
> >> to work with the other overlapped regions.
> > But all the updates and reads are done in the critical sections, and
> > will be that way in the future too. That's the reason why the iova
> > tree does not have any mutex or equivalent.
> Right. That's the way how the SVQ translation API is currently being
> used for sure. It's always protected in the critical section under BQL
> in the synchronized context. But if we want to support a future use
> caselike cache the translation result somewhere and get it used in an
> async context, this will be problematic. There's no way to correct the
> cached IOVA once the associated aliased map goes away, regardless of
> protection using mutex, BQL or something similar.

Sorry, I still don't follow this point :(. I still fail to see how
this series solves that problem, or how it is different from just
using two trees instead of three. Some questions trying to understand
it:

This series still uses vaddr -> gpa translation through the function
qemu_ram_block_from_host. It uses RCU for synchronization. How does
this series work if it gets updated out of the RCU?

The function qemu_ram_block_from_host could also pick the small region
contained in the big region, as it is the same linear search as we do
in the current case. There is no code to select the big region at the
moment. How do we avoid the consequences of that small region
vanishing in that case?

Is that cache just in the SVQ? How is the coherency handled
differently in the HVA -> IOVA case than IOVA -> HVA?

> >
> > If we take out SVQ from BQL, we will need to protect the update of it
> > with something, either a mutex or similar. But that's already part of
> > the plan, even without my proposal or if we implement this RFC the way
> > it is.
> Yep, I know currently we don't have such use case, or we don't even
> actually have to support or get any chance to trip over those edge use
> cases in the future. The point I wanted to make is that, this full HVA
> tree based translation path is tightly coupled with how SVQ is now
> supposed to work, while departing too much from the rest of memory
> subsystem. Not saying it is not okay to go this way, though you may be
> aware already that with this abstraction, there'll be loss of generality
> and consistency with memory system's view,  which would need duplicative
> work like the permission check to satisfy those well established
> functionalities already built in memory system itself; get it compared
> to the other IOTLB implementation or similar memory translation API in
> QEMU, limitation applies to where and how the API should be used.
>

I think we cannot compare, as other IOTLB implementations in QEMU are
acting as devices, not drivers. As SVQ is the only one acting as
drivers, it makes sense the code is very different from the rest of
the code.

> >
> >> In my opinion if we have to
> >> work with the HVA tree, it'd be ideal for the caller(s) to get more aid
> >> in helping figure out the right IOVA range in match rather than for SVQ
> >> or vhost-iova-tree to blindly pick a random one or break up contiguous
> >> range into pieces (in an inconsistent and unnecessary way).
> > The point is that SVQ already needs to work like that, as it needs to
> > support that a contiguous GPA range is splitted between different HVA.
> > Moreover, this is not a part of SVQ, but of VirtQueue. Both should be
> > extremely rare cases. Is it worth it to complicate / penalize the
> > general case solution to benefit this weird case, which is supported?
> Sure, not saying SVQ shouldn't support split GPA range between
> difference HVA. I guess what I meant was the returned IOVA will likely
> not match the the memory system's view, which is kinda weird. For
> example, the IOVA returned from the translation API can't be used to
> infer the GPA via internal tree lookup, we still have to resort to
> another external lookup via the memory system API.

That will never be possible for the whole IOVA range here as SVQ also
needs to map QEMU's only memory.

But if that is a limitation, the second tree I propose can be the GPA
-> IOVA tree proposed in the patch 2/2 of this series for sure [1].

> From the looks the
> abstraction layer appears to be self-contained, but actually there are
> quite some odd assumptions here or there that may in turn prohibit
> possible future use case.
>
> Given this vhost-iova-tree abstraction can only work with the current
> assumption or the current limited usage in the SVQ code, I feel the
> abstraction might need a bit more time to evolve to a point, where with
> a feature-rich SVQ implementation we can gain more confidence to
> conclude it's the right time to abstract something up. For now I just
> mentally equalize SVQ with vhost-iova-tree both in concept and layering,
> so I have to admit I favored more on Jonah's sparse implementation with
> decoupled GPA tree plus partial HVA tree for the SVQ - it doesn't seem
> lose any generality for future extension. Do you feel if ever possible
> to start from this intuitive implementation and gradually get it evolved
> with future use cases?
>

Sorry if it sounded that way, but what I'm proposing is not too far
away from this RFC :). Let me summarize it here again,

1) Why not remove the iova allocator tree? (iova_map). We can just
follow all the allocations with the IOVA->HVA tree as we're doing now,
so we save the memory, the code needed to keep them both synchronized,
and the potential errors of these. If we implement the more optimized
HVA -> IOVA tree, we can reevaluate its inclusion of course, but I'd
start simple.

Pending: synchronization issues if we want to remove memory chunks out
of the RCU.

2) This series adds a conditional & a potentially expensive call to
qemu_ram_block_from_host in the translation path, which is a hot spot.
This is done so IOVATree knows if it needs to look in one tree or
another.

Instead of that, I propose to make this distinction at insertion /
removal time. This is both a way colder spot, and they don't need to
call qemu_ram_block_from_host or similar as the caller always knows if
it is being called from QEMU memory (via the listener) or adding SVQ
vrings / net specific code.

The first item is a prerequisite of this.

Does it make more sense?

[1] https://lists.nongnu.org/archive/html/qemu-devel/2024-08/msg04262.html

> >> This would
> >> require a bit extensive changes to all the callers to pass in more
> >> information though, like the GPA, or the RAMBlock/MemoryRegionSection
> >> backing the relevant guest memory, along with the offset.
> >>
> >>> About the permissions, maybe we can make the permissions to be part of
> >>> the lookup? Instead of returning them at iova_tree_find_iova, make
> >>> them match at iova_tree_find_address_iterator.
> >> Yes, if there's no easy way out we have to add this extra info to the
> >> HVA tree and make the lookup routine even complex (or duplicative).
> >>
> > Right.
> >
> >>>>> The most complicated situation is where we have a region contained in
> >>>>> another region, and the requested buffer crosses them. If the IOVA
> >>>>> tree returns the inner region, it will return the buffer chained with
> >>>>> the rest of the content in the outer region. Not optimal, but solved
> >>>>> either way.
> >>>> Don't quite understand what it means... So in this overlapping case,
> >>>> speaking of the expectation of the translation API, you would like to
> >>>> have all IOVA ranges that match the overlapped HVA to be returned? And
> >>>> then to rely on the user (caller) to figure out which one is correct?
> >>>> Wouldn't it be easier for the user (SVQ) to use the memory system API
> >>>> directly to figure out?
> >>>>
> >>> All of them are correct in the translation path. The device should be
> >>> able to work with a buffer that spans over different IOTLB too. You
> >>> can see how QEMU handles it at hw/virtio/virtio.c:virtqueue_map_desc.
> >>> If the region is not big enough to contain the whole buffer, the
> >>> device must keep looking for the rest of it.
> >> Yeah I see why you prefer working with HVA tree even with overlapping
> >> ranges, as the current API virtqueue_map_desc() that returns the HVA
> >> already wraps up the translation internals well for e.g. when span over
> >> different IOMMU.  Are you worry with the vIOMMU case where the GPA is no
> >> longer cached in the virtqueue elem? Maybe we can add also that
> >> information to the elem even for vIOMMU (we can defer doing it until we
> >> add the vIOMMU support to SVQ), so that SVQ can just look up the GPA
> >> tree directly in the translation path?
> > I think that IOVA should just replace GPA in the tree, isn't it? Or am
> > I missing something?
> Yeah, I mean that's the advantage for the full HVA tree solution, given
> the virtio device model that uses the virtqueue API virtqueue_map_desc()
> would return HVA rather than GPA, so when vIOMMU support is going to be
> added, SVQ translation code can still work with the returned HVA to
> translate back to IOVA as is. All the vIOMMU translation will be
> transparently handled in virtqueue_map_desc() itself.
>

[1] https://lists.nongnu.org/archive/html/qemu-devel/2024-08/msg04262.html

> Thanks,
> -Siwei
> >
> > So the user of the IOVA tree (vhost-vdpa.c) should be slightly changed
> > but there is no change required for SVQ or IOVATree as far as I know.
> >
> >>>> As we are talking about API we may want to build it in a way generic
> >>>> enough to address all possible needs (which goes with what memory
> >>>> subsystem is capable of), rather than just look on the current usage
> >>>> which has kind of narrow scope. Although virtio-net device doesn't work
> >>>> with aliased region now, some other virtio device may do, or maybe some
> >>>> day virtio-net would need to use aliased region than the API and the
> >>>> users (SVQ) would have to go with another round of significant
> >>>> refactoring due to the iova-tree internal working. I feel it's just too
> >>>> early or too tight to abstract the iova-tree layer and get the API
> >>>> customized for the current use case with a lot of limitations on how
> >>>> user should expect to use it. We need some more flexibility and ease on
> >>>> extensibility if we want to take the chance to get it rewritten, given
> >>>> it is not a lot of code that Jonah had showed here ..
> >>>>
> >>> Let me know if they are addressed here. Sorry if I didn't explain it
> >>> well, but I'm not trying to limit the alias or to handle just a subset
> >>> of them. I'm trying to delegate the handling of these to the device as
> >>> much as possible, as the device already needs to handle them and the
> >>> less we complicate the QEMU solution, the better. Of course, the IOVA
> >>> tree is a very self-contained area easy to rewrite in theory, but with
> >>> potential future users it might get complicated.
> >> Sure, understood. I just want to compare the Pros and Cons for each
> >> candidate, so that Jonah won't spend quite a lot of time to come up with
> >> complicated code, then soon find out all or most of them have to be
> >> thrown away, due to short sighted design which is unable to cope with
> >> foreseeable future use cases.
> >>
Si-Wei Liu Sept. 25, 2024, 2:54 p.m. UTC | #12
On 9/11/2024 3:45 AM, Eugenio Perez Martin wrote:
> On Wed, Sep 11, 2024 at 11:06 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 9/9/2024 11:22 PM, Eugenio Perez Martin wrote:
>>> On Tue, Sep 10, 2024 at 7:30 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>> Sorry for the delayed response, it seems I missed the email reply for
>>>> some reason during the long weekend.
>>>>
>>>> On 9/2/2024 4:09 AM, Eugenio Perez Martin wrote:
>>>>> On Fri, Aug 30, 2024 at 11:05 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>> On 8/30/2024 1:05 AM, Eugenio Perez Martin wrote:
>>>>>>> On Fri, Aug 30, 2024 at 6:20 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>> On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote:
>>>>>>>>> On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>>>>>>> Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
>>>>>>>>>> the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
>>>>>>>>>> will hold all IOVA ranges that have been allocated (e.g. in the
>>>>>>>>>> IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.
>>>>>>>>>>
>>>>>>>>>> A new API function vhost_iova_tree_insert() is also created to add a
>>>>>>>>>> IOVA->HVA mapping into the IOVA->HVA tree.
>>>>>>>>>>
>>>>>>>>> I think this is a good first iteration but we can take steps to
>>>>>>>>> simplify it. Also, it is great to be able to make points on real code
>>>>>>>>> instead of designs on the air :).
>>>>>>>>>
>>>>>>>>> I expected a split of vhost_iova_tree_map_alloc between the current
>>>>>>>>> vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
>>>>>>>>> similar. Similarly, a vhost_iova_tree_remove and
>>>>>>>>> vhost_iova_tree_remove_gpa would be needed.
>>>>>>>>>
>>>>>>>>> The first one is used for regions that don't exist in the guest, like
>>>>>>>>> SVQ vrings or CVQ buffers. The second one is the one used by the
>>>>>>>>> memory listener to map the guest regions into the vdpa device.
>>>>>>>>>
>>>>>>>>> Implementation wise, only two trees are actually needed:
>>>>>>>>> * Current iova_taddr_map that contains all IOVA->vaddr translations as
>>>>>>>>> seen by the device, so both allocation functions can work on a single
>>>>>>>>> tree. The function iova_tree_find_iova keeps using this one, so the
>>>>>>>> I thought we had thorough discussion about this and agreed upon the
>>>>>>>> decoupled IOVA allocator solution.
>>>>>>> My interpretation of it is to leave the allocator as the current one,
>>>>>>> and create a new tree with GPA which is guaranteed to be unique. But
>>>>>>> we can talk over it of course.
>>>>>>>
>>>>>>>> But maybe I missed something earlier,
>>>>>>>> I am not clear how come this iova_tree_find_iova function could still
>>>>>>>> work with the full IOVA-> HVA tree when it comes to aliased memory or
>>>>>>>> overlapped HVAs? Granted, for the memory map removal in the
>>>>>>>> .region_del() path, we could rely on the GPA tree to locate the
>>>>>>>> corresponding IOVA, but how come the translation path could figure out
>>>>>>>> which IOVA range to return when the vaddr happens to fall in an
>>>>>>>> overlapped HVA range?
>>>>>>> That is not a problem, as they both translate to the same address at the device.
>>>>>> Not sure I followed, it might return a wrong IOVA (range) which the host
>>>>>> kernel may have conflict or unmatched attribute i.e. permission, size et
>>>>>> al in the map.
>>>>>>
>>>>> Let's leave out the permissions at the moment. I'm going to use the
>>>>> example you found, but I'll reorder (1) and (3) insertions so it picks
>>>>> the "wrong" IOVA range intentionally:
>>>>>
>>>>> (1)
>>>>> HVA: [0x7f7903ea0000, 0x7f7903ec0000)
>>>>> GPA: [0xfeda0000, 0xfedc0000)
>>>>> IOVA: [0x1000, 0x21000)
>>>>>
>>>>> (2)
>>>>> HVA: [0x7f7983e00000, 0x7f9903e00000)
>>>>> GPA: [0x100000000, 0x2080000000)
>>>>> IOVA: [0x80001000, 0x2000001000)
>>>>>
>>>>> (3)
>>>>> HVA: [0x7f7903e00000, 0x7f7983e00000)
>>>>> GPA: [0x0, 0x80000000)
>>>>> IOVA: [0x2000001000, 0x2080000000)
>>>>>
>>>>> Let's say that SVQ wants to translate the HVA range
>>>>> 0xfeda0000-0xfedd0000. So it makes available for the device two
>>>>> chained buffers: One with addr=0x1000 len=0x20000 and the other one
>>>>> with addr=(0x20000c1000 len=0x10000).
>>>>>
>>>>> The VirtIO device should be able to translate these two buffers in
>>>>> isolation and chain them. Not optimal but it helps to keep QEMU source
>>>>> clean, as the device already must support it. I don't foresee lots of
>>>>> cases like this anyway :).
>>>> Hmmm, this scheme will only work if the aliased map doesn't go away
>>>> immediately. If the BQL is not held or an unmap is to be done out of RCU
>>>> critical section, it's pretty dangerous to assume we can be always fine
>>>> to work with the other overlapped regions.
>>> But all the updates and reads are done in the critical sections, and
>>> will be that way in the future too. That's the reason why the iova
>>> tree does not have any mutex or equivalent.
>> Right. That's the way how the SVQ translation API is currently being
>> used for sure. It's always protected in the critical section under BQL
>> in the synchronized context. But if we want to support a future use
>> caselike cache the translation result somewhere and get it used in an
>> async context, this will be problematic. There's no way to correct the
>> cached IOVA once the associated aliased map goes away, regardless of
>> protection using mutex, BQL or something similar.
> Sorry, I still don't follow this point :(. I still fail to see how
> this series solves that problem, or how it is different from just
> using two trees instead of three. Some questions trying to understand
> it:
>
> This series still uses vaddr -> gpa translation through the function
> qemu_ram_block_from_host. It uses RCU for synchronization. How does
> this series work if it gets updated out of the RCU?
>
> The function qemu_ram_block_from_host could also pick the small region
> contained in the big region, as it is the same linear search as we do
> in the current case. There is no code to select the big region at the
> moment. How do we avoid the consequences of that small region
> vanishing in that case?
>
> Is that cache just in the SVQ? How is the coherency handled
> differently in the HVA -> IOVA case than IOVA -> HVA?
Sorry I was busy in the last couple of weeks, didn't get a chance to 
reply. I thought I got this answered in the last sync call, and here 
just to get a recap back on list. I think the core point is not about 
supporting those various future edge case, but that the HVA tree itself 
is problematic in identifying the exact memory object (right now it 
could easily track the wrong IOVA address and we have to maintain a lot 
of duplicative code like permission checks in SVQ layer) and also hard 
for future optimization (imagine how hard it would be to build a reverse 
tree for exact HVA->IOVA fast translation). And as you know, this is 
just an RFC series and we knew the qemu_ram_block_from_host call in the 
SVQ translation path can't be any performant, but the key motivation was 
to drive to using the decoupled GPA tree against the full HVA tree. 
There could be further optimizations in the virtio queue upper layer to 
percolate the needed information down to the SVQ, such as GPA, or offset 
in MemoryRegionSection, or offset in the ramblock, such that it works in 
a natural way that can to work directly *with* the memory system APIs 
rather than build a HVA tree that has to work *around* various 
limitations and has to track or synchronize with the life cycle of 
different objects (change on guess physical address, memory region, etc) 
in the memory subsystem. Theoretically this way we could work with the 
lower level object e.g. MemoryRegionSection or RAMBlock, though I don't 
think that the need is imminent so I guess GPA is sufficient between the 
virtio device layer and SVQ without the loss in translation (to HVA, as 
the aliasing/overlapping case shows). With the abstraction and use of 
GPA in SVQ, we just have to take care of the various memory listener 
events without having to synchronize a lot of other changes in the 
memory system. On the other hand, it won't have to do extra check on 
permission, or having to sync with the vIOMMU address translation path 
as exposed by the HVA tree solution.

I think one of the immediate step Jonah can make is to make the 
virtqueue_map_desc() API return the real GPA in addition to HVA (current 
code return bus address but this could be GIOVA if virtio device behind 
a vIOMMU), so we can get rid of the qemu_ram_block_from_host API from 
the translation path. This could even work when we have vIOMMU support 
added to SVQ, you can see that within the virtqueue_map_desc() the 
vIOMMU .translate() is naturally done only in the hot datapath without 
having to do translation twice in SVQ again (which I don't see how the 
full HVA tree can get translation efficiently done when working with the 
vIOMMU). Though at earlier point I thought the qemu_ram_block_from_host 
() would be easier for Jonah to pick up and get the key notion of it 
without going too many details immediately, that's why I had to get your 
confirmation it'd be okay for Jonah to go with using the 
qemu_ram_block_from_host API as RFC.


Thanks,
-Siwei


>>> If we take out SVQ from BQL, we will need to protect the update of it
>>> with something, either a mutex or similar. But that's already part of
>>> the plan, even without my proposal or if we implement this RFC the way
>>> it is.
>> Yep, I know currently we don't have such use case, or we don't even
>> actually have to support or get any chance to trip over those edge use
>> cases in the future. The point I wanted to make is that, this full HVA
>> tree based translation path is tightly coupled with how SVQ is now
>> supposed to work, while departing too much from the rest of memory
>> subsystem. Not saying it is not okay to go this way, though you may be
>> aware already that with this abstraction, there'll be loss of generality
>> and consistency with memory system's view,  which would need duplicative
>> work like the permission check to satisfy those well established
>> functionalities already built in memory system itself; get it compared
>> to the other IOTLB implementation or similar memory translation API in
>> QEMU, limitation applies to where and how the API should be used.
>>
> I think we cannot compare, as other IOTLB implementations in QEMU are
> acting as devices, not drivers. As SVQ is the only one acting as
> drivers, it makes sense the code is very different from the rest of
> the code.
>
>>>> In my opinion if we have to
>>>> work with the HVA tree, it'd be ideal for the caller(s) to get more aid
>>>> in helping figure out the right IOVA range in match rather than for SVQ
>>>> or vhost-iova-tree to blindly pick a random one or break up contiguous
>>>> range into pieces (in an inconsistent and unnecessary way).
>>> The point is that SVQ already needs to work like that, as it needs to
>>> support that a contiguous GPA range is splitted between different HVA.
>>> Moreover, this is not a part of SVQ, but of VirtQueue. Both should be
>>> extremely rare cases. Is it worth it to complicate / penalize the
>>> general case solution to benefit this weird case, which is supported?
>> Sure, not saying SVQ shouldn't support split GPA range between
>> difference HVA. I guess what I meant was the returned IOVA will likely
>> not match the the memory system's view, which is kinda weird. For
>> example, the IOVA returned from the translation API can't be used to
>> infer the GPA via internal tree lookup, we still have to resort to
>> another external lookup via the memory system API.
> That will never be possible for the whole IOVA range here as SVQ also
> needs to map QEMU's only memory.
>
> But if that is a limitation, the second tree I propose can be the GPA
> -> IOVA tree proposed in the patch 2/2 of this series for sure [1].
>
>>  From the looks the
>> abstraction layer appears to be self-contained, but actually there are
>> quite some odd assumptions here or there that may in turn prohibit
>> possible future use case.
>>
>> Given this vhost-iova-tree abstraction can only work with the current
>> assumption or the current limited usage in the SVQ code, I feel the
>> abstraction might need a bit more time to evolve to a point, where with
>> a feature-rich SVQ implementation we can gain more confidence to
>> conclude it's the right time to abstract something up. For now I just
>> mentally equalize SVQ with vhost-iova-tree both in concept and layering,
>> so I have to admit I favored more on Jonah's sparse implementation with
>> decoupled GPA tree plus partial HVA tree for the SVQ - it doesn't seem
>> lose any generality for future extension. Do you feel if ever possible
>> to start from this intuitive implementation and gradually get it evolved
>> with future use cases?
>>
> Sorry if it sounded that way, but what I'm proposing is not too far
> away from this RFC :). Let me summarize it here again,
>
> 1) Why not remove the iova allocator tree? (iova_map). We can just
> follow all the allocations with the IOVA->HVA tree as we're doing now,
> so we save the memory, the code needed to keep them both synchronized,
> and the potential errors of these. If we implement the more optimized
> HVA -> IOVA tree, we can reevaluate its inclusion of course, but I'd
> start simple.
>
> Pending: synchronization issues if we want to remove memory chunks out
> of the RCU.
>
> 2) This series adds a conditional & a potentially expensive call to
> qemu_ram_block_from_host in the translation path, which is a hot spot.
> This is done so IOVATree knows if it needs to look in one tree or
> another.
>
> Instead of that, I propose to make this distinction at insertion /
> removal time. This is both a way colder spot, and they don't need to
> call qemu_ram_block_from_host or similar as the caller always knows if
> it is being called from QEMU memory (via the listener) or adding SVQ
> vrings / net specific code.
>
> The first item is a prerequisite of this.
>
> Does it make more sense?
>
> [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-08/msg04262.html
>
>>>> This would
>>>> require a bit extensive changes to all the callers to pass in more
>>>> information though, like the GPA, or the RAMBlock/MemoryRegionSection
>>>> backing the relevant guest memory, along with the offset.
>>>>
>>>>> About the permissions, maybe we can make the permissions to be part of
>>>>> the lookup? Instead of returning them at iova_tree_find_iova, make
>>>>> them match at iova_tree_find_address_iterator.
>>>> Yes, if there's no easy way out we have to add this extra info to the
>>>> HVA tree and make the lookup routine even complex (or duplicative).
>>>>
>>> Right.
>>>
>>>>>>> The most complicated situation is where we have a region contained in
>>>>>>> another region, and the requested buffer crosses them. If the IOVA
>>>>>>> tree returns the inner region, it will return the buffer chained with
>>>>>>> the rest of the content in the outer region. Not optimal, but solved
>>>>>>> either way.
>>>>>> Don't quite understand what it means... So in this overlapping case,
>>>>>> speaking of the expectation of the translation API, you would like to
>>>>>> have all IOVA ranges that match the overlapped HVA to be returned? And
>>>>>> then to rely on the user (caller) to figure out which one is correct?
>>>>>> Wouldn't it be easier for the user (SVQ) to use the memory system API
>>>>>> directly to figure out?
>>>>>>
>>>>> All of them are correct in the translation path. The device should be
>>>>> able to work with a buffer that spans over different IOTLB too. You
>>>>> can see how QEMU handles it at hw/virtio/virtio.c:virtqueue_map_desc.
>>>>> If the region is not big enough to contain the whole buffer, the
>>>>> device must keep looking for the rest of it.
>>>> Yeah I see why you prefer working with HVA tree even with overlapping
>>>> ranges, as the current API virtqueue_map_desc() that returns the HVA
>>>> already wraps up the translation internals well for e.g. when span over
>>>> different IOMMU.  Are you worry with the vIOMMU case where the GPA is no
>>>> longer cached in the virtqueue elem? Maybe we can add also that
>>>> information to the elem even for vIOMMU (we can defer doing it until we
>>>> add the vIOMMU support to SVQ), so that SVQ can just look up the GPA
>>>> tree directly in the translation path?
>>> I think that IOVA should just replace GPA in the tree, isn't it? Or am
>>> I missing something?
>> Yeah, I mean that's the advantage for the full HVA tree solution, given
>> the virtio device model that uses the virtqueue API virtqueue_map_desc()
>> would return HVA rather than GPA, so when vIOMMU support is going to be
>> added, SVQ translation code can still work with the returned HVA to
>> translate back to IOVA as is. All the vIOMMU translation will be
>> transparently handled in virtqueue_map_desc() itself.
>>
> [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-08/msg04262.html
>
>> Thanks,
>> -Siwei
>>> So the user of the IOVA tree (vhost-vdpa.c) should be slightly changed
>>> but there is no change required for SVQ or IOVATree as far as I know.
>>>
>>>>>> As we are talking about API we may want to build it in a way generic
>>>>>> enough to address all possible needs (which goes with what memory
>>>>>> subsystem is capable of), rather than just look on the current usage
>>>>>> which has kind of narrow scope. Although virtio-net device doesn't work
>>>>>> with aliased region now, some other virtio device may do, or maybe some
>>>>>> day virtio-net would need to use aliased region than the API and the
>>>>>> users (SVQ) would have to go with another round of significant
>>>>>> refactoring due to the iova-tree internal working. I feel it's just too
>>>>>> early or too tight to abstract the iova-tree layer and get the API
>>>>>> customized for the current use case with a lot of limitations on how
>>>>>> user should expect to use it. We need some more flexibility and ease on
>>>>>> extensibility if we want to take the chance to get it rewritten, given
>>>>>> it is not a lot of code that Jonah had showed here ..
>>>>>>
>>>>> Let me know if they are addressed here. Sorry if I didn't explain it
>>>>> well, but I'm not trying to limit the alias or to handle just a subset
>>>>> of them. I'm trying to delegate the handling of these to the device as
>>>>> much as possible, as the device already needs to handle them and the
>>>>> less we complicate the QEMU solution, the better. Of course, the IOVA
>>>>> tree is a very self-contained area easy to rewrite in theory, but with
>>>>> potential future users it might get complicated.
>>>> Sure, understood. I just want to compare the Pros and Cons for each
>>>> candidate, so that Jonah won't spend quite a lot of time to come up with
>>>> complicated code, then soon find out all or most of them have to be
>>>> thrown away, due to short sighted design which is unable to cope with
>>>> foreseeable future use cases.
>>>>
diff mbox series

Patch

diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
index 3d03395a77..32c03db2f5 100644
--- a/hw/virtio/vhost-iova-tree.c
+++ b/hw/virtio/vhost-iova-tree.c
@@ -28,12 +28,17 @@  struct VhostIOVATree {
 
     /* IOVA address to qemu memory maps. */
     IOVATree *iova_taddr_map;
+
+    /* IOVA tree (IOVA allocator) */
+    IOVATree *iova_map;
 };
 
 /**
- * Create a new IOVA tree
+ * Create a new VhostIOVATree with a new set of IOVATree's:
+ * - IOVA allocator (iova_map)
+ * - IOVA->HVA tree (iova_taddr_map)
  *
- * Returns the new IOVA tree
+ * Returns the new VhostIOVATree
  */
 VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
 {
@@ -44,6 +49,7 @@  VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
     tree->iova_last = iova_last;
 
     tree->iova_taddr_map = iova_tree_new();
+    tree->iova_map = iova_tree_new();
     return tree;
 }
 
@@ -53,6 +59,7 @@  VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
 void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
 {
     iova_tree_destroy(iova_tree->iova_taddr_map);
+    iova_tree_destroy(iova_tree->iova_map);
     g_free(iova_tree);
 }
 
@@ -88,13 +95,12 @@  int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
     /* Some vhost devices do not like addr 0. Skip first page */
     hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size();
 
-    if (map->translated_addr + map->size < map->translated_addr ||
-        map->perm == IOMMU_NONE) {
+    if (map->perm == IOMMU_NONE) {
         return IOVA_ERR_INVALID;
     }
 
     /* Allocate a node in IOVA address */
-    return iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
+    return iova_tree_alloc_map(tree->iova_map, map, iova_first,
                                tree->iova_last);
 }
 
@@ -107,4 +113,26 @@  int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map)
 void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
 {
     iova_tree_remove(iova_tree->iova_taddr_map, map);
+    iova_tree_remove(iova_tree->iova_map, map);
+}
+
+/**
+ * Insert a new mapping to the IOVA->HVA tree
+ *
+ * @tree: The VhostIOVATree
+ * @map: The iova map
+ *
+ * Returns:
+ * - IOVA_OK if the map fits in the container
+ * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
+ * - IOVA_ERR_OVERLAP if the IOVA range overlaps with an existing range
+ */
+int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map)
+{
+    if (map->translated_addr + map->size < map->translated_addr ||
+        map->perm == IOMMU_NONE) {
+        return IOVA_ERR_INVALID;
+    }
+
+    return iova_tree_insert(iova_tree->iova_taddr_map, map);
 }
diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
index 4adfd79ff0..8bf7b64786 100644
--- a/hw/virtio/vhost-iova-tree.h
+++ b/hw/virtio/vhost-iova-tree.h
@@ -23,5 +23,6 @@  const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
                                         const DMAMap *map);
 int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
 void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map);
+int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map);
 
 #endif
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 3cdaa12ed5..6702459065 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -361,10 +361,10 @@  static void vhost_vdpa_listener_region_add(MemoryListener *listener,
     if (s->shadow_data) {
         int r;
 
-        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,
         mem_region.size = int128_get64(llsize) - 1,
         mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
 
+        /* Allocate IOVA range and add the mapping to the IOVA tree */
         r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
         if (unlikely(r != IOVA_OK)) {
             error_report("Can't allocate a mapping (%d)", r);
@@ -372,6 +372,14 @@  static void vhost_vdpa_listener_region_add(MemoryListener *listener,
         }
 
         iova = mem_region.iova;
+
+        /* Add mapping to the IOVA->HVA tree */
+        mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr;
+        r = vhost_iova_tree_insert(s->iova_tree, &mem_region);
+        if (unlikely(r != IOVA_OK)) {
+            error_report("Can't add listener region mapping (%d)", r);
+            goto fail_map;
+        }
     }
 
     vhost_vdpa_iotlb_batch_begin_once(s);
@@ -1142,19 +1150,30 @@  static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
  *
  * @v: Vhost-vdpa device
  * @needle: The area to search iova
+ * @taddr: The translated address (SVQ HVA)
  * @errorp: Error pointer
  */
 static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
-                                    Error **errp)
+                                    hwaddr taddr, Error **errp)
 {
     int r;
 
+    /* Allocate IOVA range and add the mapping to the IOVA tree */
     r = vhost_iova_tree_map_alloc(v->shared->iova_tree, needle);
     if (unlikely(r != IOVA_OK)) {
         error_setg(errp, "Cannot allocate iova (%d)", r);
         return false;
     }
 
+    /* Add mapping to the IOVA->HVA tree */
+    needle->translated_addr = taddr;
+    r = vhost_iova_tree_insert(v->shared->iova_tree, needle);
+    if (unlikely(r != IOVA_OK)) {
+        error_setg(errp, "Cannot add SVQ vring mapping (%d)", r);
+        vhost_iova_tree_remove(v->shared->iova_tree, *needle);
+        return false;
+    }
+
     r = vhost_vdpa_dma_map(v->shared, v->address_space_id, needle->iova,
                            needle->size + 1,
                            (void *)(uintptr_t)needle->translated_addr,
@@ -1192,11 +1211,11 @@  static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
     vhost_svq_get_vring_addr(svq, &svq_addr);
 
     driver_region = (DMAMap) {
-        .translated_addr = svq_addr.desc_user_addr,
         .size = driver_size - 1,
         .perm = IOMMU_RO,
     };
-    ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
+    ok = vhost_vdpa_svq_map_ring(v, &driver_region, svq_addr.desc_user_addr,
+                                 errp);
     if (unlikely(!ok)) {
         error_prepend(errp, "Cannot create vq driver region: ");
         return false;
@@ -1206,11 +1225,11 @@  static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
     addr->avail_user_addr = driver_region.iova + avail_offset;
 
     device_region = (DMAMap) {
-        .translated_addr = svq_addr.used_user_addr,
         .size = device_size - 1,
         .perm = IOMMU_RW,
     };
-    ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
+    ok = vhost_vdpa_svq_map_ring(v, &device_region, svq_addr.used_user_addr,
+                                 errp);
     if (unlikely(!ok)) {
         error_prepend(errp, "Cannot create vq device region: ");
         vhost_vdpa_svq_unmap_ring(v, driver_region.translated_addr);
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 03457ead66..81da956b92 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -512,15 +512,24 @@  static int vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v, void *buf, size_t size,
     DMAMap map = {};
     int r;
 
-    map.translated_addr = (hwaddr)(uintptr_t)buf;
     map.size = size - 1;
     map.perm = write ? IOMMU_RW : IOMMU_RO,
+
+    /* Allocate IOVA range and add the mapping to the IOVA tree */
     r = vhost_iova_tree_map_alloc(v->shared->iova_tree, &map);
     if (unlikely(r != IOVA_OK)) {
-        error_report("Cannot map injected element");
+        error_report("Cannot allocate IOVA range for injected element");
         return r;
     }
 
+    /* Add mapping to the IOVA->HVA tree */
+    map.translated_addr = (hwaddr)(uintptr_t)buf;
+    r = vhost_iova_tree_insert(v->shared->iova_tree, &map);
+    if (unlikely(r != IOVA_OK)) {
+        error_report("Cannot map injected element into IOVA->HVA tree");
+        goto dma_map_err;
+    }
+
     r = vhost_vdpa_dma_map(v->shared, v->address_space_id, map.iova,
                            vhost_vdpa_net_cvq_cmd_page_len(), buf, !write);
     if (unlikely(r < 0)) {