diff mbox series

[RFC,1/2] iova_tree: add an id member to DMAMap

Message ID 20240410100345.389462-2-eperezma@redhat.com (mailing list archive)
State New
Headers show
Series Identify aliased maps in vdpa SVQ iova_tree | expand

Commit Message

Eugenio Perez Martin April 10, 2024, 10:03 a.m. UTC
IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/qemu/iova-tree.h | 5 +++--
 util/iova-tree.c         | 3 ++-
 2 files changed, 5 insertions(+), 3 deletions(-)

Comments

Si-Wei Liu April 18, 2024, 8:46 p.m. UTC | #1
On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> IOVA tree is also used to track the mappings of virtio-net shadow
> virtqueue.  This mappings may not match with the GPA->HVA ones.
>
> This causes a problem when overlapped regions (different GPA but same
> translated HVA) exists in the tree, as looking them by HVA will return
> them twice.  To solve this, create an id member so we can assign unique
> identifiers (GPA) to the maps.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   include/qemu/iova-tree.h | 5 +++--
>   util/iova-tree.c         | 3 ++-
>   2 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> index 2a10a7052e..34ee230e7d 100644
> --- a/include/qemu/iova-tree.h
> +++ b/include/qemu/iova-tree.h
> @@ -36,6 +36,7 @@ typedef struct DMAMap {
>       hwaddr iova;
>       hwaddr translated_addr;
>       hwaddr size;                /* Inclusive */
> +    uint64_t id;
>       IOMMUAccessFlags perm;
>   } QEMU_PACKED DMAMap;
>   typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
>    * @map: the mapping to search
>    *
>    * Search for a mapping in the iova tree that translated_addr overlaps with the
> - * mapping range specified.  Only the first found mapping will be
> - * returned.
> + * mapping range specified and map->id is equal.  Only the first found
> + * mapping will be returned.
>    *
>    * Return: DMAMap pointer if found, or NULL if not found.  Note that
>    * the returned DMAMap pointer is maintained internally.  User should
> diff --git a/util/iova-tree.c b/util/iova-tree.c
> index 536789797e..0863e0a3b8 100644
> --- a/util/iova-tree.c
> +++ b/util/iova-tree.c
> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
>   
>       needle = args->needle;
>       if (map->translated_addr + map->size < needle->translated_addr ||
> -        needle->translated_addr + needle->size < map->translated_addr) {
> +        needle->translated_addr + needle->size < map->translated_addr ||
> +        needle->id != map->id) {

It looks this iterator can also be invoked by SVQ from 
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA 
space will be searched on without passing in the ID (GPA), and exact 
match for the same GPA range is not actually needed unlike the mapping 
removal case. Could we create an API variant, for the SVQ lookup case 
specifically? Or alternatively, add a special flag, say skip_id_match to 
DMAMap, and the id match check may look like below:

(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or 
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().

Thanks,
-Siwei
>           return false;
>       }
>
Eugenio Perez Martin April 19, 2024, 8:29 a.m. UTC | #2
On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> > IOVA tree is also used to track the mappings of virtio-net shadow
> > virtqueue.  This mappings may not match with the GPA->HVA ones.
> >
> > This causes a problem when overlapped regions (different GPA but same
> > translated HVA) exists in the tree, as looking them by HVA will return
> > them twice.  To solve this, create an id member so we can assign unique
> > identifiers (GPA) to the maps.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   include/qemu/iova-tree.h | 5 +++--
> >   util/iova-tree.c         | 3 ++-
> >   2 files changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> > index 2a10a7052e..34ee230e7d 100644
> > --- a/include/qemu/iova-tree.h
> > +++ b/include/qemu/iova-tree.h
> > @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >       hwaddr iova;
> >       hwaddr translated_addr;
> >       hwaddr size;                /* Inclusive */
> > +    uint64_t id;
> >       IOMMUAccessFlags perm;
> >   } QEMU_PACKED DMAMap;
> >   typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> > @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
> >    * @map: the mapping to search
> >    *
> >    * Search for a mapping in the iova tree that translated_addr overlaps with the
> > - * mapping range specified.  Only the first found mapping will be
> > - * returned.
> > + * mapping range specified and map->id is equal.  Only the first found
> > + * mapping will be returned.
> >    *
> >    * Return: DMAMap pointer if found, or NULL if not found.  Note that
> >    * the returned DMAMap pointer is maintained internally.  User should
> > diff --git a/util/iova-tree.c b/util/iova-tree.c
> > index 536789797e..0863e0a3b8 100644
> > --- a/util/iova-tree.c
> > +++ b/util/iova-tree.c
> > @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
> >
> >       needle = args->needle;
> >       if (map->translated_addr + map->size < needle->translated_addr ||
> > -        needle->translated_addr + needle->size < map->translated_addr) {
> > +        needle->translated_addr + needle->size < map->translated_addr ||
> > +        needle->id != map->id) {
>
> It looks this iterator can also be invoked by SVQ from
> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> space will be searched on without passing in the ID (GPA), and exact
> match for the same GPA range is not actually needed unlike the mapping
> removal case. Could we create an API variant, for the SVQ lookup case
> specifically? Or alternatively, add a special flag, say skip_id_match to
> DMAMap, and the id match check may look like below:
>
> (!needle->skip_id_match && needle->id != map->id)
>
> I think vhost_svq_translate_addr() could just call the API variant or
> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
>

I think you're totally right. But I'd really like to not complicate
the API of the iova_tree more.

I think we can look for the hwaddr using memory_region_from_host and
then get the hwaddr. It is another lookup though...

> Thanks,
> -Siwei
> >           return false;
> >       }
> >
>
Si-Wei Liu April 19, 2024, 11:49 p.m. UTC | #3
On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
>>> IOVA tree is also used to track the mappings of virtio-net shadow
>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
>>>
>>> This causes a problem when overlapped regions (different GPA but same
>>> translated HVA) exists in the tree, as looking them by HVA will return
>>> them twice.  To solve this, create an id member so we can assign unique
>>> identifiers (GPA) to the maps.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>    include/qemu/iova-tree.h | 5 +++--
>>>    util/iova-tree.c         | 3 ++-
>>>    2 files changed, 5 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
>>> index 2a10a7052e..34ee230e7d 100644
>>> --- a/include/qemu/iova-tree.h
>>> +++ b/include/qemu/iova-tree.h
>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
>>>        hwaddr iova;
>>>        hwaddr translated_addr;
>>>        hwaddr size;                /* Inclusive */
>>> +    uint64_t id;
>>>        IOMMUAccessFlags perm;
>>>    } QEMU_PACKED DMAMap;
>>>    typedef gboolean (*iova_tree_iterator)(DMAMap *map);
>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
>>>     * @map: the mapping to search
>>>     *
>>>     * Search for a mapping in the iova tree that translated_addr overlaps with the
>>> - * mapping range specified.  Only the first found mapping will be
>>> - * returned.
>>> + * mapping range specified and map->id is equal.  Only the first found
>>> + * mapping will be returned.
>>>     *
>>>     * Return: DMAMap pointer if found, or NULL if not found.  Note that
>>>     * the returned DMAMap pointer is maintained internally.  User should
>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
>>> index 536789797e..0863e0a3b8 100644
>>> --- a/util/iova-tree.c
>>> +++ b/util/iova-tree.c
>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
>>>
>>>        needle = args->needle;
>>>        if (map->translated_addr + map->size < needle->translated_addr ||
>>> -        needle->translated_addr + needle->size < map->translated_addr) {
>>> +        needle->translated_addr + needle->size < map->translated_addr ||
>>> +        needle->id != map->id) {
>> It looks this iterator can also be invoked by SVQ from
>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
>> space will be searched on without passing in the ID (GPA), and exact
>> match for the same GPA range is not actually needed unlike the mapping
>> removal case. Could we create an API variant, for the SVQ lookup case
>> specifically? Or alternatively, add a special flag, say skip_id_match to
>> DMAMap, and the id match check may look like below:
>>
>> (!needle->skip_id_match && needle->id != map->id)
>>
>> I think vhost_svq_translate_addr() could just call the API variant or
>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
>>
> I think you're totally right. But I'd really like to not complicate
> the API of the iova_tree more.
>
> I think we can look for the hwaddr using memory_region_from_host and
> then get the hwaddr. It is another lookup though...
Yeah, that will be another means of doing translation without having to 
complicate the API around iova_tree. I wonder how the lookup through 
memory_region_from_host() may perform compared to the iova tree one, the 
former looks to be an O(N) linear search on a linked list while the 
latter would be roughly O(log N) on an AVL tree? Of course, 
memory_region_from_host() won't search out of the guest memory space for 
sure. As this could be on the hot data path I have a little bit 
hesitance over the potential cost or performance regression this change 
could bring in, but maybe I'm overthinking it too much...

Thanks,
-Siwei

>
>> Thanks,
>> -Siwei
>>>            return false;
>>>        }
>>>
Eugenio Perez Martin April 22, 2024, 8:49 a.m. UTC | #4
On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> > On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>
> >>> This causes a problem when overlapped regions (different GPA but same
> >>> translated HVA) exists in the tree, as looking them by HVA will return
> >>> them twice.  To solve this, create an id member so we can assign unique
> >>> identifiers (GPA) to the maps.
> >>>
> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>> ---
> >>>    include/qemu/iova-tree.h | 5 +++--
> >>>    util/iova-tree.c         | 3 ++-
> >>>    2 files changed, 5 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>> index 2a10a7052e..34ee230e7d 100644
> >>> --- a/include/qemu/iova-tree.h
> >>> +++ b/include/qemu/iova-tree.h
> >>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>        hwaddr iova;
> >>>        hwaddr translated_addr;
> >>>        hwaddr size;                /* Inclusive */
> >>> +    uint64_t id;
> >>>        IOMMUAccessFlags perm;
> >>>    } QEMU_PACKED DMAMap;
> >>>    typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
> >>>     * @map: the mapping to search
> >>>     *
> >>>     * Search for a mapping in the iova tree that translated_addr overlaps with the
> >>> - * mapping range specified.  Only the first found mapping will be
> >>> - * returned.
> >>> + * mapping range specified and map->id is equal.  Only the first found
> >>> + * mapping will be returned.
> >>>     *
> >>>     * Return: DMAMap pointer if found, or NULL if not found.  Note that
> >>>     * the returned DMAMap pointer is maintained internally.  User should
> >>> diff --git a/util/iova-tree.c b/util/iova-tree.c
> >>> index 536789797e..0863e0a3b8 100644
> >>> --- a/util/iova-tree.c
> >>> +++ b/util/iova-tree.c
> >>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
> >>>
> >>>        needle = args->needle;
> >>>        if (map->translated_addr + map->size < needle->translated_addr ||
> >>> -        needle->translated_addr + needle->size < map->translated_addr) {
> >>> +        needle->translated_addr + needle->size < map->translated_addr ||
> >>> +        needle->id != map->id) {
> >> It looks this iterator can also be invoked by SVQ from
> >> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> >> space will be searched on without passing in the ID (GPA), and exact
> >> match for the same GPA range is not actually needed unlike the mapping
> >> removal case. Could we create an API variant, for the SVQ lookup case
> >> specifically? Or alternatively, add a special flag, say skip_id_match to
> >> DMAMap, and the id match check may look like below:
> >>
> >> (!needle->skip_id_match && needle->id != map->id)
> >>
> >> I think vhost_svq_translate_addr() could just call the API variant or
> >> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
> >>
> > I think you're totally right. But I'd really like to not complicate
> > the API of the iova_tree more.
> >
> > I think we can look for the hwaddr using memory_region_from_host and
> > then get the hwaddr. It is another lookup though...
> Yeah, that will be another means of doing translation without having to
> complicate the API around iova_tree. I wonder how the lookup through
> memory_region_from_host() may perform compared to the iova tree one, the
> former looks to be an O(N) linear search on a linked list while the
> latter would be roughly O(log N) on an AVL tree?

Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
linear too. It is not even ordered.

But apart from this detail you're right, I have the same concerns with
this solution too. If we see a hard performance regression we could go
to more complicated solutions, like maintaining a reverse IOVATree in
vhost-iova-tree too. First RFCs of SVQ did that actually.

Thanks!

> Of course,
> memory_region_from_host() won't search out of the guest memory space for
> sure. As this could be on the hot data path I have a little bit
> hesitance over the potential cost or performance regression this change
> could bring in, but maybe I'm overthinking it too much...
>
> Thanks,
> -Siwei
>
> >
> >> Thanks,
> >> -Siwei
> >>>            return false;
> >>>        }
> >>>
>
Si-Wei Liu April 23, 2024, 10:20 p.m. UTC | #5
On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
>>>>>
>>>>> This causes a problem when overlapped regions (different GPA but same
>>>>> translated HVA) exists in the tree, as looking them by HVA will return
>>>>> them twice.  To solve this, create an id member so we can assign unique
>>>>> identifiers (GPA) to the maps.
>>>>>
>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>> ---
>>>>>     include/qemu/iova-tree.h | 5 +++--
>>>>>     util/iova-tree.c         | 3 ++-
>>>>>     2 files changed, 5 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
>>>>> index 2a10a7052e..34ee230e7d 100644
>>>>> --- a/include/qemu/iova-tree.h
>>>>> +++ b/include/qemu/iova-tree.h
>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
>>>>>         hwaddr iova;
>>>>>         hwaddr translated_addr;
>>>>>         hwaddr size;                /* Inclusive */
>>>>> +    uint64_t id;
>>>>>         IOMMUAccessFlags perm;
>>>>>     } QEMU_PACKED DMAMap;
>>>>>     typedef gboolean (*iova_tree_iterator)(DMAMap *map);
>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
>>>>>      * @map: the mapping to search
>>>>>      *
>>>>>      * Search for a mapping in the iova tree that translated_addr overlaps with the
>>>>> - * mapping range specified.  Only the first found mapping will be
>>>>> - * returned.
>>>>> + * mapping range specified and map->id is equal.  Only the first found
>>>>> + * mapping will be returned.
>>>>>      *
>>>>>      * Return: DMAMap pointer if found, or NULL if not found.  Note that
>>>>>      * the returned DMAMap pointer is maintained internally.  User should
>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
>>>>> index 536789797e..0863e0a3b8 100644
>>>>> --- a/util/iova-tree.c
>>>>> +++ b/util/iova-tree.c
>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
>>>>>
>>>>>         needle = args->needle;
>>>>>         if (map->translated_addr + map->size < needle->translated_addr ||
>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
>>>>> +        needle->id != map->id) {
>>>> It looks this iterator can also be invoked by SVQ from
>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
>>>> space will be searched on without passing in the ID (GPA), and exact
>>>> match for the same GPA range is not actually needed unlike the mapping
>>>> removal case. Could we create an API variant, for the SVQ lookup case
>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
>>>> DMAMap, and the id match check may look like below:
>>>>
>>>> (!needle->skip_id_match && needle->id != map->id)
>>>>
>>>> I think vhost_svq_translate_addr() could just call the API variant or
>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
>>>>
>>> I think you're totally right. But I'd really like to not complicate
>>> the API of the iova_tree more.
>>>
>>> I think we can look for the hwaddr using memory_region_from_host and
>>> then get the hwaddr. It is another lookup though...
>> Yeah, that will be another means of doing translation without having to
>> complicate the API around iova_tree. I wonder how the lookup through
>> memory_region_from_host() may perform compared to the iova tree one, the
>> former looks to be an O(N) linear search on a linked list while the
>> latter would be roughly O(log N) on an AVL tree?
> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
> linear too. It is not even ordered.
Oh Sorry, I misread the code and I should look for g_tree_foreach () 
instead of g_tree_search_node(). So the former is indeed linear 
iteration, but it looks to be ordered?

https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115
>
> But apart from this detail you're right, I have the same concerns with
> this solution too. If we see a hard performance regression we could go
> to more complicated solutions, like maintaining a reverse IOVATree in
> vhost-iova-tree too. First RFCs of SVQ did that actually.
Agreed, yeap we can use memory_region_from_host for now.  Any reason why 
reverse IOVATree was dropped, lack of users? But now we have one!

Thanks,
-Siwei
>
> Thanks!
>
>> Of course,
>> memory_region_from_host() won't search out of the guest memory space for
>> sure. As this could be on the hot data path I have a little bit
>> hesitance over the potential cost or performance regression this change
>> could bring in, but maybe I'm overthinking it too much...
>>
>> Thanks,
>> -Siwei
>>
>>>> Thanks,
>>>> -Siwei
>>>>>             return false;
>>>>>         }
>>>>>
Eugenio Perez Martin April 24, 2024, 7:33 a.m. UTC | #6
On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> > On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> >>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>>>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>>>
> >>>>> This causes a problem when overlapped regions (different GPA but same
> >>>>> translated HVA) exists in the tree, as looking them by HVA will return
> >>>>> them twice.  To solve this, create an id member so we can assign unique
> >>>>> identifiers (GPA) to the maps.
> >>>>>
> >>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>> ---
> >>>>>     include/qemu/iova-tree.h | 5 +++--
> >>>>>     util/iova-tree.c         | 3 ++-
> >>>>>     2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>>
> >>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>>>> index 2a10a7052e..34ee230e7d 100644
> >>>>> --- a/include/qemu/iova-tree.h
> >>>>> +++ b/include/qemu/iova-tree.h
> >>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>>>         hwaddr iova;
> >>>>>         hwaddr translated_addr;
> >>>>>         hwaddr size;                /* Inclusive */
> >>>>> +    uint64_t id;
> >>>>>         IOMMUAccessFlags perm;
> >>>>>     } QEMU_PACKED DMAMap;
> >>>>>     typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
> >>>>>      * @map: the mapping to search
> >>>>>      *
> >>>>>      * Search for a mapping in the iova tree that translated_addr overlaps with the
> >>>>> - * mapping range specified.  Only the first found mapping will be
> >>>>> - * returned.
> >>>>> + * mapping range specified and map->id is equal.  Only the first found
> >>>>> + * mapping will be returned.
> >>>>>      *
> >>>>>      * Return: DMAMap pointer if found, or NULL if not found.  Note that
> >>>>>      * the returned DMAMap pointer is maintained internally.  User should
> >>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
> >>>>> index 536789797e..0863e0a3b8 100644
> >>>>> --- a/util/iova-tree.c
> >>>>> +++ b/util/iova-tree.c
> >>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
> >>>>>
> >>>>>         needle = args->needle;
> >>>>>         if (map->translated_addr + map->size < needle->translated_addr ||
> >>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
> >>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
> >>>>> +        needle->id != map->id) {
> >>>> It looks this iterator can also be invoked by SVQ from
> >>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> >>>> space will be searched on without passing in the ID (GPA), and exact
> >>>> match for the same GPA range is not actually needed unlike the mapping
> >>>> removal case. Could we create an API variant, for the SVQ lookup case
> >>>> specifically? Or alternatively, add a special flag, say skip_id_match to
> >>>> DMAMap, and the id match check may look like below:
> >>>>
> >>>> (!needle->skip_id_match && needle->id != map->id)
> >>>>
> >>>> I think vhost_svq_translate_addr() could just call the API variant or
> >>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
> >>>>
> >>> I think you're totally right. But I'd really like to not complicate
> >>> the API of the iova_tree more.
> >>>
> >>> I think we can look for the hwaddr using memory_region_from_host and
> >>> then get the hwaddr. It is another lookup though...
> >> Yeah, that will be another means of doing translation without having to
> >> complicate the API around iova_tree. I wonder how the lookup through
> >> memory_region_from_host() may perform compared to the iova tree one, the
> >> former looks to be an O(N) linear search on a linked list while the
> >> latter would be roughly O(log N) on an AVL tree?
> > Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
> > linear too. It is not even ordered.
> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
> instead of g_tree_search_node(). So the former is indeed linear
> iteration, but it looks to be ordered?
>
> https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115

The GPA / IOVA are ordered but we're looking by QEMU's vaddr.

If we have these translations:
[0x1000, 0x2000] -> [0x10000, 0x11000]
[0x2000, 0x3000] -> [0x6000, 0x7000]

We will see them in this order, so we cannot stop the search at the first node.

> >
> > But apart from this detail you're right, I have the same concerns with
> > this solution too. If we see a hard performance regression we could go
> > to more complicated solutions, like maintaining a reverse IOVATree in
> > vhost-iova-tree too. First RFCs of SVQ did that actually.
> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
> reverse IOVATree was dropped, lack of users? But now we have one!
>

No, it is just simplicity. We already have an user in the hot patch in
the master branch, vhost_svq_vring_write_descs. But I never profiled
enough to find if it is a bottleneck or not to be honest.

I'll send the new series by today, thank you for finding these issues!

> Thanks,
> -Siwei
> >
> > Thanks!
> >
> >> Of course,
> >> memory_region_from_host() won't search out of the guest memory space for
> >> sure. As this could be on the hot data path I have a little bit
> >> hesitance over the potential cost or performance regression this change
> >> could bring in, but maybe I'm overthinking it too much...
> >>
> >> Thanks,
> >> -Siwei
> >>
> >>>> Thanks,
> >>>> -Siwei
> >>>>>             return false;
> >>>>>         }
> >>>>>
>
Si-Wei Liu April 25, 2024, 5:43 p.m. UTC | #7
On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
>>>>>>>
>>>>>>> This causes a problem when overlapped regions (different GPA but same
>>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
>>>>>>> them twice.  To solve this, create an id member so we can assign unique
>>>>>>> identifiers (GPA) to the maps.
>>>>>>>
>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>>>> ---
>>>>>>>      include/qemu/iova-tree.h | 5 +++--
>>>>>>>      util/iova-tree.c         | 3 ++-
>>>>>>>      2 files changed, 5 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
>>>>>>> index 2a10a7052e..34ee230e7d 100644
>>>>>>> --- a/include/qemu/iova-tree.h
>>>>>>> +++ b/include/qemu/iova-tree.h
>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
>>>>>>>          hwaddr iova;
>>>>>>>          hwaddr translated_addr;
>>>>>>>          hwaddr size;                /* Inclusive */
>>>>>>> +    uint64_t id;
>>>>>>>          IOMMUAccessFlags perm;
>>>>>>>      } QEMU_PACKED DMAMap;
>>>>>>>      typedef gboolean (*iova_tree_iterator)(DMAMap *map);
>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
>>>>>>>       * @map: the mapping to search
>>>>>>>       *
>>>>>>>       * Search for a mapping in the iova tree that translated_addr overlaps with the
>>>>>>> - * mapping range specified.  Only the first found mapping will be
>>>>>>> - * returned.
>>>>>>> + * mapping range specified and map->id is equal.  Only the first found
>>>>>>> + * mapping will be returned.
>>>>>>>       *
>>>>>>>       * Return: DMAMap pointer if found, or NULL if not found.  Note that
>>>>>>>       * the returned DMAMap pointer is maintained internally.  User should
>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
>>>>>>> index 536789797e..0863e0a3b8 100644
>>>>>>> --- a/util/iova-tree.c
>>>>>>> +++ b/util/iova-tree.c
>>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
>>>>>>>
>>>>>>>          needle = args->needle;
>>>>>>>          if (map->translated_addr + map->size < needle->translated_addr ||
>>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
>>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
>>>>>>> +        needle->id != map->id) {
>>>>>> It looks this iterator can also be invoked by SVQ from
>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
>>>>>> space will be searched on without passing in the ID (GPA), and exact
>>>>>> match for the same GPA range is not actually needed unlike the mapping
>>>>>> removal case. Could we create an API variant, for the SVQ lookup case
>>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
>>>>>> DMAMap, and the id match check may look like below:
>>>>>>
>>>>>> (!needle->skip_id_match && needle->id != map->id)
>>>>>>
>>>>>> I think vhost_svq_translate_addr() could just call the API variant or
>>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
>>>>>>
>>>>> I think you're totally right. But I'd really like to not complicate
>>>>> the API of the iova_tree more.
>>>>>
>>>>> I think we can look for the hwaddr using memory_region_from_host and
>>>>> then get the hwaddr. It is another lookup though...
>>>> Yeah, that will be another means of doing translation without having to
>>>> complicate the API around iova_tree. I wonder how the lookup through
>>>> memory_region_from_host() may perform compared to the iova tree one, the
>>>> former looks to be an O(N) linear search on a linked list while the
>>>> latter would be roughly O(log N) on an AVL tree?
>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
>>> linear too. It is not even ordered.
>> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
>> instead of g_tree_search_node(). So the former is indeed linear
>> iteration, but it looks to be ordered?
>>
>> https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115
> The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
>
> If we have these translations:
> [0x1000, 0x2000] -> [0x10000, 0x11000]
> [0x2000, 0x3000] -> [0x6000, 0x7000]
>
> We will see them in this order, so we cannot stop the search at the first node.
Yeah, reverse lookup is unordered indeed, anyway.

>
>>> But apart from this detail you're right, I have the same concerns with
>>> this solution too. If we see a hard performance regression we could go
>>> to more complicated solutions, like maintaining a reverse IOVATree in
>>> vhost-iova-tree too. First RFCs of SVQ did that actually.
>> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
>> reverse IOVATree was dropped, lack of users? But now we have one!
>>
> No, it is just simplicity. We already have an user in the hot patch in
> the master branch, vhost_svq_vring_write_descs. But I never profiled
> enough to find if it is a bottleneck or not to be honest.
Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to 
profile and see the difference.
>
> I'll send the new series by today, thank you for finding these issues!
Thanks! In case you don't have bandwidth to add back reverse IOVA tree, 
Jonah (cc'ed) may have interest in looking into it.

-Siwei


>
>> Thanks,
>> -Siwei
>>> Thanks!
>>>
>>>> Of course,
>>>> memory_region_from_host() won't search out of the guest memory space for
>>>> sure. As this could be on the hot data path I have a little bit
>>>> hesitance over the potential cost or performance regression this change
>>>> could bring in, but maybe I'm overthinking it too much...
>>>>
>>>> Thanks,
>>>> -Siwei
>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>>>>              return false;
>>>>>>>          }
>>>>>>>
Eugenio Perez Martin April 29, 2024, 8:14 a.m. UTC | #8
On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
> > On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> >>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> >>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>>>>>
> >>>>>>> This causes a problem when overlapped regions (different GPA but same
> >>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
> >>>>>>> them twice.  To solve this, create an id member so we can assign unique
> >>>>>>> identifiers (GPA) to the maps.
> >>>>>>>
> >>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>>>> ---
> >>>>>>>      include/qemu/iova-tree.h | 5 +++--
> >>>>>>>      util/iova-tree.c         | 3 ++-
> >>>>>>>      2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>>>>>> index 2a10a7052e..34ee230e7d 100644
> >>>>>>> --- a/include/qemu/iova-tree.h
> >>>>>>> +++ b/include/qemu/iova-tree.h
> >>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>>>>>          hwaddr iova;
> >>>>>>>          hwaddr translated_addr;
> >>>>>>>          hwaddr size;                /* Inclusive */
> >>>>>>> +    uint64_t id;
> >>>>>>>          IOMMUAccessFlags perm;
> >>>>>>>      } QEMU_PACKED DMAMap;
> >>>>>>>      typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
> >>>>>>>       * @map: the mapping to search
> >>>>>>>       *
> >>>>>>>       * Search for a mapping in the iova tree that translated_addr overlaps with the
> >>>>>>> - * mapping range specified.  Only the first found mapping will be
> >>>>>>> - * returned.
> >>>>>>> + * mapping range specified and map->id is equal.  Only the first found
> >>>>>>> + * mapping will be returned.
> >>>>>>>       *
> >>>>>>>       * Return: DMAMap pointer if found, or NULL if not found.  Note that
> >>>>>>>       * the returned DMAMap pointer is maintained internally.  User should
> >>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
> >>>>>>> index 536789797e..0863e0a3b8 100644
> >>>>>>> --- a/util/iova-tree.c
> >>>>>>> +++ b/util/iova-tree.c
> >>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
> >>>>>>>
> >>>>>>>          needle = args->needle;
> >>>>>>>          if (map->translated_addr + map->size < needle->translated_addr ||
> >>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
> >>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
> >>>>>>> +        needle->id != map->id) {
> >>>>>> It looks this iterator can also be invoked by SVQ from
> >>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> >>>>>> space will be searched on without passing in the ID (GPA), and exact
> >>>>>> match for the same GPA range is not actually needed unlike the mapping
> >>>>>> removal case. Could we create an API variant, for the SVQ lookup case
> >>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
> >>>>>> DMAMap, and the id match check may look like below:
> >>>>>>
> >>>>>> (!needle->skip_id_match && needle->id != map->id)
> >>>>>>
> >>>>>> I think vhost_svq_translate_addr() could just call the API variant or
> >>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
> >>>>>>
> >>>>> I think you're totally right. But I'd really like to not complicate
> >>>>> the API of the iova_tree more.
> >>>>>
> >>>>> I think we can look for the hwaddr using memory_region_from_host and
> >>>>> then get the hwaddr. It is another lookup though...
> >>>> Yeah, that will be another means of doing translation without having to
> >>>> complicate the API around iova_tree. I wonder how the lookup through
> >>>> memory_region_from_host() may perform compared to the iova tree one, the
> >>>> former looks to be an O(N) linear search on a linked list while the
> >>>> latter would be roughly O(log N) on an AVL tree?
> >>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
> >>> linear too. It is not even ordered.
> >> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
> >> instead of g_tree_search_node(). So the former is indeed linear
> >> iteration, but it looks to be ordered?
> >>
> >> https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115
> > The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
> >
> > If we have these translations:
> > [0x1000, 0x2000] -> [0x10000, 0x11000]
> > [0x2000, 0x3000] -> [0x6000, 0x7000]
> >
> > We will see them in this order, so we cannot stop the search at the first node.
> Yeah, reverse lookup is unordered indeed, anyway.
>
> >
> >>> But apart from this detail you're right, I have the same concerns with
> >>> this solution too. If we see a hard performance regression we could go
> >>> to more complicated solutions, like maintaining a reverse IOVATree in
> >>> vhost-iova-tree too. First RFCs of SVQ did that actually.
> >> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
> >> reverse IOVATree was dropped, lack of users? But now we have one!
> >>
> > No, it is just simplicity. We already have an user in the hot patch in
> > the master branch, vhost_svq_vring_write_descs. But I never profiled
> > enough to find if it is a bottleneck or not to be honest.
> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to
> profile and see the difference.
> >
> > I'll send the new series by today, thank you for finding these issues!
> Thanks! In case you don't have bandwidth to add back reverse IOVA tree,
> Jonah (cc'ed) may have interest in looking into it.
>

Actually, yes. I've tried to solve it using:
memory_region_get_ram_ptr -> It's hard to get this pointer to work
without messing a lot with IOVATree.
memory_region_find -> I'm totally unable to make it return sections
that make sense
flatview_for_each_range -> It does not return the same
MemoryRegionsection as the listener, not sure why.

The only advance I have is that memory_region_from_host is able to
tell if the vaddr is from the guest or not.

So I'm convinced there must be a way to do it with the memory
subsystem, but I think the best way to do it ATM is to store a
parallel tree with GPA-> SVQ IOVA translations. At removal time, if we
find the entry in this new tree, we can directly remove it by GPA. If
not, assume it is a host-only address like SVQ vrings, and remove by
iterating on vaddr as we do now. It is guaranteed the guest does not
translate to that vaddr and that that vaddr is unique in the tree
anyway.

Does it sound reasonable? Jonah, would you be interested in moving this forward?

Thanks!

> -Siwei
>
>
> >
> >> Thanks,
> >> -Siwei
> >>> Thanks!
> >>>
> >>>> Of course,
> >>>> memory_region_from_host() won't search out of the guest memory space for
> >>>> sure. As this could be on the hot data path I have a little bit
> >>>> hesitance over the potential cost or performance regression this change
> >>>> could bring in, but maybe I'm overthinking it too much...
> >>>>
> >>>> Thanks,
> >>>> -Siwei
> >>>>
> >>>>>> Thanks,
> >>>>>> -Siwei
> >>>>>>>              return false;
> >>>>>>>          }
> >>>>>>>
>
Jonah Palmer April 29, 2024, 11:19 a.m. UTC | #9
On 4/29/24 4:14 AM, Eugenio Perez Martin wrote:
> On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>>
>> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
>>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>>
>>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
>>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>
>>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
>>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
>>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
>>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
>>>>>>>>>
>>>>>>>>> This causes a problem when overlapped regions (different GPA but same
>>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
>>>>>>>>> them twice.  To solve this, create an id member so we can assign unique
>>>>>>>>> identifiers (GPA) to the maps.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>>>>>> ---
>>>>>>>>>       include/qemu/iova-tree.h | 5 +++--
>>>>>>>>>       util/iova-tree.c         | 3 ++-
>>>>>>>>>       2 files changed, 5 insertions(+), 3 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
>>>>>>>>> index 2a10a7052e..34ee230e7d 100644
>>>>>>>>> --- a/include/qemu/iova-tree.h
>>>>>>>>> +++ b/include/qemu/iova-tree.h
>>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
>>>>>>>>>           hwaddr iova;
>>>>>>>>>           hwaddr translated_addr;
>>>>>>>>>           hwaddr size;                /* Inclusive */
>>>>>>>>> +    uint64_t id;
>>>>>>>>>           IOMMUAccessFlags perm;
>>>>>>>>>       } QEMU_PACKED DMAMap;
>>>>>>>>>       typedef gboolean (*iova_tree_iterator)(DMAMap *map);
>>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
>>>>>>>>>        * @map: the mapping to search
>>>>>>>>>        *
>>>>>>>>>        * Search for a mapping in the iova tree that translated_addr overlaps with the
>>>>>>>>> - * mapping range specified.  Only the first found mapping will be
>>>>>>>>> - * returned.
>>>>>>>>> + * mapping range specified and map->id is equal.  Only the first found
>>>>>>>>> + * mapping will be returned.
>>>>>>>>>        *
>>>>>>>>>        * Return: DMAMap pointer if found, or NULL if not found.  Note that
>>>>>>>>>        * the returned DMAMap pointer is maintained internally.  User should
>>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
>>>>>>>>> index 536789797e..0863e0a3b8 100644
>>>>>>>>> --- a/util/iova-tree.c
>>>>>>>>> +++ b/util/iova-tree.c
>>>>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
>>>>>>>>>
>>>>>>>>>           needle = args->needle;
>>>>>>>>>           if (map->translated_addr + map->size < needle->translated_addr ||
>>>>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
>>>>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
>>>>>>>>> +        needle->id != map->id) {
>>>>>>>> It looks this iterator can also be invoked by SVQ from
>>>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
>>>>>>>> space will be searched on without passing in the ID (GPA), and exact
>>>>>>>> match for the same GPA range is not actually needed unlike the mapping
>>>>>>>> removal case. Could we create an API variant, for the SVQ lookup case
>>>>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
>>>>>>>> DMAMap, and the id match check may look like below:
>>>>>>>>
>>>>>>>> (!needle->skip_id_match && needle->id != map->id)
>>>>>>>>
>>>>>>>> I think vhost_svq_translate_addr() could just call the API variant or
>>>>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
>>>>>>>>
>>>>>>> I think you're totally right. But I'd really like to not complicate
>>>>>>> the API of the iova_tree more.
>>>>>>>
>>>>>>> I think we can look for the hwaddr using memory_region_from_host and
>>>>>>> then get the hwaddr. It is another lookup though...
>>>>>> Yeah, that will be another means of doing translation without having to
>>>>>> complicate the API around iova_tree. I wonder how the lookup through
>>>>>> memory_region_from_host() may perform compared to the iova tree one, the
>>>>>> former looks to be an O(N) linear search on a linked list while the
>>>>>> latter would be roughly O(log N) on an AVL tree?
>>>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
>>>>> linear too. It is not even ordered.
>>>> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
>>>> instead of g_tree_search_node(). So the former is indeed linear
>>>> iteration, but it looks to be ordered?
>>>>
>>>> https://urldefense.com/v3/__https://github.com/GNOME/glib/blob/main/glib/gtree.c*L1115__;Iw!!ACWV5N9M2RV99hQ!Ng2rLfRd9tLyNTNocW50Mf5AcxSt0uF0wOdv120djff-z_iAdbujYK-jMi5UC1DZLxb1yLUv2vV0j3wJo8o$
>>> The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
>>>
>>> If we have these translations:
>>> [0x1000, 0x2000] -> [0x10000, 0x11000]
>>> [0x2000, 0x3000] -> [0x6000, 0x7000]
>>>
>>> We will see them in this order, so we cannot stop the search at the first node.
>> Yeah, reverse lookup is unordered indeed, anyway.
>>
>>>
>>>>> But apart from this detail you're right, I have the same concerns with
>>>>> this solution too. If we see a hard performance regression we could go
>>>>> to more complicated solutions, like maintaining a reverse IOVATree in
>>>>> vhost-iova-tree too. First RFCs of SVQ did that actually.
>>>> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
>>>> reverse IOVATree was dropped, lack of users? But now we have one!
>>>>
>>> No, it is just simplicity. We already have an user in the hot patch in
>>> the master branch, vhost_svq_vring_write_descs. But I never profiled
>>> enough to find if it is a bottleneck or not to be honest.
>> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to
>> profile and see the difference.
>>>
>>> I'll send the new series by today, thank you for finding these issues!
>> Thanks! In case you don't have bandwidth to add back reverse IOVA tree,
>> Jonah (cc'ed) may have interest in looking into it.
>>
> 
> Actually, yes. I've tried to solve it using:
> memory_region_get_ram_ptr -> It's hard to get this pointer to work
> without messing a lot with IOVATree.
> memory_region_find -> I'm totally unable to make it return sections
> that make sense
> flatview_for_each_range -> It does not return the same
> MemoryRegionsection as the listener, not sure why.
> 
> The only advance I have is that memory_region_from_host is able to
> tell if the vaddr is from the guest or not.
> 
> So I'm convinced there must be a way to do it with the memory
> subsystem, but I think the best way to do it ATM is to store a
> parallel tree with GPA-> SVQ IOVA translations. At removal time, if we
> find the entry in this new tree, we can directly remove it by GPA. If
> not, assume it is a host-only address like SVQ vrings, and remove by
> iterating on vaddr as we do now. It is guaranteed the guest does not
> translate to that vaddr and that that vaddr is unique in the tree
> anyway.
> 
> Does it sound reasonable? Jonah, would you be interested in moving this forward?
> 
> Thanks!
> 

Sure, I'd be more than happy to work on this stuff! I can probably get 
started on this either today or tomorrow.

Si-Wei mentioned something about these "reverse IOVATree" patches that 
were dropped; is this relevant to what you're asking here? Is it 
something I should base my work off of?

If there's any other relevant information about this issue that you 
think I should know, let me know. I'll start digging into this ASAP and 
will reach out if I need any guidance. :)

Jonah

>> -Siwei
>>
>>
>>>
>>>> Thanks,
>>>> -Siwei
>>>>> Thanks!
>>>>>
>>>>>> Of course,
>>>>>> memory_region_from_host() won't search out of the guest memory space for
>>>>>> sure. As this could be on the hot data path I have a little bit
>>>>>> hesitance over the potential cost or performance regression this change
>>>>>> could bring in, but maybe I'm overthinking it too much...
>>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Siwei
>>>>>>>>>               return false;
>>>>>>>>>           }
>>>>>>>>>
>>
>
Si-Wei Liu April 30, 2024, 5:54 a.m. UTC | #10
On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote:
> On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
>>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
>>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
>>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
>>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
>>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
>>>>>>>>>
>>>>>>>>> This causes a problem when overlapped regions (different GPA but same
>>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
>>>>>>>>> them twice.  To solve this, create an id member so we can assign unique
>>>>>>>>> identifiers (GPA) to the maps.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>>>>>> ---
>>>>>>>>>       include/qemu/iova-tree.h | 5 +++--
>>>>>>>>>       util/iova-tree.c         | 3 ++-
>>>>>>>>>       2 files changed, 5 insertions(+), 3 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
>>>>>>>>> index 2a10a7052e..34ee230e7d 100644
>>>>>>>>> --- a/include/qemu/iova-tree.h
>>>>>>>>> +++ b/include/qemu/iova-tree.h
>>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
>>>>>>>>>           hwaddr iova;
>>>>>>>>>           hwaddr translated_addr;
>>>>>>>>>           hwaddr size;                /* Inclusive */
>>>>>>>>> +    uint64_t id;
>>>>>>>>>           IOMMUAccessFlags perm;
>>>>>>>>>       } QEMU_PACKED DMAMap;
>>>>>>>>>       typedef gboolean (*iova_tree_iterator)(DMAMap *map);
>>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
>>>>>>>>>        * @map: the mapping to search
>>>>>>>>>        *
>>>>>>>>>        * Search for a mapping in the iova tree that translated_addr overlaps with the
>>>>>>>>> - * mapping range specified.  Only the first found mapping will be
>>>>>>>>> - * returned.
>>>>>>>>> + * mapping range specified and map->id is equal.  Only the first found
>>>>>>>>> + * mapping will be returned.
>>>>>>>>>        *
>>>>>>>>>        * Return: DMAMap pointer if found, or NULL if not found.  Note that
>>>>>>>>>        * the returned DMAMap pointer is maintained internally.  User should
>>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
>>>>>>>>> index 536789797e..0863e0a3b8 100644
>>>>>>>>> --- a/util/iova-tree.c
>>>>>>>>> +++ b/util/iova-tree.c
>>>>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
>>>>>>>>>
>>>>>>>>>           needle = args->needle;
>>>>>>>>>           if (map->translated_addr + map->size < needle->translated_addr ||
>>>>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
>>>>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
>>>>>>>>> +        needle->id != map->id) {
>>>>>>>> It looks this iterator can also be invoked by SVQ from
>>>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
>>>>>>>> space will be searched on without passing in the ID (GPA), and exact
>>>>>>>> match for the same GPA range is not actually needed unlike the mapping
>>>>>>>> removal case. Could we create an API variant, for the SVQ lookup case
>>>>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
>>>>>>>> DMAMap, and the id match check may look like below:
>>>>>>>>
>>>>>>>> (!needle->skip_id_match && needle->id != map->id)
>>>>>>>>
>>>>>>>> I think vhost_svq_translate_addr() could just call the API variant or
>>>>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
>>>>>>>>
>>>>>>> I think you're totally right. But I'd really like to not complicate
>>>>>>> the API of the iova_tree more.
>>>>>>>
>>>>>>> I think we can look for the hwaddr using memory_region_from_host and
>>>>>>> then get the hwaddr. It is another lookup though...
>>>>>> Yeah, that will be another means of doing translation without having to
>>>>>> complicate the API around iova_tree. I wonder how the lookup through
>>>>>> memory_region_from_host() may perform compared to the iova tree one, the
>>>>>> former looks to be an O(N) linear search on a linked list while the
>>>>>> latter would be roughly O(log N) on an AVL tree?
>>>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
>>>>> linear too. It is not even ordered.
>>>> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
>>>> instead of g_tree_search_node(). So the former is indeed linear
>>>> iteration, but it looks to be ordered?
>>>>
>>>> https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115
>>> The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
>>>
>>> If we have these translations:
>>> [0x1000, 0x2000] -> [0x10000, 0x11000]
>>> [0x2000, 0x3000] -> [0x6000, 0x7000]
>>>
>>> We will see them in this order, so we cannot stop the search at the first node.
>> Yeah, reverse lookup is unordered indeed, anyway.
>>
>>>>> But apart from this detail you're right, I have the same concerns with
>>>>> this solution too. If we see a hard performance regression we could go
>>>>> to more complicated solutions, like maintaining a reverse IOVATree in
>>>>> vhost-iova-tree too. First RFCs of SVQ did that actually.
>>>> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
>>>> reverse IOVATree was dropped, lack of users? But now we have one!
>>>>
>>> No, it is just simplicity. We already have an user in the hot patch in
>>> the master branch, vhost_svq_vring_write_descs. But I never profiled
>>> enough to find if it is a bottleneck or not to be honest.
>> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to
>> profile and see the difference.
>>> I'll send the new series by today, thank you for finding these issues!
>> Thanks! In case you don't have bandwidth to add back reverse IOVA tree,
>> Jonah (cc'ed) may have interest in looking into it.
>>
> Actually, yes. I've tried to solve it using:
> memory_region_get_ram_ptr -> It's hard to get this pointer to work
> without messing a lot with IOVATree.
> memory_region_find -> I'm totally unable to make it return sections
> that make sense
> flatview_for_each_range -> It does not return the same
> MemoryRegionsection as the listener, not sure why.
Ouch, thank you for the summary of attempts that were done earlier.
> The only advance I have is that memory_region_from_host is able to
> tell if the vaddr is from the guest or not.
Hmmm, then it won't be too useful without a direct means to identifying 
the exact memory region associated with the iova that is being mapped. 
And, this additional indirection seems introduce a tiny bit of more 
latency in the reverse lookup routine (should not be a scalability issue 
though if it's a linear search)?

> So I'm convinced there must be a way to do it with the memory
> subsystem, but I think the best way to do it ATM is to store a
> parallel tree with GPA-> SVQ IOVA translations. At removal time, if we
> find the entry in this new tree, we can directly remove it by GPA. If
> not, assume it is a host-only address like SVQ vrings, and remove by
> iterating on vaddr as we do now.
Yeah, this could work I think. On the other hand, given that we are now 
trying to improve it, I wonder if possible to come up with a fast 
version for the SVQ (host-only address) case without having to look up 
twice? SVQ callers should be able to tell apart from the guest case 
where GPA -> IOVA translation doesn't exist? Or just maintain a parallel 
tree with HVA -> IOVA translations for SVQ reverse lookup only? I feel 
SVQ mappings may be worth a separate fast lookup path - unlike guest 
mappings, the insertion, lookup and removal for SVQ mappings seem 
unavoidable during the migration downtime path.

>   It is guaranteed the guest does not
> translate to that vaddr and that that vaddr is unique in the tree
> anyway.
>
> Does it sound reasonable? Jonah, would you be interested in moving this forward?
My thought would be that the reverse IOVA tree stuff can be added as a 
follow-up optimization right after for extended scalability, but for now 
as the interim, we may still need some form of simple fix, so as to 
quickly unblock the other dependent work built on top of this one and 
the early pinning series [1]. With it said, I'm completely fine if 
performing the reverse lookup through linear tree walk e.g. 
g_tree_foreach(), that should suffice small VM configs with just a 
couple of queues and limited number of memory regions. Going forward, to 
address the scalability bottleneck, Jonah could just replace the 
corresponding API call with the one built on top of reverse IOVA tree (I 
presume the use of these iova tree APIs is kind of internal that only 
limits to SVQ and vhost-vdpa subsystems) once he gets there, and then 
eliminate the other API variants that will no longer be in use. What do 
you think about this idea / plan?

Thanks,
-Siwei

[1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html

>
> Thanks!
>
>> -Siwei
>>
>>
>>>> Thanks,
>>>> -Siwei
>>>>> Thanks!
>>>>>
>>>>>> Of course,
>>>>>> memory_region_from_host() won't search out of the guest memory space for
>>>>>> sure. As this could be on the hot data path I have a little bit
>>>>>> hesitance over the potential cost or performance regression this change
>>>>>> could bring in, but maybe I'm overthinking it too much...
>>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Siwei
>>>>>>>>>               return false;
>>>>>>>>>           }
>>>>>>>>>
Eugenio Perez Martin April 30, 2024, 5:19 p.m. UTC | #11
On Tue, Apr 30, 2024 at 7:55 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote:
> > On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
> >>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> >>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>>>>>>>
> >>>>>>>>> This causes a problem when overlapped regions (different GPA but same
> >>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
> >>>>>>>>> them twice.  To solve this, create an id member so we can assign unique
> >>>>>>>>> identifiers (GPA) to the maps.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>>>>>> ---
> >>>>>>>>>       include/qemu/iova-tree.h | 5 +++--
> >>>>>>>>>       util/iova-tree.c         | 3 ++-
> >>>>>>>>>       2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>>>>>>>> index 2a10a7052e..34ee230e7d 100644
> >>>>>>>>> --- a/include/qemu/iova-tree.h
> >>>>>>>>> +++ b/include/qemu/iova-tree.h
> >>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>>>>>>>           hwaddr iova;
> >>>>>>>>>           hwaddr translated_addr;
> >>>>>>>>>           hwaddr size;                /* Inclusive */
> >>>>>>>>> +    uint64_t id;
> >>>>>>>>>           IOMMUAccessFlags perm;
> >>>>>>>>>       } QEMU_PACKED DMAMap;
> >>>>>>>>>       typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
> >>>>>>>>>        * @map: the mapping to search
> >>>>>>>>>        *
> >>>>>>>>>        * Search for a mapping in the iova tree that translated_addr overlaps with the
> >>>>>>>>> - * mapping range specified.  Only the first found mapping will be
> >>>>>>>>> - * returned.
> >>>>>>>>> + * mapping range specified and map->id is equal.  Only the first found
> >>>>>>>>> + * mapping will be returned.
> >>>>>>>>>        *
> >>>>>>>>>        * Return: DMAMap pointer if found, or NULL if not found.  Note that
> >>>>>>>>>        * the returned DMAMap pointer is maintained internally.  User should
> >>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
> >>>>>>>>> index 536789797e..0863e0a3b8 100644
> >>>>>>>>> --- a/util/iova-tree.c
> >>>>>>>>> +++ b/util/iova-tree.c
> >>>>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
> >>>>>>>>>
> >>>>>>>>>           needle = args->needle;
> >>>>>>>>>           if (map->translated_addr + map->size < needle->translated_addr ||
> >>>>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
> >>>>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
> >>>>>>>>> +        needle->id != map->id) {
> >>>>>>>> It looks this iterator can also be invoked by SVQ from
> >>>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> >>>>>>>> space will be searched on without passing in the ID (GPA), and exact
> >>>>>>>> match for the same GPA range is not actually needed unlike the mapping
> >>>>>>>> removal case. Could we create an API variant, for the SVQ lookup case
> >>>>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
> >>>>>>>> DMAMap, and the id match check may look like below:
> >>>>>>>>
> >>>>>>>> (!needle->skip_id_match && needle->id != map->id)
> >>>>>>>>
> >>>>>>>> I think vhost_svq_translate_addr() could just call the API variant or
> >>>>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
> >>>>>>>>
> >>>>>>> I think you're totally right. But I'd really like to not complicate
> >>>>>>> the API of the iova_tree more.
> >>>>>>>
> >>>>>>> I think we can look for the hwaddr using memory_region_from_host and
> >>>>>>> then get the hwaddr. It is another lookup though...
> >>>>>> Yeah, that will be another means of doing translation without having to
> >>>>>> complicate the API around iova_tree. I wonder how the lookup through
> >>>>>> memory_region_from_host() may perform compared to the iova tree one, the
> >>>>>> former looks to be an O(N) linear search on a linked list while the
> >>>>>> latter would be roughly O(log N) on an AVL tree?
> >>>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
> >>>>> linear too. It is not even ordered.
> >>>> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
> >>>> instead of g_tree_search_node(). So the former is indeed linear
> >>>> iteration, but it looks to be ordered?
> >>>>
> >>>> https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115
> >>> The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
> >>>
> >>> If we have these translations:
> >>> [0x1000, 0x2000] -> [0x10000, 0x11000]
> >>> [0x2000, 0x3000] -> [0x6000, 0x7000]
> >>>
> >>> We will see them in this order, so we cannot stop the search at the first node.
> >> Yeah, reverse lookup is unordered indeed, anyway.
> >>
> >>>>> But apart from this detail you're right, I have the same concerns with
> >>>>> this solution too. If we see a hard performance regression we could go
> >>>>> to more complicated solutions, like maintaining a reverse IOVATree in
> >>>>> vhost-iova-tree too. First RFCs of SVQ did that actually.
> >>>> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
> >>>> reverse IOVATree was dropped, lack of users? But now we have one!
> >>>>
> >>> No, it is just simplicity. We already have an user in the hot patch in
> >>> the master branch, vhost_svq_vring_write_descs. But I never profiled
> >>> enough to find if it is a bottleneck or not to be honest.
> >> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to
> >> profile and see the difference.
> >>> I'll send the new series by today, thank you for finding these issues!
> >> Thanks! In case you don't have bandwidth to add back reverse IOVA tree,
> >> Jonah (cc'ed) may have interest in looking into it.
> >>
> > Actually, yes. I've tried to solve it using:
> > memory_region_get_ram_ptr -> It's hard to get this pointer to work
> > without messing a lot with IOVATree.
> > memory_region_find -> I'm totally unable to make it return sections
> > that make sense
> > flatview_for_each_range -> It does not return the same
> > MemoryRegionsection as the listener, not sure why.
> Ouch, thank you for the summary of attempts that were done earlier.
> > The only advance I have is that memory_region_from_host is able to
> > tell if the vaddr is from the guest or not.
> Hmmm, then it won't be too useful without a direct means to identifying
> the exact memory region associated with the iova that is being mapped.
> And, this additional indirection seems introduce a tiny bit of more
> latency in the reverse lookup routine (should not be a scalability issue
> though if it's a linear search)?
>

I didn't measure, but I guess yes it might. OTOH these structs may be
cached because virtqueue_pop just looked for them.

> > So I'm convinced there must be a way to do it with the memory
> > subsystem, but I think the best way to do it ATM is to store a
> > parallel tree with GPA-> SVQ IOVA translations. At removal time, if we
> > find the entry in this new tree, we can directly remove it by GPA. If
> > not, assume it is a host-only address like SVQ vrings, and remove by
> > iterating on vaddr as we do now.
> Yeah, this could work I think. On the other hand, given that we are now
> trying to improve it, I wonder if possible to come up with a fast
> version for the SVQ (host-only address) case without having to look up
> twice? SVQ callers should be able to tell apart from the guest case
> where GPA -> IOVA translation doesn't exist? Or just maintain a parallel
> tree with HVA -> IOVA translations for SVQ reverse lookup only? I feel
> SVQ mappings may be worth a separate fast lookup path - unlike guest
> mappings, the insertion, lookup and removal for SVQ mappings seem
> unavoidable during the migration downtime path.
>

I think the ideal order is the opposite actually. So:
1) Try for the NIC to support _F_VRING_ASID, no translation needed by QEMU
2) Try reverse lookup from HVA to GPA. Since dataplane should fit
this, we should test this first
3) Look in SVQ host-only entries (SVQ, current shadow CVQ). It is the
control VQ, speed is not so important.

Overlapping regions may return the wrong SVQ IOVA though. We should
take extra care to make sure these are correctly handled. I mean,
there are valid translations in the tree unless the driver is buggy,
just may need to span many translations.

> >   It is guaranteed the guest does not
> > translate to that vaddr and that that vaddr is unique in the tree
> > anyway.
> >
> > Does it sound reasonable? Jonah, would you be interested in moving this forward?
> My thought would be that the reverse IOVA tree stuff can be added as a
> follow-up optimization right after for extended scalability, but for now
> as the interim, we may still need some form of simple fix, so as to
> quickly unblock the other dependent work built on top of this one and
> the early pinning series [1]. With it said, I'm completely fine if
> performing the reverse lookup through linear tree walk e.g.
> g_tree_foreach(), that should suffice small VM configs with just a
> couple of queues and limited number of memory regions. Going forward, to
> address the scalability bottleneck, Jonah could just replace the
> corresponding API call with the one built on top of reverse IOVA tree (I
> presume the use of these iova tree APIs is kind of internal that only
> limits to SVQ and vhost-vdpa subsystems) once he gets there, and then
> eliminate the other API variants that will no longer be in use. What do
> you think about this idea / plan?
>

Yeah it makes sense to me. Hopefully we can even get rid of the id member.

> Thanks,
> -Siwei
>
> [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
>
> >
> > Thanks!
> >
> >> -Siwei
> >>
> >>
> >>>> Thanks,
> >>>> -Siwei
> >>>>> Thanks!
> >>>>>
> >>>>>> Of course,
> >>>>>> memory_region_from_host() won't search out of the guest memory space for
> >>>>>> sure. As this could be on the hot data path I have a little bit
> >>>>>> hesitance over the potential cost or performance regression this change
> >>>>>> could bring in, but maybe I'm overthinking it too much...
> >>>>>>
> >>>>>> Thanks,
> >>>>>> -Siwei
> >>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> -Siwei
> >>>>>>>>>               return false;
> >>>>>>>>>           }
> >>>>>>>>>
>
Eugenio Perez Martin April 30, 2024, 6:11 p.m. UTC | #12
On Mon, Apr 29, 2024 at 1:19 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
>
>
> On 4/29/24 4:14 AM, Eugenio Perez Martin wrote:
> > On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >>
> >> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
> >>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>>
> >>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> >>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>
> >>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>>>>>>>
> >>>>>>>>> This causes a problem when overlapped regions (different GPA but same
> >>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
> >>>>>>>>> them twice.  To solve this, create an id member so we can assign unique
> >>>>>>>>> identifiers (GPA) to the maps.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>>>>>> ---
> >>>>>>>>>       include/qemu/iova-tree.h | 5 +++--
> >>>>>>>>>       util/iova-tree.c         | 3 ++-
> >>>>>>>>>       2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>>>>>>>> index 2a10a7052e..34ee230e7d 100644
> >>>>>>>>> --- a/include/qemu/iova-tree.h
> >>>>>>>>> +++ b/include/qemu/iova-tree.h
> >>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>>>>>>>           hwaddr iova;
> >>>>>>>>>           hwaddr translated_addr;
> >>>>>>>>>           hwaddr size;                /* Inclusive */
> >>>>>>>>> +    uint64_t id;
> >>>>>>>>>           IOMMUAccessFlags perm;
> >>>>>>>>>       } QEMU_PACKED DMAMap;
> >>>>>>>>>       typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
> >>>>>>>>>        * @map: the mapping to search
> >>>>>>>>>        *
> >>>>>>>>>        * Search for a mapping in the iova tree that translated_addr overlaps with the
> >>>>>>>>> - * mapping range specified.  Only the first found mapping will be
> >>>>>>>>> - * returned.
> >>>>>>>>> + * mapping range specified and map->id is equal.  Only the first found
> >>>>>>>>> + * mapping will be returned.
> >>>>>>>>>        *
> >>>>>>>>>        * Return: DMAMap pointer if found, or NULL if not found.  Note that
> >>>>>>>>>        * the returned DMAMap pointer is maintained internally.  User should
> >>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
> >>>>>>>>> index 536789797e..0863e0a3b8 100644
> >>>>>>>>> --- a/util/iova-tree.c
> >>>>>>>>> +++ b/util/iova-tree.c
> >>>>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
> >>>>>>>>>
> >>>>>>>>>           needle = args->needle;
> >>>>>>>>>           if (map->translated_addr + map->size < needle->translated_addr ||
> >>>>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
> >>>>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
> >>>>>>>>> +        needle->id != map->id) {
> >>>>>>>> It looks this iterator can also be invoked by SVQ from
> >>>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> >>>>>>>> space will be searched on without passing in the ID (GPA), and exact
> >>>>>>>> match for the same GPA range is not actually needed unlike the mapping
> >>>>>>>> removal case. Could we create an API variant, for the SVQ lookup case
> >>>>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
> >>>>>>>> DMAMap, and the id match check may look like below:
> >>>>>>>>
> >>>>>>>> (!needle->skip_id_match && needle->id != map->id)
> >>>>>>>>
> >>>>>>>> I think vhost_svq_translate_addr() could just call the API variant or
> >>>>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
> >>>>>>>>
> >>>>>>> I think you're totally right. But I'd really like to not complicate
> >>>>>>> the API of the iova_tree more.
> >>>>>>>
> >>>>>>> I think we can look for the hwaddr using memory_region_from_host and
> >>>>>>> then get the hwaddr. It is another lookup though...
> >>>>>> Yeah, that will be another means of doing translation without having to
> >>>>>> complicate the API around iova_tree. I wonder how the lookup through
> >>>>>> memory_region_from_host() may perform compared to the iova tree one, the
> >>>>>> former looks to be an O(N) linear search on a linked list while the
> >>>>>> latter would be roughly O(log N) on an AVL tree?
> >>>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
> >>>>> linear too. It is not even ordered.
> >>>> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
> >>>> instead of g_tree_search_node(). So the former is indeed linear
> >>>> iteration, but it looks to be ordered?
> >>>>
> >>>> https://urldefense.com/v3/__https://github.com/GNOME/glib/blob/main/glib/gtree.c*L1115__;Iw!!ACWV5N9M2RV99hQ!Ng2rLfRd9tLyNTNocW50Mf5AcxSt0uF0wOdv120djff-z_iAdbujYK-jMi5UC1DZLxb1yLUv2vV0j3wJo8o$
> >>> The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
> >>>
> >>> If we have these translations:
> >>> [0x1000, 0x2000] -> [0x10000, 0x11000]
> >>> [0x2000, 0x3000] -> [0x6000, 0x7000]
> >>>
> >>> We will see them in this order, so we cannot stop the search at the first node.
> >> Yeah, reverse lookup is unordered indeed, anyway.
> >>
> >>>
> >>>>> But apart from this detail you're right, I have the same concerns with
> >>>>> this solution too. If we see a hard performance regression we could go
> >>>>> to more complicated solutions, like maintaining a reverse IOVATree in
> >>>>> vhost-iova-tree too. First RFCs of SVQ did that actually.
> >>>> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
> >>>> reverse IOVATree was dropped, lack of users? But now we have one!
> >>>>
> >>> No, it is just simplicity. We already have an user in the hot patch in
> >>> the master branch, vhost_svq_vring_write_descs. But I never profiled
> >>> enough to find if it is a bottleneck or not to be honest.
> >> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to
> >> profile and see the difference.
> >>>
> >>> I'll send the new series by today, thank you for finding these issues!
> >> Thanks! In case you don't have bandwidth to add back reverse IOVA tree,
> >> Jonah (cc'ed) may have interest in looking into it.
> >>
> >
> > Actually, yes. I've tried to solve it using:
> > memory_region_get_ram_ptr -> It's hard to get this pointer to work
> > without messing a lot with IOVATree.
> > memory_region_find -> I'm totally unable to make it return sections
> > that make sense
> > flatview_for_each_range -> It does not return the same
> > MemoryRegionsection as the listener, not sure why.
> >
> > The only advance I have is that memory_region_from_host is able to
> > tell if the vaddr is from the guest or not.
> >
> > So I'm convinced there must be a way to do it with the memory
> > subsystem, but I think the best way to do it ATM is to store a
> > parallel tree with GPA-> SVQ IOVA translations. At removal time, if we
> > find the entry in this new tree, we can directly remove it by GPA. If
> > not, assume it is a host-only address like SVQ vrings, and remove by
> > iterating on vaddr as we do now. It is guaranteed the guest does not
> > translate to that vaddr and that that vaddr is unique in the tree
> > anyway.
> >
> > Does it sound reasonable? Jonah, would you be interested in moving this forward?
> >
> > Thanks!
> >
>
> Sure, I'd be more than happy to work on this stuff! I can probably get
> started on this either today or tomorrow.
>
> Si-Wei mentioned something about these "reverse IOVATree" patches that
> were dropped;

The patches implementing the reverse IOVA tree were never created /
posted, just in case you try to look for them.


> is this relevant to what you're asking here? Is it
> something I should base my work off of?
>

So these patches work ok for adding and removing maps. We assign ids,
which is the gpa of the memory region that the listener receives. The
bad news is that SVQ also needs this id to look for the right
translation at vhost_svq_translate_addr, so this series is not
complete. You can find the
vhost_iova_tree_find_iova()->iova_tree_find_iova() call there.

The easiest solution is the reverse IOVA tree of HVA -> SVQ IOVA. It
is also the less elegant and (potentially) the less performant, as it
includes duplicating information that QEMU already has, and a
potentially linear search.

David Hildenbrand (CCed) proposed to try iterating through RAMBlocks.
I guess qemu_ram_block_from_host() could return a block where
block->offset is the id of the map?

It would be great to try this approach. If you don't have the bandwith
for this, going directly for the reverse iova tree is also a valid
solution.

Thanks!

> If there's any other relevant information about this issue that you
> think I should know, let me know. I'll start digging into this ASAP and
> will reach out if I need any guidance. :)
>
> Jonah
>
> >> -Siwei
> >>
> >>
> >>>
> >>>> Thanks,
> >>>> -Siwei
> >>>>> Thanks!
> >>>>>
> >>>>>> Of course,
> >>>>>> memory_region_from_host() won't search out of the guest memory space for
> >>>>>> sure. As this could be on the hot data path I have a little bit
> >>>>>> hesitance over the potential cost or performance regression this change
> >>>>>> could bring in, but maybe I'm overthinking it too much...
> >>>>>>
> >>>>>> Thanks,
> >>>>>> -Siwei
> >>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> -Siwei
> >>>>>>>>>               return false;
> >>>>>>>>>           }
> >>>>>>>>>
> >>
> >
>
diff mbox series

Patch

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@  typedef struct DMAMap {
     hwaddr iova;
     hwaddr translated_addr;
     hwaddr size;                /* Inclusive */
+    uint64_t id;
     IOMMUAccessFlags perm;
 } QEMU_PACKED DMAMap;
 typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@  const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
  * @map: the mapping to search
  *
  * Search for a mapping in the iova tree that translated_addr overlaps with the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
  *
  * Return: DMAMap pointer if found, or NULL if not found.  Note that
  * the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@  static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
 
     needle = args->needle;
     if (map->translated_addr + map->size < needle->translated_addr ||
-        needle->translated_addr + needle->size < map->translated_addr) {
+        needle->translated_addr + needle->size < map->translated_addr ||
+        needle->id != map->id) {
         return false;
     }