diff mbox series

[v2] physmem: avoid bounce buffer too small

Message ID 20240228125939.56925-1-heinrich.schuchardt@canonical.com (mailing list archive)
State New, archived
Headers show
Series [v2] physmem: avoid bounce buffer too small | expand

Commit Message

Heinrich Schuchardt Feb. 28, 2024, 12:59 p.m. UTC
virtqueue_map_desc() is called with values of sz exceeding that may exceed
TARGET_PAGE_SIZE. sz = 0x2800 has been observed.

We only support a single bounce buffer. We have to avoid
virtqueue_map_desc() calling address_space_map() multiple times. Otherwise
we see an error

    qemu: virtio: bogus descriptor or out of resources

Increase the minimum size of the bounce buffer to 0x10000 which matches
the largest value of TARGET_PAGE_SIZE for all architectures.

Signed-off-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com>
---
v2:
	remove unrelated change
---
 system/physmem.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Philippe Mathieu-Daudé Feb. 28, 2024, 3:06 p.m. UTC | #1
Hi Heinrich,

On 28/2/24 13:59, Heinrich Schuchardt wrote:
> virtqueue_map_desc() is called with values of sz exceeding that may exceed
> TARGET_PAGE_SIZE. sz = 0x2800 has been observed.
> 
> We only support a single bounce buffer. We have to avoid
> virtqueue_map_desc() calling address_space_map() multiple times. Otherwise
> we see an error
> 
>      qemu: virtio: bogus descriptor or out of resources
> 
> Increase the minimum size of the bounce buffer to 0x10000 which matches
> the largest value of TARGET_PAGE_SIZE for all architectures.
> 
> Signed-off-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com>
> ---
> v2:
> 	remove unrelated change
> ---
>   system/physmem.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/system/physmem.c b/system/physmem.c
> index e3ebc19eef..3c82da1c86 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -3151,8 +3151,12 @@ void *address_space_map(AddressSpace *as,
>               *plen = 0;
>               return NULL;
>           }
> -        /* Avoid unbounded allocations */
> -        l = MIN(l, TARGET_PAGE_SIZE);
> +        /*
> +         * There is only one bounce buffer. The largest occuring value of
> +         * parameter sz of virtqueue_map_desc() must fit into the bounce
> +         * buffer.
> +         */
> +        l = MIN(l, 0x10000);

Please define this magic value. Maybe ANY_TARGET_PAGE_SIZE or
TARGETS_BIGGEST_PAGE_SIZE?

Then along:
   QEMU_BUILD_BUG_ON(TARGET_PAGE_SIZE <= TARGETS_BIGGEST_PAGE_SIZE);

>           bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
>           bounce.addr = addr;
>           bounce.len = l;
Heinrich Schuchardt Feb. 28, 2024, 6:27 p.m. UTC | #2
On 28.02.24 16:06, Philippe Mathieu-Daudé wrote:
> Hi Heinrich,
> 
> On 28/2/24 13:59, Heinrich Schuchardt wrote:
>> virtqueue_map_desc() is called with values of sz exceeding that may 
>> exceed
>> TARGET_PAGE_SIZE. sz = 0x2800 has been observed.
>>
>> We only support a single bounce buffer. We have to avoid
>> virtqueue_map_desc() calling address_space_map() multiple times. 
>> Otherwise
>> we see an error
>>
>>      qemu: virtio: bogus descriptor or out of resources
>>
>> Increase the minimum size of the bounce buffer to 0x10000 which matches
>> the largest value of TARGET_PAGE_SIZE for all architectures.
>>
>> Signed-off-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com>
>> ---
>> v2:
>>     remove unrelated change
>> ---
>>   system/physmem.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/system/physmem.c b/system/physmem.c
>> index e3ebc19eef..3c82da1c86 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -3151,8 +3151,12 @@ void *address_space_map(AddressSpace *as,
>>               *plen = 0;
>>               return NULL;
>>           }
>> -        /* Avoid unbounded allocations */
>> -        l = MIN(l, TARGET_PAGE_SIZE);
>> +        /*
>> +         * There is only one bounce buffer. The largest occuring 
>> value of
>> +         * parameter sz of virtqueue_map_desc() must fit into the bounce
>> +         * buffer.
>> +         */
>> +        l = MIN(l, 0x10000);
> 
> Please define this magic value. Maybe ANY_TARGET_PAGE_SIZE or
> TARGETS_BIGGEST_PAGE_SIZE?
> 
> Then along:
>    QEMU_BUILD_BUG_ON(TARGET_PAGE_SIZE <= TARGETS_BIGGEST_PAGE_SIZE);

Thank you Philippe for reviewing.

TARGETS_BIGGEST_PAGE_SIZE does not fit as the value is not driven by the 
page size.
How about MIN_BOUNCE_BUFFER_SIZE?
Is include/exec/memory.h the right include for the constant?

I don't think that TARGET_PAGE_SIZE has any relevance for setting the 
bounce buffer size. I only mentioned it to say that we are not 
decreasing the value on any existing architecture.

I don't know why TARGET_PAGE_SIZE ever got into this piece of code. 
e3127ae0cdcd ("exec: reorganize address_space_map") does not provide a 
reason for this choice. Maybe Paolo remembers.

Best regards

Heinrich

> 
>>           bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
>>           bounce.addr = addr;
>>           bounce.len = l;
>
Peter Maydell Feb. 28, 2024, 6:39 p.m. UTC | #3
On Wed, 28 Feb 2024 at 18:28, Heinrich Schuchardt
<heinrich.schuchardt@canonical.com> wrote:
>
> On 28.02.24 16:06, Philippe Mathieu-Daudé wrote:
> > Hi Heinrich,
> >
> > On 28/2/24 13:59, Heinrich Schuchardt wrote:
> >> virtqueue_map_desc() is called with values of sz exceeding that may
> >> exceed
> >> TARGET_PAGE_SIZE. sz = 0x2800 has been observed.
> >>
> >> We only support a single bounce buffer. We have to avoid
> >> virtqueue_map_desc() calling address_space_map() multiple times.
> >> Otherwise
> >> we see an error
> >>
> >>      qemu: virtio: bogus descriptor or out of resources
> >>
> >> Increase the minimum size of the bounce buffer to 0x10000 which matches
> >> the largest value of TARGET_PAGE_SIZE for all architectures.
> >>
> >> Signed-off-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com>
> >> ---
> >> v2:
> >>     remove unrelated change
> >> ---
> >>   system/physmem.c | 8 ++++++--
> >>   1 file changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/system/physmem.c b/system/physmem.c
> >> index e3ebc19eef..3c82da1c86 100644
> >> --- a/system/physmem.c
> >> +++ b/system/physmem.c
> >> @@ -3151,8 +3151,12 @@ void *address_space_map(AddressSpace *as,
> >>               *plen = 0;
> >>               return NULL;
> >>           }
> >> -        /* Avoid unbounded allocations */
> >> -        l = MIN(l, TARGET_PAGE_SIZE);
> >> +        /*
> >> +         * There is only one bounce buffer. The largest occuring
> >> value of
> >> +         * parameter sz of virtqueue_map_desc() must fit into the bounce
> >> +         * buffer.
> >> +         */
> >> +        l = MIN(l, 0x10000);
> >
> > Please define this magic value. Maybe ANY_TARGET_PAGE_SIZE or
> > TARGETS_BIGGEST_PAGE_SIZE?
> >
> > Then along:
> >    QEMU_BUILD_BUG_ON(TARGET_PAGE_SIZE <= TARGETS_BIGGEST_PAGE_SIZE);
>
> Thank you Philippe for reviewing.
>
> TARGETS_BIGGEST_PAGE_SIZE does not fit as the value is not driven by the
> page size.
> How about MIN_BOUNCE_BUFFER_SIZE?
> Is include/exec/memory.h the right include for the constant?
>
> I don't think that TARGET_PAGE_SIZE has any relevance for setting the
> bounce buffer size. I only mentioned it to say that we are not
> decreasing the value on any existing architecture.
>
> I don't know why TARGET_PAGE_SIZE ever got into this piece of code.
> e3127ae0cdcd ("exec: reorganize address_space_map") does not provide a
> reason for this choice. Maybe Paolo remembers.

The limitation to a page dates back to commit 6d16c2f88f2a in 2009,
which was the first implementation of this function. I don't think
there's a particular reason for that value beyond that it was
probably a convenient value that was assumed to be likely "big enough".

I think the idea with this bounce-buffer has always been that this
isn't really a code path we expected to end up in very often --
it's supposed to be for when devices are doing DMA, which they
will typically be doing to memory (backed by host RAM), not
devices (backed by MMIO and needing a bounce buffer). So the
whole mechanism is a bit "last fallback to stop things breaking
entirely".

The address_space_map() API says that it's allowed to return
a subset of the range you ask for, so if the virtio code doesn't
cope with the minimum being set to TARGET_PAGE_SIZE then either
we need to fix that virtio code or we need to change the API
of this function. (But I think you will also get a reduced
range if you try to use it across a boundary between normal
host-memory-backed RAM and a device MemoryRegion.)

thanks
-- PMM
Heinrich Schuchardt Feb. 28, 2024, 7:07 p.m. UTC | #4
On 28.02.24 19:39, Peter Maydell wrote:
> On Wed, 28 Feb 2024 at 18:28, Heinrich Schuchardt
> <heinrich.schuchardt@canonical.com> wrote:
>>
>> On 28.02.24 16:06, Philippe Mathieu-Daudé wrote:
>>> Hi Heinrich,
>>>
>>> On 28/2/24 13:59, Heinrich Schuchardt wrote:
>>>> virtqueue_map_desc() is called with values of sz exceeding that may
>>>> exceed
>>>> TARGET_PAGE_SIZE. sz = 0x2800 has been observed.
>>>>
>>>> We only support a single bounce buffer. We have to avoid
>>>> virtqueue_map_desc() calling address_space_map() multiple times.
>>>> Otherwise
>>>> we see an error
>>>>
>>>>       qemu: virtio: bogus descriptor or out of resources
>>>>
>>>> Increase the minimum size of the bounce buffer to 0x10000 which matches
>>>> the largest value of TARGET_PAGE_SIZE for all architectures.
>>>>
>>>> Signed-off-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com>
>>>> ---
>>>> v2:
>>>>      remove unrelated change
>>>> ---
>>>>    system/physmem.c | 8 ++++++--
>>>>    1 file changed, 6 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>> index e3ebc19eef..3c82da1c86 100644
>>>> --- a/system/physmem.c
>>>> +++ b/system/physmem.c
>>>> @@ -3151,8 +3151,12 @@ void *address_space_map(AddressSpace *as,
>>>>                *plen = 0;
>>>>                return NULL;
>>>>            }
>>>> -        /* Avoid unbounded allocations */
>>>> -        l = MIN(l, TARGET_PAGE_SIZE);
>>>> +        /*
>>>> +         * There is only one bounce buffer. The largest occuring
>>>> value of
>>>> +         * parameter sz of virtqueue_map_desc() must fit into the bounce
>>>> +         * buffer.
>>>> +         */
>>>> +        l = MIN(l, 0x10000);
>>>
>>> Please define this magic value. Maybe ANY_TARGET_PAGE_SIZE or
>>> TARGETS_BIGGEST_PAGE_SIZE?
>>>
>>> Then along:
>>>     QEMU_BUILD_BUG_ON(TARGET_PAGE_SIZE <= TARGETS_BIGGEST_PAGE_SIZE);
>>
>> Thank you Philippe for reviewing.
>>
>> TARGETS_BIGGEST_PAGE_SIZE does not fit as the value is not driven by the
>> page size.
>> How about MIN_BOUNCE_BUFFER_SIZE?
>> Is include/exec/memory.h the right include for the constant?
>>
>> I don't think that TARGET_PAGE_SIZE has any relevance for setting the
>> bounce buffer size. I only mentioned it to say that we are not
>> decreasing the value on any existing architecture.
>>
>> I don't know why TARGET_PAGE_SIZE ever got into this piece of code.
>> e3127ae0cdcd ("exec: reorganize address_space_map") does not provide a
>> reason for this choice. Maybe Paolo remembers.
> 
> The limitation to a page dates back to commit 6d16c2f88f2a in 2009,
> which was the first implementation of this function. I don't think
> there's a particular reason for that value beyond that it was
> probably a convenient value that was assumed to be likely "big enough".
> 
> I think the idea with this bounce-buffer has always been that this
> isn't really a code path we expected to end up in very often --
> it's supposed to be for when devices are doing DMA, which they
> will typically be doing to memory (backed by host RAM), not
> devices (backed by MMIO and needing a bounce buffer). So the
> whole mechanism is a bit "last fallback to stop things breaking
> entirely".
> 
> The address_space_map() API says that it's allowed to return
> a subset of the range you ask for, so if the virtio code doesn't
> cope with the minimum being set to TARGET_PAGE_SIZE then either
> we need to fix that virtio code or we need to change the API
> of this function. (But I think you will also get a reduced
> range if you try to use it across a boundary between normal
> host-memory-backed RAM and a device MemoryRegion.)

If we allow a bounce buffer only to be used once (via the in_use flag), 
why do we allow only a single bounce buffer?

Could address_space_map() allocate a new bounce buffer on every call and 
address_space_unmap() deallocate it?

Isn't the design with a single bounce buffer bound to fail with a 
multi-threaded client as collision can be expected?

Best regards

Heinrich
Peter Xu Feb. 29, 2024, 1:11 a.m. UTC | #5
On Wed, Feb 28, 2024 at 08:07:47PM +0100, Heinrich Schuchardt wrote:
> On 28.02.24 19:39, Peter Maydell wrote:
> > On Wed, 28 Feb 2024 at 18:28, Heinrich Schuchardt
> > <heinrich.schuchardt@canonical.com> wrote:
> > > 
> > > On 28.02.24 16:06, Philippe Mathieu-Daudé wrote:
> > > > Hi Heinrich,
> > > > 
> > > > On 28/2/24 13:59, Heinrich Schuchardt wrote:
> > > > > virtqueue_map_desc() is called with values of sz exceeding that may
> > > > > exceed
> > > > > TARGET_PAGE_SIZE. sz = 0x2800 has been observed.

Pure (and can also be stupid) question: why virtqueue_map_desc() would map
to !direct mem?  Shouldn't those buffers normally allocated from guest RAM?

> > > > > 
> > > > > We only support a single bounce buffer. We have to avoid
> > > > > virtqueue_map_desc() calling address_space_map() multiple times.
> > > > > Otherwise
> > > > > we see an error
> > > > > 
> > > > >       qemu: virtio: bogus descriptor or out of resources
> > > > > 
> > > > > Increase the minimum size of the bounce buffer to 0x10000 which matches
> > > > > the largest value of TARGET_PAGE_SIZE for all architectures.
> > > > > 
> > > > > Signed-off-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com>
> > > > > ---
> > > > > v2:
> > > > >      remove unrelated change
> > > > > ---
> > > > >    system/physmem.c | 8 ++++++--
> > > > >    1 file changed, 6 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/system/physmem.c b/system/physmem.c
> > > > > index e3ebc19eef..3c82da1c86 100644
> > > > > --- a/system/physmem.c
> > > > > +++ b/system/physmem.c
> > > > > @@ -3151,8 +3151,12 @@ void *address_space_map(AddressSpace *as,
> > > > >                *plen = 0;
> > > > >                return NULL;
> > > > >            }
> > > > > -        /* Avoid unbounded allocations */
> > > > > -        l = MIN(l, TARGET_PAGE_SIZE);
> > > > > +        /*
> > > > > +         * There is only one bounce buffer. The largest occuring
> > > > > value of
> > > > > +         * parameter sz of virtqueue_map_desc() must fit into the bounce
> > > > > +         * buffer.
> > > > > +         */
> > > > > +        l = MIN(l, 0x10000);
> > > > 
> > > > Please define this magic value. Maybe ANY_TARGET_PAGE_SIZE or
> > > > TARGETS_BIGGEST_PAGE_SIZE?
> > > > 
> > > > Then along:
> > > >     QEMU_BUILD_BUG_ON(TARGET_PAGE_SIZE <= TARGETS_BIGGEST_PAGE_SIZE);
> > > 
> > > Thank you Philippe for reviewing.
> > > 
> > > TARGETS_BIGGEST_PAGE_SIZE does not fit as the value is not driven by the
> > > page size.
> > > How about MIN_BOUNCE_BUFFER_SIZE?
> > > Is include/exec/memory.h the right include for the constant?
> > > 
> > > I don't think that TARGET_PAGE_SIZE has any relevance for setting the
> > > bounce buffer size. I only mentioned it to say that we are not
> > > decreasing the value on any existing architecture.
> > > 
> > > I don't know why TARGET_PAGE_SIZE ever got into this piece of code.
> > > e3127ae0cdcd ("exec: reorganize address_space_map") does not provide a
> > > reason for this choice. Maybe Paolo remembers.
> > 
> > The limitation to a page dates back to commit 6d16c2f88f2a in 2009,
> > which was the first implementation of this function. I don't think
> > there's a particular reason for that value beyond that it was
> > probably a convenient value that was assumed to be likely "big enough".
> > 
> > I think the idea with this bounce-buffer has always been that this
> > isn't really a code path we expected to end up in very often --
> > it's supposed to be for when devices are doing DMA, which they
> > will typically be doing to memory (backed by host RAM), not
> > devices (backed by MMIO and needing a bounce buffer). So the
> > whole mechanism is a bit "last fallback to stop things breaking
> > entirely".
> > 
> > The address_space_map() API says that it's allowed to return
> > a subset of the range you ask for, so if the virtio code doesn't
> > cope with the minimum being set to TARGET_PAGE_SIZE then either
> > we need to fix that virtio code or we need to change the API
> > of this function. (But I think you will also get a reduced
> > range if you try to use it across a boundary between normal
> > host-memory-backed RAM and a device MemoryRegion.)
> 
> If we allow a bounce buffer only to be used once (via the in_use flag), why
> do we allow only a single bounce buffer?
> 
> Could address_space_map() allocate a new bounce buffer on every call and
> address_space_unmap() deallocate it?
> 
> Isn't the design with a single bounce buffer bound to fail with a
> multi-threaded client as collision can be expected?

See:

https://lore.kernel.org/r/20240212080617.2559498-1-mnissler@rivosinc.com

For some reason that series didn't land, but it seems to be helpful in this
case too if e.g. there can be multiple of such devices.

Thanks,
Peter Maydell Feb. 29, 2024, 9:38 a.m. UTC | #6
On Wed, 28 Feb 2024 at 19:07, Heinrich Schuchardt
<heinrich.schuchardt@canonical.com> wrote:
>
> On 28.02.24 19:39, Peter Maydell wrote:
> > The limitation to a page dates back to commit 6d16c2f88f2a in 2009,
> > which was the first implementation of this function. I don't think
> > there's a particular reason for that value beyond that it was
> > probably a convenient value that was assumed to be likely "big enough".
> >
> > I think the idea with this bounce-buffer has always been that this
> > isn't really a code path we expected to end up in very often --
> > it's supposed to be for when devices are doing DMA, which they
> > will typically be doing to memory (backed by host RAM), not
> > devices (backed by MMIO and needing a bounce buffer). So the
> > whole mechanism is a bit "last fallback to stop things breaking
> > entirely".
> >
> > The address_space_map() API says that it's allowed to return
> > a subset of the range you ask for, so if the virtio code doesn't
> > cope with the minimum being set to TARGET_PAGE_SIZE then either
> > we need to fix that virtio code or we need to change the API
> > of this function. (But I think you will also get a reduced
> > range if you try to use it across a boundary between normal
> > host-memory-backed RAM and a device MemoryRegion.)
>
> If we allow a bounce buffer only to be used once (via the in_use flag),
> why do we allow only a single bounce buffer?
>
> Could address_space_map() allocate a new bounce buffer on every call and
> address_space_unmap() deallocate it?
>
> Isn't the design with a single bounce buffer bound to fail with a
> multi-threaded client as collision can be expected?

Yeah, I don't suppose multi-threaded was particularly expected.
Again, this is really a "handle the case where the guest does
something silly" setup, which is why only one bounce buffer.

Why is your guest ending up in the bounce-buffer path?

-- PMM
Heinrich Schuchardt Feb. 29, 2024, 10:22 a.m. UTC | #7
On 29.02.24 02:11, Peter Xu wrote:
> On Wed, Feb 28, 2024 at 08:07:47PM +0100, Heinrich Schuchardt wrote:
>> On 28.02.24 19:39, Peter Maydell wrote:
>>> On Wed, 28 Feb 2024 at 18:28, Heinrich Schuchardt
>>> <heinrich.schuchardt@canonical.com> wrote:
>>>>
>>>> On 28.02.24 16:06, Philippe Mathieu-Daudé wrote:
>>>>> Hi Heinrich,
>>>>>
>>>>> On 28/2/24 13:59, Heinrich Schuchardt wrote:
>>>>>> virtqueue_map_desc() is called with values of sz exceeding that may
>>>>>> exceed
>>>>>> TARGET_PAGE_SIZE. sz = 0x2800 has been observed.
> 
> Pure (and can also be stupid) question: why virtqueue_map_desc() would map
> to !direct mem?  Shouldn't those buffers normally allocated from guest RAM?
> 
>>>>>>
>>>>>> We only support a single bounce buffer. We have to avoid
>>>>>> virtqueue_map_desc() calling address_space_map() multiple times.
>>>>>> Otherwise
>>>>>> we see an error
>>>>>>
>>>>>>        qemu: virtio: bogus descriptor or out of resources
>>>>>>
>>>>>> Increase the minimum size of the bounce buffer to 0x10000 which matches
>>>>>> the largest value of TARGET_PAGE_SIZE for all architectures.
>>>>>>
>>>>>> Signed-off-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com>
>>>>>> ---
>>>>>> v2:
>>>>>>       remove unrelated change
>>>>>> ---
>>>>>>     system/physmem.c | 8 ++++++--
>>>>>>     1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>>>> index e3ebc19eef..3c82da1c86 100644
>>>>>> --- a/system/physmem.c
>>>>>> +++ b/system/physmem.c
>>>>>> @@ -3151,8 +3151,12 @@ void *address_space_map(AddressSpace *as,
>>>>>>                 *plen = 0;
>>>>>>                 return NULL;
>>>>>>             }
>>>>>> -        /* Avoid unbounded allocations */
>>>>>> -        l = MIN(l, TARGET_PAGE_SIZE);
>>>>>> +        /*
>>>>>> +         * There is only one bounce buffer. The largest occuring
>>>>>> value of
>>>>>> +         * parameter sz of virtqueue_map_desc() must fit into the bounce
>>>>>> +         * buffer.
>>>>>> +         */
>>>>>> +        l = MIN(l, 0x10000);
>>>>>
>>>>> Please define this magic value. Maybe ANY_TARGET_PAGE_SIZE or
>>>>> TARGETS_BIGGEST_PAGE_SIZE?
>>>>>
>>>>> Then along:
>>>>>      QEMU_BUILD_BUG_ON(TARGET_PAGE_SIZE <= TARGETS_BIGGEST_PAGE_SIZE);
>>>>
>>>> Thank you Philippe for reviewing.
>>>>
>>>> TARGETS_BIGGEST_PAGE_SIZE does not fit as the value is not driven by the
>>>> page size.
>>>> How about MIN_BOUNCE_BUFFER_SIZE?
>>>> Is include/exec/memory.h the right include for the constant?
>>>>
>>>> I don't think that TARGET_PAGE_SIZE has any relevance for setting the
>>>> bounce buffer size. I only mentioned it to say that we are not
>>>> decreasing the value on any existing architecture.
>>>>
>>>> I don't know why TARGET_PAGE_SIZE ever got into this piece of code.
>>>> e3127ae0cdcd ("exec: reorganize address_space_map") does not provide a
>>>> reason for this choice. Maybe Paolo remembers.
>>>
>>> The limitation to a page dates back to commit 6d16c2f88f2a in 2009,
>>> which was the first implementation of this function. I don't think
>>> there's a particular reason for that value beyond that it was
>>> probably a convenient value that was assumed to be likely "big enough".
>>>
>>> I think the idea with this bounce-buffer has always been that this
>>> isn't really a code path we expected to end up in very often --
>>> it's supposed to be for when devices are doing DMA, which they
>>> will typically be doing to memory (backed by host RAM), not
>>> devices (backed by MMIO and needing a bounce buffer). So the
>>> whole mechanism is a bit "last fallback to stop things breaking
>>> entirely".
>>>
>>> The address_space_map() API says that it's allowed to return
>>> a subset of the range you ask for, so if the virtio code doesn't
>>> cope with the minimum being set to TARGET_PAGE_SIZE then either
>>> we need to fix that virtio code or we need to change the API
>>> of this function. (But I think you will also get a reduced
>>> range if you try to use it across a boundary between normal
>>> host-memory-backed RAM and a device MemoryRegion.)
>>
>> If we allow a bounce buffer only to be used once (via the in_use flag), why
>> do we allow only a single bounce buffer?
>>
>> Could address_space_map() allocate a new bounce buffer on every call and
>> address_space_unmap() deallocate it?
>>
>> Isn't the design with a single bounce buffer bound to fail with a
>> multi-threaded client as collision can be expected?
> 
> See:
> 
> https://lore.kernel.org/r/20240212080617.2559498-1-mnissler@rivosinc.com
> 
> For some reason that series didn't land, but it seems to be helpful in this
> case too if e.g. there can be multiple of such devices.
> 
> Thanks,
> 

Hello Peter Xu,

thanks for pointing to your series. What I like about it is that it 
removes the limit of a single bounce buffer per AddressSpace.

Unfortunately it does not solve my problem. You limit the sum of all of 
the allocations for a single AddressSpcace to 
DEFAULT_MAX_BOUNCE_BUFFER_SIZE = 4096 which is too small for my use case.

Why do we need a limit?
Why is it so tiny?

Best regards

Heinrich
Mattias Nissler Feb. 29, 2024, 10:36 a.m. UTC | #8
On Thu, Feb 29, 2024 at 11:22 AM Heinrich Schuchardt
<heinrich.schuchardt@canonical.com> wrote:
>
> On 29.02.24 02:11, Peter Xu wrote:
> > On Wed, Feb 28, 2024 at 08:07:47PM +0100, Heinrich Schuchardt wrote:
> >> On 28.02.24 19:39, Peter Maydell wrote:
> >>> On Wed, 28 Feb 2024 at 18:28, Heinrich Schuchardt
> >>> <heinrich.schuchardt@canonical.com> wrote:
> >>>>
> >>>> On 28.02.24 16:06, Philippe Mathieu-Daudé wrote:
> >>>>> Hi Heinrich,
> >>>>>
> >>>>> On 28/2/24 13:59, Heinrich Schuchardt wrote:
> >>>>>> virtqueue_map_desc() is called with values of sz exceeding that may
> >>>>>> exceed
> >>>>>> TARGET_PAGE_SIZE. sz = 0x2800 has been observed.
> >
> > Pure (and can also be stupid) question: why virtqueue_map_desc() would map
> > to !direct mem?  Shouldn't those buffers normally allocated from guest RAM?
> >
> >>>>>>
> >>>>>> We only support a single bounce buffer. We have to avoid
> >>>>>> virtqueue_map_desc() calling address_space_map() multiple times.
> >>>>>> Otherwise
> >>>>>> we see an error
> >>>>>>
> >>>>>>        qemu: virtio: bogus descriptor or out of resources
> >>>>>>
> >>>>>> Increase the minimum size of the bounce buffer to 0x10000 which matches
> >>>>>> the largest value of TARGET_PAGE_SIZE for all architectures.
> >>>>>>
> >>>>>> Signed-off-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com>
> >>>>>> ---
> >>>>>> v2:
> >>>>>>       remove unrelated change
> >>>>>> ---
> >>>>>>     system/physmem.c | 8 ++++++--
> >>>>>>     1 file changed, 6 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/system/physmem.c b/system/physmem.c
> >>>>>> index e3ebc19eef..3c82da1c86 100644
> >>>>>> --- a/system/physmem.c
> >>>>>> +++ b/system/physmem.c
> >>>>>> @@ -3151,8 +3151,12 @@ void *address_space_map(AddressSpace *as,
> >>>>>>                 *plen = 0;
> >>>>>>                 return NULL;
> >>>>>>             }
> >>>>>> -        /* Avoid unbounded allocations */
> >>>>>> -        l = MIN(l, TARGET_PAGE_SIZE);
> >>>>>> +        /*
> >>>>>> +         * There is only one bounce buffer. The largest occuring
> >>>>>> value of
> >>>>>> +         * parameter sz of virtqueue_map_desc() must fit into the bounce
> >>>>>> +         * buffer.
> >>>>>> +         */
> >>>>>> +        l = MIN(l, 0x10000);
> >>>>>
> >>>>> Please define this magic value. Maybe ANY_TARGET_PAGE_SIZE or
> >>>>> TARGETS_BIGGEST_PAGE_SIZE?
> >>>>>
> >>>>> Then along:
> >>>>>      QEMU_BUILD_BUG_ON(TARGET_PAGE_SIZE <= TARGETS_BIGGEST_PAGE_SIZE);
> >>>>
> >>>> Thank you Philippe for reviewing.
> >>>>
> >>>> TARGETS_BIGGEST_PAGE_SIZE does not fit as the value is not driven by the
> >>>> page size.
> >>>> How about MIN_BOUNCE_BUFFER_SIZE?
> >>>> Is include/exec/memory.h the right include for the constant?
> >>>>
> >>>> I don't think that TARGET_PAGE_SIZE has any relevance for setting the
> >>>> bounce buffer size. I only mentioned it to say that we are not
> >>>> decreasing the value on any existing architecture.
> >>>>
> >>>> I don't know why TARGET_PAGE_SIZE ever got into this piece of code.
> >>>> e3127ae0cdcd ("exec: reorganize address_space_map") does not provide a
> >>>> reason for this choice. Maybe Paolo remembers.
> >>>
> >>> The limitation to a page dates back to commit 6d16c2f88f2a in 2009,
> >>> which was the first implementation of this function. I don't think
> >>> there's a particular reason for that value beyond that it was
> >>> probably a convenient value that was assumed to be likely "big enough".
> >>>
> >>> I think the idea with this bounce-buffer has always been that this
> >>> isn't really a code path we expected to end up in very often --
> >>> it's supposed to be for when devices are doing DMA, which they
> >>> will typically be doing to memory (backed by host RAM), not
> >>> devices (backed by MMIO and needing a bounce buffer). So the
> >>> whole mechanism is a bit "last fallback to stop things breaking
> >>> entirely".
> >>>
> >>> The address_space_map() API says that it's allowed to return
> >>> a subset of the range you ask for, so if the virtio code doesn't
> >>> cope with the minimum being set to TARGET_PAGE_SIZE then either
> >>> we need to fix that virtio code or we need to change the API
> >>> of this function. (But I think you will also get a reduced
> >>> range if you try to use it across a boundary between normal
> >>> host-memory-backed RAM and a device MemoryRegion.)
> >>
> >> If we allow a bounce buffer only to be used once (via the in_use flag), why
> >> do we allow only a single bounce buffer?
> >>
> >> Could address_space_map() allocate a new bounce buffer on every call and
> >> address_space_unmap() deallocate it?
> >>
> >> Isn't the design with a single bounce buffer bound to fail with a
> >> multi-threaded client as collision can be expected?
> >
> > See:
> >
> > https://lore.kernel.org/r/20240212080617.2559498-1-mnissler@rivosinc.com
> >
> > For some reason that series didn't land, but it seems to be helpful in this
> > case too if e.g. there can be multiple of such devices.
> >
> > Thanks,
> >
>
> Hello Peter Xu,
>
> thanks for pointing to your series. What I like about it is that it
> removes the limit of a single bounce buffer per AddressSpace.
>
> Unfortunately it does not solve my problem. You limit the sum of all of
> the allocations for a single AddressSpcace to
> DEFAULT_MAX_BOUNCE_BUFFER_SIZE = 4096 which is too small for my use case.

Note that the limit is configured for address spaces attached to PCI
devices with a parameter.

>
> Why do we need a limit?

The rationale is to prevent a guest from allocating unlimited amounts
of host memory.

> Why is it so tiny?

Nobody has come up with a good way to determine a "sufficient" amount
that works with all use cases, while at the same time addressing the
concern due to malicious guest behavior mentioned above.

(Note that I'm merely reciting previous conversations as the author of
the series Peter pointed you at)

>
> Best regards
>
> Heinrich
>
>
>
Jonathan Cameron Feb. 29, 2024, 10:46 a.m. UTC | #9
On Thu, 29 Feb 2024 11:22:24 +0100
Heinrich Schuchardt <heinrich.schuchardt@canonical.com> wrote:

> On 29.02.24 02:11, Peter Xu wrote:
> > On Wed, Feb 28, 2024 at 08:07:47PM +0100, Heinrich Schuchardt wrote:  
> >> On 28.02.24 19:39, Peter Maydell wrote:  
> >>> On Wed, 28 Feb 2024 at 18:28, Heinrich Schuchardt
> >>> <heinrich.schuchardt@canonical.com> wrote:  
> >>>>
> >>>> On 28.02.24 16:06, Philippe Mathieu-Daudé wrote:  
> >>>>> Hi Heinrich,
> >>>>>
> >>>>> On 28/2/24 13:59, Heinrich Schuchardt wrote:  
> >>>>>> virtqueue_map_desc() is called with values of sz exceeding that may
> >>>>>> exceed
> >>>>>> TARGET_PAGE_SIZE. sz = 0x2800 has been observed.  
> > 
> > Pure (and can also be stupid) question: why virtqueue_map_desc() would map
> > to !direct mem?  Shouldn't those buffers normally allocated from guest RAM?
> >   
> >>>>>>
> >>>>>> We only support a single bounce buffer. We have to avoid
> >>>>>> virtqueue_map_desc() calling address_space_map() multiple times.
> >>>>>> Otherwise
> >>>>>> we see an error
> >>>>>>
> >>>>>>        qemu: virtio: bogus descriptor or out of resources
> >>>>>>
> >>>>>> Increase the minimum size of the bounce buffer to 0x10000 which matches
> >>>>>> the largest value of TARGET_PAGE_SIZE for all architectures.
> >>>>>>
> >>>>>> Signed-off-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com>
> >>>>>> ---
> >>>>>> v2:
> >>>>>>       remove unrelated change
> >>>>>> ---
> >>>>>>     system/physmem.c | 8 ++++++--
> >>>>>>     1 file changed, 6 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/system/physmem.c b/system/physmem.c
> >>>>>> index e3ebc19eef..3c82da1c86 100644
> >>>>>> --- a/system/physmem.c
> >>>>>> +++ b/system/physmem.c
> >>>>>> @@ -3151,8 +3151,12 @@ void *address_space_map(AddressSpace *as,
> >>>>>>                 *plen = 0;
> >>>>>>                 return NULL;
> >>>>>>             }
> >>>>>> -        /* Avoid unbounded allocations */
> >>>>>> -        l = MIN(l, TARGET_PAGE_SIZE);
> >>>>>> +        /*
> >>>>>> +         * There is only one bounce buffer. The largest occuring
> >>>>>> value of
> >>>>>> +         * parameter sz of virtqueue_map_desc() must fit into the bounce
> >>>>>> +         * buffer.
> >>>>>> +         */
> >>>>>> +        l = MIN(l, 0x10000);  
> >>>>>
> >>>>> Please define this magic value. Maybe ANY_TARGET_PAGE_SIZE or
> >>>>> TARGETS_BIGGEST_PAGE_SIZE?
> >>>>>
> >>>>> Then along:
> >>>>>      QEMU_BUILD_BUG_ON(TARGET_PAGE_SIZE <= TARGETS_BIGGEST_PAGE_SIZE);  
> >>>>
> >>>> Thank you Philippe for reviewing.
> >>>>
> >>>> TARGETS_BIGGEST_PAGE_SIZE does not fit as the value is not driven by the
> >>>> page size.
> >>>> How about MIN_BOUNCE_BUFFER_SIZE?
> >>>> Is include/exec/memory.h the right include for the constant?
> >>>>
> >>>> I don't think that TARGET_PAGE_SIZE has any relevance for setting the
> >>>> bounce buffer size. I only mentioned it to say that we are not
> >>>> decreasing the value on any existing architecture.
> >>>>
> >>>> I don't know why TARGET_PAGE_SIZE ever got into this piece of code.
> >>>> e3127ae0cdcd ("exec: reorganize address_space_map") does not provide a
> >>>> reason for this choice. Maybe Paolo remembers.  
> >>>
> >>> The limitation to a page dates back to commit 6d16c2f88f2a in 2009,
> >>> which was the first implementation of this function. I don't think
> >>> there's a particular reason for that value beyond that it was
> >>> probably a convenient value that was assumed to be likely "big enough".
> >>>
> >>> I think the idea with this bounce-buffer has always been that this
> >>> isn't really a code path we expected to end up in very often --
> >>> it's supposed to be for when devices are doing DMA, which they
> >>> will typically be doing to memory (backed by host RAM), not
> >>> devices (backed by MMIO and needing a bounce buffer). So the
> >>> whole mechanism is a bit "last fallback to stop things breaking
> >>> entirely".
> >>>
> >>> The address_space_map() API says that it's allowed to return
> >>> a subset of the range you ask for, so if the virtio code doesn't
> >>> cope with the minimum being set to TARGET_PAGE_SIZE then either
> >>> we need to fix that virtio code or we need to change the API
> >>> of this function. (But I think you will also get a reduced
> >>> range if you try to use it across a boundary between normal
> >>> host-memory-backed RAM and a device MemoryRegion.)  
> >>
> >> If we allow a bounce buffer only to be used once (via the in_use flag), why
> >> do we allow only a single bounce buffer?
> >>
> >> Could address_space_map() allocate a new bounce buffer on every call and
> >> address_space_unmap() deallocate it?
> >>
> >> Isn't the design with a single bounce buffer bound to fail with a
> >> multi-threaded client as collision can be expected?  
> > 
> > See:
> > 
> > https://lore.kernel.org/r/20240212080617.2559498-1-mnissler@rivosinc.com
> > 
> > For some reason that series didn't land, but it seems to be helpful in this
> > case too if e.g. there can be multiple of such devices.
> > 
> > Thanks,
> >   
> 
> Hello Peter Xu,
> 
> thanks for pointing to your series. What I like about it is that it 
> removes the limit of a single bounce buffer per AddressSpace.
> 
> Unfortunately it does not solve my problem. You limit the sum of all of 
> the allocations for a single AddressSpcace to 
> DEFAULT_MAX_BOUNCE_BUFFER_SIZE = 4096 which is too small for my use case.

Agreed. For CXL memory (which indeed hits this with virtio-blk-pci for example)
I'm carrying a far from subtle patch [1] on top of the series linked above
(which I would very much like to land asap!)  Note I also need:

https://lore.kernel.org/qemu-devel/20240215142817.1904-1-Jonathan.Cameron@huawei.com/#t
[PATCH 0/3] physmem: Fix MemoryRegion for second access to cached MMIO Address Space

otherwise the second 8 byte read comes from the wrong address space.
That used to accidentally trigger very large reads hence the rather excessive
size set below.  I haven't checked yet if a smaller size is fine as obviously
something as hacky as the below isn't going anywhere! Perhaps we need a rather more
global version of x-max-bounce-buffers from Mattias' series?

[1]
diff --git a/system/physmem.c b/system/physmem.c
index 0a838f1b33..10e9d8da86 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2554,6 +2554,7 @@ static void memory_map_init(void)
     memory_region_init(system_memory, NULL, "system", UINT64_MAX);
     address_space_init(&address_space_memory, system_memory, "memory");

+    address_space_memory.max_bounce_buffer_size = 1024 * 1024 * 1024;
     system_io = g_malloc(sizeof(*system_io));
     memory_region_init_io(system_io, NULL, &unassigned_io_ops, NULL, "io",
                           65536);


> 
> Why do we need a limit?
> Why is it so tiny?
> 
> Best regards
> 
> Heinrich
> 
> 
> 
>
Jonathan Cameron Feb. 29, 2024, 10:59 a.m. UTC | #10
On Thu, 29 Feb 2024 09:38:29 +0000
Peter Maydell <peter.maydell@linaro.org> wrote:

> On Wed, 28 Feb 2024 at 19:07, Heinrich Schuchardt
> <heinrich.schuchardt@canonical.com> wrote:
> >
> > On 28.02.24 19:39, Peter Maydell wrote:  
> > > The limitation to a page dates back to commit 6d16c2f88f2a in 2009,
> > > which was the first implementation of this function. I don't think
> > > there's a particular reason for that value beyond that it was
> > > probably a convenient value that was assumed to be likely "big enough".
> > >
> > > I think the idea with this bounce-buffer has always been that this
> > > isn't really a code path we expected to end up in very often --
> > > it's supposed to be for when devices are doing DMA, which they
> > > will typically be doing to memory (backed by host RAM), not
> > > devices (backed by MMIO and needing a bounce buffer). So the
> > > whole mechanism is a bit "last fallback to stop things breaking
> > > entirely".
> > >
> > > The address_space_map() API says that it's allowed to return
> > > a subset of the range you ask for, so if the virtio code doesn't
> > > cope with the minimum being set to TARGET_PAGE_SIZE then either
> > > we need to fix that virtio code or we need to change the API
> > > of this function. (But I think you will also get a reduced
> > > range if you try to use it across a boundary between normal
> > > host-memory-backed RAM and a device MemoryRegion.)  
> >
> > If we allow a bounce buffer only to be used once (via the in_use flag),
> > why do we allow only a single bounce buffer?
> >
> > Could address_space_map() allocate a new bounce buffer on every call and
> > address_space_unmap() deallocate it?
> >
> > Isn't the design with a single bounce buffer bound to fail with a
> > multi-threaded client as collision can be expected?  
> 
> Yeah, I don't suppose multi-threaded was particularly expected.
> Again, this is really a "handle the case where the guest does
> something silly" setup, which is why only one bounce buffer.
> 
> Why is your guest ending up in the bounce-buffer path?

Happens for me with emulated CXL memory. I think the case I saw
was split descriptors in virtio via address space caches
https://elixir.bootlin.com/qemu/latest/source/hw/virtio/virtio.c#L4043

One bounce buffer is in use for the outer loop and another for the descriptors
it is pointing to.

Matthias' series makes this work fine.  I need to circle back and check
how big a cache this needs.  I'm carrying a silly size because of the
side effect of the address space bug here
https://lore.kernel.org/qemu-devel/20240215142817.1904-1-Jonathan.Cameron@huawei.com/#t
and can probably set it to much less than my currently 1GiB.

Jonathan

> 
> -- PMM
>
Peter Maydell Feb. 29, 2024, 11:11 a.m. UTC | #11
On Thu, 29 Feb 2024 at 10:59, Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Thu, 29 Feb 2024 09:38:29 +0000
> Peter Maydell <peter.maydell@linaro.org> wrote:
>
> > On Wed, 28 Feb 2024 at 19:07, Heinrich Schuchardt
> > <heinrich.schuchardt@canonical.com> wrote:
> > >
> > > On 28.02.24 19:39, Peter Maydell wrote:
> > > > The limitation to a page dates back to commit 6d16c2f88f2a in 2009,
> > > > which was the first implementation of this function. I don't think
> > > > there's a particular reason for that value beyond that it was
> > > > probably a convenient value that was assumed to be likely "big enough".
> > > >
> > > > I think the idea with this bounce-buffer has always been that this
> > > > isn't really a code path we expected to end up in very often --
> > > > it's supposed to be for when devices are doing DMA, which they
> > > > will typically be doing to memory (backed by host RAM), not
> > > > devices (backed by MMIO and needing a bounce buffer). So the
> > > > whole mechanism is a bit "last fallback to stop things breaking
> > > > entirely".
> > > >
> > > > The address_space_map() API says that it's allowed to return
> > > > a subset of the range you ask for, so if the virtio code doesn't
> > > > cope with the minimum being set to TARGET_PAGE_SIZE then either
> > > > we need to fix that virtio code or we need to change the API
> > > > of this function. (But I think you will also get a reduced
> > > > range if you try to use it across a boundary between normal
> > > > host-memory-backed RAM and a device MemoryRegion.)
> > >
> > > If we allow a bounce buffer only to be used once (via the in_use flag),
> > > why do we allow only a single bounce buffer?
> > >
> > > Could address_space_map() allocate a new bounce buffer on every call and
> > > address_space_unmap() deallocate it?
> > >
> > > Isn't the design with a single bounce buffer bound to fail with a
> > > multi-threaded client as collision can be expected?
> >
> > Yeah, I don't suppose multi-threaded was particularly expected.
> > Again, this is really a "handle the case where the guest does
> > something silly" setup, which is why only one bounce buffer.
> >
> > Why is your guest ending up in the bounce-buffer path?
>
> Happens for me with emulated CXL memory.

Can we put that in the "something silly" bucket? :-)
But yes, I'm not surprised that CXL runs into this. Heinrich,
are you doing CXL testing, or is this some other workload?

> I think the case I saw
> was split descriptors in virtio via address space caches
> https://elixir.bootlin.com/qemu/latest/source/hw/virtio/virtio.c#L4043
>
> One bounce buffer is in use for the outer loop and another for the descriptors
> it is pointing to.

Mmm. The other assumption made in the design of the address_space_map()
API I think was that it was unlikely that a device would be trying
to do two DMA operations simultaneously. This is clearly not
true in practice. We definitely need to fix one end or other of
this API.

(I'm not sure why the bounce-buffer limit ought to be per-AddressSpace:
is that just done in Matthias' series so that we can attach an
x-thingy property to the individual PCI device?)

-- PMM
Heinrich Schuchardt Feb. 29, 2024, 11:17 a.m. UTC | #12
On 29.02.24 12:11, Peter Maydell wrote:
> On Thu, 29 Feb 2024 at 10:59, Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
>>
>> On Thu, 29 Feb 2024 09:38:29 +0000
>> Peter Maydell <peter.maydell@linaro.org> wrote:
>>
>>> On Wed, 28 Feb 2024 at 19:07, Heinrich Schuchardt
>>> <heinrich.schuchardt@canonical.com> wrote:
>>>>
>>>> On 28.02.24 19:39, Peter Maydell wrote:
>>>>> The limitation to a page dates back to commit 6d16c2f88f2a in 2009,
>>>>> which was the first implementation of this function. I don't think
>>>>> there's a particular reason for that value beyond that it was
>>>>> probably a convenient value that was assumed to be likely "big enough".
>>>>>
>>>>> I think the idea with this bounce-buffer has always been that this
>>>>> isn't really a code path we expected to end up in very often --
>>>>> it's supposed to be for when devices are doing DMA, which they
>>>>> will typically be doing to memory (backed by host RAM), not
>>>>> devices (backed by MMIO and needing a bounce buffer). So the
>>>>> whole mechanism is a bit "last fallback to stop things breaking
>>>>> entirely".
>>>>>
>>>>> The address_space_map() API says that it's allowed to return
>>>>> a subset of the range you ask for, so if the virtio code doesn't
>>>>> cope with the minimum being set to TARGET_PAGE_SIZE then either
>>>>> we need to fix that virtio code or we need to change the API
>>>>> of this function. (But I think you will also get a reduced
>>>>> range if you try to use it across a boundary between normal
>>>>> host-memory-backed RAM and a device MemoryRegion.)
>>>>
>>>> If we allow a bounce buffer only to be used once (via the in_use flag),
>>>> why do we allow only a single bounce buffer?
>>>>
>>>> Could address_space_map() allocate a new bounce buffer on every call and
>>>> address_space_unmap() deallocate it?
>>>>
>>>> Isn't the design with a single bounce buffer bound to fail with a
>>>> multi-threaded client as collision can be expected?
>>>
>>> Yeah, I don't suppose multi-threaded was particularly expected.
>>> Again, this is really a "handle the case where the guest does
>>> something silly" setup, which is why only one bounce buffer.
>>>
>>> Why is your guest ending up in the bounce-buffer path?
>>
>> Happens for me with emulated CXL memory.
> 
> Can we put that in the "something silly" bucket? :-)
> But yes, I'm not surprised that CXL runs into this. Heinrich,
> are you doing CXL testing, or is this some other workload?

I am running the UEFI Self-Certification Tests (SCT) on EDK 2 using:

qemu-system-riscv64 \
       -M virt,acpi=off -accel tcg -m 4096 \
       -serial mon:stdio \
       -device virtio-gpu-pci \
       -device qemu-xhci \
       -device usb-kbd \
       -drive 
if=pflash,format=raw,unit=0,file=RISCV_VIRT_CODE.fd,readonly=on \
       -drive if=pflash,format=raw,unit=1,file=RISCV_VIRT_VARS.fd \
       -drive file=sct.img,format=raw,if=virtio \
       -device virtio-net-device,netdev=net0 \
       -netdev user,id=net0

This does not invoke any CXL related stuff.

Best regards

Heinrich

> 
>> I think the case I saw
>> was split descriptors in virtio via address space caches
>> https://elixir.bootlin.com/qemu/latest/source/hw/virtio/virtio.c#L4043
>>
>> One bounce buffer is in use for the outer loop and another for the descriptors
>> it is pointing to.
> 
> Mmm. The other assumption made in the design of the address_space_map()
> API I think was that it was unlikely that a device would be trying
> to do two DMA operations simultaneously. This is clearly not
> true in practice. We definitely need to fix one end or other of
> this API.
> 
> (I'm not sure why the bounce-buffer limit ought to be per-AddressSpace:
> is that just done in Matthias' series so that we can attach an
> x-thingy property to the individual PCI device?)
> 
> -- PMM
Mattias Nissler Feb. 29, 2024, 11:18 a.m. UTC | #13
On Thu, Feb 29, 2024 at 12:12 PM Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Thu, 29 Feb 2024 at 10:59, Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> >
> > On Thu, 29 Feb 2024 09:38:29 +0000
> > Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > > On Wed, 28 Feb 2024 at 19:07, Heinrich Schuchardt
> > > <heinrich.schuchardt@canonical.com> wrote:
> > > >
> > > > On 28.02.24 19:39, Peter Maydell wrote:
> > > > > The limitation to a page dates back to commit 6d16c2f88f2a in 2009,
> > > > > which was the first implementation of this function. I don't think
> > > > > there's a particular reason for that value beyond that it was
> > > > > probably a convenient value that was assumed to be likely "big enough".
> > > > >
> > > > > I think the idea with this bounce-buffer has always been that this
> > > > > isn't really a code path we expected to end up in very often --
> > > > > it's supposed to be for when devices are doing DMA, which they
> > > > > will typically be doing to memory (backed by host RAM), not
> > > > > devices (backed by MMIO and needing a bounce buffer). So the
> > > > > whole mechanism is a bit "last fallback to stop things breaking
> > > > > entirely".
> > > > >
> > > > > The address_space_map() API says that it's allowed to return
> > > > > a subset of the range you ask for, so if the virtio code doesn't
> > > > > cope with the minimum being set to TARGET_PAGE_SIZE then either
> > > > > we need to fix that virtio code or we need to change the API
> > > > > of this function. (But I think you will also get a reduced
> > > > > range if you try to use it across a boundary between normal
> > > > > host-memory-backed RAM and a device MemoryRegion.)
> > > >
> > > > If we allow a bounce buffer only to be used once (via the in_use flag),
> > > > why do we allow only a single bounce buffer?
> > > >
> > > > Could address_space_map() allocate a new bounce buffer on every call and
> > > > address_space_unmap() deallocate it?
> > > >
> > > > Isn't the design with a single bounce buffer bound to fail with a
> > > > multi-threaded client as collision can be expected?
> > >
> > > Yeah, I don't suppose multi-threaded was particularly expected.
> > > Again, this is really a "handle the case where the guest does
> > > something silly" setup, which is why only one bounce buffer.
> > >
> > > Why is your guest ending up in the bounce-buffer path?
> >
> > Happens for me with emulated CXL memory.
>
> Can we put that in the "something silly" bucket? :-)
> But yes, I'm not surprised that CXL runs into this. Heinrich,
> are you doing CXL testing, or is this some other workload?
>
> > I think the case I saw
> > was split descriptors in virtio via address space caches
> > https://elixir.bootlin.com/qemu/latest/source/hw/virtio/virtio.c#L4043
> >
> > One bounce buffer is in use for the outer loop and another for the descriptors
> > it is pointing to.
>
> Mmm. The other assumption made in the design of the address_space_map()
> API I think was that it was unlikely that a device would be trying
> to do two DMA operations simultaneously. This is clearly not
> true in practice. We definitely need to fix one end or other of
> this API.
>
> (I'm not sure why the bounce-buffer limit ought to be per-AddressSpace:
> is that just done in Matthias' series so that we can attach an
> x-thingy property to the individual PCI device?)

Yes, that's the result of review feedback to the early iterations of
my series. Specifically, (1) a limit is needed to prevent rogue guests
from hogging unlimited amounts of memory and (2) global parameters are
frowned upon. Setting a suitable limit is much more practical when
targeted at a given device/driver combination.
Peter Maydell Feb. 29, 2024, 12:34 p.m. UTC | #14
On Thu, 29 Feb 2024 at 11:17, Heinrich Schuchardt
<heinrich.schuchardt@canonical.com> wrote:
> > But yes, I'm not surprised that CXL runs into this. Heinrich,
> > are you doing CXL testing, or is this some other workload?
>
> I am running the UEFI Self-Certification Tests (SCT) on EDK 2 using:
>
> qemu-system-riscv64 \
>        -M virt,acpi=off -accel tcg -m 4096 \
>        -serial mon:stdio \
>        -device virtio-gpu-pci \
>        -device qemu-xhci \
>        -device usb-kbd \
>        -drive
> if=pflash,format=raw,unit=0,file=RISCV_VIRT_CODE.fd,readonly=on \
>        -drive if=pflash,format=raw,unit=1,file=RISCV_VIRT_VARS.fd \
>        -drive file=sct.img,format=raw,if=virtio \
>        -device virtio-net-device,netdev=net0 \
>        -netdev user,id=net0
>
> This does not invoke any CXL related stuff.

Hmm, that doesn't seem like it ought to be running into this.
What underlying memory region is the guest trying to do
the virtio queue access to?

-- PMM
Mattias Nissler Feb. 29, 2024, 12:52 p.m. UTC | #15
On Thu, Feb 29, 2024 at 1:35 PM Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Thu, 29 Feb 2024 at 11:17, Heinrich Schuchardt
> <heinrich.schuchardt@canonical.com> wrote:
> > > But yes, I'm not surprised that CXL runs into this. Heinrich,
> > > are you doing CXL testing, or is this some other workload?
> >
> > I am running the UEFI Self-Certification Tests (SCT) on EDK 2 using:
> >
> > qemu-system-riscv64 \
> >        -M virt,acpi=off -accel tcg -m 4096 \
> >        -serial mon:stdio \
> >        -device virtio-gpu-pci \
> >        -device qemu-xhci \
> >        -device usb-kbd \
> >        -drive
> > if=pflash,format=raw,unit=0,file=RISCV_VIRT_CODE.fd,readonly=on \
> >        -drive if=pflash,format=raw,unit=1,file=RISCV_VIRT_VARS.fd \
> >        -drive file=sct.img,format=raw,if=virtio \
> >        -device virtio-net-device,netdev=net0 \
> >        -netdev user,id=net0
> >
> > This does not invoke any CXL related stuff.
>
> Hmm, that doesn't seem like it ought to be running into this.
> What underlying memory region is the guest trying to do
> the virtio queue access to?

FWIW, I have seen multiple bounce buffer usage with the generic net TX
path as well as the XHCI controller, so it might be either of these.
Bounce buffering should only take place when the memory region can't
be accessed directly though - I don't see why that's the case for the
given command line.
Peter Maydell Feb. 29, 2024, 1:19 p.m. UTC | #16
On Thu, 29 Feb 2024 at 12:52, Mattias Nissler <mnissler@rivosinc.com> wrote:
>
> On Thu, Feb 29, 2024 at 1:35 PM Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > On Thu, 29 Feb 2024 at 11:17, Heinrich Schuchardt
> > <heinrich.schuchardt@canonical.com> wrote:
> > > > But yes, I'm not surprised that CXL runs into this. Heinrich,
> > > > are you doing CXL testing, or is this some other workload?
> > >
> > > I am running the UEFI Self-Certification Tests (SCT) on EDK 2 using:
> > >
> > > qemu-system-riscv64 \
> > >        -M virt,acpi=off -accel tcg -m 4096 \
> > >        -serial mon:stdio \
> > >        -device virtio-gpu-pci \
> > >        -device qemu-xhci \
> > >        -device usb-kbd \
> > >        -drive
> > > if=pflash,format=raw,unit=0,file=RISCV_VIRT_CODE.fd,readonly=on \
> > >        -drive if=pflash,format=raw,unit=1,file=RISCV_VIRT_VARS.fd \
> > >        -drive file=sct.img,format=raw,if=virtio \
> > >        -device virtio-net-device,netdev=net0 \
> > >        -netdev user,id=net0
> > >
> > > This does not invoke any CXL related stuff.
> >
> > Hmm, that doesn't seem like it ought to be running into this.
> > What underlying memory region is the guest trying to do
> > the virtio queue access to?
>
> FWIW, I have seen multiple bounce buffer usage with the generic net TX
> path as well as the XHCI controller, so it might be either of these.
> Bounce buffering should only take place when the memory region can't
> be accessed directly though - I don't see why that's the case for the
> given command line.

Yeah, exactly -- the only thing I can see there where a reasonable
guest program ought to be trying to do DMA to/from non-host-RAM
would be if (a) it's reading data from directly out of the pflash image
and (b) the flash device has got itself into programming mode
somehow. But that doesn't happen for the other architectures that
have pflash for boot firmware as far as I know. The bounce buffering
stuff does need to be fixed but it's also worth looking at why we're
hitting the code path at all here, because even if it works the
performance is terrible (doubly so if you try to execute code out of
that MR).

-- PMM
Heinrich Schuchardt Feb. 29, 2024, 2:17 p.m. UTC | #17
On 29.02.24 13:34, Peter Maydell wrote:
> On Thu, 29 Feb 2024 at 11:17, Heinrich Schuchardt
> <heinrich.schuchardt@canonical.com> wrote:
>>> But yes, I'm not surprised that CXL runs into this. Heinrich,
>>> are you doing CXL testing, or is this some other workload?
>>
>> I am running the UEFI Self-Certification Tests (SCT) on EDK 2 using:
>>
>> qemu-system-riscv64 \
>>         -M virt,acpi=off -accel tcg -m 4096 \
>>         -serial mon:stdio \
>>         -device virtio-gpu-pci \
>>         -device qemu-xhci \
>>         -device usb-kbd \
>>         -drive
>> if=pflash,format=raw,unit=0,file=RISCV_VIRT_CODE.fd,readonly=on \
>>         -drive if=pflash,format=raw,unit=1,file=RISCV_VIRT_VARS.fd \
>>         -drive file=sct.img,format=raw,if=virtio \
>>         -device virtio-net-device,netdev=net0 \
>>         -netdev user,id=net0
>>
>> This does not invoke any CXL related stuff.
> 
> Hmm, that doesn't seem like it ought to be running into this.
> What underlying memory region is the guest trying to do
> the virtio queue access to?
> 
> -- PMM

The error occurs while the SCT is executing function 
BBTestReadBlocksConformanceAutoTest 
(https://github.com/tianocore/edk2-test/blob/cabb98d44be94e7547605435a0be7c4946d10f8b/uefi-sct/SctPkg/TestCase/UEFI/EFI/Protocol/BlockIo/BlackBoxTest/BlockIoBBTestConformance.c#L45)

This code is accessing the drive defined as
-drive file=sct.img,format=raw,if=virtio .

In the conformance test correct error handling for invalid parameters of 
the UEFI block IO protocol is tested. This includes calling the UEFI API to

* read with incorrectly aligned buffers
* read with invalid LBA parameter
* read with buffer size not being a multiple of the sector size

In all these cases the UEFI API implemented by EDK II is expected to 
return an error.

Best regards

Heinrich
Peter Maydell Feb. 29, 2024, 2:52 p.m. UTC | #18
On Thu, 29 Feb 2024 at 14:17, Heinrich Schuchardt
<heinrich.schuchardt@canonical.com> wrote:
>
> On 29.02.24 13:34, Peter Maydell wrote:
> > On Thu, 29 Feb 2024 at 11:17, Heinrich Schuchardt
> > <heinrich.schuchardt@canonical.com> wrote:
> >>> But yes, I'm not surprised that CXL runs into this. Heinrich,
> >>> are you doing CXL testing, or is this some other workload?
> >>
> >> I am running the UEFI Self-Certification Tests (SCT) on EDK 2 using:
> >>
> >> qemu-system-riscv64 \
> >>         -M virt,acpi=off -accel tcg -m 4096 \
> >>         -serial mon:stdio \
> >>         -device virtio-gpu-pci \
> >>         -device qemu-xhci \
> >>         -device usb-kbd \
> >>         -drive
> >> if=pflash,format=raw,unit=0,file=RISCV_VIRT_CODE.fd,readonly=on \
> >>         -drive if=pflash,format=raw,unit=1,file=RISCV_VIRT_VARS.fd \
> >>         -drive file=sct.img,format=raw,if=virtio \
> >>         -device virtio-net-device,netdev=net0 \
> >>         -netdev user,id=net0
> >>
> >> This does not invoke any CXL related stuff.
> >
> > Hmm, that doesn't seem like it ought to be running into this.
> > What underlying memory region is the guest trying to do
> > the virtio queue access to?

> The error occurs while the SCT is executing function
> BBTestReadBlocksConformanceAutoTest
> (https://github.com/tianocore/edk2-test/blob/cabb98d44be94e7547605435a0be7c4946d10f8b/uefi-sct/SctPkg/TestCase/UEFI/EFI/Protocol/BlockIo/BlackBoxTest/BlockIoBBTestConformance.c#L45)
>
> This code is accessing the drive defined as
> -drive file=sct.img,format=raw,if=virtio .
>
> In the conformance test correct error handling for invalid parameters of
> the UEFI block IO protocol is tested. This includes calling the UEFI API to
>
> * read with incorrectly aligned buffers
> * read with invalid LBA parameter
> * read with buffer size not being a multiple of the sector size
>
> In all these cases the UEFI API implemented by EDK II is expected to
> return an error.

OK, but where is it actually reading/writing to ? Even if the
buffer isn't aligned right it ought not to be in a non-host-ram-backed
memory region, I would have thought.

-- PMM
diff mbox series

Patch

diff --git a/system/physmem.c b/system/physmem.c
index e3ebc19eef..3c82da1c86 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3151,8 +3151,12 @@  void *address_space_map(AddressSpace *as,
             *plen = 0;
             return NULL;
         }
-        /* Avoid unbounded allocations */
-        l = MIN(l, TARGET_PAGE_SIZE);
+        /*
+         * There is only one bounce buffer. The largest occuring value of
+         * parameter sz of virtqueue_map_desc() must fit into the bounce
+         * buffer.
+         */
+        l = MIN(l, 0x10000);
         bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
         bounce.addr = addr;
         bounce.len = l;