diff mbox series

[V3,15/23] xen/arm: Stick around in leave_hypervisor_to_guest until I/O has completed

Message ID 1606732298-22107-16-git-send-email-olekstysh@gmail.com (mailing list archive)
State New, archived
Headers show
Series [V3,01/23] x86/ioreq: Prepare IOREQ feature for making it common | expand

Commit Message

Oleksandr Tyshchenko Nov. 30, 2020, 10:31 a.m. UTC
From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>

This patch adds proper handling of return value of
vcpu_ioreq_handle_completion() which involves using a loop
in leave_hypervisor_to_guest().

The reason to use an unbounded loop here is the fact that vCPU
shouldn't continue until an I/O has completed. In Xen case, if an I/O
never completes then it most likely means that something went horribly
wrong with the Device Emulator. And it is most likely not safe to
continue. So letting the vCPU to spin forever if I/O never completes
is a safer action than letting it continue and leaving the guest in
unclear state and is the best what we can do for now.

This wouldn't be an issue for Xen as do_softirq() would be called at
every loop. In case of failure, the guest will crash and the vCPU
will be unscheduled.

Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
CC: Julien Grall <julien.grall@arm.com>

---
Please note, this is a split/cleanup/hardening of Julien's PoC:
"Add support for Guest IO forwarding to a device emulator"

Changes V1 -> V2:
   - new patch, changes were derived from (+ new explanation):
     arm/ioreq: Introduce arch specific bits for IOREQ/DM features

Changes V2 -> V3:
   - update patch description
---
---
 xen/arch/arm/traps.c | 31 ++++++++++++++++++++++++++-----
 1 file changed, 26 insertions(+), 5 deletions(-)

Comments

Volodymyr Babchuk Nov. 30, 2020, 8:51 p.m. UTC | #1
Hello Oleksandr,

Oleksandr Tyshchenko writes:

> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>
> This patch adds proper handling of return value of
> vcpu_ioreq_handle_completion() which involves using a loop
> in leave_hypervisor_to_guest().
>
> The reason to use an unbounded loop here is the fact that vCPU
> shouldn't continue until an I/O has completed. In Xen case, if an I/O
> never completes then it most likely means that something went horribly
> wrong with the Device Emulator. And it is most likely not safe to
> continue. So letting the vCPU to spin forever if I/O never completes
> is a safer action than letting it continue and leaving the guest in
> unclear state and is the best what we can do for now.
>
> This wouldn't be an issue for Xen as do_softirq() would be called at
> every loop. In case of failure, the guest will crash and the vCPU
> will be unscheduled.
>

Why you don't block vcpu there and unblock it when response is ready? If
I got it right, "client" vcpu will spin in the loop, eating own
scheduling budget with no useful work done. In the worst case, it will
prevent "server" vcpu from running.

> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> CC: Julien Grall <julien.grall@arm.com>
>
> ---
> Please note, this is a split/cleanup/hardening of Julien's PoC:
> "Add support for Guest IO forwarding to a device emulator"
>
> Changes V1 -> V2:
>    - new patch, changes were derived from (+ new explanation):
>      arm/ioreq: Introduce arch specific bits for IOREQ/DM features
>
> Changes V2 -> V3:
>    - update patch description
> ---
> ---
>  xen/arch/arm/traps.c | 31 ++++++++++++++++++++++++++-----
>  1 file changed, 26 insertions(+), 5 deletions(-)
>
> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
> index 036b13f..4cef43e 100644
> --- a/xen/arch/arm/traps.c
> +++ b/xen/arch/arm/traps.c
> @@ -2257,18 +2257,23 @@ static void check_for_pcpu_work(void)
>   * Process pending work for the vCPU. Any call should be fast or
>   * implement preemption.
>   */
> -static void check_for_vcpu_work(void)
> +static bool check_for_vcpu_work(void)
>  {
>      struct vcpu *v = current;
>  
>  #ifdef CONFIG_IOREQ_SERVER
> +    bool handled;
> +
>      local_irq_enable();
> -    vcpu_ioreq_handle_completion(v);
> +    handled = vcpu_ioreq_handle_completion(v);
>      local_irq_disable();
> +
> +    if ( !handled )
> +        return true;
>  #endif
>  
>      if ( likely(!v->arch.need_flush_to_ram) )
> -        return;
> +        return false;
>  
>      /*
>       * Give a chance for the pCPU to process work before handling the vCPU
> @@ -2279,6 +2284,8 @@ static void check_for_vcpu_work(void)
>      local_irq_enable();
>      p2m_flush_vm(v);
>      local_irq_disable();
> +
> +    return false;
>  }
>  
>  /*
> @@ -2291,8 +2298,22 @@ void leave_hypervisor_to_guest(void)
>  {
>      local_irq_disable();
>  
> -    check_for_vcpu_work();
> -    check_for_pcpu_work();
> +    /*
> +     * The reason to use an unbounded loop here is the fact that vCPU
> +     * shouldn't continue until an I/O has completed. In Xen case, if an I/O
> +     * never completes then it most likely means that something went horribly
> +     * wrong with the Device Emulator. And it is most likely not safe to
> +     * continue. So letting the vCPU to spin forever if I/O never completes
> +     * is a safer action than letting it continue and leaving the guest in
> +     * unclear state and is the best what we can do for now.
> +     *
> +     * This wouldn't be an issue for Xen as do_softirq() would be called at
> +     * every loop. In case of failure, the guest will crash and the vCPU
> +     * will be unscheduled.
> +     */
> +    do {
> +        check_for_pcpu_work();
> +    } while ( check_for_vcpu_work() );
>  
>      vgic_sync_to_lrs();
Julien Grall Dec. 1, 2020, 12:46 p.m. UTC | #2
Hi Volodymyr,

On 30/11/2020 20:51, Volodymyr Babchuk wrote:
> Oleksandr Tyshchenko writes:
> 
>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>
>> This patch adds proper handling of return value of
>> vcpu_ioreq_handle_completion() which involves using a loop
>> in leave_hypervisor_to_guest().
>>
>> The reason to use an unbounded loop here is the fact that vCPU
>> shouldn't continue until an I/O has completed. In Xen case, if an I/O
>> never completes then it most likely means that something went horribly
>> wrong with the Device Emulator. And it is most likely not safe to
>> continue. So letting the vCPU to spin forever if I/O never completes
>> is a safer action than letting it continue and leaving the guest in
>> unclear state and is the best what we can do for now.
>>
>> This wouldn't be an issue for Xen as do_softirq() would be called at
>> every loop. In case of failure, the guest will crash and the vCPU
>> will be unscheduled.
>>
> 
> Why you don't block vcpu there and unblock it when response is ready?

The vCPU will already block while waiting for the event channel. See the 
call wait_for_event_channel() in the ioreq code. However, you can still 
receive spurious unblock (e.g. the event channel notificaiton is 
received before the I/O is unhandled). So you have to loop around and 
check if there are more work to do.

> If
> I got it right, "client" vcpu will spin in the loop, eating own
> scheduling budget with no useful work done.

You can't really do much about that if you have a rogue/buggy device model.

> In the worst case, it will
> prevent "server" vcpu from running.

How so? Xen will raise the schedule softirq if the I/O is handled. You 
would have a pretty buggy system if your "client" vCPU is considered to 
be a much higher priority than your "server" vCPU.

Cheers,
Stefano Stabellini Dec. 9, 2020, 11:18 p.m. UTC | #3
On Mon, 30 Nov 2020, Oleksandr Tyshchenko wrote:
> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> 
> This patch adds proper handling of return value of
> vcpu_ioreq_handle_completion() which involves using a loop
> in leave_hypervisor_to_guest().
> 
> The reason to use an unbounded loop here is the fact that vCPU
> shouldn't continue until an I/O has completed. In Xen case, if an I/O
> never completes then it most likely means that something went horribly
> wrong with the Device Emulator. And it is most likely not safe to
> continue. So letting the vCPU to spin forever if I/O never completes
> is a safer action than letting it continue and leaving the guest in
> unclear state and is the best what we can do for now.
> 
> This wouldn't be an issue for Xen as do_softirq() would be called at
> every loop. In case of failure, the guest will crash and the vCPU
> will be unscheduled.

Imagine that we have two guests: one that requires an ioreq server and
one that doesn't. If I am not mistaken this loop could potentially spin
forever on a pcpu, thus preventing any other guest being scheduled, even
if the other guest doesn't need any ioreq servers.


My other concern is that we are busy-looping. Could we call something
like wfi() or do_idle() instead? The ioreq server event notification of
completion should wake us up?

Following this line of thinking, I am wondering if instead of the
busy-loop we should call vcpu_block_unless_event_pending(current) in
try_handle_mmio if IO_RETRY. Then when the emulation is done, QEMU (or
equivalent) calls xenevtchn_notify which ends up waking up the domU
vcpu. Would that work?



> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> CC: Julien Grall <julien.grall@arm.com>
> 
> ---
> Please note, this is a split/cleanup/hardening of Julien's PoC:
> "Add support for Guest IO forwarding to a device emulator"
> 
> Changes V1 -> V2:
>    - new patch, changes were derived from (+ new explanation):
>      arm/ioreq: Introduce arch specific bits for IOREQ/DM features
> 
> Changes V2 -> V3:
>    - update patch description
> ---
> ---
>  xen/arch/arm/traps.c | 31 ++++++++++++++++++++++++++-----
>  1 file changed, 26 insertions(+), 5 deletions(-)
> 
> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
> index 036b13f..4cef43e 100644
> --- a/xen/arch/arm/traps.c
> +++ b/xen/arch/arm/traps.c
> @@ -2257,18 +2257,23 @@ static void check_for_pcpu_work(void)
>   * Process pending work for the vCPU. Any call should be fast or
>   * implement preemption.
>   */
> -static void check_for_vcpu_work(void)
> +static bool check_for_vcpu_work(void)
>  {
>      struct vcpu *v = current;
>  
>  #ifdef CONFIG_IOREQ_SERVER
> +    bool handled;
> +
>      local_irq_enable();
> -    vcpu_ioreq_handle_completion(v);
> +    handled = vcpu_ioreq_handle_completion(v);
>      local_irq_disable();
> +
> +    if ( !handled )
> +        return true;
>  #endif
>  
>      if ( likely(!v->arch.need_flush_to_ram) )
> -        return;
> +        return false;
>  
>      /*
>       * Give a chance for the pCPU to process work before handling the vCPU
> @@ -2279,6 +2284,8 @@ static void check_for_vcpu_work(void)
>      local_irq_enable();
>      p2m_flush_vm(v);
>      local_irq_disable();
> +
> +    return false;
>  }
>  
>  /*
> @@ -2291,8 +2298,22 @@ void leave_hypervisor_to_guest(void)
>  {
>      local_irq_disable();
>  
> -    check_for_vcpu_work();
> -    check_for_pcpu_work();
> +    /*
> +     * The reason to use an unbounded loop here is the fact that vCPU
> +     * shouldn't continue until an I/O has completed. In Xen case, if an I/O
> +     * never completes then it most likely means that something went horribly
> +     * wrong with the Device Emulator. And it is most likely not safe to
> +     * continue. So letting the vCPU to spin forever if I/O never completes
> +     * is a safer action than letting it continue and leaving the guest in
> +     * unclear state and is the best what we can do for now.
> +     *
> +     * This wouldn't be an issue for Xen as do_softirq() would be called at
> +     * every loop. In case of failure, the guest will crash and the vCPU
> +     * will be unscheduled.
> +     */
> +    do {
> +        check_for_pcpu_work();
> +    } while ( check_for_vcpu_work() );
>  
>      vgic_sync_to_lrs();
>  
> -- 
> 2.7.4
>
Stefano Stabellini Dec. 9, 2020, 11:35 p.m. UTC | #4
On Wed, 9 Dec 2020, Stefano Stabellini wrote:
> On Mon, 30 Nov 2020, Oleksandr Tyshchenko wrote:
> > From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> > 
> > This patch adds proper handling of return value of
> > vcpu_ioreq_handle_completion() which involves using a loop
> > in leave_hypervisor_to_guest().
> > 
> > The reason to use an unbounded loop here is the fact that vCPU
> > shouldn't continue until an I/O has completed. In Xen case, if an I/O
> > never completes then it most likely means that something went horribly
> > wrong with the Device Emulator. And it is most likely not safe to
> > continue. So letting the vCPU to spin forever if I/O never completes
> > is a safer action than letting it continue and leaving the guest in
> > unclear state and is the best what we can do for now.
> > 
> > This wouldn't be an issue for Xen as do_softirq() would be called at
> > every loop. In case of failure, the guest will crash and the vCPU
> > will be unscheduled.
> 
> Imagine that we have two guests: one that requires an ioreq server and
> one that doesn't. If I am not mistaken this loop could potentially spin
> forever on a pcpu, thus preventing any other guest being scheduled, even
> if the other guest doesn't need any ioreq servers.
> 
> 
> My other concern is that we are busy-looping. Could we call something
> like wfi() or do_idle() instead? The ioreq server event notification of
> completion should wake us up?
> 
> Following this line of thinking, I am wondering if instead of the
> busy-loop we should call vcpu_block_unless_event_pending(current) in
> try_handle_mmio if IO_RETRY. Then when the emulation is done, QEMU (or
> equivalent) calls xenevtchn_notify which ends up waking up the domU
> vcpu. Would that work?

I read now Julien's reply: we are already doing something similar to
what I suggested with the following call chain:

check_for_vcpu_work -> vcpu_ioreq_handle_completion -> wait_for_io -> wait_on_xen_event_channel

So the busy-loop here is only a safety-belt in cause of a spurious
wake-up, in which case we are going to call again check_for_vcpu_work,
potentially causing a guest reschedule.

Then, this is fine and addresses both my concerns. Maybe let's add a note
in the commit message about it.


I am also wondering if there is any benefit in calling wait_for_io()
earlier, maybe from try_handle_mmio if IO_RETRY?
leave_hypervisor_to_guest is very late for that. In any case, it is not
an important optimization (if it is even an optimization at all) so it
is fine to leave it as is.
Julien Grall Dec. 9, 2020, 11:38 p.m. UTC | #5
Hi Stefano,

On 09/12/2020 23:18, Stefano Stabellini wrote:
> On Mon, 30 Nov 2020, Oleksandr Tyshchenko wrote:
>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>
>> This patch adds proper handling of return value of
>> vcpu_ioreq_handle_completion() which involves using a loop
>> in leave_hypervisor_to_guest().
>>
>> The reason to use an unbounded loop here is the fact that vCPU
>> shouldn't continue until an I/O has completed. In Xen case, if an I/O
>> never completes then it most likely means that something went horribly
>> wrong with the Device Emulator. And it is most likely not safe to
>> continue. So letting the vCPU to spin forever if I/O never completes
>> is a safer action than letting it continue and leaving the guest in
>> unclear state and is the best what we can do for now.
>>
>> This wouldn't be an issue for Xen as do_softirq() would be called at
>> every loop. In case of failure, the guest will crash and the vCPU
>> will be unscheduled.
>
> Imagine that we have two guests: one that requires an ioreq server and
> one that doesn't. If I am not mistaken this loop could potentially spin
> forever on a pcpu, thus preventing any other guest being scheduled, even
> if the other guest doesn't need any ioreq servers.

That's not correct. At every loop we will call check_for_pcpu_work() 
that will process pending softirqs. If the rescheduling is necessary 
(might be set by a timer or by a caller in check_for_vcpu_work()), then 
the vCPU will be descheduled to give place to someone else.

> 
> 
> My other concern is that we are busy-looping. Could we call something
> like wfi() or do_idle() instead? The ioreq server event notification of
> completion should wake us up?

There are no busy loop here. If the IOREQ has not yet handled the I/O we 
will block the vCPU until an event notification is received (see the 
call to wait_on_xen_event_channel()).

This loop make sure that all the vPCU works are done before we return to 
the guest.

The worse that can happen here if the vCPU will never run again if the 
IOREQ server is been naughty.

> 
> Following this line of thinking, I am wondering if instead of the
> busy-loop we should call vcpu_block_unless_event_pending(current) in
> try_handle_mmio if IO_RETRY. Then when the emulation is done, QEMU (or
> equivalent) calls xenevtchn_notify which ends up waking up the domU
> vcpu. Would that work?

vcpu_block_unless_event_pending() will not block if there are interrupts 
pending. However, here we really want to block until the I/O has been 
completed. So vcpu_block_unless_event_pending() is not the right approach.

The IOREQ code is using wait_on_xen_event_channel(). Yet, this can still 
"exit" early if an event has been received. But this doesn't mean the 
I/O has completed. So we need to check if the I/O has completed and wait 
again if it hasn't.

I seem to keep having to explain how the code works. So maybe we want to 
update the commit message with more details?

Cheers,
Julien Grall Dec. 9, 2020, 11:47 p.m. UTC | #6
On 09/12/2020 23:35, Stefano Stabellini wrote:
> On Wed, 9 Dec 2020, Stefano Stabellini wrote:
>> On Mon, 30 Nov 2020, Oleksandr Tyshchenko wrote:
>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>
>>> This patch adds proper handling of return value of
>>> vcpu_ioreq_handle_completion() which involves using a loop
>>> in leave_hypervisor_to_guest().
>>>
>>> The reason to use an unbounded loop here is the fact that vCPU
>>> shouldn't continue until an I/O has completed. In Xen case, if an I/O
>>> never completes then it most likely means that something went horribly
>>> wrong with the Device Emulator. And it is most likely not safe to
>>> continue. So letting the vCPU to spin forever if I/O never completes
>>> is a safer action than letting it continue and leaving the guest in
>>> unclear state and is the best what we can do for now.
>>>
>>> This wouldn't be an issue for Xen as do_softirq() would be called at
>>> every loop. In case of failure, the guest will crash and the vCPU
>>> will be unscheduled.
>>
>> Imagine that we have two guests: one that requires an ioreq server and
>> one that doesn't. If I am not mistaken this loop could potentially spin
>> forever on a pcpu, thus preventing any other guest being scheduled, even
>> if the other guest doesn't need any ioreq servers.
>>
>>
>> My other concern is that we are busy-looping. Could we call something
>> like wfi() or do_idle() instead? The ioreq server event notification of
>> completion should wake us up?
>>
>> Following this line of thinking, I am wondering if instead of the
>> busy-loop we should call vcpu_block_unless_event_pending(current) in
>> try_handle_mmio if IO_RETRY. Then when the emulation is done, QEMU (or
>> equivalent) calls xenevtchn_notify which ends up waking up the domU
>> vcpu. Would that work?
> 
> I read now Julien's reply: we are already doing something similar to
> what I suggested with the following call chain:
> 
> check_for_vcpu_work -> vcpu_ioreq_handle_completion -> wait_for_io -> wait_on_xen_event_channel
> 
> So the busy-loop here is only a safety-belt in cause of a spurious
> wake-up, in which case we are going to call again check_for_vcpu_work,
> potentially causing a guest reschedule.
> 
> Then, this is fine and addresses both my concerns. Maybe let's add a note
> in the commit message about it.

Damm, I hit the "sent" button just a second before seen your reply. :/ 
Oh well. I suggested the same because I have seen the same question 
multiple time.

> 
> 
> I am also wondering if there is any benefit in calling wait_for_io()
> earlier, maybe from try_handle_mmio if IO_RETRY?

wait_for_io() may end up to deschedule the vCPU. I would like to avoid 
this to happen in the middle of the I/O emulation because we need to 
happen it without lock held at all.

I don't think there are locks involved today, but the deeper in the call 
stack the scheduling happens, the more chance we may screw up in the future.

However...

> leave_hypervisor_to_guest is very late for that.

... I am not sure what's the problem with that. The IOREQ will be 
notified of the pending I/O as soon as try_handle_mmio() put the I/O in 
the shared page.

If the IOREQ server is running on a different pCPU, then it might be 
possible that the I/O has completed before reached 
leave_hypervisor_to_guest(). In this case, we would not have to wait for 
the I/O.

Cheers,
Stefano Stabellini Dec. 10, 2020, 2:30 a.m. UTC | #7
On Wed, 9 Dec 2020, Julien Grall wrote:
> On 09/12/2020 23:35, Stefano Stabellini wrote:
> > On Wed, 9 Dec 2020, Stefano Stabellini wrote:
> > > On Mon, 30 Nov 2020, Oleksandr Tyshchenko wrote:
> > > > From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> > > > 
> > > > This patch adds proper handling of return value of
> > > > vcpu_ioreq_handle_completion() which involves using a loop
> > > > in leave_hypervisor_to_guest().
> > > > 
> > > > The reason to use an unbounded loop here is the fact that vCPU
> > > > shouldn't continue until an I/O has completed. In Xen case, if an I/O
> > > > never completes then it most likely means that something went horribly
> > > > wrong with the Device Emulator. And it is most likely not safe to
> > > > continue. So letting the vCPU to spin forever if I/O never completes
> > > > is a safer action than letting it continue and leaving the guest in
> > > > unclear state and is the best what we can do for now.
> > > > 
> > > > This wouldn't be an issue for Xen as do_softirq() would be called at
> > > > every loop. In case of failure, the guest will crash and the vCPU
> > > > will be unscheduled.
> > > 
> > > Imagine that we have two guests: one that requires an ioreq server and
> > > one that doesn't. If I am not mistaken this loop could potentially spin
> > > forever on a pcpu, thus preventing any other guest being scheduled, even
> > > if the other guest doesn't need any ioreq servers.
> > > 
> > > 
> > > My other concern is that we are busy-looping. Could we call something
> > > like wfi() or do_idle() instead? The ioreq server event notification of
> > > completion should wake us up?
> > > 
> > > Following this line of thinking, I am wondering if instead of the
> > > busy-loop we should call vcpu_block_unless_event_pending(current) in
> > > try_handle_mmio if IO_RETRY. Then when the emulation is done, QEMU (or
> > > equivalent) calls xenevtchn_notify which ends up waking up the domU
> > > vcpu. Would that work?
> > 
> > I read now Julien's reply: we are already doing something similar to
> > what I suggested with the following call chain:
> > 
> > check_for_vcpu_work -> vcpu_ioreq_handle_completion -> wait_for_io ->
> > wait_on_xen_event_channel
> > 
> > So the busy-loop here is only a safety-belt in cause of a spurious
> > wake-up, in which case we are going to call again check_for_vcpu_work,
> > potentially causing a guest reschedule.
> > 
> > Then, this is fine and addresses both my concerns. Maybe let's add a note
> > in the commit message about it.
> 
> Damm, I hit the "sent" button just a second before seen your reply. :/ Oh
> well. I suggested the same because I have seen the same question multiple
> time.

:-)

 
> > I am also wondering if there is any benefit in calling wait_for_io()
> > earlier, maybe from try_handle_mmio if IO_RETRY?
> 
> wait_for_io() may end up to deschedule the vCPU. I would like to avoid this to
> happen in the middle of the I/O emulation because we need to happen it without
> lock held at all.
> 
> I don't think there are locks involved today, but the deeper in the call stack
> the scheduling happens, the more chance we may screw up in the future.
> 
> However...
> 
> > leave_hypervisor_to_guest is very late for that.
> 
> ... I am not sure what's the problem with that. The IOREQ will be notified of
> the pending I/O as soon as try_handle_mmio() put the I/O in the shared page.
> 
> If the IOREQ server is running on a different pCPU, then it might be possible
> that the I/O has completed before reached leave_hypervisor_to_guest(). In this
> case, we would not have to wait for the I/O.

Yeah, I was thinking about that too. Actually it could be faster
this way we end up being lucky.

The reason for moving it earlier would be that by the time
leave_hypervisor_to_guest is called "Xen" has already decided to
continue running this particular vcpu. If we called wait_for_io()
earlier, we would give important information to the scheduler before any
decision is made. This is more "philosophical" than practical though.
Let's leave it as is.
Julien Grall Dec. 10, 2020, 1:17 p.m. UTC | #8
Hi Stefano,

On 10/12/2020 02:30, Stefano Stabellini wrote:
>>> I am also wondering if there is any benefit in calling wait_for_io()
>>> earlier, maybe from try_handle_mmio if IO_RETRY?
>>
>> wait_for_io() may end up to deschedule the vCPU. I would like to avoid this to
>> happen in the middle of the I/O emulation because we need to happen it without
>> lock held at all.
>>
>> I don't think there are locks involved today, but the deeper in the call stack
>> the scheduling happens, the more chance we may screw up in the future.
>>
>> However...
>>
>>> leave_hypervisor_to_guest is very late for that.
>>
>> ... I am not sure what's the problem with that. The IOREQ will be notified of
>> the pending I/O as soon as try_handle_mmio() put the I/O in the shared page.
>>
>> If the IOREQ server is running on a different pCPU, then it might be possible
>> that the I/O has completed before reached leave_hypervisor_to_guest(). In this
>> case, we would not have to wait for the I/O.
> 
> Yeah, I was thinking about that too. Actually it could be faster
> this way we end up being lucky.
> 
> The reason for moving it earlier would be that by the time
> leave_hypervisor_to_guest is called "Xen" has already decided to
> continue running this particular vcpu. If we called wait_for_io()
> earlier, we would give important information to the scheduler before any
> decision is made.

I don't understand this. Xen preemption is voluntary, so the scheduler 
is not going to run unless requested.

wait_for_io() is a preemption point. So if you call it, then vCPU may 
get descheduled at that point.

Why would we want to do this? What's our benefits here?

Cheers,
Oleksandr Tyshchenko Dec. 10, 2020, 1:21 p.m. UTC | #9
On 10.12.20 04:30, Stefano Stabellini wrote:

Hi Julien, Stefano

> On Wed, 9 Dec 2020, Julien Grall wrote:
>> On 09/12/2020 23:35, Stefano Stabellini wrote:
>>> On Wed, 9 Dec 2020, Stefano Stabellini wrote:
>>>> On Mon, 30 Nov 2020, Oleksandr Tyshchenko wrote:
>>>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>>>
>>>>> This patch adds proper handling of return value of
>>>>> vcpu_ioreq_handle_completion() which involves using a loop
>>>>> in leave_hypervisor_to_guest().
>>>>>
>>>>> The reason to use an unbounded loop here is the fact that vCPU
>>>>> shouldn't continue until an I/O has completed. In Xen case, if an I/O
>>>>> never completes then it most likely means that something went horribly
>>>>> wrong with the Device Emulator. And it is most likely not safe to
>>>>> continue. So letting the vCPU to spin forever if I/O never completes
>>>>> is a safer action than letting it continue and leaving the guest in
>>>>> unclear state and is the best what we can do for now.
>>>>>
>>>>> This wouldn't be an issue for Xen as do_softirq() would be called at
>>>>> every loop. In case of failure, the guest will crash and the vCPU
>>>>> will be unscheduled.
>>>> Imagine that we have two guests: one that requires an ioreq server and
>>>> one that doesn't. If I am not mistaken this loop could potentially spin
>>>> forever on a pcpu, thus preventing any other guest being scheduled, even
>>>> if the other guest doesn't need any ioreq servers.
>>>>
>>>>
>>>> My other concern is that we are busy-looping. Could we call something
>>>> like wfi() or do_idle() instead? The ioreq server event notification of
>>>> completion should wake us up?
>>>>
>>>> Following this line of thinking, I am wondering if instead of the
>>>> busy-loop we should call vcpu_block_unless_event_pending(current) in
>>>> try_handle_mmio if IO_RETRY. Then when the emulation is done, QEMU (or
>>>> equivalent) calls xenevtchn_notify which ends up waking up the domU
>>>> vcpu. Would that work?
>>> I read now Julien's reply: we are already doing something similar to
>>> what I suggested with the following call chain:
>>>
>>> check_for_vcpu_work -> vcpu_ioreq_handle_completion -> wait_for_io ->
>>> wait_on_xen_event_channel
>>>
>>> So the busy-loop here is only a safety-belt in cause of a spurious
>>> wake-up, in which case we are going to call again check_for_vcpu_work,
>>> potentially causing a guest reschedule.
>>>
>>> Then, this is fine and addresses both my concerns. Maybe let's add a note
>>> in the commit message about it.
>> Damm, I hit the "sent" button just a second before seen your reply. :/ Oh
>> well. I suggested the same because I have seen the same question multiple
>> time.


I will update commit description and probably the comment in code.
diff mbox series

Patch

diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
index 036b13f..4cef43e 100644
--- a/xen/arch/arm/traps.c
+++ b/xen/arch/arm/traps.c
@@ -2257,18 +2257,23 @@  static void check_for_pcpu_work(void)
  * Process pending work for the vCPU. Any call should be fast or
  * implement preemption.
  */
-static void check_for_vcpu_work(void)
+static bool check_for_vcpu_work(void)
 {
     struct vcpu *v = current;
 
 #ifdef CONFIG_IOREQ_SERVER
+    bool handled;
+
     local_irq_enable();
-    vcpu_ioreq_handle_completion(v);
+    handled = vcpu_ioreq_handle_completion(v);
     local_irq_disable();
+
+    if ( !handled )
+        return true;
 #endif
 
     if ( likely(!v->arch.need_flush_to_ram) )
-        return;
+        return false;
 
     /*
      * Give a chance for the pCPU to process work before handling the vCPU
@@ -2279,6 +2284,8 @@  static void check_for_vcpu_work(void)
     local_irq_enable();
     p2m_flush_vm(v);
     local_irq_disable();
+
+    return false;
 }
 
 /*
@@ -2291,8 +2298,22 @@  void leave_hypervisor_to_guest(void)
 {
     local_irq_disable();
 
-    check_for_vcpu_work();
-    check_for_pcpu_work();
+    /*
+     * The reason to use an unbounded loop here is the fact that vCPU
+     * shouldn't continue until an I/O has completed. In Xen case, if an I/O
+     * never completes then it most likely means that something went horribly
+     * wrong with the Device Emulator. And it is most likely not safe to
+     * continue. So letting the vCPU to spin forever if I/O never completes
+     * is a safer action than letting it continue and leaving the guest in
+     * unclear state and is the best what we can do for now.
+     *
+     * This wouldn't be an issue for Xen as do_softirq() would be called at
+     * every loop. In case of failure, the guest will crash and the vCPU
+     * will be unscheduled.
+     */
+    do {
+        check_for_pcpu_work();
+    } while ( check_for_vcpu_work() );
 
     vgic_sync_to_lrs();