diff mbox

[v5,11/12] vfio: device may stuck in D3 when doing aer recovery

Message ID 1458727927-15082-12-git-send-email-caoj.fnst@cn.fujitsu.com (mailing list archive)
State New, archived
Headers show

Commit Message

Cao jin March 23, 2016, 10:12 a.m. UTC
From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>

when a physical device aer occurred, the device state probably
is not in D0 in a short time, if we recover the device quickly.
we may stuck in D3 state when force to change device state to D0.
we may need to wait for a short time to inject the error to guest.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Alex Williamson March 24, 2016, 10:54 p.m. UTC | #1
On Wed, 23 Mar 2016 18:12:06 +0800
Cao jin <caoj.fnst@cn.fujitsu.com> wrote:

> From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> 
> when a physical device aer occurred, the device state probably
> is not in D0 in a short time, if we recover the device quickly.
> we may stuck in D3 state when force to change device state to D0.
> we may need to wait for a short time to inject the error to guest.
> 
> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> ---
>  hw/vfio/pci.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 25fc095..5216e7f 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2658,6 +2658,9 @@ static void vfio_err_notifier_handler(void *opaque)
>          msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
>                                   PCI_ERR_ROOT_CMD_NONFATAL_EN;
>  
> +        /* wait a bit to ensure aer device is ready */
> +        usleep(2 * 1000);

Where does this number come from?  Why would the device be in D3?  I
don't understand this at all.

> +
>          pcie_aer_msg(dev, &msg);
>          return;
>      }
Chen Fan March 25, 2016, 1:38 a.m. UTC | #2
On 03/25/2016 06:54 AM, Alex Williamson wrote:
> On Wed, 23 Mar 2016 18:12:06 +0800
> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>
>> From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>>
>> when a physical device aer occurred, the device state probably
>> is not in D0 in a short time, if we recover the device quickly.
>> we may stuck in D3 state when force to change device state to D0.
>> we may need to wait for a short time to inject the error to guest.
>>
>> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>> ---
>>   hw/vfio/pci.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 25fc095..5216e7f 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -2658,6 +2658,9 @@ static void vfio_err_notifier_handler(void *opaque)
>>           msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
>>                                    PCI_ERR_ROOT_CMD_NONFATAL_EN;
>>   
>> +        /* wait a bit to ensure aer device is ready */
>> +        usleep(2 * 1000);
> Where does this number come from?  Why would the device be in D3?  I
> don't understand this at all.
Hi Alex,

     when I tested the code in my environment, I found that when I used
the aer-inject module to inject a fake aer error to device on host, the qemu
would throw out the message "vfio: Unable to power on device, stuck in D3"
on and off. if I use "gdb" to debug the vfio_pci_pre_reset, the phenomenon
would not appearance, I just thought it should be some timing race issue,
so I use a sleep() to wait 2ms (double the reset time of 1ms) to ensure the
device state is ready. maybe the root reason still need to be 
investigated deeply.

Thanks,
Chen


>
>> +
>>           pcie_aer_msg(dev, &msg);
>>           return;
>>       }
>
>
> .
>
Alex Williamson March 25, 2016, 2:22 a.m. UTC | #3
On Fri, 25 Mar 2016 09:38:09 +0800
Chen Fan <chen.fan.fnst@cn.fujitsu.com> wrote:

> On 03/25/2016 06:54 AM, Alex Williamson wrote:
> > On Wed, 23 Mar 2016 18:12:06 +0800
> > Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
> >  
> >> From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> >>
> >> when a physical device aer occurred, the device state probably
> >> is not in D0 in a short time, if we recover the device quickly.
> >> we may stuck in D3 state when force to change device state to D0.
> >> we may need to wait for a short time to inject the error to guest.
> >>
> >> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> >> ---
> >>   hw/vfio/pci.c | 3 +++
> >>   1 file changed, 3 insertions(+)
> >>
> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >> index 25fc095..5216e7f 100644
> >> --- a/hw/vfio/pci.c
> >> +++ b/hw/vfio/pci.c
> >> @@ -2658,6 +2658,9 @@ static void vfio_err_notifier_handler(void *opaque)
> >>           msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
> >>                                    PCI_ERR_ROOT_CMD_NONFATAL_EN;
> >>   
> >> +        /* wait a bit to ensure aer device is ready */
> >> +        usleep(2 * 1000);  
> > Where does this number come from?  Why would the device be in D3?  I
> > don't understand this at all.  
> Hi Alex,
> 
>      when I tested the code in my environment, I found that when I used
> the aer-inject module to inject a fake aer error to device on host, the qemu
> would throw out the message "vfio: Unable to power on device, stuck in D3"
> on and off. if I use "gdb" to debug the vfio_pci_pre_reset, the phenomenon
> would not appearance, I just thought it should be some timing race issue,
> so I use a sleep() to wait 2ms (double the reset time of 1ms) to ensure the
> device state is ready. maybe the root reason still need to be 
> investigated deeply.

Yes, it sounds like you need to investigate this further, the delay is
arbitrary and perhaps suggests a race that needs to be fixed
correctly.  Thanks,

Alex
Chen Fan March 31, 2016, 6:55 a.m. UTC | #4
On 03/25/2016 10:22 AM, Alex Williamson wrote:
> On Fri, 25 Mar 2016 09:38:09 +0800
> Chen Fan <chen.fan.fnst@cn.fujitsu.com> wrote:
>
>> On 03/25/2016 06:54 AM, Alex Williamson wrote:
>>> On Wed, 23 Mar 2016 18:12:06 +0800
>>> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>>>   
>>>> From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>>>>
>>>> when a physical device aer occurred, the device state probably
>>>> is not in D0 in a short time, if we recover the device quickly.
>>>> we may stuck in D3 state when force to change device state to D0.
>>>> we may need to wait for a short time to inject the error to guest.
>>>>
>>>> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>>>> ---
>>>>    hw/vfio/pci.c | 3 +++
>>>>    1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index 25fc095..5216e7f 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -2658,6 +2658,9 @@ static void vfio_err_notifier_handler(void *opaque)
>>>>            msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
>>>>                                     PCI_ERR_ROOT_CMD_NONFATAL_EN;
>>>>    
>>>> +        /* wait a bit to ensure aer device is ready */
>>>> +        usleep(2 * 1000);
>>> Where does this number come from?  Why would the device be in D3?  I
>>> don't understand this at all.
>> Hi Alex,
>>
>>       when I tested the code in my environment, I found that when I used
>> the aer-inject module to inject a fake aer error to device on host, the qemu
>> would throw out the message "vfio: Unable to power on device, stuck in D3"
>> on and off. if I use "gdb" to debug the vfio_pci_pre_reset, the phenomenon
>> would not appearance, I just thought it should be some timing race issue,
>> so I use a sleep() to wait 2ms (double the reset time of 1ms) to ensure the
>> device state is ready. maybe the root reason still need to be
>> investigated deeply.
> Yes, it sounds like you need to investigate this further, the delay is
> arbitrary and perhaps suggests a race that needs to be fixed
> correctly.  Thanks,
Hi Alex,

     after done some investigation of the problem, I found that only 
when the injected
  error is fatal, the problem will appear. because in aer do_recovery, 
host will call reset_link
on the root port, which would invoke pci_reset_bridge_secondary_bus in 
aer_root_reset,
that would reset the bridge and all the device under that. so when qemu 
receive the aer
notification, then propagate the error to guest, guest does the same way 
to perform the
recovery, if the guest `reset_link` that will call the vfio_pre_reset 
done at the stage of host
bridge reset, the device status would probable stick in D3.

so I think after qemu receive the aer notification, we should wait for 
enough time to
ensure the bridge has been reset completely. I just use sleep <=10ms to 
test the code,
seems still appear the message "vfio: Unable to power on device, stuck 
in D3". so I think
we should sleep 100ms to ensure the delay sufficient. I have tested that 
code 100+ times
by inject aer error. the issue no longer appears.

Thanks,
Chen

>
> Alex
>
>
> .
>
Alex Williamson March 31, 2016, 3:44 p.m. UTC | #5
On Thu, 31 Mar 2016 14:55:07 +0800
Chen Fan <chen.fan.fnst@cn.fujitsu.com> wrote:

> On 03/25/2016 10:22 AM, Alex Williamson wrote:
> > On Fri, 25 Mar 2016 09:38:09 +0800
> > Chen Fan <chen.fan.fnst@cn.fujitsu.com> wrote:
> >  
> >> On 03/25/2016 06:54 AM, Alex Williamson wrote:  
> >>> On Wed, 23 Mar 2016 18:12:06 +0800
> >>> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
> >>>     
> >>>> From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> >>>>
> >>>> when a physical device aer occurred, the device state probably
> >>>> is not in D0 in a short time, if we recover the device quickly.
> >>>> we may stuck in D3 state when force to change device state to D0.
> >>>> we may need to wait for a short time to inject the error to guest.
> >>>>
> >>>> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> >>>> ---
> >>>>    hw/vfio/pci.c | 3 +++
> >>>>    1 file changed, 3 insertions(+)
> >>>>
> >>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >>>> index 25fc095..5216e7f 100644
> >>>> --- a/hw/vfio/pci.c
> >>>> +++ b/hw/vfio/pci.c
> >>>> @@ -2658,6 +2658,9 @@ static void vfio_err_notifier_handler(void *opaque)
> >>>>            msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
> >>>>                                     PCI_ERR_ROOT_CMD_NONFATAL_EN;
> >>>>    
> >>>> +        /* wait a bit to ensure aer device is ready */
> >>>> +        usleep(2 * 1000);  
> >>> Where does this number come from?  Why would the device be in D3?  I
> >>> don't understand this at all.  
> >> Hi Alex,
> >>
> >>       when I tested the code in my environment, I found that when I used
> >> the aer-inject module to inject a fake aer error to device on host, the qemu
> >> would throw out the message "vfio: Unable to power on device, stuck in D3"
> >> on and off. if I use "gdb" to debug the vfio_pci_pre_reset, the phenomenon
> >> would not appearance, I just thought it should be some timing race issue,
> >> so I use a sleep() to wait 2ms (double the reset time of 1ms) to ensure the
> >> device state is ready. maybe the root reason still need to be
> >> investigated deeply.  
> > Yes, it sounds like you need to investigate this further, the delay is
> > arbitrary and perhaps suggests a race that needs to be fixed
> > correctly.  Thanks,  
> Hi Alex,
> 
>      after done some investigation of the problem, I found that only 
> when the injected
>   error is fatal, the problem will appear. because in aer do_recovery, 
> host will call reset_link
> on the root port, which would invoke pci_reset_bridge_secondary_bus in 
> aer_root_reset,
> that would reset the bridge and all the device under that. so when qemu 
> receive the aer
> notification, then propagate the error to guest, guest does the same way 
> to perform the
> recovery, if the guest `reset_link` that will call the vfio_pre_reset 
> done at the stage of host
> bridge reset, the device status would probable stick in D3.
> 
> so I think after qemu receive the aer notification, we should wait for 
> enough time to
> ensure the bridge has been reset completely. I just use sleep <=10ms to 
> test the code,
> seems still appear the message "vfio: Unable to power on device, stuck 
> in D3". so I think
> we should sleep 100ms to ensure the delay sufficient. I have tested that 
> code 100+ times
> by inject aer error. the issue no longer appears.

I'm not satisfied with this.  pci_reset_bridge_secondary_bus() is
invoked by both the host AER code and the guest AER code, the latter
via the vfio PCI hot reset interface.  The
pci_reset_bridge_secondary_bus() function includes the spec defined
delay by which point all the devices should be operational again.  The
spec also defines that devices are in D0 after reset, which implies
that the only reason we would ever be seeing a device in D3 is if we're
reading the device while it is still in reset or before it has
recovered from reset.  That implies that either
pci_reset_bridge_secondary_bus() is not waiting long enough or QEMU is
allowing one device to call vfio_pre_reset while another device is
still in reset.  I suspect QEMU serializes reset such that the latter
case is not possible, which means that you might have a device that
takes longer to reset than the spec defines.  Such a quirk should be
handled in the host kernel reset, not by adding arbitrary delays in
userspace code.  Thanks,

Alex
Chen Fan April 1, 2016, 1:40 a.m. UTC | #6
On 03/31/2016 11:44 PM, Alex Williamson wrote:
> On Thu, 31 Mar 2016 14:55:07 +0800
> Chen Fan <chen.fan.fnst@cn.fujitsu.com> wrote:
>
>> On 03/25/2016 10:22 AM, Alex Williamson wrote:
>>> On Fri, 25 Mar 2016 09:38:09 +0800
>>> Chen Fan <chen.fan.fnst@cn.fujitsu.com> wrote:
>>>   
>>>> On 03/25/2016 06:54 AM, Alex Williamson wrote:
>>>>> On Wed, 23 Mar 2016 18:12:06 +0800
>>>>> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>>>>>      
>>>>>> From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>>>>>>
>>>>>> when a physical device aer occurred, the device state probably
>>>>>> is not in D0 in a short time, if we recover the device quickly.
>>>>>> we may stuck in D3 state when force to change device state to D0.
>>>>>> we may need to wait for a short time to inject the error to guest.
>>>>>>
>>>>>> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>>>>>> ---
>>>>>>     hw/vfio/pci.c | 3 +++
>>>>>>     1 file changed, 3 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>> index 25fc095..5216e7f 100644
>>>>>> --- a/hw/vfio/pci.c
>>>>>> +++ b/hw/vfio/pci.c
>>>>>> @@ -2658,6 +2658,9 @@ static void vfio_err_notifier_handler(void *opaque)
>>>>>>             msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
>>>>>>                                      PCI_ERR_ROOT_CMD_NONFATAL_EN;
>>>>>>     
>>>>>> +        /* wait a bit to ensure aer device is ready */
>>>>>> +        usleep(2 * 1000);
>>>>> Where does this number come from?  Why would the device be in D3?  I
>>>>> don't understand this at all.
>>>> Hi Alex,
>>>>
>>>>        when I tested the code in my environment, I found that when I used
>>>> the aer-inject module to inject a fake aer error to device on host, the qemu
>>>> would throw out the message "vfio: Unable to power on device, stuck in D3"
>>>> on and off. if I use "gdb" to debug the vfio_pci_pre_reset, the phenomenon
>>>> would not appearance, I just thought it should be some timing race issue,
>>>> so I use a sleep() to wait 2ms (double the reset time of 1ms) to ensure the
>>>> device state is ready. maybe the root reason still need to be
>>>> investigated deeply.
>>> Yes, it sounds like you need to investigate this further, the delay is
>>> arbitrary and perhaps suggests a race that needs to be fixed
>>> correctly.  Thanks,
>> Hi Alex,
>>
>>       after done some investigation of the problem, I found that only
>> when the injected
>>    error is fatal, the problem will appear. because in aer do_recovery,
>> host will call reset_link
>> on the root port, which would invoke pci_reset_bridge_secondary_bus in
>> aer_root_reset,
>> that would reset the bridge and all the device under that. so when qemu
>> receive the aer
>> notification, then propagate the error to guest, guest does the same way
>> to perform the
>> recovery, if the guest `reset_link` that will call the vfio_pre_reset
>> done at the stage of host
>> bridge reset, the device status would probable stick in D3.
>>
>> so I think after qemu receive the aer notification, we should wait for
>> enough time to
>> ensure the bridge has been reset completely. I just use sleep <=10ms to
>> test the code,
>> seems still appear the message "vfio: Unable to power on device, stuck
>> in D3". so I think
>> we should sleep 100ms to ensure the delay sufficient. I have tested that
>> code 100+ times
>> by inject aer error. the issue no longer appears.
> I'm not satisfied with this.  pci_reset_bridge_secondary_bus() is
> invoked by both the host AER code and the guest AER code, the latter
> via the vfio PCI hot reset interface.  The
> pci_reset_bridge_secondary_bus() function includes the spec defined
> delay by which point all the devices should be operational again.  The
> spec also defines that devices are in D0 after reset, which implies
> that the only reason we would ever be seeing a device in D3 is if we're
> reading the device while it is still in reset or before it has
> recovered from reset.  That implies that either
> pci_reset_bridge_secondary_bus() is not waiting long enough or QEMU is
> allowing one device to call vfio_pre_reset while another device is
> still in reset.  I suspect QEMU serializes reset such that the latter
> case is not possible, which means that you might have a device that
> takes longer to reset than the spec defines.  Such a quirk should be
> handled in the host kernel reset, not by adding arbitrary delays in
> userspace code.  Thanks,
maybe it is not the delay issue actually. let me analyze the aer process
in do_recovery:

                host                                      qemu          
      guest

         broadcast: error_detected
         +--> error_detected (VFIO)
           +--> eventfd_signal   ----->    inject_aer ---------> 
broadcast: error_detected
*reset_link* +--> error_detected (real driver)
*reset_link*
         broadcast: mmio/slot_reset                                     
broadcast: mmio/slot_reset
         broadcast: 
resume                                                  broadcast: resume

we can see that the reset_link in host and in guest are not called in 
order. the reset_link
in guest may execute during the host broadcast "error_detected" or 
"reset_link" as long
as there are many devices need to walk on the bus. so in the case qemu 
has the opportunity
to call vfio_pre_reset while host is still in reset.

so can we consider moving the eventfd_signal in error_detected to 
resume/slot_reset callback
in vfio driver, then we can assure that the host reset_link finish 
before guest do reset_link.
or adding a new resume/slot_reset event between vfio driver and qemu to 
delay the guest reset_link.

Thanks,
Chen











>
> Alex
>
>
> .
>
Chen Fan April 1, 2016, 1:53 a.m. UTC | #7
On 04/01/2016 09:40 AM, Chen Fan wrote:
>
> On 03/31/2016 11:44 PM, Alex Williamson wrote:
>> On Thu, 31 Mar 2016 14:55:07 +0800
>> Chen Fan <chen.fan.fnst@cn.fujitsu.com> wrote:
>>
>>> On 03/25/2016 10:22 AM, Alex Williamson wrote:
>>>> On Fri, 25 Mar 2016 09:38:09 +0800
>>>> Chen Fan <chen.fan.fnst@cn.fujitsu.com> wrote:
>>>>> On 03/25/2016 06:54 AM, Alex Williamson wrote:
>>>>>> On Wed, 23 Mar 2016 18:12:06 +0800
>>>>>> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>>>>>>> From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>>>>>>>
>>>>>>> when a physical device aer occurred, the device state probably
>>>>>>> is not in D0 in a short time, if we recover the device quickly.
>>>>>>> we may stuck in D3 state when force to change device state to D0.
>>>>>>> we may need to wait for a short time to inject the error to guest.
>>>>>>>
>>>>>>> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>>>>>>> ---
>>>>>>>     hw/vfio/pci.c | 3 +++
>>>>>>>     1 file changed, 3 insertions(+)
>>>>>>>
>>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>>> index 25fc095..5216e7f 100644
>>>>>>> --- a/hw/vfio/pci.c
>>>>>>> +++ b/hw/vfio/pci.c
>>>>>>> @@ -2658,6 +2658,9 @@ static void vfio_err_notifier_handler(void 
>>>>>>> *opaque)
>>>>>>>             msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
>>>>>>> PCI_ERR_ROOT_CMD_NONFATAL_EN;
>>>>>>>     +        /* wait a bit to ensure aer device is ready */
>>>>>>> +        usleep(2 * 1000);
>>>>>> Where does this number come from?  Why would the device be in D3?  I
>>>>>> don't understand this at all.
>>>>> Hi Alex,
>>>>>
>>>>>        when I tested the code in my environment, I found that when 
>>>>> I used
>>>>> the aer-inject module to inject a fake aer error to device on 
>>>>> host, the qemu
>>>>> would throw out the message "vfio: Unable to power on device, 
>>>>> stuck in D3"
>>>>> on and off. if I use "gdb" to debug the vfio_pci_pre_reset, the 
>>>>> phenomenon
>>>>> would not appearance, I just thought it should be some timing race 
>>>>> issue,
>>>>> so I use a sleep() to wait 2ms (double the reset time of 1ms) to 
>>>>> ensure the
>>>>> device state is ready. maybe the root reason still need to be
>>>>> investigated deeply.
>>>> Yes, it sounds like you need to investigate this further, the delay is
>>>> arbitrary and perhaps suggests a race that needs to be fixed
>>>> correctly.  Thanks,
>>> Hi Alex,
>>>
>>>       after done some investigation of the problem, I found that only
>>> when the injected
>>>    error is fatal, the problem will appear. because in aer do_recovery,
>>> host will call reset_link
>>> on the root port, which would invoke pci_reset_bridge_secondary_bus in
>>> aer_root_reset,
>>> that would reset the bridge and all the device under that. so when qemu
>>> receive the aer
>>> notification, then propagate the error to guest, guest does the same 
>>> way
>>> to perform the
>>> recovery, if the guest `reset_link` that will call the vfio_pre_reset
>>> done at the stage of host
>>> bridge reset, the device status would probable stick in D3.
>>>
>>> so I think after qemu receive the aer notification, we should wait for
>>> enough time to
>>> ensure the bridge has been reset completely. I just use sleep <=10ms to
>>> test the code,
>>> seems still appear the message "vfio: Unable to power on device, stuck
>>> in D3". so I think
>>> we should sleep 100ms to ensure the delay sufficient. I have tested 
>>> that
>>> code 100+ times
>>> by inject aer error. the issue no longer appears.
>> I'm not satisfied with this.  pci_reset_bridge_secondary_bus() is
>> invoked by both the host AER code and the guest AER code, the latter
>> via the vfio PCI hot reset interface.  The
>> pci_reset_bridge_secondary_bus() function includes the spec defined
>> delay by which point all the devices should be operational again.  The
>> spec also defines that devices are in D0 after reset, which implies
>> that the only reason we would ever be seeing a device in D3 is if we're
>> reading the device while it is still in reset or before it has
>> recovered from reset.  That implies that either
>> pci_reset_bridge_secondary_bus() is not waiting long enough or QEMU is
>> allowing one device to call vfio_pre_reset while another device is
>> still in reset.  I suspect QEMU serializes reset such that the latter
>> case is not possible, which means that you might have a device that
>> takes longer to reset than the spec defines.  Such a quirk should be
>> handled in the host kernel reset, not by adding arbitrary delays in
>> userspace code.  Thanks,
> maybe it is not the delay issue actually. let me analyze the aer process
> in do_recovery:
>
>                host qemu               guest
>
>         broadcast: error_detected
>         +--> error_detected (VFIO)
>           +--> eventfd_signal   ----->    inject_aer ---------> 
> broadcast: error_detected
> *reset_link* +--> error_detected (real driver)
> *reset_link*
>         broadcast: mmio/slot_reset                                     
> broadcast: mmio/slot_reset
>         broadcast: 
> resume                                                  broadcast: resume
apologies to all, the above diagram was chaotic. I redraw it.

                host                             qemu        guest

broadcast: error_detected (host)
  +--> error_detected (VFIO)
    +--> eventfd_signal   ----->  inject_aer  -----> broadcast: 
error_detected (gust)
*reset_link* (host) +--> error_detected (real driver) (gust)
*reset_link* (gust)
broadcast: mmio/slot_reset (host)                      broadcast: 
mmio/slot_reset (gust)
broadcast: resume (host) broadcast: resume (gust)

Thanks,
Chen

>
> we can see that the reset_link in host and in guest are not called in 
> order. the reset_link
> in guest may execute during the host broadcast "error_detected" or 
> "reset_link" as long
> as there are many devices need to walk on the bus. so in the case qemu 
> has the opportunity
> to call vfio_pre_reset while host is still in reset.
>
> so can we consider moving the eventfd_signal in error_detected to 
> resume/slot_reset callback
> in vfio driver, then we can assure that the host reset_link finish 
> before guest do reset_link.
> or adding a new resume/slot_reset event between vfio driver and qemu 
> to delay the guest reset_link.
>
> Thanks,
> Chen
>
>
>
>
>
>
>
>
>
>
>
>>
>> Alex
>>
>>
>> .
>>
>
>
>
>
> .
>
diff mbox

Patch

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 25fc095..5216e7f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2658,6 +2658,9 @@  static void vfio_err_notifier_handler(void *opaque)
         msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
                                  PCI_ERR_ROOT_CMD_NONFATAL_EN;
 
+        /* wait a bit to ensure aer device is ready */
+        usleep(2 * 1000);
+
         pcie_aer_msg(dev, &msg);
         return;
     }