Message ID | 20190121134249.16615-3-david@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | s390x/pci: hotplug handler fixes and reworks | expand |
On Mon, 21 Jan 2019 14:42:49 +0100 David Hildenbrand <david@redhat.com> wrote: > When resetting the guest we should unplug and remove all devices that > are still pending. Otherwise the fresh guest will see devices that will > suddenly vanish. > > Can be triggered e.g. via > (hmp) device_add virtio-mouse-pci,id=test > (hmp) stop > (hmp) device_del test > (hmp) system_reset > (hmp) c > > The device will vanish after roughly 5 minutes. With this patch, the > device will vanish on reboot (S390_RESET_EXTERNAL and S390_RESET_REIPL, > which reset the pcihost bridge via qemu_devices_reset()). If we want > these devices to vanish directly on any reset (S390_RESET_MODIFIED_CLEAR > and S390_RESET_LOAD_NORMAL), we have to modify s390_machine_reset(). But > I have the feeling that this should not be done for all reset types. > > This approach is similar to what's done for acpi PCI hotplug in > acpi_pcihp_reset() -> acpi_pcihp_update() -> > acpi_pcihp_update_hotplug_bus() -> acpi_pcihp_eject_slot(). > > s390_pci_generate_plug_event()'s will still be generated, I guess this > is not an issue (same thing could happen right now if the timer expires > just after reset). I'm wondering what the architecture says regarding those events -- can someone with access to the documentation comment? > > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > hw/s390x/s390-pci-bus.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c > index bc17a8cf65..b70ae25533 100644 > --- a/hw/s390x/s390-pci-bus.c > +++ b/hw/s390x/s390-pci-bus.c > @@ -1102,6 +1102,14 @@ static void s390_pcihost_reset(DeviceState *dev) > { > S390pciState *s = S390_PCI_HOST_BRIDGE(dev); > PCIBus *bus = s->parent_obj.bus; > + S390PCIBusDevice *pbdev, *next; > + > + /* Unplug all pending devices that were requested to be released */ > + QTAILQ_FOREACH_SAFE(pbdev, &s->zpci_devs, link, next) { > + if (pbdev->release_timer) { > + s390_pcihost_timer_cb(pbdev); > + } > + } > > s->bus_no = 0; > pci_for_each_device(bus, pci_bus_num(bus), s390_pci_enumerate_bridge, s);
On Wed, 23 Jan 2019 12:05:39 +0100 Cornelia Huck <cohuck@redhat.com> wrote: > On Mon, 21 Jan 2019 14:42:49 +0100 > David Hildenbrand <david@redhat.com> wrote: > > > When resetting the guest we should unplug and remove all devices that > > are still pending. Otherwise the fresh guest will see devices that will > > suddenly vanish. > > > > Can be triggered e.g. via > > (hmp) device_add virtio-mouse-pci,id=test > > (hmp) stop > > (hmp) device_del test > > (hmp) system_reset > > (hmp) c > > > > The device will vanish after roughly 5 minutes. With this patch, the > > device will vanish on reboot (S390_RESET_EXTERNAL and S390_RESET_REIPL, > > which reset the pcihost bridge via qemu_devices_reset()). If we want > > these devices to vanish directly on any reset (S390_RESET_MODIFIED_CLEAR > > and S390_RESET_LOAD_NORMAL), we have to modify s390_machine_reset(). But > > I have the feeling that this should not be done for all reset types. > > > > This approach is similar to what's done for acpi PCI hotplug in > > acpi_pcihp_reset() -> acpi_pcihp_update() -> > > acpi_pcihp_update_hotplug_bus() -> acpi_pcihp_eject_slot(). > > > > s390_pci_generate_plug_event()'s will still be generated, I guess this > > is not an issue (same thing could happen right now if the timer expires > > just after reset). > > I'm wondering what the architecture says regarding those events -- can > someone with access to the documentation comment? Ping. Any comments from the IBM folks? > > > > > Signed-off-by: David Hildenbrand <david@redhat.com> > > --- > > hw/s390x/s390-pci-bus.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c > > index bc17a8cf65..b70ae25533 100644 > > --- a/hw/s390x/s390-pci-bus.c > > +++ b/hw/s390x/s390-pci-bus.c > > @@ -1102,6 +1102,14 @@ static void s390_pcihost_reset(DeviceState *dev) > > { > > S390pciState *s = S390_PCI_HOST_BRIDGE(dev); > > PCIBus *bus = s->parent_obj.bus; > > + S390PCIBusDevice *pbdev, *next; > > + > > + /* Unplug all pending devices that were requested to be released */ > > + QTAILQ_FOREACH_SAFE(pbdev, &s->zpci_devs, link, next) { > > + if (pbdev->release_timer) { > > + s390_pcihost_timer_cb(pbdev); > > + } > > + } > > > > s->bus_no = 0; > > pci_for_each_device(bus, pci_bus_num(bus), s390_pci_enumerate_bridge, s); >
On 1/28/19 6:28 AM, Cornelia Huck wrote: > On Wed, 23 Jan 2019 12:05:39 +0100 > Cornelia Huck <cohuck@redhat.com> wrote: > >> On Mon, 21 Jan 2019 14:42:49 +0100 >> David Hildenbrand <david@redhat.com> wrote: >> >>> When resetting the guest we should unplug and remove all devices that >>> are still pending. Otherwise the fresh guest will see devices that will >>> suddenly vanish. >>> >>> Can be triggered e.g. via >>> (hmp) device_add virtio-mouse-pci,id=test >>> (hmp) stop >>> (hmp) device_del test >>> (hmp) system_reset >>> (hmp) c >>> >>> The device will vanish after roughly 5 minutes. With this patch, the >>> device will vanish on reboot (S390_RESET_EXTERNAL and S390_RESET_REIPL, >>> which reset the pcihost bridge via qemu_devices_reset()). If we want >>> these devices to vanish directly on any reset (S390_RESET_MODIFIED_CLEAR >>> and S390_RESET_LOAD_NORMAL), we have to modify s390_machine_reset(). But >>> I have the feeling that this should not be done for all reset types. >>> >>> This approach is similar to what's done for acpi PCI hotplug in >>> acpi_pcihp_reset() -> acpi_pcihp_update() -> >>> acpi_pcihp_update_hotplug_bus() -> acpi_pcihp_eject_slot(). >>> >>> s390_pci_generate_plug_event()'s will still be generated, I guess this >>> is not an issue (same thing could happen right now if the timer expires >>> just after reset). >> >> I'm wondering what the architecture says regarding those events -- can >> someone with access to the documentation comment? > > Ping. Any comments from the IBM folks? > So the idea here is that if we have a PCI device that is the process of being deconfigured and we are also in the middle of a reset, then let's accelerate deconfiguring of the PCI device during the reset. Makes sense. Note: The callback function will deconfigure the the device and put it into standby mode. However, a PCI device should only go into standby from the *disabled state* (which it could already be in due to the unplug sequence), or from a *permanent error state* (something we should hopefully never see -- this means something went seriously wrong with the device). Two things I'm concerned about: 1) What I would suggest is adding a check for the pbdev->state for ZPCI_FS_DISABLED || ZPCI_FS_PERMANENT_ERROR. If it is in either of these states, then we're safe to deconfigure and put into standby. If the device is still in another state (such as enabled or blocked, etc) then we should allow the timer to resume and give the device some more time before forcing an unplug. It's also probably not a good idea to try and deconfigure a device that might already be deconfigured (e.g. if it's already in standby or reserved state). That might not happen though, but it's good to cover our bases. A side thought: In addition to checking the states, what would happen if you forced the timer to 0? Would the callback get called? Would that just accelerate the already-in-progress unplug sequence? and 2) I worry that the sclp might try to deconfigure a PCI device at the same time we force the callback in this patch. I noticed that the sclp_deconfigure function also checks on the release timer before unplugging. I think we should make sure the timer is properly stopped or canceled before the sclp tries to deconfigure and before this patch forces the callback, that way we hopefully won't try to do both at the same time. something like if (release_timer) { stop timer unplug } Your adjustments to the pcihost_unplug function in patch #1 would of course handle freeing the release timer later on. >> >>> >>> Signed-off-by: David Hildenbrand <david@redhat.com> >>> --- >>> hw/s390x/s390-pci-bus.c | 8 ++++++++ >>> 1 file changed, 8 insertions(+) >>> >>> diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c >>> index bc17a8cf65..b70ae25533 100644 >>> --- a/hw/s390x/s390-pci-bus.c >>> +++ b/hw/s390x/s390-pci-bus.c >>> @@ -1102,6 +1102,14 @@ static void s390_pcihost_reset(DeviceState *dev) >>> { >>> S390pciState *s = S390_PCI_HOST_BRIDGE(dev); >>> PCIBus *bus = s->parent_obj.bus; >>> + S390PCIBusDevice *pbdev, *next; >>> + >>> + /* Unplug all pending devices that were requested to be released */ >>> + QTAILQ_FOREACH_SAFE(pbdev, &s->zpci_devs, link, next) { >>> + if (pbdev->release_timer) { >>> + s390_pcihost_timer_cb(pbdev); >>> + } >>> + } >>> >>> s->bus_no = 0; >>> pci_for_each_device(bus, pci_bus_num(bus), s390_pci_enumerate_bridge, s); >> >
>>> I'm wondering what the architecture says regarding those events -- can >>> someone with access to the documentation comment? >> >> Ping. Any comments from the IBM folks? >> > > > So the idea here is that if we have a PCI device that is the process of > being deconfigured and we are also in the middle of a reset, then let's > accelerate deconfiguring of the PCI device during the reset. Makes sense. > > Note: > > The callback function will deconfigure the the device and put it into > standby mode. However, a PCI device should only go into standby from the > *disabled state* (which it could already be in due to the unplug > sequence), or from a *permanent error state* (something we should > hopefully never see -- this means something went seriously wrong with > the device). Right, this should already have been checked before setting up the timer. > > Two things I'm concerned about: > > 1) > > What I would suggest is adding a check for the pbdev->state for > ZPCI_FS_DISABLED || ZPCI_FS_PERMANENT_ERROR. If it is in either of these > states, then we're safe to deconfigure and put into standby. If the We setup a timer if !ZPCI_FS_STANDBY and !ZPCI_FS_RESERVED. So for - ZPCI_FS_DISABLED - ZPCI_FS_ENABLED - ZPCI_FS_BLOCKED - ZPCI_FS_ERROR - ZPCI_FS_PERMANENT_ERROR We setup a timer and simply go ahead and unplug the device when the timer expires ("forced unplug"). Changing that behavior might be more invasive. Simply not unplugging in s390_pcihost_timer_cb() on some of these states would mean that somebody issued a request and that requests is simply lost/ignored. Not sure if that is what we want. I think we need separate patches to change something like that. Especially 1. What happens if the device was in ZPCI_FS_DISABLED, the guest ignores the unplug request and moves the device to ZPCI_FS_ENABLED before the timer expires? These are corner cases to consider. 2. Do we need a timer at all? Now that Patch #1 introduces unplug_requests, we are free to ignore requests from the user if the guest is not reacting. I would really favor getting rid of the timer completely. Was there a special reason why this was introduced? No other architecture (e.g. ACPI) uses such a timer. They use a simple flag to remember if a request is pending. I would really favor going into that direction. @Christian do you think we need this "force remove after 30 seconds" timer thingy? > device is still in another state (such as enabled or blocked, etc) then > we should allow the timer to resume and give the device some more time > before forcing an unplug. It's also probably not a good idea to try and > deconfigure a device that might already be deconfigured (e.g. if it's > already in standby or reserved state). That might not happen though, but > it's good to cover our bases. > > A side thought: In addition to checking the states, what would happen if > you forced the timer to 0? Would the callback get called? Would that > just accelerate the already-in-progress unplug sequence? I guess so, but then action will be called asynchronously from the main loop when timers are run. Calling it synchronously during the reset makes in my point of view more sense. > > and 2) > > I worry that the sclp might try to deconfigure a PCI device at the same > time we force the callback in this patch. I noticed that the > sclp_deconfigure function also checks on the release timer before > unplugging. I think we should make sure the timer is properly stopped or > canceled before the sclp tries to deconfigure and before this patch > forces the callback, that way we hopefully won't try to do both at the > same time. Not necessary. Timers are ran from the main thread, just as we (system reset) are. We don't have any race conditions here. This would only be possible when being triggered by a VCPU without holding the iothread mutex. But this is never the case. All VCPUs and even timers (due to the virtual clock being stopped) are stopped during system reset.
On 29/01/2019 11:24, David Hildenbrand wrote: >>>> I'm wondering what the architecture says regarding those events -- can >>>> someone with access to the documentation comment? >>> >>> Ping. Any comments from the IBM folks? Hi, Sorry to have wait so long. At least Collin was faster. >>> >> >> >> So the idea here is that if we have a PCI device that is the process of >> being deconfigured and we are also in the middle of a reset, then let's >> accelerate deconfiguring of the PCI device during the reset. Makes sense. to me too. However, how do we ensure that the guest got time to respond to the first deconfigure request? >> >> Note: >> >> The callback function will deconfigure the the device and put it into >> standby mode. However, a PCI device should only go into standby from the >> *disabled state* (which it could already be in due to the unplug >> sequence), or from a *permanent error state* (something we should >> hopefully never see -- this means something went seriously wrong with >> the device). Not completely exact, the CHSC event 0x304, on the initiative of the host, force the "deconfigure state" from "configure state" generally, whatever sub-state it has (enabled/disabled/error...). > > Right, this should already have been checked before setting up the timer. Apropos timer, do we need a timer or wouldn't it be better to use a delay / a timer + condition? AFAIU we get out of the unplug without waiting for any answer from the guest and we surely get the timer triggering after the reset has been done. That seems bad. > >> >> Two things I'm concerned about: >> >> 1) >> >> What I would suggest is adding a check for the pbdev->state for >> ZPCI_FS_DISABLED || ZPCI_FS_PERMANENT_ERROR. If it is in either of these >> states, then we're safe to deconfigure and put into standby. If the > > We setup a timer if !ZPCI_FS_STANDBY and !ZPCI_FS_RESERVED. > > So for > - ZPCI_FS_DISABLED > - ZPCI_FS_ENABLED > - ZPCI_FS_BLOCKED > - ZPCI_FS_ERROR > - ZPCI_FS_PERMANENT_ERROR > > We setup a timer and simply go ahead and unplug the device when the > timer expires ("forced unplug"). I agree only for ZPCI_FS_ENABLED why do we need to be smooth for other states? ZPCI_FS_DISABLED may be a candidate even I do not see the interrest but other states of the device should issue a HP_EVENT_DECONFIGURE_REQUEST and we do not need a timer (or any delay) > > Changing that behavior might be more invasive. Simply not unplugging in > s390_pcihost_timer_cb() on some of these states would mean that somebody > issued a request and that requests is simply lost/ignored. Not sure if > that is what we want. I think we need separate patches to change > something like that. Especially > > 1. What happens if the device was in ZPCI_FS_DISABLED, the guest ignores > the unplug request and moves the device to ZPCI_FS_ENABLED before the > timer expires? These are corner cases to consider. +1, we must ensure to do the work inside the unplug CB. > > 2. Do we need a timer at all? Now that Patch #1 introduces > unplug_requests, we are free to ignore requests from the user if the > guest is not reacting. I would really favor getting rid of the timer > completely. Was there a special reason why this was introduced? Yes, to let a chance to the guest to smoothly relinquish the device. (for example sync/clean the disk) However I do not think it is right implemented. > > No other architecture (e.g. ACPI) uses such a timer. They use a simple > flag to remember if a request is pending. I would really favor going > into that direction. I am not sure that the Intel architecture is a good example. :) AFAIU they do not wait for the guest to have relinquish the device. Or do they? How long do they wait? > > @Christian do you think we need this "force remove after 30 seconds" > timer thingy? > >> device is still in another state (such as enabled or blocked, etc) then >> we should allow the timer to resume and give the device some more time >> before forcing an unplug. It's also probably not a good idea to try and >> deconfigure a device that might already be deconfigured (e.g. if it's >> already in standby or reserved state). That might not happen though, but >> it's good to cover our bases. >> >> A side thought: In addition to checking the states, what would happen if >> you forced the timer to 0? Would the callback get called? Would that >> just accelerate the already-in-progress unplug sequence? > > I guess so, but then action will be called asynchronously from the main > loop when timers are run. Calling it synchronously during the reset > makes in my point of view more sense. To me too, (and during hot unplug too) but it has a sense only if we wait some time between the demand to relinquish the device and the rip off of the device. Regards, Pierre
On 29.01.19 14:50, Pierre Morel wrote: > On 29/01/2019 11:24, David Hildenbrand wrote: >>>>> I'm wondering what the architecture says regarding those events -- can >>>>> someone with access to the documentation comment? >>>> >>>> Ping. Any comments from the IBM folks? > > Hi, > > Sorry to have wait so long. > At least Collin was faster. > >>>> >>> >>> >>> So the idea here is that if we have a PCI device that is the process of >>> being deconfigured and we are also in the middle of a reset, then let's >>> accelerate deconfiguring of the PCI device during the reset. Makes sense. > > to me too. > However, how do we ensure that the guest got time to respond to the > first deconfigure request? 30 seconds, then the reboot. On a reboot, I don't see why we should give a guest more time. "It's dead", rip out the card as the guest refused to hand it back. (maybe it crashed! but after a reboot the guest state is reset and baught back to life) > >>> >>> Note: >>> >>> The callback function will deconfigure the the device and put it into >>> standby mode. However, a PCI device should only go into standby from the >>> *disabled state* (which it could already be in due to the unplug >>> sequence), or from a *permanent error state* (something we should >>> hopefully never see -- this means something went seriously wrong with >>> the device). > > Not completely exact, the CHSC event 0x304, on the initiative of the > host, force the "deconfigure state" from "configure state" generally, > whatever sub-state it has (enabled/disabled/error...). > >> >> Right, this should already have been checked before setting up the timer. > > Apropos timer, do we need a timer or wouldn't it be better to use a > delay / a timer + condition? I don't think we need a timer at all. > > AFAIU we get out of the unplug without waiting for any answer from the > guest and we surely get the timer triggering after the reset has been done. > That seems bad. This is the case right now, correct. > > >> >>> >>> Two things I'm concerned about: >>> >>> 1) >>> >>> What I would suggest is adding a check for the pbdev->state for >>> ZPCI_FS_DISABLED || ZPCI_FS_PERMANENT_ERROR. If it is in either of these >>> states, then we're safe to deconfigure and put into standby. If the >> >> We setup a timer if !ZPCI_FS_STANDBY and !ZPCI_FS_RESERVED. >> >> So for >> - ZPCI_FS_DISABLED >> - ZPCI_FS_ENABLED >> - ZPCI_FS_BLOCKED >> - ZPCI_FS_ERROR >> - ZPCI_FS_PERMANENT_ERROR >> >> We setup a timer and simply go ahead and unplug the device when the >> timer expires ("forced unplug"). > > I agree only for ZPCI_FS_ENABLED why do we need to be smooth for other > states? > ZPCI_FS_DISABLED may be a candidate even I do not see the interrest but > other states of the device should issue a HP_EVENT_DECONFIGURE_REQUEST > and we do not need a timer (or any delay) You can always expect that your guest driver is dead. > >> >> Changing that behavior might be more invasive. Simply not unplugging in >> s390_pcihost_timer_cb() on some of these states would mean that somebody >> issued a request and that requests is simply lost/ignored. Not sure if >> that is what we want. I think we need separate patches to change >> something like that. Especially >> >> 1. What happens if the device was in ZPCI_FS_DISABLED, the guest ignores >> the unplug request and moves the device to ZPCI_FS_ENABLED before the >> timer expires? These are corner cases to consider. > > +1, we must ensure to do the work inside the unplug CB. > >> >> 2. Do we need a timer at all? Now that Patch #1 introduces >> unplug_requests, we are free to ignore requests from the user if the >> guest is not reacting. I would really favor getting rid of the timer >> completely. Was there a special reason why this was introduced? > > Yes, to let a chance to the guest to smoothly relinquish the device. > (for example sync/clean the disk) > However I do not think it is right implemented. > >> >> No other architecture (e.g. ACPI) uses such a timer. They use a simple >> flag to remember if a request is pending. I would really favor going >> into that direction. > > I am not sure that the Intel architecture is a good example. :) Right, we all learned that zPCI did it better. (sorry ;) ) > > AFAIU they do not wait for the guest to have relinquish the device. > Or do they? > How long do they wait? They wait for ever. And I guess we should do the same thing. If the guest driver is broken (and this is really a rare scenario!) we would not get the device back. Which is perfectly fine in my point of view. In all other scenarios I guess the guest will simply respond in a timely manner. And ripping out stuff from the guest always feels wrong. (yes the channel subsystem works like that, but here we actually have a choice) If we reboot, we can unplug the device. Otherwise, let's keep it simply and don't use a timer. Thanks!
On 29/01/2019 16:11, David Hildenbrand wrote: > On 29.01.19 14:50, Pierre Morel wrote: >> On 29/01/2019 11:24, David Hildenbrand wrote: >>>>>> I'm wondering what the architecture says regarding those events -- can >>>>>> someone with access to the documentation comment? >>>>> >>>>> Ping. Any comments from the IBM folks? >> >> Hi, >> >> Sorry to have wait so long. >> At least Collin was faster. >> >>>>> >>>> >>>> >>>> So the idea here is that if we have a PCI device that is the process of >>>> being deconfigured and we are also in the middle of a reset, then let's >>>> accelerate deconfiguring of the PCI device during the reset. Makes sense. >> >> to me too. >> However, how do we ensure that the guest got time to respond to the >> first deconfigure request? > > 30 seconds, then the reboot. On a reboot, I don't see why we should give > a guest more time. "It's dead", rip out the card as the guest refused to > hand it back. (maybe it crashed! but after a reboot the guest state is > reset and baught back to life) I agree that 30 seconds is way enough, also agree that in most cases the guest is dead. > >> >>>> >>>> Note: >>>> >>>> The callback function will deconfigure the the device and put it into >>>> standby mode. However, a PCI device should only go into standby from the >>>> *disabled state* (which it could already be in due to the unplug >>>> sequence), or from a *permanent error state* (something we should >>>> hopefully never see -- this means something went seriously wrong with >>>> the device). >> >> Not completely exact, the CHSC event 0x304, on the initiative of the >> host, force the "deconfigure state" from "configure state" generally, >> whatever sub-state it has (enabled/disabled/error...). >> >>> >>> Right, this should already have been checked before setting up the timer. >> >> Apropos timer, do we need a timer or wouldn't it be better to use a >> delay / a timer + condition? > > I don't think we need a timer at all. Yes, if it is possible to wait synchronously 30s it seems good to me. > >> >> AFAIU we get out of the unplug without waiting for any answer from the >> guest and we surely get the timer triggering after the reset has been done. >> That seems bad. > > This is the case right now, correct. > >> >> >>> >>>> >>>> Two things I'm concerned about: >>>> >>>> 1) >>>> >>>> What I would suggest is adding a check for the pbdev->state for >>>> ZPCI_FS_DISABLED || ZPCI_FS_PERMANENT_ERROR. If it is in either of these >>>> states, then we're safe to deconfigure and put into standby. If the >>> >>> We setup a timer if !ZPCI_FS_STANDBY and !ZPCI_FS_RESERVED. >>> >>> So for >>> - ZPCI_FS_DISABLED >>> - ZPCI_FS_ENABLED >>> - ZPCI_FS_BLOCKED >>> - ZPCI_FS_ERROR >>> - ZPCI_FS_PERMANENT_ERROR >>> >>> We setup a timer and simply go ahead and unplug the device when the >>> timer expires ("forced unplug"). >> >> I agree only for ZPCI_FS_ENABLED why do we need to be smooth for other >> states? >> ZPCI_FS_DISABLED may be a candidate even I do not see the interrest but >> other states of the device should issue a HP_EVENT_DECONFIGURE_REQUEST >> and we do not need a timer (or any delay) > > You can always expect that your guest driver is dead. hum, the device is dead. May be the guest got hit too if the driver is not right written. > >> >>> >>> Changing that behavior might be more invasive. Simply not unplugging in >>> s390_pcihost_timer_cb() on some of these states would mean that somebody >>> issued a request and that requests is simply lost/ignored. Not sure if >>> that is what we want. I think we need separate patches to change >>> something like that. Especially >>> >>> 1. What happens if the device was in ZPCI_FS_DISABLED, the guest ignores >>> the unplug request and moves the device to ZPCI_FS_ENABLED before the >>> timer expires? These are corner cases to consider. >> >> +1, we must ensure to do the work inside the unplug CB. >> >>> >>> 2. Do we need a timer at all? Now that Patch #1 introduces >>> unplug_requests, we are free to ignore requests from the user if the >>> guest is not reacting. I would really favor getting rid of the timer >>> completely. Was there a special reason why this was introduced? >> >> Yes, to let a chance to the guest to smoothly relinquish the device. >> (for example sync/clean the disk) >> However I do not think it is right implemented. >> >>> >>> No other architecture (e.g. ACPI) uses such a timer. They use a simple >>> flag to remember if a request is pending. I would really favor going >>> into that direction. >> >> I am not sure that the Intel architecture is a good example. :) > > Right, we all learned that zPCI did it better. (sorry ;) ) Well I really think so. It is designed with several guest in parallel and shared devices. In such an architecture, ripping of a device from a guest may have interest. One good thing would be that the software of the guest handle it :) > >> >> AFAIU they do not wait for the guest to have relinquish the device. >> Or do they? >> How long do they wait? > > They wait for ever. And I guess we should do the same thing. If the > guest driver is broken (and this is really a rare scenario!) we would > not get the device back. Which is perfectly fine in my point of view. In > all other scenarios I guess the guest will simply respond in a timely > manner. And ripping out stuff from the guest always feels wrong. (yes > the channel subsystem works like that, but here we actually have a choice) > > If we reboot, we can unplug the device. Otherwise, let's keep it simply > and don't use a timer. > > Thanks! > AFAIK We have no plan to operate on pools of PCI devices so for me I have no objection to keep it simple: Regards, Pierre
On 29.01.19 17:50, Pierre Morel wrote: > On 29/01/2019 16:11, David Hildenbrand wrote: >> On 29.01.19 14:50, Pierre Morel wrote: >>> On 29/01/2019 11:24, David Hildenbrand wrote: >>>>>>> I'm wondering what the architecture says regarding those events -- can >>>>>>> someone with access to the documentation comment? >>>>>> >>>>>> Ping. Any comments from the IBM folks? >>> >>> Hi, >>> >>> Sorry to have wait so long. >>> At least Collin was faster. >>> >>>>>> >>>>> >>>>> >>>>> So the idea here is that if we have a PCI device that is the process of >>>>> being deconfigured and we are also in the middle of a reset, then let's >>>>> accelerate deconfiguring of the PCI device during the reset. Makes sense. >>> >>> to me too. >>> However, how do we ensure that the guest got time to respond to the >>> first deconfigure request? >> >> 30 seconds, then the reboot. On a reboot, I don't see why we should give >> a guest more time. "It's dead", rip out the card as the guest refused to >> hand it back. (maybe it crashed! but after a reboot the guest state is >> reset and baught back to life) > > I agree that 30 seconds is way enough, > also agree that in most cases the guest is dead. > > >> >>> >>>>> >>>>> Note: >>>>> >>>>> The callback function will deconfigure the the device and put it into >>>>> standby mode. However, a PCI device should only go into standby from the >>>>> *disabled state* (which it could already be in due to the unplug >>>>> sequence), or from a *permanent error state* (something we should >>>>> hopefully never see -- this means something went seriously wrong with >>>>> the device). >>> >>> Not completely exact, the CHSC event 0x304, on the initiative of the >>> host, force the "deconfigure state" from "configure state" generally, >>> whatever sub-state it has (enabled/disabled/error...). >>> >>>> >>>> Right, this should already have been checked before setting up the timer. >>> >>> Apropos timer, do we need a timer or wouldn't it be better to use a >>> delay / a timer + condition? >> >> I don't think we need a timer at all. > > Yes, if it is possible to wait synchronously 30s it seems good to me. I mean, we don't have to wait 30 seconds at all. 1. We send a request to the guest 2. It responds (after some seconds), letting go of the zPCI device 3. We unplug the device 1. We send a request to the guest 2. Guest does not respond, request keeps pending forever 3. On reboot, unplug the device This is how x86/ACPI handles it. > >> >>> >>> AFAIU we get out of the unplug without waiting for any answer from the >>> guest and we surely get the timer triggering after the reset has been done. >>> That seems bad. >> >> This is the case right now, correct. >> >>> >>> >>>> >>>>> >>>>> Two things I'm concerned about: >>>>> >>>>> 1) >>>>> >>>>> What I would suggest is adding a check for the pbdev->state for >>>>> ZPCI_FS_DISABLED || ZPCI_FS_PERMANENT_ERROR. If it is in either of these >>>>> states, then we're safe to deconfigure and put into standby. If the >>>> >>>> We setup a timer if !ZPCI_FS_STANDBY and !ZPCI_FS_RESERVED. >>>> >>>> So for >>>> - ZPCI_FS_DISABLED >>>> - ZPCI_FS_ENABLED >>>> - ZPCI_FS_BLOCKED >>>> - ZPCI_FS_ERROR >>>> - ZPCI_FS_PERMANENT_ERROR >>>> >>>> We setup a timer and simply go ahead and unplug the device when the >>>> timer expires ("forced unplug"). >>> >>> I agree only for ZPCI_FS_ENABLED why do we need to be smooth for other >>> states? >>> ZPCI_FS_DISABLED may be a candidate even I do not see the interrest but >>> other states of the device should issue a HP_EVENT_DECONFIGURE_REQUEST >>> and we do not need a timer (or any delay) >> >> You can always expect that your guest driver is dead. > > hum, the device is dead. > May be the guest got hit too if the driver is not right written. > >> >>> >>>> >>>> Changing that behavior might be more invasive. Simply not unplugging in >>>> s390_pcihost_timer_cb() on some of these states would mean that somebody >>>> issued a request and that requests is simply lost/ignored. Not sure if >>>> that is what we want. I think we need separate patches to change >>>> something like that. Especially >>>> >>>> 1. What happens if the device was in ZPCI_FS_DISABLED, the guest ignores >>>> the unplug request and moves the device to ZPCI_FS_ENABLED before the >>>> timer expires? These are corner cases to consider. >>> >>> +1, we must ensure to do the work inside the unplug CB. >>> >>>> >>>> 2. Do we need a timer at all? Now that Patch #1 introduces >>>> unplug_requests, we are free to ignore requests from the user if the >>>> guest is not reacting. I would really favor getting rid of the timer >>>> completely. Was there a special reason why this was introduced? >>> >>> Yes, to let a chance to the guest to smoothly relinquish the device. >>> (for example sync/clean the disk) >>> However I do not think it is right implemented. >>> >>>> >>>> No other architecture (e.g. ACPI) uses such a timer. They use a simple >>>> flag to remember if a request is pending. I would really favor going >>>> into that direction. >>> >>> I am not sure that the Intel architecture is a good example. :) >> >> Right, we all learned that zPCI did it better. (sorry ;) ) > > Well I really think so. > It is designed with several guest in parallel and shared devices. > > In such an architecture, ripping of a device from a guest may have interest. > One good thing would be that the software of the guest handle it :) That is indeed true, but I think such a forced removal also works on x86. Theoretically. ("physically rip out the card"). See below. > >> >>> >>> AFAIU they do not wait for the guest to have relinquish the device. >>> Or do they? >>> How long do they wait? >> >> They wait for ever. And I guess we should do the same thing. If the >> guest driver is broken (and this is really a rare scenario!) we would >> not get the device back. Which is perfectly fine in my point of view. In >> all other scenarios I guess the guest will simply respond in a timely >> manner. And ripping out stuff from the guest always feels wrong. (yes >> the channel subsystem works like that, but here we actually have a choice) >> >> If we reboot, we can unplug the device. Otherwise, let's keep it simply >> and don't use a timer. >> >> Thanks! >> > > AFAIK We have no plan to operate on pools of PCI devices so for me I > have no objection to keep it simple: Especially one note: There seems to be demand for a so called "forced PCI removal" also on other architectures. However, this would than rather most probably be modeled on top of what we have right now. E.g. instead of "device_del XXX" which would request the guest to let go of the device, there could be something like "device_del XXX,forced=true". E.g. ask the guest. If it does not respond after some time, force remove it. This is basically the timer, but managed by a different level, of software. And you can than actually decide if you want to do eventually harm to the guest OS. Are there any other objects of getting rid of the timer? Conny could pick up patch #1 once you get an ACK. I would send more patches to drop the timer and rework this patch. Thanks Pierre! > > Regards, > Pierre > > >
On Tue, 29 Jan 2019 19:20:55 +0100 David Hildenbrand <david@redhat.com> wrote: > Conny could pick up patch #1 once you get an ACK. I would send more > patches to drop the timer and rework this patch. Sure, will do.
On 1/29/19 1:20 PM, David Hildenbrand wrote: > On 29.01.19 17:50, Pierre Morel wrote: >> On 29/01/2019 16:11, David Hildenbrand wrote: >>> On 29.01.19 14:50, Pierre Morel wrote: >>>> On 29/01/2019 11:24, David Hildenbrand wrote: >>>>>>>> I'm wondering what the architecture says regarding those events -- can >>>>>>>> someone with access to the documentation comment? >>>>>>> >>>>>>> Ping. Any comments from the IBM folks? >>>> >>>> Hi, >>>> >>>> Sorry to have wait so long. >>>> At least Collin was faster. >>>> >>>>>>> >>>>>> >>>>>> >>>>>> So the idea here is that if we have a PCI device that is the process of >>>>>> being deconfigured and we are also in the middle of a reset, then let's >>>>>> accelerate deconfiguring of the PCI device during the reset. Makes sense. >>>> >>>> to me too. >>>> However, how do we ensure that the guest got time to respond to the >>>> first deconfigure request? >>> >>> 30 seconds, then the reboot. On a reboot, I don't see why we should give >>> a guest more time. "It's dead", rip out the card as the guest refused to >>> hand it back. (maybe it crashed! but after a reboot the guest state is >>> reset and baught back to life) >> >> I agree that 30 seconds is way enough, >> also agree that in most cases the guest is dead. >> >> >>> >>>> >>>>>> >>>>>> Note: >>>>>> >>>>>> The callback function will deconfigure the the device and put it into >>>>>> standby mode. However, a PCI device should only go into standby from the >>>>>> *disabled state* (which it could already be in due to the unplug >>>>>> sequence), or from a *permanent error state* (something we should >>>>>> hopefully never see -- this means something went seriously wrong with >>>>>> the device). >>>> >>>> Not completely exact, the CHSC event 0x304, on the initiative of the >>>> host, force the "deconfigure state" from "configure state" generally, >>>> whatever sub-state it has (enabled/disabled/error...). >>>> >>>>> >>>>> Right, this should already have been checked before setting up the timer. >>>> >>>> Apropos timer, do we need a timer or wouldn't it be better to use a >>>> delay / a timer + condition? >>> >>> I don't think we need a timer at all. >> >> Yes, if it is possible to wait synchronously 30s it seems good to me. > > I mean, we don't have to wait 30 seconds at all. > > 1. We send a request to the guest > 2. It responds (after some seconds), letting go of the zPCI device > 3. We unplug the device > > 1. We send a request to the guest > 2. Guest does not respond, request keeps pending forever > 3. On reboot, unplug the device > > This is how x86/ACPI handles it. > >> >>> >>>> >>>> AFAIU we get out of the unplug without waiting for any answer from the >>>> guest and we surely get the timer triggering after the reset has been done. >>>> That seems bad. >>> >>> This is the case right now, correct. >>> >>>> >>>> >>>>> >>>>>> >>>>>> Two things I'm concerned about: >>>>>> >>>>>> 1) >>>>>> >>>>>> What I would suggest is adding a check for the pbdev->state for >>>>>> ZPCI_FS_DISABLED || ZPCI_FS_PERMANENT_ERROR. If it is in either of these >>>>>> states, then we're safe to deconfigure and put into standby. If the >>>>> >>>>> We setup a timer if !ZPCI_FS_STANDBY and !ZPCI_FS_RESERVED. >>>>> >>>>> So for >>>>> - ZPCI_FS_DISABLED >>>>> - ZPCI_FS_ENABLED >>>>> - ZPCI_FS_BLOCKED >>>>> - ZPCI_FS_ERROR >>>>> - ZPCI_FS_PERMANENT_ERROR >>>>> >>>>> We setup a timer and simply go ahead and unplug the device when the >>>>> timer expires ("forced unplug"). >>>> >>>> I agree only for ZPCI_FS_ENABLED why do we need to be smooth for other >>>> states? >>>> ZPCI_FS_DISABLED may be a candidate even I do not see the interrest but >>>> other states of the device should issue a HP_EVENT_DECONFIGURE_REQUEST >>>> and we do not need a timer (or any delay) >>> >>> You can always expect that your guest driver is dead. >> >> hum, the device is dead. >> May be the guest got hit too if the driver is not right written. >> >>> >>>> >>>>> >>>>> Changing that behavior might be more invasive. Simply not unplugging in >>>>> s390_pcihost_timer_cb() on some of these states would mean that somebody >>>>> issued a request and that requests is simply lost/ignored. Not sure if >>>>> that is what we want. I think we need separate patches to change >>>>> something like that. Especially >>>>> >>>>> 1. What happens if the device was in ZPCI_FS_DISABLED, the guest ignores >>>>> the unplug request and moves the device to ZPCI_FS_ENABLED before the >>>>> timer expires? These are corner cases to consider. >>>> >>>> +1, we must ensure to do the work inside the unplug CB. >>>> >>>>> >>>>> 2. Do we need a timer at all? Now that Patch #1 introduces >>>>> unplug_requests, we are free to ignore requests from the user if the >>>>> guest is not reacting. I would really favor getting rid of the timer >>>>> completely. Was there a special reason why this was introduced? >>>> >>>> Yes, to let a chance to the guest to smoothly relinquish the device. >>>> (for example sync/clean the disk) >>>> However I do not think it is right implemented. >>>> >>>>> >>>>> No other architecture (e.g. ACPI) uses such a timer. They use a simple >>>>> flag to remember if a request is pending. I would really favor going >>>>> into that direction. >>>> >>>> I am not sure that the Intel architecture is a good example. :) >>> >>> Right, we all learned that zPCI did it better. (sorry ;) ) >> >> Well I really think so. >> It is designed with several guest in parallel and shared devices. >> >> In such an architecture, ripping of a device from a guest may have interest. >> One good thing would be that the software of the guest handle it :) > > That is indeed true, but I think such a forced removal also works on > x86. Theoretically. ("physically rip out the card"). See below. > >> >>> >>>> >>>> AFAIU they do not wait for the guest to have relinquish the device. >>>> Or do they? >>>> How long do they wait? >>> >>> They wait for ever. And I guess we should do the same thing. If the >>> guest driver is broken (and this is really a rare scenario!) we would >>> not get the device back. Which is perfectly fine in my point of view. In >>> all other scenarios I guess the guest will simply respond in a timely >>> manner. And ripping out stuff from the guest always feels wrong. (yes >>> the channel subsystem works like that, but here we actually have a choice) >>> >>> If we reboot, we can unplug the device. Otherwise, let's keep it simply >>> and don't use a timer. >>> >>> Thanks! >>> This makes more sense to me. If something goes wrong with unplugging a device, and we use a timeout for forcefully unplug the device... it will never be apparent to the guest or user that something went wrong in the first place! >> >> AFAIK We have no plan to operate on pools of PCI devices so for me I >> have no objection to keep it simple: > > Especially one note: > > There seems to be demand for a so called "forced PCI removal" also on > other architectures. However, this would than rather most probably be > modeled on top of what we have right now. > > E.g. instead of "device_del XXX" which would request the guest to let go > of the device, there could be something like "device_del XXX,forced=true". > > E.g. ask the guest. If it does not respond after some time, force remove > it. This is basically the timer, but managed by a different level, of > software. And you can than actually decide if you want to do eventually > harm to the guest OS. > > Are there any other objects of getting rid of the timer? > > Conny could pick up patch #1 once you get an ACK. I would send more > patches to drop the timer and rework this patch. > > Thanks Pierre! > >> >> Regards, >> Pierre >> >> >> > > David, Pierre: good discussion. I'm in favor of dropping the timer, especially with the notes and examples above. I think the PCI code will look a lot cleaner / easier-to-maintain without it as well.
diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c index bc17a8cf65..b70ae25533 100644 --- a/hw/s390x/s390-pci-bus.c +++ b/hw/s390x/s390-pci-bus.c @@ -1102,6 +1102,14 @@ static void s390_pcihost_reset(DeviceState *dev) { S390pciState *s = S390_PCI_HOST_BRIDGE(dev); PCIBus *bus = s->parent_obj.bus; + S390PCIBusDevice *pbdev, *next; + + /* Unplug all pending devices that were requested to be released */ + QTAILQ_FOREACH_SAFE(pbdev, &s->zpci_devs, link, next) { + if (pbdev->release_timer) { + s390_pcihost_timer_cb(pbdev); + } + } s->bus_no = 0; pci_for_each_device(bus, pci_bus_num(bus), s390_pci_enumerate_bridge, s);
When resetting the guest we should unplug and remove all devices that are still pending. Otherwise the fresh guest will see devices that will suddenly vanish. Can be triggered e.g. via (hmp) device_add virtio-mouse-pci,id=test (hmp) stop (hmp) device_del test (hmp) system_reset (hmp) c The device will vanish after roughly 5 minutes. With this patch, the device will vanish on reboot (S390_RESET_EXTERNAL and S390_RESET_REIPL, which reset the pcihost bridge via qemu_devices_reset()). If we want these devices to vanish directly on any reset (S390_RESET_MODIFIED_CLEAR and S390_RESET_LOAD_NORMAL), we have to modify s390_machine_reset(). But I have the feeling that this should not be done for all reset types. This approach is similar to what's done for acpi PCI hotplug in acpi_pcihp_reset() -> acpi_pcihp_update() -> acpi_pcihp_update_hotplug_bus() -> acpi_pcihp_eject_slot(). s390_pci_generate_plug_event()'s will still be generated, I guess this is not an issue (same thing could happen right now if the timer expires just after reset). Signed-off-by: David Hildenbrand <david@redhat.com> --- hw/s390x/s390-pci-bus.c | 8 ++++++++ 1 file changed, 8 insertions(+)