Message ID | 20230512021518.336460-1-clementwei90@163.com (mailing list archive) |
---|---|
State | Accepted |
Commit | e8afd0d9fccc27c8ad263db5cf5952cfcf72d6fe |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | [v4] PCI: pciehp: Fix the slot in BLINKINGON_STATE when Presence Detect Changed event occurred | expand |
On Fri, May 12, 2023 at 10:15:18AM +0800, Rongguang Wei wrote: > pciehp's behavior is incorrect if the Attention Button is pressed > on an unoccupied slot. [...] > V2: Update to simple code and avoid gratuitous message. > V3: Add Suggested-by. > V4: Add state change conditional and message. > > Fixes: d331710ea78f ("PCI: pciehp: Become resilient to missed events") > Link: https://lore.kernel.org/linux-pci/20230403054619.19163-1-clementwei90@163.com/ > Link: https://lore.kernel.org/linux-pci/20230421025641.655991-1-clementwei90@163.com/ The changes from previous versions of the patch as well as the links to those previous versions are usually placed below the three dashes. That way they don't become part of the git history as they're generally not interesting for future readers. (The exception of that rule is the gpu subsystem which habitually puts the changelog in the commit message.) However I don't think you need to respin the patch because of that as Bjorn can fix up the commit message when applying. Just so you know for future submissions. > Suggested-by: Lukas Wunner <lukas@wunner.de> > Signed-off-by: Rongguang Wei <weirongguang@kylinos.cn> Reviewed-by: Lukas Wunner <lukas@wunner.de> Cc: stable@vger.kernel.org # v4.19+ Thanks! Lukas > --- > drivers/pci/hotplug/pciehp_ctrl.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/drivers/pci/hotplug/pciehp_ctrl.c b/drivers/pci/hotplug/pciehp_ctrl.c > index 529c34808440..32baba1b7f13 100644 > --- a/drivers/pci/hotplug/pciehp_ctrl.c > +++ b/drivers/pci/hotplug/pciehp_ctrl.c > @@ -256,6 +256,14 @@ void pciehp_handle_presence_or_link_change(struct controller *ctrl, u32 events) > present = pciehp_card_present(ctrl); > link_active = pciehp_check_link_active(ctrl); > if (present <= 0 && link_active <= 0) { > + if (ctrl->state == BLINKINGON_STATE) { > + ctrl->state = OFF_STATE; > + cancel_delayed_work(&ctrl->button_work); > + pciehp_set_indicators(ctrl, PCI_EXP_SLTCTL_PWR_IND_OFF, > + INDICATOR_NOOP); > + ctrl_info(ctrl, "Slot(%s): Card not present\n", > + slot_name(ctrl)); > + } > mutex_unlock(&ctrl->state_lock); > return; > } > -- > 2.25.1
On Fri, May 12, 2023 at 10:15:18AM +0800, Rongguang Wei wrote: > From: Rongguang Wei <weirongguang@kylinos.cn> > > pciehp's behavior is incorrect if the Attention Button is pressed > on an unoccupied slot. > > When a Presence Detect Changed event has occurred, the slot status > in either BLINKINGOFF_STATE or OFF_STATE, turn it off unconditionally. > But if the slot status is in BLINKINGON_STATE and the slot is currently > empty, the slot status was staying in BLINKINGON_STATE. Thanks for the patch! I don't quite follow the events here. I think the current behavior is this (tell me if I'm going wrong): - Slot is empty (OFF_STATE). - User presses Attention Button. pciehp_handle_button_press() sets state to BLINKINGON_STATE, sets power indicator to blinking, schedules pciehp_queue_pushbutton_work() to turn on power after 5 seconds. - When pciehp_queue_pushbutton_work() runs 5 seconds later, it synthesizes a PCI_EXP_SLTSTA_PDC event and wakes the IRQ thread. - The IRQ thread (pciehp_ist()) calls pciehp_handle_presence_or_link_change(), which does nothing since the slot is in BLINKINGON_STATE, the slot is empty, and the link is not active. - Slot incorrectly remains in BLINKINGON_STATE and power indicator remains blinking. And this patch changes pciehp_handle_presence_or_link_change() so that if the slot is empty, the link is not acive, and the slot is in BLINKINGON_STATE, we put it in OFF_STATE, cancel the delayed work, and turn off the power indicator. After this patch, the user experience is this: - Slot is empty (OFF_STATE). - User presses Attention Button. - Power indicator blinks for 5 seconds. - Power indicator turns off. which definitely seems better. I'm curious why we want the 5 seconds of blinking power indicator at all. We can't really do anything in response to an Attention Button on an empty slot, so could we just ignore it completely in pciehp_handle_button_press()? IIUC, this patch leads to messages like these, which are slightly confusing because we say we're powering up the slot, then later decide "oops, there's nothing here, never mind" (or, I guess the user could push the button, *then* insert the card, and we would power it up, which seems a little sketchy): [ 0.000] pcieport 0000:00:01.5: pciehp: Slot(0-5): Attention button pressed [ 0.001] pcieport 0000:00:01.5: pciehp: Slot(0-5): Powering on due to button press [ 5.001] pcieport 0000:00:01.5: pciehp: Slot(0-5): Card not present Is there a spec that covers the user experience of this case? The closest I could find are SHPC r1.0, sec 2.5, and PCIe r6.0, sec 6.7.1.5. Both mention the 5-second abort interval with the power indicator blinking, but they implicitly assume the slot is occupied. Neither mentions the empty slot case. > The message print like this: > pcieport 0000:00:01.5: pciehp: Slot(0-5): Attention button pressed > pcieport 0000:00:01.5: pciehp: Slot(0-5) Powering on due to button press > pcieport 0000:00:01.5: pciehp: Slot(0-5): Attention button pressed > pcieport 0000:00:01.5: pciehp: Slot(0-5): Button cancel > pcieport 0000:00:01.5: pciehp: Slot(0-5): Action canceled due to button press > > It cause the next Attention Button Pressed event become Button cancel > and missing the Presence Detect Changed event with this button press > though this button presses event is occurred after 5s. It seems like the problem ("empty slot staying in BLINKINGON_STATE forever after one Attention Button event") only requires one button press. If so, why do we talk about the *next* button press here? > According to the Commit d331710ea78f ("PCI: pciehp: Become resilient > to missed events"), if the slot is currently occupied, turn it on and > if the slot is empty, it need to set in OFF_STATE rather than stay in > current status when pciehp_handle_presence_or_link_change() bails out. > > Fixes: d331710ea78f ("PCI: pciehp: Become resilient to missed events") > Link: https://lore.kernel.org/linux-pci/20230403054619.19163-1-clementwei90@163.com/ > Link: https://lore.kernel.org/linux-pci/20230421025641.655991-1-clementwei90@163.com/ > Suggested-by: Lukas Wunner <lukas@wunner.de> > Signed-off-by: Rongguang Wei <weirongguang@kylinos.cn> > --- > drivers/pci/hotplug/pciehp_ctrl.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/drivers/pci/hotplug/pciehp_ctrl.c b/drivers/pci/hotplug/pciehp_ctrl.c > index 529c34808440..32baba1b7f13 100644 > --- a/drivers/pci/hotplug/pciehp_ctrl.c > +++ b/drivers/pci/hotplug/pciehp_ctrl.c > @@ -256,6 +256,14 @@ void pciehp_handle_presence_or_link_change(struct controller *ctrl, u32 events) > present = pciehp_card_present(ctrl); > link_active = pciehp_check_link_active(ctrl); > if (present <= 0 && link_active <= 0) { > + if (ctrl->state == BLINKINGON_STATE) { > + ctrl->state = OFF_STATE; > + cancel_delayed_work(&ctrl->button_work); > + pciehp_set_indicators(ctrl, PCI_EXP_SLTCTL_PWR_IND_OFF, > + INDICATOR_NOOP); > + ctrl_info(ctrl, "Slot(%s): Card not present\n", > + slot_name(ctrl)); > + } > mutex_unlock(&ctrl->state_lock); > return; > }
On Wed, May 17, 2023 at 04:02:01PM -0500, Bjorn Helgaas wrote: > I'm curious why we want the 5 seconds of blinking power indicator at > all. We can't really do anything in response to an Attention Button > on an empty slot, so could we just ignore it completely in > pciehp_handle_button_press()? That wouldn't cover the case where the slot is occupied when the button is pressed, but the card is yanked out during the 5 second blinking interval. We'd still need the present commit to handle that case. Thanks, Lukas
On Thu, May 18, 2023 at 08:25:57AM +0200, Lukas Wunner wrote: > On Wed, May 17, 2023 at 04:02:01PM -0500, Bjorn Helgaas wrote: > > I'm curious why we want the 5 seconds of blinking power indicator at > > all. We can't really do anything in response to an Attention Button > > on an empty slot, so could we just ignore it completely in > > pciehp_handle_button_press()? > > That wouldn't cover the case where the slot is occupied when the > button is pressed, but the card is yanked out during the 5 second > blinking interval. Obviously we can't ignore a button press when the slot is occupied, because that's part of the "insert card, press button to power it up" and "press button to power down card, remove card" flows. I'm asking about ignoring it when the slot is empty, which would mean adding a check for card presence in pciehp_handle_button_press(). But maybe there's a reason why we can't do that there? Bjorn
Hi On 5/18/23 7:09 PM, Bjorn Helgaas wrote: > On Thu, May 18, 2023 at 08:25:57AM +0200, Lukas Wunner wrote: >> On Wed, May 17, 2023 at 04:02:01PM -0500, Bjorn Helgaas wrote: >>> I'm curious why we want the 5 seconds of blinking power indicator at >>> all. We can't really do anything in response to an Attention Button >>> on an empty slot, so could we just ignore it completely in >>> pciehp_handle_button_press()? >> >> That wouldn't cover the case where the slot is occupied when the >> button is pressed, but the card is yanked out during the 5 second >> blinking interval. > > Obviously we can't ignore a button press when the slot is occupied, > because that's part of the "insert card, press button to power it up" > and "press button to power down card, remove card" flows. > > I'm asking about ignoring it when the slot is empty, which would mean > adding a check for card presence in pciehp_handle_button_press(). But > maybe there's a reason why we can't do that there? > > Bjorn > I think we can't add a check in pciehp_handle_button_press() because this function is handle the ABP event and the slot is occupied or empty is PDC event. Those two events are better not control in one function. Thanks.
On Thu, May 18, 2023 at 06:09:41AM -0500, Bjorn Helgaas wrote: > On Thu, May 18, 2023 at 08:25:57AM +0200, Lukas Wunner wrote: > > On Wed, May 17, 2023 at 04:02:01PM -0500, Bjorn Helgaas wrote: > > > I'm curious why we want the 5 seconds of blinking power indicator at > > > all. We can't really do anything in response to an Attention Button > > > on an empty slot, so could we just ignore it completely in > > > pciehp_handle_button_press()? > > > > That wouldn't cover the case where the slot is occupied when the > > button is pressed, but the card is yanked out during the 5 second > > blinking interval. > > Obviously we can't ignore a button press when the slot is occupied, > because that's part of the "insert card, press button to power it up" > and "press button to power down card, remove card" flows. > > I'm asking about ignoring it when the slot is empty, which would mean > adding a check for card presence in pciehp_handle_button_press(). But > maybe there's a reason why we can't do that there? It would of course be possible to copy/paste the pciehp_card_present() + pciehp_check_link_active() check from pciehp_handle_presence_or_link_change(). The only downside is that the symmetry between the ON_STATE / OFF_STATE cases in pciehp_handle_button_press() could no longer be preserved. (Because the additional check only applies to OFF_STATE.) So it could be argued that readability becomes a little worse and the logic of the state machine slightly more difficult to follow. Ultimately any engineering discipline boils down to balancing various competing traits (such as simplicity of code versus usability) and personally I would decide to continue allowing the "press button first, insert card afterwards" usage model because it keeps the code lean. Unfortunately back in the day the PCISIG decided to saddle PCIe hotplug with numerous optional features which now complicate implementations. Form factors implementing the Attention Button seem pretty rare, as evidenced by the fact this glitch was discovered by Rongguang Wei almost 5 years after the issue was introduced. I think most modern form factors just use surprise-removal instead (NVMe drives in data centers specifically, and Thunderbolt). The present commit is necessary anyway to deal with the "card is in slot when button is pressed, but yanked within 5 seconds" case. The additional check in pciehp_handle_button_press() you're contemplating to avoid bringup attempts if the card is not in the slot upon button press is optional. Your call. :) Thanks, Lukas
On Wed, May 17, 2023 at 04:02:01PM -0500, Bjorn Helgaas wrote: > On Fri, May 12, 2023 at 10:15:18AM +0800, Rongguang Wei wrote: > > From: Rongguang Wei <weirongguang@kylinos.cn> > > > > pciehp's behavior is incorrect if the Attention Button is pressed > > on an unoccupied slot. > > > > When a Presence Detect Changed event has occurred, the slot status > > in either BLINKINGOFF_STATE or OFF_STATE, turn it off unconditionally. Was this supposed to say "BLINKINGOFF_STATE or ON_STATE" (not "OFF_STATE")? BLINKINGOFF_STATE and ON_STATE are the only cases where pciehp_handle_presence_or_link_change() calls pciehp_disable_slot() to turn off slot power. > > The message print like this: > > pcieport 0000:00:01.5: pciehp: Slot(0-5): Attention button pressed > > pcieport 0000:00:01.5: pciehp: Slot(0-5) Powering on due to button press > > pcieport 0000:00:01.5: pciehp: Slot(0-5): Attention button pressed > > pcieport 0000:00:01.5: pciehp: Slot(0-5): Button cancel > > pcieport 0000:00:01.5: pciehp: Slot(0-5): Action canceled due to button press I think the above messages are from *two* button presses, which complicates the description unnecessarily, since IIUC we only need one press to see the problem. I propose the following commit log: If a PCIe hotplug slot has an Attention Button, the normal hot-add flow is: - Slot is empty and slot power is off - User inserts card in slot and presses Attention Button - OS blinks Power Indicator for 5 seconds - After 5 seconds, OS turns on Power Indicator, turns on slot power, and enumerates the device Previously, if a user pressed the Attention Button on an *empty* slot, pciehp logged the following messages and blinked the Power Indicator indefinitely: [0.000] pcieport 0000:00:01.5: pciehp: Slot(0-5): Attention button pressed [0.001] pcieport 0000:00:01.5: pciehp: Slot(0-5): Powering on due to button press An empty slot is in OFF_STATE. When the Attention Button is pressed, pciehp_handle_button_press() puts the slot in BLINKINGON_STATE, sets the Power Indicator blinking, and schedules pciehp_queue_pushbutton_work() to run 5 seconds later. pciehp_queue_pushbutton_work() synthesizes a Presence Detect Changed event, and pciehp_handle_presence_or_link_change() exits when it finds the slot empty, leaving the slot in BLINKINGON_STATE with the Power Indicator blinking. To fix the indefinitely blinking Power Indicator, change pciehp_handle_presence_or_link_change() to put the empty slot back in OFF_STATE and turn off the Power Indicator before exiting. The Power Indicator will blink for 5 seconds before stopping, and messages like this will be logged: [0.000] pcieport 0000:00:01.5: pciehp: Slot(0-5): Attention button pressed [0.001] pcieport 0000:00:01.5: pciehp: Slot(0-5): Powering on due to button press [5.001] pcieport 0000:00:01.5: pciehp: Slot(0-5): Card not present Let me know if I got anything wrong above. Bjorn > > Fixes: d331710ea78f ("PCI: pciehp: Become resilient to missed events") > > Link: https://lore.kernel.org/linux-pci/20230403054619.19163-1-clementwei90@163.com/ > > Link: https://lore.kernel.org/linux-pci/20230421025641.655991-1-clementwei90@163.com/ > > Suggested-by: Lukas Wunner <lukas@wunner.de> > > Signed-off-by: Rongguang Wei <weirongguang@kylinos.cn> > > --- > > drivers/pci/hotplug/pciehp_ctrl.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/drivers/pci/hotplug/pciehp_ctrl.c b/drivers/pci/hotplug/pciehp_ctrl.c > > index 529c34808440..32baba1b7f13 100644 > > --- a/drivers/pci/hotplug/pciehp_ctrl.c > > +++ b/drivers/pci/hotplug/pciehp_ctrl.c > > @@ -256,6 +256,14 @@ void pciehp_handle_presence_or_link_change(struct controller *ctrl, u32 events) > > present = pciehp_card_present(ctrl); > > link_active = pciehp_check_link_active(ctrl); > > if (present <= 0 && link_active <= 0) { > > + if (ctrl->state == BLINKINGON_STATE) { > > + ctrl->state = OFF_STATE; > > + cancel_delayed_work(&ctrl->button_work); > > + pciehp_set_indicators(ctrl, PCI_EXP_SLTCTL_PWR_IND_OFF, > > + INDICATOR_NOOP); > > + ctrl_info(ctrl, "Slot(%s): Card not present\n", > > + slot_name(ctrl)); > > + } > > mutex_unlock(&ctrl->state_lock); > > return; > > }
On Fri, May 19, 2023 at 03:55:35PM -0500, Bjorn Helgaas wrote: > On Wed, May 17, 2023 at 04:02:01PM -0500, Bjorn Helgaas wrote: > > On Fri, May 12, 2023 at 10:15:18AM +0800, Rongguang Wei wrote: > > > From: Rongguang Wei <weirongguang@kylinos.cn> > > > > > > pciehp's behavior is incorrect if the Attention Button is pressed > > > on an unoccupied slot. > > > > > > When a Presence Detect Changed event has occurred, the slot status > > > in either BLINKINGOFF_STATE or OFF_STATE, turn it off unconditionally. > > Was this supposed to say "BLINKINGOFF_STATE or ON_STATE" > (not "OFF_STATE")? Yes I think you're right. > I propose the following commit log: [...] > pciehp_queue_pushbutton_work() synthesizes a Presence Detect Changed > event, and pciehp_handle_presence_or_link_change() exits when it > finds the slot empty, leaving the slot in BLINKINGON_STATE with the > Power Indicator blinking. > > To fix the indefinitely blinking Power Indicator, change > pciehp_handle_presence_or_link_change() to put the empty slot back > in OFF_STATE and turn off the Power Indicator before exiting. The indefinitely blinking Power Indicator is only one half of the problem. The other half is that the next button press doesn't result in slot bringup, even if the slot is occupied and the 5 second timeout has elapsed. Suggested wording, feel free to rephrase as you see fit: Because the slot was previously left in BLINKINGON_STATE, the next button press was interpreted as a "button cancel" event, even if the slot was occupied upon that next button press: pciehp stopped blinking and did not perform another slot bringup attempt. By putting the slot in OFF_STATE, such user-unfriendly behavior is avoided: Instead, the next button press will result in the slot starting to blink again and another bringup attempt after 5 seconds. Thanks, Lukas
On Sat, May 20, 2023 at 10:31:18AM +0200, Lukas Wunner wrote: > On Fri, May 19, 2023 at 03:55:35PM -0500, Bjorn Helgaas wrote: > > On Wed, May 17, 2023 at 04:02:01PM -0500, Bjorn Helgaas wrote: > > > On Fri, May 12, 2023 at 10:15:18AM +0800, Rongguang Wei wrote: > > > > From: Rongguang Wei <weirongguang@kylinos.cn> > > > > > > > > pciehp's behavior is incorrect if the Attention Button is pressed > > > > on an unoccupied slot. > > > > > > > > When a Presence Detect Changed event has occurred, the slot status > > > > in either BLINKINGOFF_STATE or OFF_STATE, turn it off unconditionally. > > > > Was this supposed to say "BLINKINGOFF_STATE or ON_STATE" > > (not "OFF_STATE")? > > Yes I think you're right. > > > I propose the following commit log: > [...] > > pciehp_queue_pushbutton_work() synthesizes a Presence Detect Changed > > event, and pciehp_handle_presence_or_link_change() exits when it > > finds the slot empty, leaving the slot in BLINKINGON_STATE with the > > Power Indicator blinking. > > > > To fix the indefinitely blinking Power Indicator, change > > pciehp_handle_presence_or_link_change() to put the empty slot back > > in OFF_STATE and turn off the Power Indicator before exiting. > > The indefinitely blinking Power Indicator is only one half of the problem. > The other half is that the next button press doesn't result in slot > bringup, even if the slot is occupied and the 5 second timeout has > elapsed. Thanks for your patience, I think I understand that. Here's another try: Previously, if a user pressed the Attention Button on an *empty* slot, pciehp logged the following messages and blinked the Power Indicator until a second button press: [0.000] pciehp: Attention button pressed [0.001] pciehp: Powering on due to button press [0.002] # Power Indicator starts blinking [5.002] # 5 second timeout should abort power-on sequence, but doesn't [8.000] # Power Indicator should be off, but is still blinking [9.000] # possible card insertion [9.000] pciehp: Attention button pressed [9.001] pciehp: Button cancel [9.002] pciehp: Action canceled due to button press The first button press incorrectly left the slot in BLINKINGON_STATE, so the second was interpreted as a "button cancel" event regardless of whether a card was present. If the slot is empty, turn off the Power Indicator and return from BLINKINGON_STATE to OFF_STATE after 5 seconds. Putting the slot in OFF_STATE also means the second button press will correctly start a bringup attempt if the slot is occupied. Maybe the above is enough for a commit log. The notes below are my attempt to work through in more detail: IIUC, if the button is pressed twice on an empty slot, we end up back in the "Empty slot, OFF" state (although the indicator blinks until the second press, when it should stop after 5 seconds), and inserting a card and pressing the button works as expected. The problem is when the card is inserted between first and second button presses, where the second press cancels the BLINKINGON when it should *start* BLINKINGON. A third press would power on the slot, when it should go to BLINKINGOFF to power it off: Slot v6.4 Expected -------- ----------- ----------- Slot empty Empty OFF OFF Button press 1 Empty BLINKINGON BLINKINGON "Powering on" "Powering on" sched-work sched-work +5s synth PDC Empty BLINKINGON OFF (a) "Card not present" Insert card Occupied BLINKINGON OFF Button press 2 Occupied OFF BLINKINGON "Button cancel" "Powering on" sched-work +5s synth PDC Occupied (b, N/A) POWERON Power control Occupied (b, N/A) ON Button press 3 Occupied BLINKINGON BLINKINGOFF "Powering on" "Powering off" sched-work sched-work +5s synth PDC Occupied POWERON POWEROFF Power control Occupied ON OFF At (a), v6.4-rc1 will blink until another button press. At (b), the button press generates a "Button cancel" message and does not schedule button_work. And (b) is the situation you refer to where the second button press doesn't bring the slot up when it should. Right? > Suggested wording, feel free to rephrase as you see fit: > > Because the slot was previously left in BLINKINGON_STATE, the next > button press was interpreted as a "button cancel" event, even if the > slot was occupied upon that next button press: pciehp stopped blinking > and did not perform another slot bringup attempt. > > By putting the slot in OFF_STATE, such user-unfriendly behavior is > avoided: Instead, the next button press will result in the slot > starting to blink again and another bringup attempt after 5 seconds. > > Thanks, > > Lukas
On Mon, May 22, 2023 at 04:10:36PM -0500, Bjorn Helgaas wrote: > Thanks for your patience, I think I understand that. Here's another > try: The wording LGTM. > And (b) is the situation you refer to where the second button press > doesn't bring the slot up when it should. Right? Yes, exactly. Thanks, Lukas
diff --git a/drivers/pci/hotplug/pciehp_ctrl.c b/drivers/pci/hotplug/pciehp_ctrl.c index 529c34808440..32baba1b7f13 100644 --- a/drivers/pci/hotplug/pciehp_ctrl.c +++ b/drivers/pci/hotplug/pciehp_ctrl.c @@ -256,6 +256,14 @@ void pciehp_handle_presence_or_link_change(struct controller *ctrl, u32 events) present = pciehp_card_present(ctrl); link_active = pciehp_check_link_active(ctrl); if (present <= 0 && link_active <= 0) { + if (ctrl->state == BLINKINGON_STATE) { + ctrl->state = OFF_STATE; + cancel_delayed_work(&ctrl->button_work); + pciehp_set_indicators(ctrl, PCI_EXP_SLTCTL_PWR_IND_OFF, + INDICATOR_NOOP); + ctrl_info(ctrl, "Slot(%s): Card not present\n", + slot_name(ctrl)); + } mutex_unlock(&ctrl->state_lock); return; }