diff mbox series

PCI: pciehp: clear cmd_busy bit when Command Completed in polling mode

Message ID 20211111054258.7309-1-zhangliguang@linux.alibaba.com (mailing list archive)
State Accepted
Delegated to: Bjorn Helgaas
Headers show
Series PCI: pciehp: clear cmd_busy bit when Command Completed in polling mode | expand

Commit Message

luanshi Nov. 11, 2021, 5:42 a.m. UTC
This patch fixes this problem that on driver probe from system startup,
pciehp checks the Presence Detect State bit in the Slot Status register
to bring up an occupied slot or bring down an unoccupied slot. If empty
slot's power status is on, turn power off. The Hot-Plug interrupt isn't
requested yet, so avoid triggering a notification by calling
pcie_disable_notification().

Both the CCIE and HPIE bits are masked in pcie_disable_notification(),
when we issue a hotplug command, pcie_wait_cmd() will polling the
Command Completed bit instead of waiting for an interrupt. But cmd_busy
bit was not cleared when Command Completed which results in timeouts
like this in pciehp_power_off_slot() and pcie_init_notification():

  pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x01c0
(issued 2264 msec ago)
  pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x05c0
(issued 2288 msec ago)

Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
---
 drivers/pci/hotplug/pciehp_hpc.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

luanshi Nov. 18, 2021, 11 a.m. UTC | #1
Hi Bjorn & Lukas & Kuppuswamy & Amey,

Gentle ping! Any comments on this patch?


在 2021/11/11 13:42, Liguang Zhang 写道:
> This patch fixes this problem that on driver probe from system startup,
> pciehp checks the Presence Detect State bit in the Slot Status register
> to bring up an occupied slot or bring down an unoccupied slot. If empty
> slot's power status is on, turn power off. The Hot-Plug interrupt isn't
> requested yet, so avoid triggering a notification by calling
> pcie_disable_notification().
>
> Both the CCIE and HPIE bits are masked in pcie_disable_notification(),
> when we issue a hotplug command, pcie_wait_cmd() will polling the
> Command Completed bit instead of waiting for an interrupt. But cmd_busy
> bit was not cleared when Command Completed which results in timeouts
> like this in pciehp_power_off_slot() and pcie_init_notification():
>
>    pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x01c0
> (issued 2264 msec ago)
>    pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x05c0
> (issued 2288 msec ago)
>
> Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
> ---
>   drivers/pci/hotplug/pciehp_hpc.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> index 83a0fa119cae..8698aefc6041 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -98,6 +98,8 @@ static int pcie_poll_cmd(struct controller *ctrl, int timeout)
>   		if (slot_status & PCI_EXP_SLTSTA_CC) {
>   			pcie_capability_write_word(pdev, PCI_EXP_SLTSTA,
>   						   PCI_EXP_SLTSTA_CC);
> +			ctrl->cmd_busy = 0;
> +			smp_mb();
>   			return 1;
>   		}
>   		msleep(10);
Lukas Wunner Nov. 19, 2021, noon UTC | #2
On Thu, Nov 11, 2021 at 01:42:58PM +0800, Liguang Zhang wrote:
> Both the CCIE and HPIE bits are masked in pcie_disable_notification(),
> when we issue a hotplug command, pcie_wait_cmd() will polling the
> Command Completed bit instead of waiting for an interrupt. But cmd_busy
> bit was not cleared when Command Completed which results in timeouts
> like this in pciehp_power_off_slot() and pcie_init_notification():
> 
>   pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x01c0
> (issued 2264 msec ago)
>   pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x05c0
> (issued 2288 msec ago)

The first timeout occurs with the following bits set in ctrl->slot_ctrl:
  PCI_EXP_SLTCTL_PWR_IND_ON | PCI_EXP_SLTCTL_ATTN_IND_OFF

Those bits are set by:
  board_added()
    pciehp_set_indicators()


The second timeout occurs with:
  PCI_EXP_SLTCTL_PWR_IND_ON | PCI_EXP_SLTCTL_ATTN_IND_OFF |
  PCI_EXP_SLTCTL_PWR_OFF

This might be triggered by:
  remove_board()
    pciehp_power_off_slot()


So it seems Command Completed is not signaled when changing the
Power Indicator, Attention Indicator and Power Controller Control
bits in the Slot Control register.  But apparently it works for
the other bits.

We know there are hotplug controllers out there which suffer from
broken Command Completed support.  They support it for the bits
mentioned above but not the others.  So the inverse behavior from
your hotplug controller.  See this code comment in pcie_do_write_cmd():

	/*
	 * Controllers with the Intel CF118 and similar errata advertise
	 * Command Completed support, but they only set Command Completed
	 * if we change the "Control" bits for power, power indicator,
	 * attention indicator, or interlock.  If we only change the
	 * "Enable" bits, they never set the Command Completed bit.
	 */


> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -98,6 +98,8 @@ static int pcie_poll_cmd(struct controller *ctrl, int timeout)
>  		if (slot_status & PCI_EXP_SLTSTA_CC) {
>  			pcie_capability_write_word(pdev, PCI_EXP_SLTSTA,
>  						   PCI_EXP_SLTSTA_CC);
> +			ctrl->cmd_busy = 0;
> +			smp_mb();
>  			return 1;
>  		}
>  		msleep(10);

I suspect that this patch merely papers over the problem and that
the real solution would be to either apply quirk_cmd_compl or a
similar quirk to your hotplug controller.

Please open a bug on bugzilla.kernel.org and attach full output
of lspci -vv and dmesg.  Be sure to add the following to the
command line:
  pciehp.pciehp_debug=1 dyndbg="file pciehp* +p"

Once you've done that, please report the bugzilla link here
so that we can analyze the issue properly.

Thanks,

Lukas
luanshi Nov. 21, 2021, 1:50 a.m. UTC | #3
Hi,thanks for your comments.

在 2021/11/19 20:00, Lukas Wunner 写道:
> On Thu, Nov 11, 2021 at 01:42:58PM +0800, Liguang Zhang wrote:
>> Both the CCIE and HPIE bits are masked in pcie_disable_notification(),
>> when we issue a hotplug command, pcie_wait_cmd() will polling the
>> Command Completed bit instead of waiting for an interrupt. But cmd_busy
>> bit was not cleared when Command Completed which results in timeouts
>> like this in pciehp_power_off_slot() and pcie_init_notification():
>>
>>    pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x01c0
>> (issued 2264 msec ago)
>>    pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x05c0
>> (issued 2288 msec ago)
> The first timeout occurs with the following bits set in ctrl->slot_ctrl:
>    PCI_EXP_SLTCTL_PWR_IND_ON | PCI_EXP_SLTCTL_ATTN_IND_OFF

After some debug, the first timeout occurs:

     pciehp_power_off_slot

         pcie_write_cmd(ctrl, PCI_EXP_SLTCTL_PWR_OFF, PCI_EXP_SLTCTL_PCC)

             pcie_do_write_cmd

                     /*
                      * Always wait for any previous command that might 
still be in progress
                      */
                     pcie_wait_cmd(ctrl);  // at the beginning

                         if (!ctrl->cmd_busy)  // cmd_busy was not zero
                             return;


Why cmd_busy was not flase, because in function 
pcie_disable_notification cmd_busy was not setting correctly:

     pcie_disable_notification  //  Both the CCIE and HPIE bits are masked

         pcie_write_cmd

             pcie_do_write_cmd

                 ctrl->cmd_busy = 1  // cmd_busy was setting to 1

                 pcie_wait_cmd

                     pcie_poll_cmd // pcie_wait_cmd() will polling the 
Command Completed bit instead of waiting for an interrupt

                          After debug we found Command Completed can be 
signaled without cmd_busy was setting to 0.


If we use interrupt mode pciehp_isr:

         pciehp_isr

             if (events & PCI_EXP_SLTSTA_CC) {
                 ctrl->cmd_busy = 0;  //  cmd_busy was setting to zero, 
this was left in pcie_poll_cmd.


Thanks,

Liguang

>
> Those bits are set by:
>    board_added()
>      pciehp_set_indicators()
>
>
> The second timeout occurs with:
>    PCI_EXP_SLTCTL_PWR_IND_ON | PCI_EXP_SLTCTL_ATTN_IND_OFF |
>    PCI_EXP_SLTCTL_PWR_OFF
>
> This might be triggered by:
>    remove_board()
>      pciehp_power_off_slot()
>
>
> So it seems Command Completed is not signaled when changing the
> Power Indicator, Attention Indicator and Power Controller Control
> bits in the Slot Control register.  But apparently it works for
> the other bits.
>
> We know there are hotplug controllers out there which suffer from
> broken Command Completed support.  They support it for the bits
> mentioned above but not the others.  So the inverse behavior from
> your hotplug controller.  See this code comment in pcie_do_write_cmd():
>
> 	/*
> 	 * Controllers with the Intel CF118 and similar errata advertise
> 	 * Command Completed support, but they only set Command Completed
> 	 * if we change the "Control" bits for power, power indicator,
> 	 * attention indicator, or interlock.  If we only change the
> 	 * "Enable" bits, they never set the Command Completed bit.
> 	 */
>
>
>> --- a/drivers/pci/hotplug/pciehp_hpc.c
>> +++ b/drivers/pci/hotplug/pciehp_hpc.c
>> @@ -98,6 +98,8 @@ static int pcie_poll_cmd(struct controller *ctrl, int timeout)
>>   		if (slot_status & PCI_EXP_SLTSTA_CC) {
>>   			pcie_capability_write_word(pdev, PCI_EXP_SLTSTA,
>>   						   PCI_EXP_SLTSTA_CC);
>> +			ctrl->cmd_busy = 0;
>> +			smp_mb();
>>   			return 1;
>>   		}
>>   		msleep(10);
> I suspect that this patch merely papers over the problem and that
> the real solution would be to either apply quirk_cmd_compl or a
> similar quirk to your hotplug controller.
>
> Please open a bug on bugzilla.kernel.org and attach full output
> of lspci -vv and dmesg.  Be sure to add the following to the
> command line:
>    pciehp.pciehp_debug=1 dyndbg="file pciehp* +p"
>
> Once you've done that, please report the bugzilla link here
> so that we can analyze the issue properly.
>
> Thanks,
>
> Lukas
Lukas Wunner Nov. 21, 2021, 6:35 a.m. UTC | #4
On Sun, Nov 21, 2021 at 09:50:38AM +0800, luanshi wrote:
> 2021/11/19 20:00, Lukas Wunner:
> > Please open a bug on bugzilla.kernel.org and attach full output
> > of lspci -vv and dmesg.  Be sure to add the following to the
> > command line:
> >    pciehp.pciehp_debug=1 dyndbg="file pciehp* +p"
> > 
> > Once you've done that, please report the bugzilla link here
> > so that we can analyze the issue properly.

I really need you to perform the above steps in order to analyze
what's going on here.

Again, if you get such timeout messages, it's usually not caused
by a bug in the driver but by an erratum in the hardware,
i.e. the hardware neglected to signal Command Completed in response
to a Slot Control register write.

Thanks,

Lukas
luanshi Nov. 26, 2021, 7:47 a.m. UTC | #5
Sorry for the late reply, the machine is occupied by someone else.

For detailed information:

https://bugzilla.kernel.org/show_bug.cgi?id=215143


Thanks,

Liguang

在 2021/11/21 14:35, Lukas Wunner 写道:
> On Sun, Nov 21, 2021 at 09:50:38AM +0800, luanshi wrote:
>> 2021/11/19 20:00, Lukas Wunner:
>>> Please open a bug on bugzilla.kernel.org and attach full output
>>> of lspci -vv and dmesg.  Be sure to add the following to the
>>> command line:
>>>     pciehp.pciehp_debug=1 dyndbg="file pciehp* +p"
>>>
>>> Once you've done that, please report the bugzilla link here
>>> so that we can analyze the issue properly.
> I really need you to perform the above steps in order to analyze
> what's going on here.
>
> Again, if you get such timeout messages, it's usually not caused
> by a bug in the driver but by an erratum in the hardware,
> i.e. the hardware neglected to signal Command Completed in response
> to a Slot Control register write.
>
> Thanks,
>
> Lukas
Lukas Wunner Nov. 26, 2021, 5:33 p.m. UTC | #6
On Thu, Nov 11, 2021 at 01:42:58PM +0800, Liguang Zhang wrote:
> This patch fixes this problem that on driver probe from system startup,
> pciehp checks the Presence Detect State bit in the Slot Status register
> to bring up an occupied slot or bring down an unoccupied slot. If empty
> slot's power status is on, turn power off. The Hot-Plug interrupt isn't
> requested yet, so avoid triggering a notification by calling
> pcie_disable_notification().
> 
> Both the CCIE and HPIE bits are masked in pcie_disable_notification(),
> when we issue a hotplug command, pcie_wait_cmd() will polling the
> Command Completed bit instead of waiting for an interrupt. But cmd_busy
> bit was not cleared when Command Completed which results in timeouts
> like this in pciehp_power_off_slot() and pcie_init_notification():
> 
>   pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x01c0
> (issued 2264 msec ago)
>   pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x05c0
> (issued 2288 msec ago)
> 
> Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>

Fixes: a5dd4b4b0570 ("PCI: pciehp: Wait for hotplug command completion where necessary")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215143
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org # v4.19+

Thanks a lot, that's a really good catch.

It's a somewhat intricate bug, so I'll try to explain in my own words:

If notification is disabled (HPIE or CCIE not set in the Slot Status
register), we rely on pcie_poll_cmd() to poll for Command Completed.
But once it's signaled, we neglect to clear ctrl->cmd_busy.
(Normally it is cleared by the hardirq handler pciehp_isr() if
notification is enabled.)

The result is that starting with the second Slot Control write,
pciehp will gratuitously wait for a command to finish which has
already finished and it will incorrectly report a timeout.

The bug was originally introduced in 2015 by commit a5dd4b4b0570
("PCI: pciehp: Wait for hotplug command completion where necessary"),
but didn't manifest itself because the first Slot Control Write already
enabled notification and from that point on the hardirq handler would
clear ctrl->cmd_busy.  However I think the bug may have manifested
itself with pciehp_poll_mode=1.

It wasn't until commit 4e6a13356f1c ("PCI: pciehp: Deduplicate presence
check on probe & resume") that multiple consecutive Slot Control writes
were performed on ->probe with notification disabled, so that's the
commit which first exposed the bug with pciehp_poll_mode=0.

Thanks,

Lukas

> ---
>  drivers/pci/hotplug/pciehp_hpc.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> index 83a0fa119cae..8698aefc6041 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -98,6 +98,8 @@ static int pcie_poll_cmd(struct controller *ctrl, int timeout)
>  		if (slot_status & PCI_EXP_SLTSTA_CC) {
>  			pcie_capability_write_word(pdev, PCI_EXP_SLTSTA,
>  						   PCI_EXP_SLTSTA_CC);
> +			ctrl->cmd_busy = 0;
> +			smp_mb();
>  			return 1;
>  		}
>  		msleep(10);
> -- 
> 2.19.1.6.gb485710b
Lukas Wunner Nov. 26, 2021, 5:35 p.m. UTC | #7
On Fri, Nov 26, 2021 at 06:33:09PM +0100, Lukas Wunner wrote:
> Cc: stable@vger.kernel.org # v4.19+

Sorry, I meant v4.2+.
luanshi Jan. 6, 2022, 8:55 a.m. UTC | #8
Hi Bjorn & Lukas & Kuppuswamy & Amey,

Gentle ping! Any comments on this patch,should be merged?


Thanks,

Liguang

在 2021/11/27 1:33, Lukas Wunner 写道:
> On Thu, Nov 11, 2021 at 01:42:58PM +0800, Liguang Zhang wrote:
>> This patch fixes this problem that on driver probe from system startup,
>> pciehp checks the Presence Detect State bit in the Slot Status register
>> to bring up an occupied slot or bring down an unoccupied slot. If empty
>> slot's power status is on, turn power off. The Hot-Plug interrupt isn't
>> requested yet, so avoid triggering a notification by calling
>> pcie_disable_notification().
>>
>> Both the CCIE and HPIE bits are masked in pcie_disable_notification(),
>> when we issue a hotplug command, pcie_wait_cmd() will polling the
>> Command Completed bit instead of waiting for an interrupt. But cmd_busy
>> bit was not cleared when Command Completed which results in timeouts
>> like this in pciehp_power_off_slot() and pcie_init_notification():
>>
>>    pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x01c0
>> (issued 2264 msec ago)
>>    pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x05c0
>> (issued 2288 msec ago)
>>
>> Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
> Fixes: a5dd4b4b0570 ("PCI: pciehp: Wait for hotplug command completion where necessary")
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215143
> Reviewed-by: Lukas Wunner <lukas@wunner.de>
> Cc: stable@vger.kernel.org # v4.19+
>
> Thanks a lot, that's a really good catch.
>
> It's a somewhat intricate bug, so I'll try to explain in my own words:
>
> If notification is disabled (HPIE or CCIE not set in the Slot Status
> register), we rely on pcie_poll_cmd() to poll for Command Completed.
> But once it's signaled, we neglect to clear ctrl->cmd_busy.
> (Normally it is cleared by the hardirq handler pciehp_isr() if
> notification is enabled.)
>
> The result is that starting with the second Slot Control write,
> pciehp will gratuitously wait for a command to finish which has
> already finished and it will incorrectly report a timeout.
>
> The bug was originally introduced in 2015 by commit a5dd4b4b0570
> ("PCI: pciehp: Wait for hotplug command completion where necessary"),
> but didn't manifest itself because the first Slot Control Write already
> enabled notification and from that point on the hardirq handler would
> clear ctrl->cmd_busy.  However I think the bug may have manifested
> itself with pciehp_poll_mode=1.
>
> It wasn't until commit 4e6a13356f1c ("PCI: pciehp: Deduplicate presence
> check on probe & resume") that multiple consecutive Slot Control writes
> were performed on ->probe with notification disabled, so that's the
> commit which first exposed the bug with pciehp_poll_mode=0.
>
> Thanks,
>
> Lukas
>
>> ---
>>   drivers/pci/hotplug/pciehp_hpc.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
>> index 83a0fa119cae..8698aefc6041 100644
>> --- a/drivers/pci/hotplug/pciehp_hpc.c
>> +++ b/drivers/pci/hotplug/pciehp_hpc.c
>> @@ -98,6 +98,8 @@ static int pcie_poll_cmd(struct controller *ctrl, int timeout)
>>   		if (slot_status & PCI_EXP_SLTSTA_CC) {
>>   			pcie_capability_write_word(pdev, PCI_EXP_SLTSTA,
>>   						   PCI_EXP_SLTSTA_CC);
>> +			ctrl->cmd_busy = 0;
>> +			smp_mb();
>>   			return 1;
>>   		}
>>   		msleep(10);
>> -- 
>> 2.19.1.6.gb485710b
Lukas Wunner Feb. 3, 2022, 7:07 p.m. UTC | #9
Hi Bjorn,

the below patch is marked "Changes Requested" in patchwork:

https://patchwork.kernel.org/project/linux-pci/patch/20211111054258.7309-1-zhangliguang@linux.alibaba.com/

I think that might be erroneous because the patch is correct,
I've provided a Reviewed-by and no change requests are recorded
in patchwork or the mailing list archive.

If you've got a few minutes to spare, could you double-check the
state in patchwork and provide Liguang Zhang with the changes you'd
want (if any)?

Thanks!

Lukas

On Fri, Nov 26, 2021 at 06:33:09PM +0100, Lukas Wunner wrote:
> On Thu, Nov 11, 2021 at 01:42:58PM +0800, Liguang Zhang wrote:
> > This patch fixes this problem that on driver probe from system startup,
> > pciehp checks the Presence Detect State bit in the Slot Status register
> > to bring up an occupied slot or bring down an unoccupied slot. If empty
> > slot's power status is on, turn power off. The Hot-Plug interrupt isn't
> > requested yet, so avoid triggering a notification by calling
> > pcie_disable_notification().
> > 
> > Both the CCIE and HPIE bits are masked in pcie_disable_notification(),
> > when we issue a hotplug command, pcie_wait_cmd() will polling the
> > Command Completed bit instead of waiting for an interrupt. But cmd_busy
> > bit was not cleared when Command Completed which results in timeouts
> > like this in pciehp_power_off_slot() and pcie_init_notification():
> > 
> >   pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x01c0
> > (issued 2264 msec ago)
> >   pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x05c0
> > (issued 2288 msec ago)
> > 
> > Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
> 
> Fixes: a5dd4b4b0570 ("PCI: pciehp: Wait for hotplug command completion where necessary")
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215143
> Reviewed-by: Lukas Wunner <lukas@wunner.de>
> Cc: stable@vger.kernel.org # v4.2+
> 
> Thanks a lot, that's a really good catch.
> 
> It's a somewhat intricate bug, so I'll try to explain in my own words:
> 
> If notification is disabled (HPIE or CCIE not set in the Slot Status
> register), we rely on pcie_poll_cmd() to poll for Command Completed.
> But once it's signaled, we neglect to clear ctrl->cmd_busy.
> (Normally it is cleared by the hardirq handler pciehp_isr() if
> notification is enabled.)
> 
> The result is that starting with the second Slot Control write,
> pciehp will gratuitously wait for a command to finish which has
> already finished and it will incorrectly report a timeout.
> 
> The bug was originally introduced in 2015 by commit a5dd4b4b0570
> ("PCI: pciehp: Wait for hotplug command completion where necessary"),
> but didn't manifest itself because the first Slot Control Write already
> enabled notification and from that point on the hardirq handler would
> clear ctrl->cmd_busy.  However I think the bug may have manifested
> itself with pciehp_poll_mode=1.
> 
> It wasn't until commit 4e6a13356f1c ("PCI: pciehp: Deduplicate presence
> check on probe & resume") that multiple consecutive Slot Control writes
> were performed on ->probe with notification disabled, so that's the
> commit which first exposed the bug with pciehp_poll_mode=0.
> 
> Thanks,
> 
> Lukas
> 
> > ---
> >  drivers/pci/hotplug/pciehp_hpc.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> > index 83a0fa119cae..8698aefc6041 100644
> > --- a/drivers/pci/hotplug/pciehp_hpc.c
> > +++ b/drivers/pci/hotplug/pciehp_hpc.c
> > @@ -98,6 +98,8 @@ static int pcie_poll_cmd(struct controller *ctrl, int timeout)
> >  		if (slot_status & PCI_EXP_SLTSTA_CC) {
> >  			pcie_capability_write_word(pdev, PCI_EXP_SLTSTA,
> >  						   PCI_EXP_SLTSTA_CC);
> > +			ctrl->cmd_busy = 0;
> > +			smp_mb();
> >  			return 1;
> >  		}
> >  		msleep(10);
> > -- 
> > 2.19.1.6.gb485710b
Bjorn Helgaas Feb. 3, 2022, 8:38 p.m. UTC | #10
On Thu, Feb 03, 2022 at 08:07:31PM +0100, Lukas Wunner wrote:
> Hi Bjorn,
> 
> the below patch is marked "Changes Requested" in patchwork:
> 
> https://patchwork.kernel.org/project/linux-pci/patch/20211111054258.7309-1-zhangliguang@linux.alibaba.com/
> 
> I think that might be erroneous because the patch is correct,
> I've provided a Reviewed-by and no change requests are recorded
> in patchwork or the mailing list archive.
> 
> If you've got a few minutes to spare, could you double-check the
> state in patchwork and provide Liguang Zhang with the changes you'd
> want (if any)?

Thanks for reminding me about this.  I don't remember why I marked it
"changes requested" but I often do that if there's been significant
discussion.  In this case you also provided additional information
(Fixes tag, bugzilla link, commit log elaboration) that should be
included.  Given that, I would typically wait for the author to repost
it and incorporate the additional information.

But I applied it to pci/hotplug with the commit log below.

commit 92912b175178 ("PCI: pciehp: Clear cmd_busy bit in polling mode")
Author: Liguang Zhang <zhangliguang@linux.alibaba.com>
Date:   Thu Nov 11 13:42:58 2021 +0800

    PCI: pciehp: Clear cmd_busy bit in polling mode
    
    Writes to a Downstream Port's Slot Control register are PCIe hotplug
    "commands."  If the Port supports Command Completed events, software must
    wait for a command to complete before writing to Slot Control again.
    
    pcie_do_write_cmd() sets ctrl->cmd_busy when it writes to Slot Control.  If
    software notification is enabled, i.e., PCI_EXP_SLTCTL_HPIE and
    PCI_EXP_SLTCTL_CCIE are set, ctrl->cmd_busy is cleared by pciehp_isr().
    
    But when software notification is disabled, as it is when pcie_init()
    powers off an empty slot, pcie_wait_cmd() uses pcie_poll_cmd() to poll for
    command completion, and it neglects to clear ctrl->cmd_busy, which leads to
    spurious timeouts:
    
      pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x01c0 (issued 2264 msec ago)
      pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x05c0 (issued 2288 msec ago)
    
    Clear ctrl->cmd_busy in pcie_poll_cmd() when it detects a Command Completed
    event (PCI_EXP_SLTSTA_CC).
    
    [bhelgaas: commit log]
    Fixes: a5dd4b4b0570 ("PCI: pciehp: Wait for hotplug command completion where necessary")
    Link: https://lore.kernel.org/r/20211111054258.7309-1-zhangliguang@linux.alibaba.com
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=215143
    Link: https://lore.kernel.org/r/20211126173309.GA12255@wunner.de
    Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Reviewed-by: Lukas Wunner <lukas@wunner.de>
    Cc: stable@vger.kernel.org	# v4.19+
diff mbox series

Patch

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 83a0fa119cae..8698aefc6041 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -98,6 +98,8 @@  static int pcie_poll_cmd(struct controller *ctrl, int timeout)
 		if (slot_status & PCI_EXP_SLTSTA_CC) {
 			pcie_capability_write_word(pdev, PCI_EXP_SLTSTA,
 						   PCI_EXP_SLTSTA_CC);
+			ctrl->cmd_busy = 0;
+			smp_mb();
 			return 1;
 		}
 		msleep(10);