[v3] net/tg3: fix race condition in tg3_reset_task()

Message ID	20231116151822.281-1-thinhtr@linux.vnet.ibm.com (mailing list archive)
State	Changes Requested
Delegated to:	Netdev Maintainers
Headers	show Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="tj85W4tW" From: Thinh Tran <thinhtr@linux.vnet.ibm.com> To: mchan@broadcom.com Cc: pavan.chebbi@broadcom.com, netdev@vger.kernel.org, prashant@broadcom.com, siva.kallam@broadcom.com, drc@linux.vnet.ibm.com, venkata.sai.duggi@ibm.com, thinhtr@linux.vnet.ibm.com, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, davem@davemloft.net Subject: [PATCH v3] net/tg3: fix race condition in tg3_reset_task() Date: Thu, 16 Nov 2023 09:18:22 -0600 Message-Id: <20231116151822.281-1-thinhtr@linux.vnet.ibm.com> In-Reply-To: <20231102161219.220-1-thinhtr@linux.vnet.ibm.com> References: <20231102161219.220-1-thinhtr@linux.vnet.ibm.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v3] net/tg3: fix race condition in tg3_reset_task() \| expand [v3] net/tg3: fix race condition in tg3_reset_task()

Context	Check	Description
netdev/series_format	warning	Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection	success	Guessed tree name to be net-next
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 1134 this patch: 1134
netdev/cc_maintainers	success	CCed 6 of 6 maintainers
netdev/build_clang	success	Errors and warnings before: 1154 this patch: 1154
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 1165 this patch: 1165
netdev/checkpatch	warning	WARNING: line length of 84 exceeds 80 columns
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Thinh Tran Nov. 16, 2023, 3:18 p.m. UTC

When an EEH error is encountered by a PCI adapter, the EEH driver
modifies the PCI channel's state as shown below:

   enum {
      /* I/O channel is in normal state */
      pci_channel_io_normal = (__force pci_channel_state_t) 1,

      /* I/O to channel is blocked */
      pci_channel_io_frozen = (__force pci_channel_state_t) 2,

      /* PCI card is dead */
      pci_channel_io_perm_failure = (__force pci_channel_state_t) 3,
   };

If the same EEH error then causes the tg3 driver's transmit timeout
logic to execute, the tg3_tx_timeout() function schedules a reset
task via tg3_reset_task_schedule(), which may cause a race condition
between the tg3 and EEH driver as both attempt to recover the HW via
a reset action.

EEH driver gets error event
--> eeh_set_channel_state()
    and set device to one of
    error state above		scheduler: tg3_reset_task() get 
   				returned error from tg3_init_hw()
			     --> dev_close() shuts down the interface

tg3_io_slot_reset() and 
tg3_io_resume() fail to
reset/resume the device


To resolve this issue, we avoid the race condition by checking the PCI
channel state in the tg3_tx_timeout() function and skip the tg3 driver
initiated reset when the PCI channel is not in the normal state.  (The
driver has no access to tg3 device registers at this point and cannot
even complete the reset task successfully without external assistance.)
We'll leave the reset procedure to be managed by the EEH driver which
calls the tg3_io_error_detected(), tg3_io_slot_reset() and 
tg3_io_resume() functions as appropriate. 



Signed-off-by: Thinh Tran <thinhtr@linux.vnet.ibm.com>
Tested-by: Venkata Sai Duggi <venkata.sai.duggi@ibm.com>
Reviewed-by: David Christensen <drc@linux.vnet.ibm.com>

  v3: re-post the patch.
  v2: checking PCI errors in tg3_tx_timeout()

---
 drivers/net/ethernet/broadcom/tg3.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

Michael Chan Nov. 16, 2023, 9:34 p.m. UTC | #1

On Thu, Nov 16, 2023 at 7:19 AM Thinh Tran <thinhtr@linux.vnet.ibm.com> wrote:
> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> index 14b311196b8f..1c72ef05ab1b 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -7630,6 +7630,26 @@ static void tg3_tx_timeout(struct net_device *dev, unsigned int txqueue)
>  {
>         struct tg3 *tp = netdev_priv(dev);
>
> +       /* checking the PCI channel state for hard errors
> +        * for pci_channel_io_frozen case
> +        *   - I/O to channel is blocked.
> +        *     The EEH layer and I/O error detections will
> +        *     handle the reset procedure
> +        * for pci_channel_io_perm_failure  case
> +        *   - the PCI card is dead.
> +        *     The reset will not help
> +        * report the error for both cases and return.
> +        */
> +       if (tp->pdev->error_state == pci_channel_io_frozen) {
> +               netdev_err(dev, " %s, I/O to channel is blocked\n", __func__);
> +               return;
> +       }
> +
> +       if (tp->pdev->error_state == pci_channel_io_perm_failure) {
> +               netdev_err(dev, " %s, adapter has failed permanently!\n", __func__);
> +               return;
> +       }
> +

I think it will be better to add these 2 checks in tg3_reset_task().
We are already doing a similar check there for tp->pcierr_recovery so
it is better to centralize all the related checks in the same place.
I don't think tg3_dump_state() below will cause any problems.  We'll
probably read 0xffffffff for all registers and it will actually
confirm that the registers are not readable.

>         if (netif_msg_tx_err(tp)) {
>                 netdev_err(dev, "transmit timed out, resetting\n");
>                 tg3_dump_state(tp);
> --
> 2.25.1
>

Thinh Tran Nov. 17, 2023, 4:19 p.m. UTC | #2

On 11/16/2023 3:34 PM, Michael Chan wrote:
> 
> I think it will be better to add these 2 checks in tg3_reset_task().
> We are already doing a similar check there for tp->pcierr_recovery so
> it is better to centralize all the related checks in the same place.
> I don't think tg3_dump_state() below will cause any problems.  We'll
> probably read 0xffffffff for all registers and it will actually
> confirm that the registers are not readable.
> 

I'm concerned that race conditions could still occur during the handling 
of Partitionable Endpoint (PE) reset by the EEH driver. The issue lies 
in the dependency on the lower-level FW reset procedure. When the driver 
executes tg3_dump_state(), and then follows it with tg3_reset_task(), 
the EEH driver possibility changes in the PCI devices' state, but the 
MMIO and DMA reset procedures may not have completed yet. Leading to a 
crash in tg3_reset_task().  This patch tries to prevent that scenario.

While tg3_dump_state() is helpful, it also poses issues as it takes a 
considerable amount of time, approximately 1 or 2 seconds per device. 
Given our 4-port adapter, this could extend to more than 10 seconds to 
write to the dmesg buffer. With the default size, the dmesg buffer may 
be over-written and not retain all useful information.

Thanks,
Thinh Tran

Michael Chan Nov. 17, 2023, 6:31 p.m. UTC | #3

On Fri, Nov 17, 2023 at 8:19 AM Thinh Tran <thinhtr@linux.vnet.ibm.com> wrote:
>
>
> On 11/16/2023 3:34 PM, Michael Chan wrote:
> >
> > I think it will be better to add these 2 checks in tg3_reset_task().
> > We are already doing a similar check there for tp->pcierr_recovery so
> > it is better to centralize all the related checks in the same place.
> > I don't think tg3_dump_state() below will cause any problems.  We'll
> > probably read 0xffffffff for all registers and it will actually
> > confirm that the registers are not readable.
> >
>
> I'm concerned that race conditions could still occur during the handling
> of Partitionable Endpoint (PE) reset by the EEH driver. The issue lies
> in the dependency on the lower-level FW reset procedure. When the driver
> executes tg3_dump_state(), and then follows it with tg3_reset_task(),
> the EEH driver possibility changes in the PCI devices' state, but the
> MMIO and DMA reset procedures may not have completed yet. Leading to a
> crash in tg3_reset_task().  This patch tries to prevent that scenario.

It seems fragile if you are relying on such timing.  TG3_TX_TIMEOUT is
5 seconds but the actual time tg3_tx_timeout() is called varies
depending on when the TX queue is stopped.  So tg3_tx_timeout() will
be called 5 seconds or more after EEH if there are uncompleted TX
packets but we don't know precisely when.

>
> While tg3_dump_state() is helpful, it also poses issues as it takes a
> considerable amount of time, approximately 1 or 2 seconds per device.
> Given our 4-port adapter, this could extend to more than 10 seconds to
> write to the dmesg buffer. With the default size, the dmesg buffer may
> be over-written and not retain all useful information.
>

If tg3_dump_state() is not useful and fills the dmesg log with useless
data, we can add the same check in tg3_dump_state() and skip it.
tg3_dump_state() is also called by tg3_process_error() so we can avoid
dumping useless data if we hit tg3_process_error() during EEH or AER.

Thanks.

Thinh Tran Nov. 30, 2023, 10:29 p.m. UTC | #4

On 11/17/2023 12:31 PM, Michael Chan wrote:
> On Fri, Nov 17, 2023 at 8:19 AM Thinh Tran <thinhtr@linux.vnet.ibm.com> wrote:
>>
>>
>> On 11/16/2023 3:34 PM, Michael Chan wrote:
>>>
>>> I think it will be better to add these 2 checks in tg3_reset_task().
>>> We are already doing a similar check there for tp->pcierr_recovery so
>>> it is better to centralize all the related checks in the same place.
>>> I don't think tg3_dump_state() below will cause any problems.  We'll
>>> probably read 0xffffffff for all registers and it will actually
>>> confirm that the registers are not readable.
>>>
>>
>> I'm concerned that race conditions could still occur during the handling
>> of Partitionable Endpoint (PE) reset by the EEH driver. The issue lies
>> in the dependency on the lower-level FW reset procedure. When the driver
>> executes tg3_dump_state(), and then follows it with tg3_reset_task(),
>> the EEH driver possibility changes in the PCI devices' state, but the
>> MMIO and DMA reset procedures may not have completed yet. Leading to a
>> crash in tg3_reset_task().  This patch tries to prevent that scenario.
> 
> It seems fragile if you are relying on such timing.  TG3_TX_TIMEOUT is
> 5 seconds but the actual time tg3_tx_timeout() is called varies
> depending on when the TX queue is stopped.  So tg3_tx_timeout() will
> be called 5 seconds or more after EEH if there are uncompleted TX
> packets but we don't know precisely when.
> 
>>
>> While tg3_dump_state() is helpful, it also poses issues as it takes a
>> considerable amount of time, approximately 1 or 2 seconds per device.
>> Given our 4-port adapter, this could extend to more than 10 seconds to
>> write to the dmesg buffer. With the default size, the dmesg buffer may
>> be over-written and not retain all useful information.
>>
> 
> If tg3_dump_state() is not useful and fills the dmesg log with useless
> data, we can add the same check in tg3_dump_state() and skip it.
> tg3_dump_state() is also called by tg3_process_error() so we can avoid
> dumping useless data if we hit tg3_process_error() during EEH or AER.
> 
> Thanks.

I implemented the fix as you suggested and passed the tests.
I will send the next version of patch shortly.

Thanks.

[v3] net/tg3: fix race condition in tg3_reset_task()

Checks

Commit Message

Comments

Patch