From patchwork Fri Dec 1 00:19:11 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thinh Tran X-Patchwork-Id: 13475216 X-Patchwork-Delegate: kuba@kernel.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="hQtreRop" Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ACEC3197 for ; Thu, 30 Nov 2023 16:19:39 -0800 (PST) Received: from pps.filterd (m0356517.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 3B107Nx0027003; Fri, 1 Dec 2023 00:19:34 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=kGOElApfG0nsY3ihHSfxR9/QkQdYffNiRwrB6Ko+AZU=; b=hQtreRopcI4hlHjuxaqo9Q7lnkqNGGk9acJVt6HIblq2/pOCKjXSIXBMtbyE15rjwA1d tyE7da/L9/fBatXSN8Glp8xxLMwtxtHyorcPOG4VTFp83VWui/cYsZHn8yLB6UxWa15m A4q3W9aPaot7bOPm8cDhJjbc4uNO+OrgDTxMT1ZpHKtvEL9yuglTPPSu8lLSgckvlMHu /5RWgFh8qg3ZyC1qKDejHDZ3CAz32js4ro3JqoS89R2m4chJ9/GWm5nDGLR6KSJHGpn9 DAoX44DpNurpb+ZCjCzU3RUg9fGRMLLsw4SHrkcoAeTztSfNNCKWF4NUG3Ukscci4+QO gg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3uq4nn89q6-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 01 Dec 2023 00:19:34 +0000 Received: from m0356517.ppops.net (m0356517.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 3B108K4P029871; Fri, 1 Dec 2023 00:19:33 GMT Received: from ppma11.dal12v.mail.ibm.com (db.9e.1632.ip4.static.sl-reverse.com [50.22.158.219]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3uq4nn89pv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 01 Dec 2023 00:19:33 +0000 Received: from pps.filterd (ppma11.dal12v.mail.ibm.com [127.0.0.1]) by ppma11.dal12v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 3AUMaY43029841; Fri, 1 Dec 2023 00:19:32 GMT Received: from smtprelay07.dal12v.mail.ibm.com ([172.16.1.9]) by ppma11.dal12v.mail.ibm.com (PPS) with ESMTPS id 3ukwy2999t-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 01 Dec 2023 00:19:32 +0000 Received: from smtpav05.dal12v.mail.ibm.com (smtpav05.dal12v.mail.ibm.com [10.241.53.104]) by smtprelay07.dal12v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 3B10JWOe52495076 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 1 Dec 2023 00:19:32 GMT Received: from smtpav05.dal12v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id F298A58056; Fri, 1 Dec 2023 00:19:31 +0000 (GMT) Received: from smtpav05.dal12v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 976BF58052; Fri, 1 Dec 2023 00:19:31 +0000 (GMT) Received: from linux.vnet.ibm.com (unknown [9.41.99.4]) by smtpav05.dal12v.mail.ibm.com (Postfix) with ESMTP; Fri, 1 Dec 2023 00:19:31 +0000 (GMT) From: Thinh Tran To: michael.chan@broadcom.com Cc: davem@davemloft.net, drc@linux.vnet.ibm.com, edumazet@google.com, kuba@kernel.org, mchan@broadcom.com, netdev@vger.kernel.org, pabeni@redhat.com, pavan.chebbi@broadcom.com, prashant@broadcom.com, siva.kallam@broadcom.com, thinhtr@linux.vnet.ibm.com, Venkata Sai Duggi Subject: [PATCH v4] net/tg3: fix race condition in tg3_reset_task() Date: Thu, 30 Nov 2023 18:19:11 -0600 Message-Id: <20231201001911.656-1-thinhtr@linux.vnet.ibm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231116151822.281-1-thinhtr@linux.vnet.ibm.com> References: <20231116151822.281-1-thinhtr@linux.vnet.ibm.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: 9oym-qNdrtjLqq_dtnVBFDnsw7X5yIIc X-Proofpoint-ORIG-GUID: BeJoVe7s4W3nnHcEu-AQ66mRg5yQC-a5 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.997,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-11-30_24,2023-11-30_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0 malwarescore=0 mlxlogscore=854 phishscore=0 impostorscore=0 suspectscore=0 mlxscore=0 adultscore=0 spamscore=0 priorityscore=1501 clxscore=1015 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2311060000 definitions=main-2312010000 X-Patchwork-Delegate: kuba@kernel.org When an EEH error is encountered by a PCI adapter, the EEH driver modifies the PCI channel's state as shown below: enum { /* I/O channel is in normal state */ pci_channel_io_normal = (__force pci_channel_state_t) 1, /* I/O to channel is blocked */ pci_channel_io_frozen = (__force pci_channel_state_t) 2, /* PCI card is dead */ pci_channel_io_perm_failure = (__force pci_channel_state_t) 3, }; If the same EEH error then causes the tg3 driver's transmit timeout logic to execute, the tg3_tx_timeout() function schedules a reset task via tg3_reset_task_schedule(), which may cause a race condition between the tg3 and EEH driver as both attempt to recover the HW via a reset action. EEH driver gets error event --> eeh_set_channel_state() and set device to one of error state above scheduler: tg3_reset_task() get returned error from tg3_init_hw() --> dev_close() shuts down the interface tg3_io_slot_reset() and tg3_io_resume() fail to reset/resume the device To resolve this issue, we avoid the race condition by checking the PCI channel state in the tg3_reset_task() function and skip the tg3 driver initiated reset when the PCI channel is not in the normal state. (The driver has no access to tg3 device registers at this point and cannot even complete the reset task successfully without external assistance.) We'll leave the reset procedure to be managed by the EEH driver which calls the tg3_io_error_detected(), tg3_io_slot_reset() and tg3_io_resume() functions as appropriate. Adding the same checking in tg3_dump_state() to avoid dumping all device registers when the PCI channel is not in the normal state. Signed-off-by: Thinh Tran Tested-by: Venkata Sai Duggi Reviewed-by: David Christensen v4: moving the PCI error checking to tg3_reset_task() and tg3_dump_state() v3: re-post the patch. v2: checking PCI errors in tg3_tx_timeout() Reviewed-by: Michael Chan --- drivers/net/ethernet/broadcom/tg3.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c index 48b6191efa56..49f299c868a1 100644 --- a/drivers/net/ethernet/broadcom/tg3.c +++ b/drivers/net/ethernet/broadcom/tg3.c @@ -6474,6 +6474,14 @@ static void tg3_dump_state(struct tg3 *tp) int i; u32 *regs; + /* If it is a PCI error, all registers will be 0xffff, + * we don't dump them out, just report the error and return + */ + if (tp->pdev->error_state != pci_channel_io_normal) { + netdev_err(tp->dev, "PCI channel ERROR!\n"); + return; + } + regs = kzalloc(TG3_REG_BLK_SIZE, GFP_ATOMIC); if (!regs) return; @@ -11259,7 +11267,8 @@ static void tg3_reset_task(struct work_struct *work) rtnl_lock(); tg3_full_lock(tp, 0); - if (tp->pcierr_recovery || !netif_running(tp->dev)) { + if (tp->pcierr_recovery || !netif_running(tp->dev) || + tp->pdev->error_state != pci_channel_io_normal) { tg3_flag_clear(tp, RESET_TASK_PENDING); tg3_full_unlock(tp); rtnl_unlock();