From patchwork Mon Mar 25 09:05:26 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zheng Bin X-Patchwork-Id: 10868377 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1EAD1139A for ; Mon, 25 Mar 2019 09:01:13 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0A0EA292C5 for ; Mon, 25 Mar 2019 09:01:13 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id EBC6128D9A; Mon, 25 Mar 2019 09:01:12 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 45638292BF for ; Mon, 25 Mar 2019 09:01:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730006AbfCYJBL (ORCPT ); Mon, 25 Mar 2019 05:01:11 -0400 Received: from szxga05-in.huawei.com ([45.249.212.191]:5747 "EHLO huawei.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729987AbfCYJBL (ORCPT ); Mon, 25 Mar 2019 05:01:11 -0400 Received: from DGGEMS406-HUB.china.huawei.com (unknown [172.30.72.59]) by Forcepoint Email with ESMTP id 5E2E8C65B9C64F62AB30; Mon, 25 Mar 2019 17:01:09 +0800 (CST) Received: from huawei.com (10.90.53.225) by DGGEMS406-HUB.china.huawei.com (10.3.19.206) with Microsoft SMTP Server id 14.3.408.0; Mon, 25 Mar 2019 17:01:02 +0800 From: zhengbin To: , , , , CC: , Subject: [PATCH] scsi: fix ata_port_wait_eh() hung caused by missing to wake up eh thread Date: Mon, 25 Mar 2019 17:05:26 +0800 Message-ID: <1553504726-40837-1-git-send-email-zhengbin13@huawei.com> X-Mailer: git-send-email 2.7.4 MIME-Version: 1.0 X-Originating-IP: [10.90.53.225] X-CFilter-Loop: Reflected Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When I use fio test kernel in the following steps: 1.The sas controller mixes SAS/SATA disks 2.Use fio test all disks 3.Simultaneous enable/disable/link_reset/hard_reset PHY it will hung in ata_port_wait_eh Call trace: __switch_to+0xb4/0x1b8 __schedule+0x1e8/0x718 schedule+0x38/0x90 ata_port_wait_eh+0x70/0xf8 sas_ata_wait_eh+0x24/0x30 [libsas] transport_sas_phy_reset.isra.3+0x128/0x160 [libsas] phy_reset_work+0x20/0x30 [libsas] process_one_work+0x1e4/0x460 worker_thread+0x40/0x450 kthread+0x12c/0x130 ret_from_fork+0x10/0x18 The key code process is like this: scsi_dec_host_busy atomic_dec(&shost->host_busy); if (unlikely(scsi_host_in_recovery(shost))) { spin_lock_irqsave(shost->host_lock, flags); ... scsi_eh_wakeup(shost) ... } scsi_schedule_eh spin_lock_irqsave(shost->host_lock, flags); if (scsi_host_set_state(shost, SHOST_RECOVERY) == 0 || scsi_host_set_state(shost, SHOST_CANCEL_RECOVERY) == 0) { ... scsi_eh_wakeup(shost); } scsi_eh_wakeup if (scsi_host_busy(shost) == shost->host_failed) wake_up_process(shost->ehandler); In scsi_dec_host_busy, host_busy & shost_state not in spinlock. Neither function wakes up the SCSI error handler in the following timing: CPU 0(call scsi_dec_host_busy) CPU 1(call scsi_schedule_eh) LOAD shost_state(!=recovery) scsi_host_set_state(SHOST_RECOVERY) scsi_eh_wakeup(host_busy != host_failed) atomic_dec(&shost->host_busy); if (scsi_host_in_recovery(shost)) Add a smp_mb between host_busy and shost_state. Signed-off-by: zhengbin --- drivers/scsi/scsi_error.c | 7 +++++++ drivers/scsi/scsi_lib.c | 5 +++++ 2 files changed, 12 insertions(+) -- 2.7.4 diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 1b8378f..c605a71 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -88,6 +88,13 @@ void scsi_schedule_eh(struct Scsi_Host *shost) if (scsi_host_set_state(shost, SHOST_RECOVERY) == 0 || scsi_host_set_state(shost, SHOST_CANCEL_RECOVERY) == 0) { + /* + * We have to order shost_state store above and test of + * the host_busy(scsi_eh_wakeup will test it), because + * scsi_dec_host_busy accesses these variables without + * host_lock. + */ + smp_mb(); shost->host_eh_scheduled++; scsi_eh_wakeup(shost); } diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 601b9f1..9094d20 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -337,6 +337,11 @@ static void scsi_dec_host_busy(struct Scsi_Host *shost) rcu_read_lock(); atomic_dec(&shost->host_busy); + /* + * We have to order host_busy dec above and test of the shost_state + * below outside the host_lock. + */ + smp_mb(); if (unlikely(scsi_host_in_recovery(shost))) { spin_lock_irqsave(shost->host_lock, flags); if (shost->host_failed || shost->host_eh_scheduled)