From patchwork Mon Nov 20 19:11:50 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: stuart hayes X-Patchwork-Id: 10067061 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 975A46038F for ; Mon, 20 Nov 2017 19:12:08 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 89DCE29334 for ; Mon, 20 Nov 2017 19:12:08 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7E84929343; Mon, 20 Nov 2017 19:12:08 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, FREEMAIL_FROM, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3834F29334 for ; Mon, 20 Nov 2017 19:12:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752024AbdKTTLz (ORCPT ); Mon, 20 Nov 2017 14:11:55 -0500 Received: from mail-ot0-f193.google.com ([74.125.82.193]:34844 "EHLO mail-ot0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751769AbdKTTLy (ORCPT ); Mon, 20 Nov 2017 14:11:54 -0500 Received: by mail-ot0-f193.google.com with SMTP id b17so8527830oth.2; Mon, 20 Nov 2017 11:11:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=to:cc:from:subject:message-id:date:user-agent:mime-version :content-language:content-transfer-encoding; bh=7UGFb38H3hEFHNr4icRu2PwdLggz+6tHGuQq/f/fn7o=; b=Oq3SJkBrxihimIKwJ0ElzDSLDHvZYZfFAaQ3QVE6ZRGk+JLPzjff6SG7315E+Hevyn jX1v4lyFaLPYtpiYHikRf8X/wdBgls686IbvSYvjgnG1WECplG5wNuwxwXGK84WiiCI/ GSLhKuCsx8vofi8uEdqgxghwBp/2ZgSwdcta1gXWWd62UfT82Biv8EM3eZ9FXqGNXUxG 37BpBmDmqIhmpoCggpAHN1gWkryxb9U5+Hlj3KZwSjNk2NRo5qjlnd8V/ESlXALm7NnY 8Gw5yOdbLud+lb6xsLPPVay20qCE4vGTk//xbgrmgqILV0XtNgqXslAyIk56bjF9Te3b PdeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:from:subject:message-id:date:user-agent :mime-version:content-language:content-transfer-encoding; bh=7UGFb38H3hEFHNr4icRu2PwdLggz+6tHGuQq/f/fn7o=; b=iSf2ZXXbZSiCbsISCW6e5J7+idgEQvm2w8RCVbjdOFUfz56jjX2muWvShNoIAOgqo7 NZzRD+GDn4bY8FX+nGbml4LY5ZRn2Us46uJn4ecP0Sw7nRekdw/IRkZe5eBB5ql3A+GM fkTzDILjrzf/an00AD1S+Qnz1Xg+fIwtBL67Y9QPmnDLCVq1fiD+C0qLIpUjNEmXHN1b ZnP98Q2Ju8SVZb2WymJ7WOjny6gjJagjw5NbU00eUWDpiOj5hwCuPfXhdWZ0/s3CbE5+ rBLOmzDbok70emVlGlzStAM0NycVauoI6aT1Wq37wAZaYtKeD1InzY5RmoyLEoWfb0zO nHdQ== X-Gm-Message-State: AJaThX54orm4IM4PJR0vZoEXBvqtkLPpqxBWX8oXsXKzlv4OImoGp7xO 7Ou+qUbdoxIpCdjFvXOLaVQ= X-Google-Smtp-Source: AGs4zMZ4huzLmqxZSqQeXEYTTY0C+SEYF63lK8l58kIadwlU3dqzogD4Gcr+nRc1RLw/x2T+vzSSQg== X-Received: by 10.157.87.101 with SMTP id x34mr8439322oti.204.1511205113357; Mon, 20 Nov 2017 11:11:53 -0800 (PST) Received: from [192.168.0.2] (cpe-24-27-59-32.austin.res.rr.com. [24.27.59.32]) by smtp.gmail.com with ESMTPSA id x127sm4668793oix.8.2017.11.20.11.11.51 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 20 Nov 2017 11:11:52 -0800 (PST) To: "James E.J. Bottomley" , "Martin K. Petersen" Cc: linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org, Pavel Tikhomirov , Joshua_Giles@Dell.com From: Stuart Hayes Subject: [PATCH] scsi_error: ensure EH wakes up on error to prevent host getting stuck Message-ID: Date: Mon, 20 Nov 2017 13:11:50 -0600 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 Content-Language: en-US X-Antivirus: Avast (VPS 171120-2, 11/20/2017), Outbound message X-Antivirus-Status: Clean Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When a command is added to the host's error handler command queue, there is a chance that the error handler will not be woken up. This can happen when one CPU is running scsi_eh_scmd_add() at the same time as another CPU is running scsi_device_unbusy() for a different command on the same host. Each function changes one value, and then looks at the value of a variable that the other function has just changed, but if they both see stale data, neither will actually wake up the error handler. In scsi_eh_scmd_add, host_failed is incremented, then scsi_eh_wakeup() is called, which sees that host_busy is still 2, so it doesn't actually wake up the handler. Meanwhile, in scsi_device_unbusy(), host_busy is decremented, and then it sees that host_failed is 0, so it doesn't even call scsi_eh_wakeup(). Signed-off-by: Stuart Hyaes --- --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus diff -pur linux-4.14/drivers/scsi/scsi_error.c linux-4.14-stu/drivers/scsi/scsi_error.c --- linux-4.14/drivers/scsi/scsi_error.c 2017-11-12 12:46:13.000000000 -0600 +++ linux-4.14-stu/drivers/scsi/scsi_error.c 2017-11-17 14:22:19.230867923 -0600 @@ -243,6 +243,10 @@ void scsi_eh_scmd_add(struct scsi_cmnd * scsi_eh_reset(scmd); list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q); shost->host_failed++; + /* + * See scsi_device_unbusy() for explanation of smp_mb(). + */ + smp_mb(); scsi_eh_wakeup(shost); spin_unlock_irqrestore(shost->host_lock, flags); } diff -pur linux-4.14/drivers/scsi/scsi_lib.c linux-4.14-stu/drivers/scsi/scsi_lib.c --- linux-4.14/drivers/scsi/scsi_lib.c 2017-11-12 12:46:13.000000000 -0600 +++ linux-4.14-stu/drivers/scsi/scsi_lib.c 2017-11-17 14:22:15.814867833 -0600 @@ -325,6 +325,15 @@ void scsi_device_unbusy(struct scsi_devi unsigned long flags; atomic_dec(&shost->host_busy); + + /* This function changes host_busy and looks at host_failed, while + * scsi_eh_scmd_add() updates host_failed and looks at host_busy (in + * scsi_eh_wakeup())... if these happen simultaneously without the smp + * memory barrier, each can see the old value, such that neither will + * wake up the error handler, which can cause the host controller to + * be hung forever. + */ + smp_mb(); if (starget->can_queue > 0) atomic_dec(&starget->target_busy);