From patchwork Tue May 10 15:08:18 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: James Bottomley X-Patchwork-Id: 9058741 Return-Path: X-Original-To: patchwork-linux-scsi@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id F2B99BF440 for ; Tue, 10 May 2016 15:08:35 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id D24B2200E8 for ; Tue, 10 May 2016 15:08:34 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 23CDE20145 for ; Tue, 10 May 2016 15:08:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751616AbcEJPIc (ORCPT ); Tue, 10 May 2016 11:08:32 -0400 Received: from e37.co.us.ibm.com ([32.97.110.158]:49550 "EHLO e37.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750954AbcEJPIb (ORCPT ); Tue, 10 May 2016 11:08:31 -0400 Received: from localhost by e37.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 10 May 2016 09:08:28 -0600 Received: from d03dlp01.boulder.ibm.com (9.17.202.177) by e37.co.us.ibm.com (192.168.1.137) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Tue, 10 May 2016 09:08:24 -0600 X-IBM-Helo: d03dlp01.boulder.ibm.com X-IBM-MailFrom: jejb@linux.vnet.ibm.com X-IBM-RcptTo: s.parschauer@gmx.de; hch@infradead.org; martin.petersen@oracle.com; jinpu.wang@profitbricks.com; bart.vanassche@sandisk.com; hare@suse.de; linux-scsi@vger.kernel.org Received: from b03cxnp08026.gho.boulder.ibm.com (b03cxnp08026.gho.boulder.ibm.com [9.17.130.18]) by d03dlp01.boulder.ibm.com (Postfix) with ESMTP id 02B711FF001E; Tue, 10 May 2016 09:08:09 -0600 (MDT) Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp08026.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id u4AF8N5n37028084; Tue, 10 May 2016 08:08:23 -0700 Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id CD7A378056; Tue, 10 May 2016 09:08:23 -0600 (MDT) Received: from [153.66.254.194] (unknown [9.80.196.60]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTPS id 9A71C78041; Tue, 10 May 2016 09:08:20 -0600 (MDT) Message-ID: <1462892898.2320.18.camel@linux.vnet.ibm.com> Subject: Re: [PATCHv2]sd: Don't treat succeeded SYNC as error From: James Bottomley To: Jinpu Wang Cc: Hannes Reinecke , "Martin K. Petersen" , Christoph Hellwig , linux-scsi@vger.kernel.org, Sebastian Parschauer , Bart Van Assche Date: Tue, 10 May 2016 08:08:18 -0700 In-Reply-To: References: <5727266E.7040905@suse.de> <1462196672.2665.5.camel@linux.vnet.ibm.com> X-Mailer: Evolution 3.16.5 Mime-Version: 1.0 X-TM-AS-GCONF: 00 X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16051015-0025-0000-0000-00003D4B23AA X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org X-Spam-Status: No, score=-9.0 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Tue, 2016-05-10 at 16:48 +0200, Jinpu Wang wrote: > On Mon, May 9, 2016 at 6:41 PM, Jinpu Wang < > jinpu.wang@profitbricks.com> wrote: > > On Wed, May 4, 2016 at 7:02 PM, Jinpu Wang < > > jinpu.wang@profitbricks.com> wrote: > > > On Mon, May 2, 2016 at 3:44 PM, James Bottomley < > > > jejb@linux.vnet.ibm.com> wrote: > > > > On Mon, 2016-05-02 at 12:05 +0200, Hannes Reinecke wrote: > > > > > On 04/29/2016 02:49 PM, Jinpu Wang wrote: > > > > > > Hi, all > > > > > > > > > > > > We hit IO error on fsync, it turns out was because sd treat > > > > > > succeeded > > > > > > SYNC as error. From what I checked in SBC spec there is no > > > > > > indication > > > > > > we should fail IO in this case, so we create this patch. > > > > > > > > > > > > > > > > > > Best Regards, > > > > > > > > > > > > Jack Wang > > > > > > > > > > > > v2: > > > > > > No change on patch itself, only resend in body as suggested > > > > > > by > > > > > > Bart, > > > > > > still keep the attachment in case mail client break the > > > > > > format. > > > > > > > > > > > > From 5d1f72d9643ce61cd9f3d312377378c43f171d0c Mon Sep 17 > > > > > > 00:00:00 > > > > > > 2001 > > > > > > From: Jack Wang > > > > > > Date: Mon, 25 Apr 2016 12:05:22 +0200 > > > > > > Subject: [PATCH] sd: Don't treat succeeded SYNC as error > > > > > > > > > > > > We hit IO error in our production on multipath devices > > > > > > during > > > > > > resize > > > > > > device on target side, the problem turns out sd driver > > > > > > passes up as > > > > > > IO > > > > > > error when sense data is UNIT_ATTENTION and ASC && ASCQ > > > > > > indicate > > > > > > Capacity data has changed, even storage side sync the data > > > > > > properly. > > > > > > > > > > > > In order to fix this check in sd_done, report success if > > > > > > condition > > > > > > matches. > > > > > > > > > > > > Sebastian Parschauer report/analyze the bug here: > > > > > > https://sourceforge.net/p/scst/mailman/message/34953416/ > > > > > > > > > > > > Signed-off-by: Sebastian Parschauer > > > > > > Signed-off-by: Jack Wang > > > > > > --- > > > > > > drivers/scsi/sd.c | 13 +++++++++++++ > > > > > > 1 file changed, 13 insertions(+) > > > > > > > > > > > Well. > > > > > Is there anything which guarantees us that 'capacity data has > > > > > changed' will be the only sense code which we'll be seeing as > > > > > a > > > > > response to SYNCHRONIZE CACHE? > > > > > I sincerely doubt so. > > > > > So why don't you fall back to the default action (ie retry > > > > > the > > > > > command) whenever you hit an UNIT ATTENTION? > > > > > This way we would cove any resulting sense code, _and_ would > > > > > get rid > > > > > of the rather ugly special case here. > > > > > > > > Actually, why are we getting here at all? should we be eating > > > > this > > > > unit attention once we've reported it in scsi_check_sense()? > > > > > > > > I also don't quite understand why the normal retry mechanism in > > > > scsi_io_completion() (called after drv->done()) isn't handling > > > > this. > > > > We set retries on a flush command and we give sd_sync_cache > > > > three > > > > goes. Any one of those should also cause the CC/UA to be > > > > ignored. > > > > > > > > James > > > > > > > > > > > > > > Sorry for delay, I agree safer to retry this command. > > > I checked the code path, in scsi_io_completion, we call > > > __scsi_error_from_host_byte for FLUSH request, > > > and we set error to EIO by default, somehow the code report error > > > directly to user space without retry. > > > [ 647.206270] sd 1:0:0:0: [sdb] tag#0 Send: scmd > > > 0xffff8800b6558480 > > > [ 647.206422] sd 1:0:0:0: [sdb] tag#0 CDB: Synchronize Cache(10) > > > 35 > > > 00 00 00 00 00 00 00 00 00 > > > [ 647.209748] sd 1:0:0:0: Capacity data has changed > > > [ 647.209896] sd 1:0:0:0: [sdb] tag#0 Done: SUCCESS Result: > > > hostbyte=DID_OK driverbyte=DRIVER_OK > > > [ 647.210157] sd 1:0:0:0: [sdb] tag#0 CDB: Synchronize Cache(10) > > > 35 > > > 00 00 00 00 00 00 00 00 00 > > > [ 647.210419] sd 1:0:0:0: [sdb] tag#0 Sense Key : Unit Attention > > > [current] > > > [ 647.210567] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Capacity data > > > has changed > > > [ 647.210741] sd 1:0:0:0: [sdb] tag#0 scsi host busy 1 failed 0 > > > [ 647.210888] sd 1:0:0:0: Notifying upper driver of completion > > > (result 8000002) > > > [ 647.211035] sd 1:0:0:0: [sdb] tag#0 sd_done: completed 0 of 0 > > > bytes > > > [ 647.211182] sd 1:0:0:0: [sdb] tag#0 0 sectors total, 0 bytes > > > done, error -5 > > > [ 647.211330] blk_update_request: I/O error, dev sdb, sector 0 > > > > > > Will figure out why retry doesn't work. > > > > > > Thanks James and Hannes for all your input. > > > > > > Regards, > > > Jack > > > > Hi James, Hannes and all, > > > > I find out it's code below which report error directly back to user > > space without any retry. > > 913 /* > > 914 * If we finished all bytes in the request we are done > > now. > > 915 */ > > 916 if (!scsi_end_request(req, error, good_bytes, 0)) > > 917 return; > > > > But not sure, what's the best way to fix the behavior to let it > > retry, > > maybe add condition with sense key && asc && ascq direct go to > > requeue before line 913? > > > > Thanks > > Jack > > > Hi James , Hannes and all, > > I created a patch below, I did basic test on my test machines. Please > share your comments! > > Thanks, > Jack > From 72ab860811e14e37db81fb409abf0fa7e7fe32cb Mon Sep 17 00:00:00 > 2001 > From: Jack Wang > Date: Tue, 10 May 2016 10:10:59 +0200 > Subject: [PATCH] scsi: requeue command on capacity data has changed > > We hit IO error in our production on multipath devices during resize > device on target side, the problem turns out scsi driver passes up as > IO > error when sense data is UNIT_ATTENTION and ASC && ASCQ indicate > Capacity data has changed, even storage side sync the data properly. > > To fix this, when condition met, we simply requeue the command. > > Reported-by: Sebastian Parschauer > Signed-off-by: Jack Wang > --- > drivers/scsi/scsi_lib.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > index 8106515..b00310f 100644 > --- a/drivers/scsi/scsi_lib.c > +++ b/drivers/scsi/scsi_lib.c > @@ -910,6 +910,11 @@ void scsi_io_completion(struct scsi_cmnd *cmd, > unsigned int good_bytes) > error = 0; > } > > + if (sense_valid && (sshdr.sense_key == UNIT_ATTENTION)) { > + if ((sshdr.asc == 0x2a && sshdr.ascq == 0x09)) > + goto requeue; > + } > + Actually, I think this is symptomatic of a much bigger problem. Now that the FS can send zero length non BLOCK_PC request, we're not treating failure correctly. blk_update_request() will always finish them becuase they have no bytes outstanding (not having any in the first case). So I think we need a special exception for all zero length commands which complete with a failure to allow them to retry (if required). James --- -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 8106515..5a97866 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -911,9 +911,12 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes) } /* - * If we finished all bytes in the request we are done now. + * special case: failed zero length commands always need to + * drop down into the retry code. Otherwise, if we finished + * all bytes in the request we are done now. */ - if (!scsi_end_request(req, error, good_bytes, 0)) + if (!(good_bytes == 0 && blk_rq_bytes(req) == 0 && result != 0) && + !scsi_end_request(req, error, good_bytes, 0)) return; /*