From patchwork Tue Aug  9 17:12:47 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Laurence Oberman <loberman@redhat.com>
X-Patchwork-Id: 9271907
Return-Path: <linux-scsi-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	16AB660839 for <patchwork-linux-scsi@patchwork.kernel.org>;
	Tue,  9 Aug 2016 17:12:56 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 020C62833F
	for <patchwork-linux-scsi@patchwork.kernel.org>;
	Tue,  9 Aug 2016 17:12:56 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id E6E2527FA7; Tue,  9 Aug 2016 17:12:55 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI
	autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 321EB27FA7
	for <patchwork-linux-scsi@patchwork.kernel.org>;
	Tue,  9 Aug 2016 17:12:55 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S932211AbcHIRMy (ORCPT
	<rfc822;patchwork-linux-scsi@patchwork.kernel.org>);
	Tue, 9 Aug 2016 13:12:54 -0400
Received: from mx5-phx2.redhat.com ([209.132.183.37]:34594 "EHLO
	mx5-phx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S932199AbcHIRMx (ORCPT
	<rfc822; linux-scsi@vger.kernel.org>); Tue, 9 Aug 2016 13:12:53 -0400
Received: from zmail22.collab.prod.int.phx2.redhat.com
	(zmail22.collab.prod.int.phx2.redhat.com [10.5.83.26])
	by mx5-phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u79HClhO034316;
	Tue, 9 Aug 2016 13:12:47 -0400
Date: Tue, 9 Aug 2016 13:12:47 -0400 (EDT)
From: Laurence Oberman <loberman@redhat.com>
To: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: dm-devel@redhat.com, Mike Snitzer <snitzer@redhat.com>,
	linux-scsi@vger.kernel.org, Johannes Thumshirn <jthumshirn@suse.de>
Message-ID: <1494059467.386778.1470762767417.JavaMail.zimbra@redhat.com>
In-Reply-To: <2aaad6b7-bfa4-b965-53bf-4420fe01d3e5@sandisk.com>
References: <20160801175948.GA6685@redhat.com>
	<1616390775.11191.1470494853559.JavaMail.zimbra@redhat.com>
	<SN1PR0201MB1870C587B123A8F84A49C1BD811B0@SN1PR0201MB1870.namprd02.prod.outlook.com>
	<551419047.135340.1470669997660.JavaMail.zimbra@redhat.com>
	<077d2708-3360-d8d7-fb3c-d3a73a1e03ee@sandisk.com>
	<1345038259.188657.1470696767844.JavaMail.zimbra@redhat.com>
	<1771573384.192110.1470701350622.JavaMail.zimbra@redhat.com>
	<2aaad6b7-bfa4-b965-53bf-4420fe01d3e5@sandisk.com>
Subject: Re: [dm-devel] dm-mq and end_clone_request()
MIME-Version: 1.0
X-Originating-IP: [10.18.49.4]
X-Mailer: Zimbra 8.0.6_GA_5922 (ZimbraWebClient - FF38 (Linux)/8.0.6_GA_5922)
Thread-Topic: dm-mq and end_clone_request()
Thread-Index: 4p+BNVrxZRmav9NcGhJDQA4uBRZesQ==
Sender: linux-scsi-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-scsi.vger.kernel.org>
X-Mailing-List: linux-scsi@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

----- Original Message -----
> From: "Bart Van Assche" <bart.vanassche@sandisk.com>
> To: "Laurence Oberman" <loberman@redhat.com>
> Cc: dm-devel@redhat.com, "Mike Snitzer" <snitzer@redhat.com>, linux-scsi@vger.kernel.org, "Johannes Thumshirn"
> <jthumshirn@suse.de>
> Sent: Tuesday, August 9, 2016 11:51:00 AM
> Subject: Re: [dm-devel] dm-mq and end_clone_request()
> 
> On 08/08/2016 05:09 PM, Laurence Oberman wrote:
> > So now back to a 10 LUN dual path (ramdisk backed) two-server
> > configuration I am unable to reproduce the dm issue.
> > Recovery is very fast with the servers connected back to back.
> > This is using your kernel and this multipath.conf
> > 
> > [ ... ]
> > 
> > Mikes patches have definitely stabilized this issue for me on this
> > configuration.
> > 
> > I will see if I can move to a larger target server that has more
> > memory and allocate more mpath devices. I feel this issue in large
> > configurations is now rooted in multipath not bringing back maps
> > sometimes even when the actual paths are back via srp_daemon.
> > I am still tracking that down.
> > 
> > If you recall, last week I caused some of our own issues by
> > forgetting I had a no_path_retry 12 hiding in my multipath.conf.
> > Since removing that and spending most of the weekend testing on
> > the DDN array (had to give that back today), most of my issues
> > were either the sporadic host delete race or multipath not
> > re-instantiating paths.
> > 
> > I dont know if this helps, but since applying your latest patch I
> > have not seen the host delete race.
> 
> Hello Laurence,
> 
> My latest SCSI core patch adds additional instrumentation to the SCSI
> core but does not change the behavior of the SCSI core. So it cannot
> fix the scsi_forget_host() crash you had reported.
> 
> On my setup, with the kernel code from the srp-initiator-for-next
> branch and with CONFIG_DM_MQ_DEFAULT=n, I still see that when I run the
> srp-test software that fio reports I/O errors every now and then. What
> I see in syslog seems to indicate that these I/O errors are generated
> by dm-mpath:
> 
> Aug  9 08:45:39 ion-dev-ib-ini kernel: mpath 254:1: queue_if_no_path 1 -> 0
> Aug  9 08:45:39 ion-dev-ib-ini kernel: must_push_back: 107 callbacks
> suppressed
> Aug  9 08:45:39 ion-dev-ib-ini kernel: device-mapper: multipath:
> must_push_back: queue_if_no_path=0 suspend_active=1 suspending=0
> Aug  9 08:45:39 ion-dev-ib-ini kernel: __multipath_map(): (a) returning -5
> Aug  9 08:45:39 ion-dev-ib-ini kernel: map_request(): clone_and_map_rq()
> returned -5
> Aug  9 08:45:39 ion-dev-ib-ini kernel: dm_complete_request: error = -5
> Aug  9 08:45:39 ion-dev-ib-ini kernel: dm_softirq_done: dm-1 tio->error = -5
> 
> Bart.
> 
> 
Hello Bart

I was talking about this patch

--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -1890,10 +1890,11 @@ void scsi_forget_host(struct Scsi_Host *shost)
  restart:
         spin_lock_irqsave(shost->host_lock, flags);
         list_for_each_entry(sdev, &shost->__devices, siblings) {
-                if (sdev->sdev_state == SDEV_DEL)
+                if (sdev->sdev_state == SDEV_DEL || scsi_device_get(sdev) < 0)
                         continue;
                 spin_unlock_irqrestore(shost->host_lock, flags);
                 __scsi_remove_device(sdev);
+                scsi_device_put(sdev);
                 goto restart;
         }
         spin_unlock_irqrestore(shost->host_lock, flags);