[V6,11/11] nvme: pci: support nested EH

Message ID	20180517022030.GB21959@localhost.localdomain (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> Date: Wed, 16 May 2018 20:20:31 -0600 From: Keith Busch <keith.busch@linux.intel.com> To: Ming Lei <ming.lei@redhat.com> Cc: Keith Busch <keith.busch@intel.com>, Jens Axboe <axboe@kernel.dk>, Laurence Oberman <loberman@redhat.com>, Sagi Grimberg <sagi@grimberg.me>, James Smart <james.smart@broadcom.com>, "linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>, "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>, Jianchao Wang <jianchao.w.wang@oracle.com>, Christoph Hellwig <hch@lst.de> Subject: Re: [PATCH V6 11/11] nvme: pci: support nested EH Message-ID: <20180517022030.GB21959@localhost.localdomain> References: <20180516040313.13596-1-ming.lei@redhat.com> <20180516040313.13596-12-ming.lei@redhat.com> <20180516141242.GA20119@localhost.localdomain> <20180516231058.GB28727@ming.t460p> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180516231058.GB28727@ming.t460p> User-Agent: Mutt/1.9.1 (2017-09-22) Sender: linux-block-owner@vger.kernel.org Precedence: bulk

Keith Busch May 17, 2018, 2:20 a.m. UTC

Hi Ming,

I'm developing the answers in code the issues you raised. It will just
take a moment to complete flushing those out. In the meantime just want
to point out why I think block/011 isn't a real test.

On Thu, May 17, 2018 at 07:10:59AM +0800, Ming Lei wrote:
> All simulation in block/011 may happen in reality.

If this test actually simulates reality, then the following one line
patch (plus explanation for why) would be a real "fix" as this is very
successful in passing block/011. :)

---

Christoph Hellwig May 17, 2018, 8:41 a.m. UTC | #1

On Wed, May 16, 2018 at 08:20:31PM -0600, Keith Busch wrote:
> On Thu, May 17, 2018 at 07:10:59AM +0800, Ming Lei wrote:
> > All simulation in block/011 may happen in reality.
> 
> If this test actually simulates reality, then the following one line
> patch (plus explanation for why) would be a real "fix" as this is very
> successful in passing block/011. :)
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 1faa32cd07da..dcc5746304c4 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2118,6 +2118,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>  
>  	if (pci_enable_device_mem(pdev))
>  		return result;
> +	/*
> +	 * blktests block/011 disables the device without the driver knowing.
> +	 * We'll just enable the device twice to get the enable_cnt > 1
> +	 * so that the test's disabling does absolutely nothing.
> +	 */
> +	pci_enable_device_mem(pdev);

Heh.  But yes, this test and the PCI "enable" interface in sysfs look
horribly wrong. pci_disable_device/pci_enable_device aren't something we
can just do underneath the driver.  Even if injecting the lowlevel
config space manipulations done by it might be useful and a good test
the Linux state ones are just killing the driver.

The enable attribute appears to have been added by Arjan for the
Xorg driver.  I think if we have a driver bound to the device we
should not allow it.

Keith Busch May 17, 2018, 2:20 p.m. UTC | #2

On Thu, May 17, 2018 at 10:41:29AM +0200, Christoph Hellwig wrote:
> On Wed, May 16, 2018 at 08:20:31PM -0600, Keith Busch wrote:
> > On Thu, May 17, 2018 at 07:10:59AM +0800, Ming Lei wrote:
> > > All simulation in block/011 may happen in reality.
> > 
> > If this test actually simulates reality, then the following one line
> > patch (plus explanation for why) would be a real "fix" as this is very
> > successful in passing block/011. :)
> > 
> > ---
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 1faa32cd07da..dcc5746304c4 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2118,6 +2118,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
> >  
> >  	if (pci_enable_device_mem(pdev))
> >  		return result;
> > +	/*
> > +	 * blktests block/011 disables the device without the driver knowing.
> > +	 * We'll just enable the device twice to get the enable_cnt > 1
> > +	 * so that the test's disabling does absolutely nothing.
> > +	 */
> > +	pci_enable_device_mem(pdev);
> 
> Heh.  But yes, this test and the PCI "enable" interface in sysfs look
> horribly wrong. pci_disable_device/pci_enable_device aren't something we
> can just do underneath the driver.  Even if injecting the lowlevel
> config space manipulations done by it might be useful and a good test
> the Linux state ones are just killing the driver.

Yes, I'm totally fine with injecting errors into the config space, but
for goodness sakes, don't fuck with the internal kernel structures out
from under drivers using them.

My suggestion is to use 'setpci' to disable the device if you want to
inject this scenario. That way you get the desired broken device
scenario you want to test, but struct pci_dev hasn't been altered.
 
> The enable attribute appears to have been added by Arjan for the
> Xorg driver.  I think if we have a driver bound to the device we
> should not allow it.

Agreed. Alternatively possibly call the driver's reset_preparei/done
callbacks.

Johannes Thumshirn May 17, 2018, 2:23 p.m. UTC | #3

On Thu, May 17, 2018 at 08:20:51AM -0600, Keith Busch wrote:
> > Heh.  But yes, this test and the PCI "enable" interface in sysfs look
> > horribly wrong. pci_disable_device/pci_enable_device aren't something we
> > can just do underneath the driver.  Even if injecting the lowlevel
> > config space manipulations done by it might be useful and a good test
> > the Linux state ones are just killing the driver.
> 
> Yes, I'm totally fine with injecting errors into the config space, but
> for goodness sakes, don't fuck with the internal kernel structures out
> from under drivers using them.
> 
> My suggestion is to use 'setpci' to disable the device if you want to
> inject this scenario. That way you get the desired broken device
> scenario you want to test, but struct pci_dev hasn't been altered.
>  
> > The enable attribute appears to have been added by Arjan for the
> > Xorg driver.  I think if we have a driver bound to the device we
> > should not allow it.
> 
> Agreed. Alternatively possibly call the driver's reset_preparei/done
> callbacks.

Exactly, but as long as we can issue the reset via sysfs the test-case
is still valid.

Ming Lei May 18, 2018, 12:20 a.m. UTC | #4

On Wed, May 16, 2018 at 08:20:31PM -0600, Keith Busch wrote:
> Hi Ming,
> 
> I'm developing the answers in code the issues you raised. It will just
> take a moment to complete flushing those out. In the meantime just want
> to point out why I think block/011 isn't a real test.
> 
> On Thu, May 17, 2018 at 07:10:59AM +0800, Ming Lei wrote:
> > All simulation in block/011 may happen in reality.
> 
> If this test actually simulates reality, then the following one line
> patch (plus explanation for why) would be a real "fix" as this is very
> successful in passing block/011. :)
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 1faa32cd07da..dcc5746304c4 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2118,6 +2118,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>  
>  	if (pci_enable_device_mem(pdev))
>  		return result;
> +	/*
> +	 * blktests block/011 disables the device without the driver knowing.
> +	 * We'll just enable the device twice to get the enable_cnt > 1
> +	 * so that the test's disabling does absolutely nothing.
> +	 */
> +	pci_enable_device_mem(pdev);

What I think block/011 is helpful is that it can trigger IO timeout
during reset, which can be triggered in reality too.

That is one big problem of NVMe driver, IMO.

And this patch does fix this issue, together other timeout related
races.


Thanks,
Ming

Ming Lei May 18, 2018, 1:01 a.m. UTC | #5

On Fri, May 18, 2018 at 08:19:58AM +0800, Ming Lei wrote:
> On Wed, May 16, 2018 at 08:20:31PM -0600, Keith Busch wrote:
> > Hi Ming,
> > 
> > I'm developing the answers in code the issues you raised. It will just
> > take a moment to complete flushing those out. In the meantime just want
> > to point out why I think block/011 isn't a real test.
> > 
> > On Thu, May 17, 2018 at 07:10:59AM +0800, Ming Lei wrote:
> > > All simulation in block/011 may happen in reality.
> > 
> > If this test actually simulates reality, then the following one line
> > patch (plus explanation for why) would be a real "fix" as this is very
> > successful in passing block/011. :)
> > 
> > ---
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 1faa32cd07da..dcc5746304c4 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2118,6 +2118,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
> >  
> >  	if (pci_enable_device_mem(pdev))
> >  		return result;
> > +	/*
> > +	 * blktests block/011 disables the device without the driver knowing.
> > +	 * We'll just enable the device twice to get the enable_cnt > 1
> > +	 * so that the test's disabling does absolutely nothing.
> > +	 */
> > +	pci_enable_device_mem(pdev);
> 
> What I think block/011 is helpful is that it can trigger IO timeout
> during reset, which can be triggered in reality too.
> 
> That is one big problem of NVMe driver, IMO.
> 
> And this patch does fix this issue, together other timeout related
> races.

BTW, the above patch may not be enough to 'fix' this NVMe issue, I guess
you may have to figure out another one to remove fault-injection, or at
least disable io-timeout-fail on NVMe.

Thanks,
Ming

Keith Busch May 18, 2018, 1:57 p.m. UTC | #6

On Fri, May 18, 2018 at 08:20:05AM +0800, Ming Lei wrote:
> What I think block/011 is helpful is that it can trigger IO timeout
> during reset, which can be triggered in reality too.

As I mentioned earlier, there is nothing wrong with the spirit of
the test. What's wrong with it is the misguided implemention.

Do you underestand why it ever passes? The success happens when the
enabling part of the loop happens to coincide with the driver's enabling,
creating the pci_dev->enable_cnt > 1, making subsequent disable parts
of the loop do absolutely nothing; the exact same as the one-liner
(non-serious) patch I sent to defeat the test.

A better way to induce the timeout is:

  # setpci -s <B:D.f> 4.w=0:6

This will halt the device without messing with the kernel structures,
just like how a real device failure would occur.

Keith Busch May 18, 2018, 4:28 p.m. UTC | #7

On Thu, May 17, 2018 at 04:23:45PM +0200, Johannes Thumshirn wrote:
> > Agreed. Alternatively possibly call the driver's reset_preparei/done
> > callbacks.
> 
> Exactly, but as long as we can issue the reset via sysfs the test-case
> is still valid.

I disagree the test case is valid. The test writes '0' to the
pci-sysfs 'enable', but the driver also disables the pci device as part
of resetting, which is a perfectly reasonable thing for a driver to do.

If the timing of the test's loop happens to write '0' right after the
driver disabled the device that it owns, a 'write error' on that sysfs
write occurs, and blktests then incorrectly claims the test failed.

Jens Axboe May 18, 2018, 4:58 p.m. UTC | #8

On 5/18/18 7:57 AM, Keith Busch wrote:
> On Fri, May 18, 2018 at 08:20:05AM +0800, Ming Lei wrote:
>> What I think block/011 is helpful is that it can trigger IO timeout
>> during reset, which can be triggered in reality too.
> 
> As I mentioned earlier, there is nothing wrong with the spirit of
> the test. What's wrong with it is the misguided implemention.
> 
> Do you underestand why it ever passes? The success happens when the
> enabling part of the loop happens to coincide with the driver's enabling,
> creating the pci_dev->enable_cnt > 1, making subsequent disable parts
> of the loop do absolutely nothing; the exact same as the one-liner
> (non-serious) patch I sent to defeat the test.
> 
> A better way to induce the timeout is:
> 
>   # setpci -s <B:D.f> 4.w=0:6
> 
> This will halt the device without messing with the kernel structures,
> just like how a real device failure would occur.

Let's just improve/fix the test case. Sounds like the 'enable' sysfs
attribute should never have been exported, and hence the test should
never have used it. blktests is not the source of truth, necessarily,
and it would be silly to work around cases in the kernel if it's a
clear case of "doctor it hurts when I shoot myself in the foot".

Ming Lei May 18, 2018, 10:26 p.m. UTC | #9

On Fri, May 18, 2018 at 07:57:51AM -0600, Keith Busch wrote:
> On Fri, May 18, 2018 at 08:20:05AM +0800, Ming Lei wrote:
> > What I think block/011 is helpful is that it can trigger IO timeout
> > during reset, which can be triggered in reality too.
> 
> As I mentioned earlier, there is nothing wrong with the spirit of
> the test. What's wrong with it is the misguided implemention.
> 
> Do you underestand why it ever passes? The success happens when the
> enabling part of the loop happens to coincide with the driver's enabling,
> creating the pci_dev->enable_cnt > 1, making subsequent disable parts
> of the loop do absolutely nothing; the exact same as the one-liner
> (non-serious) patch I sent to defeat the test.
> 
> A better way to induce the timeout is:
> 
>   # setpci -s <B:D.f> 4.w=0:6
> 
> This will halt the device without messing with the kernel structures,
> just like how a real device failure would occur.

Frankly speaking, I don't care how the test-case is implemented at all.

The big problem is that NVMe driver can't handle IO time-out during
reset context, and finally either the controller becomes DEAD or reset
context hangs forever, and everything can't move on.

The issue can be reproduced easier via io-timeout-fail fault injection.

So could we please face to the real issue instead of working around
test case?

Thanks,
Ming

Keith Busch May 18, 2018, 11:45 p.m. UTC | #10

On Sat, May 19, 2018 at 06:26:28AM +0800, Ming Lei wrote:
> So could we please face to the real issue instead of working around
> test case?

Yes, that's why I want you to stop referencing the broken test.

Ming Lei May 18, 2018, 11:51 p.m. UTC | #11

On Fri, May 18, 2018 at 05:45:00PM -0600, Keith Busch wrote:
> On Sat, May 19, 2018 at 06:26:28AM +0800, Ming Lei wrote:
> > So could we please face to the real issue instead of working around
> > test case?
> 
> Yes, that's why I want you to stop referencing the broken test.

Unfortunately I don't care if it is broken or not, and I think
the test case is great because it can expose the real problem.
That is its value!

The reason why I referenced this test is that everyone can reproduce
this real issue easily, that is it.

Thanks,
Ming

Johannes Thumshirn May 22, 2018, 7:35 a.m. UTC | #12

On Fri, May 18, 2018 at 10:28:04AM -0600, Keith Busch wrote:
> On Thu, May 17, 2018 at 04:23:45PM +0200, Johannes Thumshirn wrote:
> > > Agreed. Alternatively possibly call the driver's reset_preparei/done
> > > callbacks.
> > 
> > Exactly, but as long as we can issue the reset via sysfs the test-case
> > is still valid.
> 
> I disagree the test case is valid. The test writes '0' to the
> pci-sysfs 'enable', but the driver also disables the pci device as part
> of resetting, which is a perfectly reasonable thing for a driver to do.
> 
> If the timing of the test's loop happens to write '0' right after the
> driver disabled the device that it owns, a 'write error' on that sysfs
> write occurs, and blktests then incorrectly claims the test failed.

Hmm, ok that's a valid point. But seeing you have sent a patch for
blktests anyways I think it's gone now.

[V6,11/11] nvme: pci: support nested EH

Commit Message

Comments

Patch