virtio_blk: Fix device surprise removal

Message ID	20240217180848.241068-1-parav@nvidia.com (mailing list archive)
State	New, archived
Headers	show Received: from NAM10-BN7-obe.outbound.protection.outlook.com (mail-bn7nam10on2052.outbound.protection.outlook.com [40.107.92.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0DBB641A91; Sat, 17 Feb 2024 18:09:17 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.117.160 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.117.160; helo=mail.nvidia.com; pr=C From: Parav Pandit <parav@nvidia.com> To: <mst@redhat.com>, <jasowang@redhat.com>, <xuanzhuo@linux.alibaba.com>, <pbonzini@redhat.com>, <stefanha@redhat.com>, <axboe@kernel.dk>, <virtualization@lists.linux.dev>, <linux-block@vger.kernel.org> CC: Parav Pandit <parav@nvidia.com>, <stable@vger.kernel.org>, <lirongqing@baidu.com>, Chaitanya Kulkarni <kch@nvidia.com> Subject: [PATCH] virtio_blk: Fix device surprise removal Date: Sat, 17 Feb 2024 20:08:48 +0200 Message-ID: <20240217180848.241068-1-parav@nvidia.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	virtio_blk: Fix device surprise removal \| expand virtio_blk: Fix device surprise removal

Parav Pandit Feb. 17, 2024, 6:08 p.m. UTC

When the PCI device is surprise removed, requests won't complete from
the device. These IOs are never completed and disk deletion hangs
indefinitely.

Fix it by aborting the IOs which the device will never complete
when the VQ is broken.

With this fix now fio completes swiftly.
An alternative of IO timeout has been considered, however
when the driver knows about unresponsive block device, swiftly clearing
them enables users and upper layers to react quickly.

Verified with multiple device unplug cycles with pending IOs in virtio
used ring and some pending with device.

In future instead of VQ broken, a more elegant method can be used. At the
moment the patch is kept to its minimal changes given its urgency to fix
broken kernels.

Fixes: 43bb40c5b926 ("virtio_pci: Support surprise removal of virtio pci device")
Cc: stable@vger.kernel.org
Reported-by: lirongqing@baidu.com
Closes: https://lore.kernel.org/virtualization/c45dd68698cd47238c55fb73ca9b4741@baidu.com/
Co-developed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
---
 drivers/block/virtio_blk.c | 54 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

Ming Lei Feb. 18, 2024, 1:27 p.m. UTC | #1

On Sat, Feb 17, 2024 at 08:08:48PM +0200, Parav Pandit wrote:
> When the PCI device is surprise removed, requests won't complete from
> the device. These IOs are never completed and disk deletion hangs
> indefinitely.
> 
> Fix it by aborting the IOs which the device will never complete
> when the VQ is broken.
> 
> With this fix now fio completes swiftly.
> An alternative of IO timeout has been considered, however
> when the driver knows about unresponsive block device, swiftly clearing
> them enables users and upper layers to react quickly.
> 
> Verified with multiple device unplug cycles with pending IOs in virtio
> used ring and some pending with device.
> 
> In future instead of VQ broken, a more elegant method can be used. At the
> moment the patch is kept to its minimal changes given its urgency to fix
> broken kernels.
> 
> Fixes: 43bb40c5b926 ("virtio_pci: Support surprise removal of virtio pci device")
> Cc: stable@vger.kernel.org
> Reported-by: lirongqing@baidu.com
> Closes: https://lore.kernel.org/virtualization/c45dd68698cd47238c55fb73ca9b4741@baidu.com/
> Co-developed-by: Chaitanya Kulkarni <kch@nvidia.com>
> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> Signed-off-by: Parav Pandit <parav@nvidia.com>
> ---
>  drivers/block/virtio_blk.c | 54 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 54 insertions(+)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 2bf14a0e2815..59b49899b229 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -1562,10 +1562,64 @@ static int virtblk_probe(struct virtio_device *vdev)
>  	return err;
>  }
>  
> +static bool virtblk_cancel_request(struct request *rq, void *data)
> +{
> +	struct virtblk_req *vbr = blk_mq_rq_to_pdu(rq);
> +
> +	vbr->in_hdr.status = VIRTIO_BLK_S_IOERR;
> +	if (blk_mq_request_started(rq) && !blk_mq_request_completed(rq))
> +		blk_mq_complete_request(rq);
> +
> +	return true;
> +}
> +
> +static void virtblk_cleanup_reqs(struct virtio_blk *vblk)
> +{
> +	struct virtio_blk_vq *blk_vq;
> +	struct request_queue *q;
> +	struct virtqueue *vq;
> +	unsigned long flags;
> +	int i;
> +
> +	vq = vblk->vqs[0].vq;
> +	if (!virtqueue_is_broken(vq))
> +		return;
> +

What if the surprise happens after the above check?


Thanks,
Ming

Parav Pandit Feb. 19, 2024, 3:14 a.m. UTC | #2

Hi Ming,

> From: Ming Lei <ming.lei@redhat.com>
> Sent: Sunday, February 18, 2024 6:57 PM
> 
> On Sat, Feb 17, 2024 at 08:08:48PM +0200, Parav Pandit wrote:
> > When the PCI device is surprise removed, requests won't complete from
> > the device. These IOs are never completed and disk deletion hangs
> > indefinitely.
> >
> > Fix it by aborting the IOs which the device will never complete when
> > the VQ is broken.
> >
> > With this fix now fio completes swiftly.
> > An alternative of IO timeout has been considered, however when the
> > driver knows about unresponsive block device, swiftly clearing them
> > enables users and upper layers to react quickly.
> >
> > Verified with multiple device unplug cycles with pending IOs in virtio
> > used ring and some pending with device.
> >
> > In future instead of VQ broken, a more elegant method can be used. At
> > the moment the patch is kept to its minimal changes given its urgency
> > to fix broken kernels.
> >
> > Fixes: 43bb40c5b926 ("virtio_pci: Support surprise removal of virtio
> > pci device")
> > Cc: stable@vger.kernel.org
> > Reported-by: lirongqing@baidu.com
> > Closes:
> > https://lore.kernel.org/virtualization/c45dd68698cd47238c55fb73ca9b474
> > 1@baidu.com/
> > Co-developed-by: Chaitanya Kulkarni <kch@nvidia.com>
> > Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > ---
> >  drivers/block/virtio_blk.c | 54
> > ++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 54 insertions(+)
> >
> > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > index 2bf14a0e2815..59b49899b229 100644
> > --- a/drivers/block/virtio_blk.c
> > +++ b/drivers/block/virtio_blk.c
> > @@ -1562,10 +1562,64 @@ static int virtblk_probe(struct virtio_device
> *vdev)
> >  	return err;
> >  }
> >
> > +static bool virtblk_cancel_request(struct request *rq, void *data) {
> > +	struct virtblk_req *vbr = blk_mq_rq_to_pdu(rq);
> > +
> > +	vbr->in_hdr.status = VIRTIO_BLK_S_IOERR;
> > +	if (blk_mq_request_started(rq) && !blk_mq_request_completed(rq))
> > +		blk_mq_complete_request(rq);
> > +
> > +	return true;
> > +}
> > +
> > +static void virtblk_cleanup_reqs(struct virtio_blk *vblk) {
> > +	struct virtio_blk_vq *blk_vq;
> > +	struct request_queue *q;
> > +	struct virtqueue *vq;
> > +	unsigned long flags;
> > +	int i;
> > +
> > +	vq = vblk->vqs[0].vq;
> > +	if (!virtqueue_is_broken(vq))
> > +		return;
> > +
> 
> What if the surprise happens after the above check?
> 
> 
In that small timing window, the race still exists.

I think, blk_mq_quiesce_queue(q); should move up before cleanup_reqs() regardless of surprise case along with other below changes.

Additionally, for non-surprise case, better to have a graceful timeout to complete already queued requests.
In absence of timeout scheme for this regression, shall we only complete the requests which the device has already completed (instead of waiting for the grace time)?
There was past work from Chaitanaya, for the graceful timeout.

The sequence for the fix I have in mind is:
1. quiesce the queue
2. complete all requests which has completed, with its status
3. stop the transport (queues)
4. complete remaining pending requests with error status

This should work regardless of surprise case.
An additional/optional graceful timeout on non-surprise case can be helpful for #2.

WDYT?

> Thanks,
> Ming

Michael S. Tsirkin Feb. 19, 2024, 8:15 a.m. UTC | #3

On Mon, Feb 19, 2024 at 03:14:54AM +0000, Parav Pandit wrote:
> Hi Ming,
> 
> > From: Ming Lei <ming.lei@redhat.com>
> > Sent: Sunday, February 18, 2024 6:57 PM
> > 
> > On Sat, Feb 17, 2024 at 08:08:48PM +0200, Parav Pandit wrote:
> > > When the PCI device is surprise removed, requests won't complete from
> > > the device. These IOs are never completed and disk deletion hangs
> > > indefinitely.
> > >
> > > Fix it by aborting the IOs which the device will never complete when
> > > the VQ is broken.
> > >
> > > With this fix now fio completes swiftly.
> > > An alternative of IO timeout has been considered, however when the
> > > driver knows about unresponsive block device, swiftly clearing them
> > > enables users and upper layers to react quickly.
> > >
> > > Verified with multiple device unplug cycles with pending IOs in virtio
> > > used ring and some pending with device.
> > >
> > > In future instead of VQ broken, a more elegant method can be used. At
> > > the moment the patch is kept to its minimal changes given its urgency
> > > to fix broken kernels.
> > >
> > > Fixes: 43bb40c5b926 ("virtio_pci: Support surprise removal of virtio
> > > pci device")
> > > Cc: stable@vger.kernel.org
> > > Reported-by: lirongqing@baidu.com
> > > Closes:
> > > https://lore.kernel.org/virtualization/c45dd68698cd47238c55fb73ca9b474
> > > 1@baidu.com/
> > > Co-developed-by: Chaitanya Kulkarni <kch@nvidia.com>
> > > Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > ---
> > >  drivers/block/virtio_blk.c | 54
> > > ++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 54 insertions(+)
> > >
> > > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > > index 2bf14a0e2815..59b49899b229 100644
> > > --- a/drivers/block/virtio_blk.c
> > > +++ b/drivers/block/virtio_blk.c
> > > @@ -1562,10 +1562,64 @@ static int virtblk_probe(struct virtio_device
> > *vdev)
> > >  	return err;
> > >  }
> > >
> > > +static bool virtblk_cancel_request(struct request *rq, void *data) {
> > > +	struct virtblk_req *vbr = blk_mq_rq_to_pdu(rq);
> > > +
> > > +	vbr->in_hdr.status = VIRTIO_BLK_S_IOERR;
> > > +	if (blk_mq_request_started(rq) && !blk_mq_request_completed(rq))
> > > +		blk_mq_complete_request(rq);
> > > +
> > > +	return true;
> > > +}
> > > +
> > > +static void virtblk_cleanup_reqs(struct virtio_blk *vblk) {
> > > +	struct virtio_blk_vq *blk_vq;
> > > +	struct request_queue *q;
> > > +	struct virtqueue *vq;
> > > +	unsigned long flags;
> > > +	int i;
> > > +
> > > +	vq = vblk->vqs[0].vq;
> > > +	if (!virtqueue_is_broken(vq))
> > > +		return;
> > > +
> > 
> > What if the surprise happens after the above check?
> > 
> > 
> In that small timing window, the race still exists.
> 
> I think, blk_mq_quiesce_queue(q); should move up before cleanup_reqs() regardless of surprise case along with other below changes.
> 
> Additionally, for non-surprise case, better to have a graceful timeout to complete already queued requests.
> In absence of timeout scheme for this regression, shall we only complete the requests which the device has already completed (instead of waiting for the grace time)?
> There was past work from Chaitanaya, for the graceful timeout.
> 
> The sequence for the fix I have in mind is:
> 1. quiesce the queue
> 2. complete all requests which has completed, with its status
> 3. stop the transport (queues)
> 4. complete remaining pending requests with error status
> 
> This should work regardless of surprise case.
> An additional/optional graceful timeout on non-surprise case can be helpful for #2.
> 
> WDYT?

All this is unnecessarily hard for drivers... I am thinking
maybe after we set broken we should go ahead and invoke all
callbacks. The issue is really interrupt handling core
is not making it easy for us - we must disable real
interrupts if we do, and in the past we failed to do it.
See e.g.


commit eb4cecb453a19b34d5454b49532e09e9cb0c1529
Author: Jason Wang <jasowang@redhat.com>
Date:   Wed Mar 23 11:15:24 2022 +0800

    Revert "virtio_pci: harden MSI-X interrupts"
    
    This reverts commit 9e35276a5344f74d4a3600fc4100b3dd251d5c56. Issue
    were reported for the drivers that are using affinity managed IRQ
    where manually toggling IRQ status is not expected. And we forget to
    enable the interrupts in the restore path as well.
    
    In the future, we will rework on the interrupt hardening.
    
    Fixes: 9e35276a5344 ("virtio_pci: harden MSI-X interrupts")



If someone can figure out a way to make toggling interrupt state
play nice with affinity managed interrupts, that would solve
a host of issues I feel.



> > Thanks,
> > Ming

Parav Pandit Feb. 19, 2024, 10:39 a.m. UTC | #4

> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Monday, February 19, 2024 1:45 PM
> 
> On Mon, Feb 19, 2024 at 03:14:54AM +0000, Parav Pandit wrote:
> > Hi Ming,
> >
> > > From: Ming Lei <ming.lei@redhat.com>
> > > Sent: Sunday, February 18, 2024 6:57 PM
> > >
> > > On Sat, Feb 17, 2024 at 08:08:48PM +0200, Parav Pandit wrote:
> > > > When the PCI device is surprise removed, requests won't complete
> > > > from the device. These IOs are never completed and disk deletion
> > > > hangs indefinitely.
> > > >
> > > > Fix it by aborting the IOs which the device will never complete
> > > > when the VQ is broken.
> > > >
> > > > With this fix now fio completes swiftly.
> > > > An alternative of IO timeout has been considered, however when the
> > > > driver knows about unresponsive block device, swiftly clearing
> > > > them enables users and upper layers to react quickly.
> > > >
> > > > Verified with multiple device unplug cycles with pending IOs in
> > > > virtio used ring and some pending with device.
> > > >
> > > > In future instead of VQ broken, a more elegant method can be used.
> > > > At the moment the patch is kept to its minimal changes given its
> > > > urgency to fix broken kernels.
> > > >
> > > > Fixes: 43bb40c5b926 ("virtio_pci: Support surprise removal of
> > > > virtio pci device")
> > > > Cc: stable@vger.kernel.org
> > > > Reported-by: lirongqing@baidu.com
> > > > Closes:
> > > > https://lore.kernel.org/virtualization/c45dd68698cd47238c55fb73ca9
> > > > b474
> > > > 1@baidu.com/
> > > > Co-developed-by: Chaitanya Kulkarni <kch@nvidia.com>
> > > > Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > ---
> > > >  drivers/block/virtio_blk.c | 54
> > > > ++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 54 insertions(+)
> > > >
> > > > diff --git a/drivers/block/virtio_blk.c
> > > > b/drivers/block/virtio_blk.c index 2bf14a0e2815..59b49899b229
> > > > 100644
> > > > --- a/drivers/block/virtio_blk.c
> > > > +++ b/drivers/block/virtio_blk.c
> > > > @@ -1562,10 +1562,64 @@ static int virtblk_probe(struct
> > > > virtio_device
> > > *vdev)
> > > >  	return err;
> > > >  }
> > > >
> > > > +static bool virtblk_cancel_request(struct request *rq, void *data) {
> > > > +	struct virtblk_req *vbr = blk_mq_rq_to_pdu(rq);
> > > > +
> > > > +	vbr->in_hdr.status = VIRTIO_BLK_S_IOERR;
> > > > +	if (blk_mq_request_started(rq) && !blk_mq_request_completed(rq))
> > > > +		blk_mq_complete_request(rq);
> > > > +
> > > > +	return true;
> > > > +}
> > > > +
> > > > +static void virtblk_cleanup_reqs(struct virtio_blk *vblk) {
> > > > +	struct virtio_blk_vq *blk_vq;
> > > > +	struct request_queue *q;
> > > > +	struct virtqueue *vq;
> > > > +	unsigned long flags;
> > > > +	int i;
> > > > +
> > > > +	vq = vblk->vqs[0].vq;
> > > > +	if (!virtqueue_is_broken(vq))
> > > > +		return;
> > > > +
> > >
> > > What if the surprise happens after the above check?
> > >
> > >
> > In that small timing window, the race still exists.
> >
> > I think, blk_mq_quiesce_queue(q); should move up before cleanup_reqs()
> regardless of surprise case along with other below changes.
> >
> > Additionally, for non-surprise case, better to have a graceful timeout to
> complete already queued requests.
> > In absence of timeout scheme for this regression, shall we only complete the
> requests which the device has already completed (instead of waiting for the
> grace time)?
> > There was past work from Chaitanaya, for the graceful timeout.
> >
> > The sequence for the fix I have in mind is:
> > 1. quiesce the queue
> > 2. complete all requests which has completed, with its status 3. stop
> > the transport (queues) 4. complete remaining pending requests with
> > error status
> >
> > This should work regardless of surprise case.
> > An additional/optional graceful timeout on non-surprise case can be helpful
> for #2.
> >
> > WDYT?
> 
> All this is unnecessarily hard for drivers... I am thinking maybe after we set
> broken we should go ahead and invoke all callbacks. 

Yes, #2 is about invoking the callbacks.

The issue is not with setting the flag broken. As Ming pointed, the issue is : we may miss setting the broken.

Without graceful time out it is straight forward code, just rearrangement of APIs in this patch with existing code.

The question is : it is really if we really care for that grace period when the device or driver is already on its exit path and VQ is not broken.
If we don't wait for the request in progress, is it ok?


> interrupt handling core is not making it easy for us - we must disable real
> interrupts if we do, and in the past we failed to do it.
> See e.g.
> 
> 
> commit eb4cecb453a19b34d5454b49532e09e9cb0c1529
> Author: Jason Wang <jasowang@redhat.com>
> Date:   Wed Mar 23 11:15:24 2022 +0800
> 
>     Revert "virtio_pci: harden MSI-X interrupts"
> 
>     This reverts commit 9e35276a5344f74d4a3600fc4100b3dd251d5c56.
> Issue
>     were reported for the drivers that are using affinity managed IRQ
>     where manually toggling IRQ status is not expected. And we forget to
>     enable the interrupts in the restore path as well.
> 
>     In the future, we will rework on the interrupt hardening.
> 
>     Fixes: 9e35276a5344 ("virtio_pci: harden MSI-X interrupts")
> 
> 
> 
> If someone can figure out a way to make toggling interrupt state play nice with
> affinity managed interrupts, that would solve a host of issues I feel.
> 
> 
> 
> > > Thanks,
> > > Ming

Michael S. Tsirkin Feb. 19, 2024, 10:47 a.m. UTC | #5

On Mon, Feb 19, 2024 at 10:39:36AM +0000, Parav Pandit wrote:
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Monday, February 19, 2024 1:45 PM
> > 
> > On Mon, Feb 19, 2024 at 03:14:54AM +0000, Parav Pandit wrote:
> > > Hi Ming,
> > >
> > > > From: Ming Lei <ming.lei@redhat.com>
> > > > Sent: Sunday, February 18, 2024 6:57 PM
> > > >
> > > > On Sat, Feb 17, 2024 at 08:08:48PM +0200, Parav Pandit wrote:
> > > > > When the PCI device is surprise removed, requests won't complete
> > > > > from the device. These IOs are never completed and disk deletion
> > > > > hangs indefinitely.
> > > > >
> > > > > Fix it by aborting the IOs which the device will never complete
> > > > > when the VQ is broken.
> > > > >
> > > > > With this fix now fio completes swiftly.
> > > > > An alternative of IO timeout has been considered, however when the
> > > > > driver knows about unresponsive block device, swiftly clearing
> > > > > them enables users and upper layers to react quickly.
> > > > >
> > > > > Verified with multiple device unplug cycles with pending IOs in
> > > > > virtio used ring and some pending with device.
> > > > >
> > > > > In future instead of VQ broken, a more elegant method can be used.
> > > > > At the moment the patch is kept to its minimal changes given its
> > > > > urgency to fix broken kernels.
> > > > >
> > > > > Fixes: 43bb40c5b926 ("virtio_pci: Support surprise removal of
> > > > > virtio pci device")
> > > > > Cc: stable@vger.kernel.org
> > > > > Reported-by: lirongqing@baidu.com
> > > > > Closes:
> > > > > https://lore.kernel.org/virtualization/c45dd68698cd47238c55fb73ca9
> > > > > b474
> > > > > 1@baidu.com/
> > > > > Co-developed-by: Chaitanya Kulkarni <kch@nvidia.com>
> > > > > Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> > > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > > ---
> > > > >  drivers/block/virtio_blk.c | 54
> > > > > ++++++++++++++++++++++++++++++++++++++
> > > > >  1 file changed, 54 insertions(+)
> > > > >
> > > > > diff --git a/drivers/block/virtio_blk.c
> > > > > b/drivers/block/virtio_blk.c index 2bf14a0e2815..59b49899b229
> > > > > 100644
> > > > > --- a/drivers/block/virtio_blk.c
> > > > > +++ b/drivers/block/virtio_blk.c
> > > > > @@ -1562,10 +1562,64 @@ static int virtblk_probe(struct
> > > > > virtio_device
> > > > *vdev)
> > > > >  	return err;
> > > > >  }
> > > > >
> > > > > +static bool virtblk_cancel_request(struct request *rq, void *data) {
> > > > > +	struct virtblk_req *vbr = blk_mq_rq_to_pdu(rq);
> > > > > +
> > > > > +	vbr->in_hdr.status = VIRTIO_BLK_S_IOERR;
> > > > > +	if (blk_mq_request_started(rq) && !blk_mq_request_completed(rq))
> > > > > +		blk_mq_complete_request(rq);
> > > > > +
> > > > > +	return true;
> > > > > +}
> > > > > +
> > > > > +static void virtblk_cleanup_reqs(struct virtio_blk *vblk) {
> > > > > +	struct virtio_blk_vq *blk_vq;
> > > > > +	struct request_queue *q;
> > > > > +	struct virtqueue *vq;
> > > > > +	unsigned long flags;
> > > > > +	int i;
> > > > > +
> > > > > +	vq = vblk->vqs[0].vq;
> > > > > +	if (!virtqueue_is_broken(vq))
> > > > > +		return;
> > > > > +
> > > >
> > > > What if the surprise happens after the above check?
> > > >
> > > >
> > > In that small timing window, the race still exists.
> > >
> > > I think, blk_mq_quiesce_queue(q); should move up before cleanup_reqs()
> > regardless of surprise case along with other below changes.
> > >
> > > Additionally, for non-surprise case, better to have a graceful timeout to
> > complete already queued requests.
> > > In absence of timeout scheme for this regression, shall we only complete the
> > requests which the device has already completed (instead of waiting for the
> > grace time)?
> > > There was past work from Chaitanaya, for the graceful timeout.
> > >
> > > The sequence for the fix I have in mind is:
> > > 1. quiesce the queue
> > > 2. complete all requests which has completed, with its status 3. stop
> > > the transport (queues) 4. complete remaining pending requests with
> > > error status
> > >
> > > This should work regardless of surprise case.
> > > An additional/optional graceful timeout on non-surprise case can be helpful
> > for #2.
> > >
> > > WDYT?
> > 
> > All this is unnecessarily hard for drivers... I am thinking maybe after we set
> > broken we should go ahead and invoke all callbacks. 
> 
> Yes, #2 is about invoking the callbacks.
> 
> The issue is not with setting the flag broken. As Ming pointed, the issue is : we may miss setting the broken.


So if we did get callbacks, we'd be able to test broken flag in the
callback.

> Without graceful time out it is straight forward code, just rearrangement of APIs in this patch with existing code.
> 
> The question is : it is really if we really care for that grace period when the device or driver is already on its exit path and VQ is not broken.
> If we don't wait for the request in progress, is it ok?
> 

If we are talking about physical hardware, it seems quite possible that
removal triggers then user gets impatient and yanks the card out.


> > interrupt handling core is not making it easy for us - we must disable real
> > interrupts if we do, and in the past we failed to do it.
> > See e.g.
> > 
> > 
> > commit eb4cecb453a19b34d5454b49532e09e9cb0c1529
> > Author: Jason Wang <jasowang@redhat.com>
> > Date:   Wed Mar 23 11:15:24 2022 +0800
> > 
> >     Revert "virtio_pci: harden MSI-X interrupts"
> > 
> >     This reverts commit 9e35276a5344f74d4a3600fc4100b3dd251d5c56.
> > Issue
> >     were reported for the drivers that are using affinity managed IRQ
> >     where manually toggling IRQ status is not expected. And we forget to
> >     enable the interrupts in the restore path as well.
> > 
> >     In the future, we will rework on the interrupt hardening.
> > 
> >     Fixes: 9e35276a5344 ("virtio_pci: harden MSI-X interrupts")
> > 
> > 
> > 
> > If someone can figure out a way to make toggling interrupt state play nice with
> > affinity managed interrupts, that would solve a host of issues I feel.
> > 
> > 
> > 
> > > > Thanks,
> > > > Ming

Parav Pandit Feb. 20, 2024, 12:03 p.m. UTC | #6

> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Monday, February 19, 2024 4:17 PM
> 
> On Mon, Feb 19, 2024 at 10:39:36AM +0000, Parav Pandit wrote:
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Monday, February 19, 2024 1:45 PM
> > >
> > > On Mon, Feb 19, 2024 at 03:14:54AM +0000, Parav Pandit wrote:
> > > > Hi Ming,
> > > >
> > > > > From: Ming Lei <ming.lei@redhat.com>
> > > > > Sent: Sunday, February 18, 2024 6:57 PM
> > > > >
> > > > > On Sat, Feb 17, 2024 at 08:08:48PM +0200, Parav Pandit wrote:
> > > > > > When the PCI device is surprise removed, requests won't
> > > > > > complete from the device. These IOs are never completed and
> > > > > > disk deletion hangs indefinitely.
> > > > > >
> > > > > > Fix it by aborting the IOs which the device will never
> > > > > > complete when the VQ is broken.
> > > > > >
> > > > > > With this fix now fio completes swiftly.
> > > > > > An alternative of IO timeout has been considered, however when
> > > > > > the driver knows about unresponsive block device, swiftly
> > > > > > clearing them enables users and upper layers to react quickly.
> > > > > >
> > > > > > Verified with multiple device unplug cycles with pending IOs
> > > > > > in virtio used ring and some pending with device.
> > > > > >
> > > > > > In future instead of VQ broken, a more elegant method can be used.
> > > > > > At the moment the patch is kept to its minimal changes given
> > > > > > its urgency to fix broken kernels.
> > > > > >
> > > > > > Fixes: 43bb40c5b926 ("virtio_pci: Support surprise removal of
> > > > > > virtio pci device")
> > > > > > Cc: stable@vger.kernel.org
> > > > > > Reported-by: lirongqing@baidu.com
> > > > > > Closes:
> > > > > > https://lore.kernel.org/virtualization/c45dd68698cd47238c55fb7
> > > > > > 3ca9
> > > > > > b474
> > > > > > 1@baidu.com/
> > > > > > Co-developed-by: Chaitanya Kulkarni <kch@nvidia.com>
> > > > > > Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> > > > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > > > ---
> > > > > >  drivers/block/virtio_blk.c | 54
> > > > > > ++++++++++++++++++++++++++++++++++++++
> > > > > >  1 file changed, 54 insertions(+)
> > > > > >
> > > > > > diff --git a/drivers/block/virtio_blk.c
> > > > > > b/drivers/block/virtio_blk.c index 2bf14a0e2815..59b49899b229
> > > > > > 100644
> > > > > > --- a/drivers/block/virtio_blk.c
> > > > > > +++ b/drivers/block/virtio_blk.c
> > > > > > @@ -1562,10 +1562,64 @@ static int virtblk_probe(struct
> > > > > > virtio_device
> > > > > *vdev)
> > > > > >  	return err;
> > > > > >  }
> > > > > >
> > > > > > +static bool virtblk_cancel_request(struct request *rq, void *data) {
> > > > > > +	struct virtblk_req *vbr = blk_mq_rq_to_pdu(rq);
> > > > > > +
> > > > > > +	vbr->in_hdr.status = VIRTIO_BLK_S_IOERR;
> > > > > > +	if (blk_mq_request_started(rq) &&
> !blk_mq_request_completed(rq))
> > > > > > +		blk_mq_complete_request(rq);
> > > > > > +
> > > > > > +	return true;
> > > > > > +}
> > > > > > +
> > > > > > +static void virtblk_cleanup_reqs(struct virtio_blk *vblk) {
> > > > > > +	struct virtio_blk_vq *blk_vq;
> > > > > > +	struct request_queue *q;
> > > > > > +	struct virtqueue *vq;
> > > > > > +	unsigned long flags;
> > > > > > +	int i;
> > > > > > +
> > > > > > +	vq = vblk->vqs[0].vq;
> > > > > > +	if (!virtqueue_is_broken(vq))
> > > > > > +		return;
> > > > > > +
> > > > >
> > > > > What if the surprise happens after the above check?
> > > > >
> > > > >
> > > > In that small timing window, the race still exists.
> > > >
> > > > I think, blk_mq_quiesce_queue(q); should move up before
> > > > cleanup_reqs()
> > > regardless of surprise case along with other below changes.
> > > >
> > > > Additionally, for non-surprise case, better to have a graceful
> > > > timeout to
> > > complete already queued requests.
> > > > In absence of timeout scheme for this regression, shall we only
> > > > complete the
> > > requests which the device has already completed (instead of waiting
> > > for the grace time)?
> > > > There was past work from Chaitanaya, for the graceful timeout.
> > > >
> > > > The sequence for the fix I have in mind is:
> > > > 1. quiesce the queue
> > > > 2. complete all requests which has completed, with its status 3.
> > > > stop the transport (queues) 4. complete remaining pending requests
> > > > with error status
> > > >
> > > > This should work regardless of surprise case.
> > > > An additional/optional graceful timeout on non-surprise case can
> > > > be helpful
> > > for #2.
> > > >
> > > > WDYT?
> > >
> > > All this is unnecessarily hard for drivers... I am thinking maybe
> > > after we set broken we should go ahead and invoke all callbacks.
> >
> > Yes, #2 is about invoking the callbacks.
> >
> > The issue is not with setting the flag broken. As Ming pointed, the issue is :
> we may miss setting the broken.
> 
> 
> So if we did get callbacks, we'd be able to test broken flag in the callback.
> 
Yes, getting callbacks is fine.
But when the device is surprise removed, we wont get the callbacks and completions are missed.

> > Without graceful time out it is straight forward code, just rearrangement of
> APIs in this patch with existing code.
> >
> > The question is : it is really if we really care for that grace period when the
> device or driver is already on its exit path and VQ is not broken.
> > If we don't wait for the request in progress, is it ok?
> >
> 
> If we are talking about physical hardware, it seems quite possible that removal
> triggers then user gets impatient and yanks the card out.
> 
Yes, regardless of surprise or not, completing the remaining IOs is just good enough.
Device is anyway on its exit path, so completing 10 commands vs 12, does not make a lot of difference with extra complexity of timeout.

So better to not complicate the driver, at least not when adding Fixes tag patch.

> 
> > > interrupt handling core is not making it easy for us - we must
> > > disable real interrupts if we do, and in the past we failed to do it.
> > > See e.g.
> > >
> > >
> > > commit eb4cecb453a19b34d5454b49532e09e9cb0c1529
> > > Author: Jason Wang <jasowang@redhat.com>
> > > Date:   Wed Mar 23 11:15:24 2022 +0800
> > >
> > >     Revert "virtio_pci: harden MSI-X interrupts"
> > >
> > >     This reverts commit 9e35276a5344f74d4a3600fc4100b3dd251d5c56.
> > > Issue
> > >     were reported for the drivers that are using affinity managed IRQ
> > >     where manually toggling IRQ status is not expected. And we forget to
> > >     enable the interrupts in the restore path as well.
> > >
> > >     In the future, we will rework on the interrupt hardening.
> > >
> > >     Fixes: 9e35276a5344 ("virtio_pci: harden MSI-X interrupts")
> > >
> > >
> > >
> > > If someone can figure out a way to make toggling interrupt state
> > > play nice with affinity managed interrupts, that would solve a host of
> issues I feel.
> > >
> > >
> > >
> > > > > Thanks,
> > > > > Ming

Michael S. Tsirkin Feb. 20, 2024, 12:16 p.m. UTC | #7

On Tue, Feb 20, 2024 at 12:03:15PM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Monday, February 19, 2024 4:17 PM
> > 
> > On Mon, Feb 19, 2024 at 10:39:36AM +0000, Parav Pandit wrote:
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Monday, February 19, 2024 1:45 PM
> > > >
> > > > On Mon, Feb 19, 2024 at 03:14:54AM +0000, Parav Pandit wrote:
> > > > > Hi Ming,
> > > > >
> > > > > > From: Ming Lei <ming.lei@redhat.com>
> > > > > > Sent: Sunday, February 18, 2024 6:57 PM
> > > > > >
> > > > > > On Sat, Feb 17, 2024 at 08:08:48PM +0200, Parav Pandit wrote:
> > > > > > > When the PCI device is surprise removed, requests won't
> > > > > > > complete from the device. These IOs are never completed and
> > > > > > > disk deletion hangs indefinitely.
> > > > > > >
> > > > > > > Fix it by aborting the IOs which the device will never
> > > > > > > complete when the VQ is broken.
> > > > > > >
> > > > > > > With this fix now fio completes swiftly.
> > > > > > > An alternative of IO timeout has been considered, however when
> > > > > > > the driver knows about unresponsive block device, swiftly
> > > > > > > clearing them enables users and upper layers to react quickly.
> > > > > > >
> > > > > > > Verified with multiple device unplug cycles with pending IOs
> > > > > > > in virtio used ring and some pending with device.
> > > > > > >
> > > > > > > In future instead of VQ broken, a more elegant method can be used.
> > > > > > > At the moment the patch is kept to its minimal changes given
> > > > > > > its urgency to fix broken kernels.
> > > > > > >
> > > > > > > Fixes: 43bb40c5b926 ("virtio_pci: Support surprise removal of
> > > > > > > virtio pci device")
> > > > > > > Cc: stable@vger.kernel.org
> > > > > > > Reported-by: lirongqing@baidu.com
> > > > > > > Closes:
> > > > > > > https://lore.kernel.org/virtualization/c45dd68698cd47238c55fb7
> > > > > > > 3ca9
> > > > > > > b474
> > > > > > > 1@baidu.com/
> > > > > > > Co-developed-by: Chaitanya Kulkarni <kch@nvidia.com>
> > > > > > > Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> > > > > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > > > > ---
> > > > > > >  drivers/block/virtio_blk.c | 54
> > > > > > > ++++++++++++++++++++++++++++++++++++++
> > > > > > >  1 file changed, 54 insertions(+)
> > > > > > >
> > > > > > > diff --git a/drivers/block/virtio_blk.c
> > > > > > > b/drivers/block/virtio_blk.c index 2bf14a0e2815..59b49899b229
> > > > > > > 100644
> > > > > > > --- a/drivers/block/virtio_blk.c
> > > > > > > +++ b/drivers/block/virtio_blk.c
> > > > > > > @@ -1562,10 +1562,64 @@ static int virtblk_probe(struct
> > > > > > > virtio_device
> > > > > > *vdev)
> > > > > > >  	return err;
> > > > > > >  }
> > > > > > >
> > > > > > > +static bool virtblk_cancel_request(struct request *rq, void *data) {
> > > > > > > +	struct virtblk_req *vbr = blk_mq_rq_to_pdu(rq);
> > > > > > > +
> > > > > > > +	vbr->in_hdr.status = VIRTIO_BLK_S_IOERR;
> > > > > > > +	if (blk_mq_request_started(rq) &&
> > !blk_mq_request_completed(rq))
> > > > > > > +		blk_mq_complete_request(rq);
> > > > > > > +
> > > > > > > +	return true;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void virtblk_cleanup_reqs(struct virtio_blk *vblk) {
> > > > > > > +	struct virtio_blk_vq *blk_vq;
> > > > > > > +	struct request_queue *q;
> > > > > > > +	struct virtqueue *vq;
> > > > > > > +	unsigned long flags;
> > > > > > > +	int i;
> > > > > > > +
> > > > > > > +	vq = vblk->vqs[0].vq;
> > > > > > > +	if (!virtqueue_is_broken(vq))
> > > > > > > +		return;
> > > > > > > +
> > > > > >
> > > > > > What if the surprise happens after the above check?
> > > > > >
> > > > > >
> > > > > In that small timing window, the race still exists.
> > > > >
> > > > > I think, blk_mq_quiesce_queue(q); should move up before
> > > > > cleanup_reqs()
> > > > regardless of surprise case along with other below changes.
> > > > >
> > > > > Additionally, for non-surprise case, better to have a graceful
> > > > > timeout to
> > > > complete already queued requests.
> > > > > In absence of timeout scheme for this regression, shall we only
> > > > > complete the
> > > > requests which the device has already completed (instead of waiting
> > > > for the grace time)?
> > > > > There was past work from Chaitanaya, for the graceful timeout.
> > > > >
> > > > > The sequence for the fix I have in mind is:
> > > > > 1. quiesce the queue
> > > > > 2. complete all requests which has completed, with its status 3.
> > > > > stop the transport (queues) 4. complete remaining pending requests
> > > > > with error status
> > > > >
> > > > > This should work regardless of surprise case.
> > > > > An additional/optional graceful timeout on non-surprise case can
> > > > > be helpful
> > > > for #2.
> > > > >
> > > > > WDYT?
> > > >
> > > > All this is unnecessarily hard for drivers... I am thinking maybe
> > > > after we set broken we should go ahead and invoke all callbacks.
> > >
> > > Yes, #2 is about invoking the callbacks.
> > >
> > > The issue is not with setting the flag broken. As Ming pointed, the issue is :
> > we may miss setting the broken.
> > 
> > 
> > So if we did get callbacks, we'd be able to test broken flag in the callback.
> > 
> Yes, getting callbacks is fine.
> But when the device is surprise removed, we wont get the callbacks and completions are missed.

exactly and then we should trigger them ourselves.

> > > Without graceful time out it is straight forward code, just rearrangement of
> > APIs in this patch with existing code.
> > >
> > > The question is : it is really if we really care for that grace period when the
> > device or driver is already on its exit path and VQ is not broken.
> > > If we don't wait for the request in progress, is it ok?
> > >
> > 
> > If we are talking about physical hardware, it seems quite possible that removal
> > triggers then user gets impatient and yanks the card out.
> > 
> Yes, regardless of surprise or not, completing the remaining IOs is just good enough.
> Device is anyway on its exit path, so completing 10 commands vs 12, does not make a lot of difference with extra complexity of timeout.
> 
> So better to not complicate the driver, at least not when adding Fixes tag patch.
> 
> > 
> > > > interrupt handling core is not making it easy for us - we must
> > > > disable real interrupts if we do, and in the past we failed to do it.
> > > > See e.g.
> > > >
> > > >
> > > > commit eb4cecb453a19b34d5454b49532e09e9cb0c1529
> > > > Author: Jason Wang <jasowang@redhat.com>
> > > > Date:   Wed Mar 23 11:15:24 2022 +0800
> > > >
> > > >     Revert "virtio_pci: harden MSI-X interrupts"
> > > >
> > > >     This reverts commit 9e35276a5344f74d4a3600fc4100b3dd251d5c56.
> > > > Issue
> > > >     were reported for the drivers that are using affinity managed IRQ
> > > >     where manually toggling IRQ status is not expected. And we forget to
> > > >     enable the interrupts in the restore path as well.
> > > >
> > > >     In the future, we will rework on the interrupt hardening.
> > > >
> > > >     Fixes: 9e35276a5344 ("virtio_pci: harden MSI-X interrupts")
> > > >
> > > >
> > > >
> > > > If someone can figure out a way to make toggling interrupt state
> > > > play nice with affinity managed interrupts, that would solve a host of
> > issues I feel.
> > > >
> > > >
> > > >
> > > > > > Thanks,
> > > > > > Ming

Stefan Hajnoczi Feb. 20, 2024, 10:05 p.m. UTC | #8

On Sat, Feb 17, 2024 at 08:08:48PM +0200, Parav Pandit wrote:
> When the PCI device is surprise removed, requests won't complete from
> the device. These IOs are never completed and disk deletion hangs
> indefinitely.
> 
> Fix it by aborting the IOs which the device will never complete
> when the VQ is broken.
> 
> With this fix now fio completes swiftly.
> An alternative of IO timeout has been considered, however
> when the driver knows about unresponsive block device, swiftly clearing
> them enables users and upper layers to react quickly.
> 
> Verified with multiple device unplug cycles with pending IOs in virtio
> used ring and some pending with device.
> 
> In future instead of VQ broken, a more elegant method can be used. At the
> moment the patch is kept to its minimal changes given its urgency to fix
> broken kernels.
> 
> Fixes: 43bb40c5b926 ("virtio_pci: Support surprise removal of virtio pci device")
> Cc: stable@vger.kernel.org
> Reported-by: lirongqing@baidu.com
> Closes: https://lore.kernel.org/virtualization/c45dd68698cd47238c55fb73ca9b4741@baidu.com/
> Co-developed-by: Chaitanya Kulkarni <kch@nvidia.com>
> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> Signed-off-by: Parav Pandit <parav@nvidia.com>
> ---
>  drivers/block/virtio_blk.c | 54 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 54 insertions(+)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 2bf14a0e2815..59b49899b229 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -1562,10 +1562,64 @@ static int virtblk_probe(struct virtio_device *vdev)
>  	return err;
>  }
>  
> +static bool virtblk_cancel_request(struct request *rq, void *data)
> +{
> +	struct virtblk_req *vbr = blk_mq_rq_to_pdu(rq);
> +
> +	vbr->in_hdr.status = VIRTIO_BLK_S_IOERR;
> +	if (blk_mq_request_started(rq) && !blk_mq_request_completed(rq))
> +		blk_mq_complete_request(rq);
> +
> +	return true;
> +}
> +
> +static void virtblk_cleanup_reqs(struct virtio_blk *vblk)
> +{
> +	struct virtio_blk_vq *blk_vq;
> +	struct request_queue *q;
> +	struct virtqueue *vq;
> +	unsigned long flags;
> +	int i;
> +
> +	vq = vblk->vqs[0].vq;
> +	if (!virtqueue_is_broken(vq))
> +		return;
> +
> +	q = vblk->disk->queue;
> +	/* Block upper layer to not get any new requests */
> +	blk_mq_quiesce_queue(q);
> +
> +	for (i = 0; i < vblk->num_vqs; i++) {
> +		blk_vq = &vblk->vqs[i];
> +
> +		/* Synchronize with any ongoing virtblk_poll() which may be
> +		 * completing the requests to uppper layer which has already
> +		 * crossed the broken vq check.
> +		 */
> +		spin_lock_irqsave(&blk_vq->lock, flags);
> +		spin_unlock_irqrestore(&blk_vq->lock, flags);
> +	}
> +
> +	blk_sync_queue(q);
> +
> +	/* Complete remaining pending requests with error */
> +	blk_mq_tagset_busy_iter(&vblk->tag_set, virtblk_cancel_request, vblk);

Interrupts can still occur here. What prevents the race between
virtblk_cancel_request() and virtblk_request_done()?

> +	blk_mq_tagset_wait_completed_request(&vblk->tag_set);
> +
> +	/*
> +	 * Unblock any pending dispatch I/Os before we destroy device. From
> +	 * del_gendisk() -> __blk_mark_disk_dead(disk) will set GD_DEAD flag,
> +	 * that will make sure any new I/O from bio_queue_enter() to fail.
> +	 */
> +	blk_mq_unquiesce_queue(q);
> +}
> +
>  static void virtblk_remove(struct virtio_device *vdev)
>  {
>  	struct virtio_blk *vblk = vdev->priv;
>  
> +	virtblk_cleanup_reqs(vblk);
> +
>  	/* Make sure no work handler is accessing the device. */
>  	flush_work(&vblk->config_work);
>  
> -- 
> 2.34.1
>

Parav Pandit Feb. 22, 2024, 4:46 a.m. UTC | #9

> From: Stefan Hajnoczi <stefanha@redhat.com>
> Sent: Wednesday, February 21, 2024 3:35 AM
> To: Parav Pandit <parav@nvidia.com>
> 
> On Sat, Feb 17, 2024 at 08:08:48PM +0200, Parav Pandit wrote:
> > When the PCI device is surprise removed, requests won't complete from
> > the device. These IOs are never completed and disk deletion hangs
> > indefinitely.
> >
> > Fix it by aborting the IOs which the device will never complete when
> > the VQ is broken.
> >
> > With this fix now fio completes swiftly.
> > An alternative of IO timeout has been considered, however when the
> > driver knows about unresponsive block device, swiftly clearing them
> > enables users and upper layers to react quickly.
> >
> > Verified with multiple device unplug cycles with pending IOs in virtio
> > used ring and some pending with device.
> >
> > In future instead of VQ broken, a more elegant method can be used. At
> > the moment the patch is kept to its minimal changes given its urgency
> > to fix broken kernels.
> >
> > Fixes: 43bb40c5b926 ("virtio_pci: Support surprise removal of virtio
> > pci device")
> > Cc: stable@vger.kernel.org
> > Reported-by: lirongqing@baidu.com
> > Closes:
> > https://lore.kernel.org/virtualization/c45dd68698cd47238c55fb73ca9b474
> > 1@baidu.com/
> > Co-developed-by: Chaitanya Kulkarni <kch@nvidia.com>
> > Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > ---
> >  drivers/block/virtio_blk.c | 54
> > ++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 54 insertions(+)
> >
> > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > index 2bf14a0e2815..59b49899b229 100644
> > --- a/drivers/block/virtio_blk.c
> > +++ b/drivers/block/virtio_blk.c
> > @@ -1562,10 +1562,64 @@ static int virtblk_probe(struct virtio_device
> *vdev)
> >  	return err;
> >  }
> >
> > +static bool virtblk_cancel_request(struct request *rq, void *data) {
> > +	struct virtblk_req *vbr = blk_mq_rq_to_pdu(rq);
> > +
> > +	vbr->in_hdr.status = VIRTIO_BLK_S_IOERR;
> > +	if (blk_mq_request_started(rq) && !blk_mq_request_completed(rq))
> > +		blk_mq_complete_request(rq);
> > +
> > +	return true;
> > +}
> > +
> > +static void virtblk_cleanup_reqs(struct virtio_blk *vblk) {
> > +	struct virtio_blk_vq *blk_vq;
> > +	struct request_queue *q;
> > +	struct virtqueue *vq;
> > +	unsigned long flags;
> > +	int i;
> > +
> > +	vq = vblk->vqs[0].vq;
> > +	if (!virtqueue_is_broken(vq))
> > +		return;
> > +
> > +	q = vblk->disk->queue;
> > +	/* Block upper layer to not get any new requests */
> > +	blk_mq_quiesce_queue(q);
> > +
> > +	for (i = 0; i < vblk->num_vqs; i++) {
> > +		blk_vq = &vblk->vqs[i];
> > +
> > +		/* Synchronize with any ongoing virtblk_poll() which may be
> > +		 * completing the requests to uppper layer which has already
> > +		 * crossed the broken vq check.
> > +		 */
> > +		spin_lock_irqsave(&blk_vq->lock, flags);
> > +		spin_unlock_irqrestore(&blk_vq->lock, flags);
> > +	}
> > +
> > +	blk_sync_queue(q);
> > +
> > +	/* Complete remaining pending requests with error */
> > +	blk_mq_tagset_busy_iter(&vblk->tag_set, virtblk_cancel_request,
> > +vblk);
> 
> Interrupts can still occur here. What prevents the race between
> virtblk_cancel_request() and virtblk_request_done()?
>
The PCI device which generates the interrupt is already removed so interrupt shouldn't arrive when executing cancel_request.
(This is ignoring the race that Ming pointed out. I am preparing the v1 that eliminates such condition.)

If there was ongoing virtblk_request_done() is synchronized by the for loop above.

 
> > +	blk_mq_tagset_wait_completed_request(&vblk->tag_set);
> > +
> > +	/*
> > +	 * Unblock any pending dispatch I/Os before we destroy device. From
> > +	 * del_gendisk() -> __blk_mark_disk_dead(disk) will set GD_DEAD
> flag,
> > +	 * that will make sure any new I/O from bio_queue_enter() to fail.
> > +	 */
> > +	blk_mq_unquiesce_queue(q);
> > +}
> > +
> >  static void virtblk_remove(struct virtio_device *vdev)  {
> >  	struct virtio_blk *vblk = vdev->priv;
> >
> > +	virtblk_cleanup_reqs(vblk);
> > +
> >  	/* Make sure no work handler is accessing the device. */
> >  	flush_work(&vblk->config_work);
> >
> > --
> > 2.34.1
> >

Stefan Hajnoczi Feb. 22, 2024, 3:23 p.m. UTC | #10

On Thu, Feb 22, 2024 at 04:46:38AM +0000, Parav Pandit wrote:
> 
> 
> > From: Stefan Hajnoczi <stefanha@redhat.com>
> > Sent: Wednesday, February 21, 2024 3:35 AM
> > To: Parav Pandit <parav@nvidia.com>
> > 
> > On Sat, Feb 17, 2024 at 08:08:48PM +0200, Parav Pandit wrote:
> > > When the PCI device is surprise removed, requests won't complete from
> > > the device. These IOs are never completed and disk deletion hangs
> > > indefinitely.
> > >
> > > Fix it by aborting the IOs which the device will never complete when
> > > the VQ is broken.
> > >
> > > With this fix now fio completes swiftly.
> > > An alternative of IO timeout has been considered, however when the
> > > driver knows about unresponsive block device, swiftly clearing them
> > > enables users and upper layers to react quickly.
> > >
> > > Verified with multiple device unplug cycles with pending IOs in virtio
> > > used ring and some pending with device.
> > >
> > > In future instead of VQ broken, a more elegant method can be used. At
> > > the moment the patch is kept to its minimal changes given its urgency
> > > to fix broken kernels.
> > >
> > > Fixes: 43bb40c5b926 ("virtio_pci: Support surprise removal of virtio
> > > pci device")
> > > Cc: stable@vger.kernel.org
> > > Reported-by: lirongqing@baidu.com
> > > Closes:
> > > https://lore.kernel.org/virtualization/c45dd68698cd47238c55fb73ca9b474
> > > 1@baidu.com/
> > > Co-developed-by: Chaitanya Kulkarni <kch@nvidia.com>
> > > Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > ---
> > >  drivers/block/virtio_blk.c | 54
> > > ++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 54 insertions(+)
> > >
> > > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > > index 2bf14a0e2815..59b49899b229 100644
> > > --- a/drivers/block/virtio_blk.c
> > > +++ b/drivers/block/virtio_blk.c
> > > @@ -1562,10 +1562,64 @@ static int virtblk_probe(struct virtio_device
> > *vdev)
> > >  	return err;
> > >  }
> > >
> > > +static bool virtblk_cancel_request(struct request *rq, void *data) {
> > > +	struct virtblk_req *vbr = blk_mq_rq_to_pdu(rq);
> > > +
> > > +	vbr->in_hdr.status = VIRTIO_BLK_S_IOERR;
> > > +	if (blk_mq_request_started(rq) && !blk_mq_request_completed(rq))
> > > +		blk_mq_complete_request(rq);
> > > +
> > > +	return true;
> > > +}
> > > +
> > > +static void virtblk_cleanup_reqs(struct virtio_blk *vblk) {
> > > +	struct virtio_blk_vq *blk_vq;
> > > +	struct request_queue *q;
> > > +	struct virtqueue *vq;
> > > +	unsigned long flags;
> > > +	int i;
> > > +
> > > +	vq = vblk->vqs[0].vq;
> > > +	if (!virtqueue_is_broken(vq))
> > > +		return;
> > > +
> > > +	q = vblk->disk->queue;
> > > +	/* Block upper layer to not get any new requests */
> > > +	blk_mq_quiesce_queue(q);
> > > +
> > > +	for (i = 0; i < vblk->num_vqs; i++) {
> > > +		blk_vq = &vblk->vqs[i];
> > > +
> > > +		/* Synchronize with any ongoing virtblk_poll() which may be
> > > +		 * completing the requests to uppper layer which has already
> > > +		 * crossed the broken vq check.
> > > +		 */
> > > +		spin_lock_irqsave(&blk_vq->lock, flags);
> > > +		spin_unlock_irqrestore(&blk_vq->lock, flags);
> > > +	}
> > > +
> > > +	blk_sync_queue(q);
> > > +
> > > +	/* Complete remaining pending requests with error */
> > > +	blk_mq_tagset_busy_iter(&vblk->tag_set, virtblk_cancel_request,
> > > +vblk);
> > 
> > Interrupts can still occur here. What prevents the race between
> > virtblk_cancel_request() and virtblk_request_done()?
> >
> The PCI device which generates the interrupt is already removed so interrupt shouldn't arrive when executing cancel_request.
> (This is ignoring the race that Ming pointed out. I am preparing the v1 that eliminates such condition.)
> 
> If there was ongoing virtblk_request_done() is synchronized by the for loop above.

Ah, I see now that:

+if (!virtqueue_is_broken(vq))
+    return;

relates to:

static void virtio_pci_remove(struct pci_dev *pci_dev)
{
	struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
	struct device *dev = get_device(&vp_dev->vdev.dev);

	/*
	 * Device is marked broken on surprise removal so that virtio upper
	 * layers can abort any ongoing operation.
	 */
	if (!pci_device_is_present(pci_dev))
		virtio_break_device(&vp_dev->vdev);

Please rename virtblk_cleanup_reqs() to virtblk_cleanup_broken_device()
or similar so it's clear that this function only applies when the device
is broken? For example, it won't handle ACPI hot unplug requests because
the device will still be present.

Thanks,
Stefan

> 
>  
> > > +	blk_mq_tagset_wait_completed_request(&vblk->tag_set);
> > > +
> > > +	/*
> > > +	 * Unblock any pending dispatch I/Os before we destroy device. From
> > > +	 * del_gendisk() -> __blk_mark_disk_dead(disk) will set GD_DEAD
> > flag,
> > > +	 * that will make sure any new I/O from bio_queue_enter() to fail.
> > > +	 */
> > > +	blk_mq_unquiesce_queue(q);
> > > +}
> > > +
> > >  static void virtblk_remove(struct virtio_device *vdev)  {
> > >  	struct virtio_blk *vblk = vdev->priv;
> > >
> > > +	virtblk_cleanup_reqs(vblk);
> > > +
> > >  	/* Make sure no work handler is accessing the device. */
> > >  	flush_work(&vblk->config_work);
> > >
> > > --
> > > 2.34.1
> > >
>

Michael S. Tsirkin Feb. 22, 2024, 3:31 p.m. UTC | #11

On Thu, Feb 22, 2024 at 10:23:28AM -0500, Stefan Hajnoczi wrote:
> On Thu, Feb 22, 2024 at 04:46:38AM +0000, Parav Pandit wrote:
> > 
> > 
> > > From: Stefan Hajnoczi <stefanha@redhat.com>
> > > Sent: Wednesday, February 21, 2024 3:35 AM
> > > To: Parav Pandit <parav@nvidia.com>
> > > 
> > > On Sat, Feb 17, 2024 at 08:08:48PM +0200, Parav Pandit wrote:
> > > > When the PCI device is surprise removed, requests won't complete from
> > > > the device. These IOs are never completed and disk deletion hangs
> > > > indefinitely.
> > > >
> > > > Fix it by aborting the IOs which the device will never complete when
> > > > the VQ is broken.
> > > >
> > > > With this fix now fio completes swiftly.
> > > > An alternative of IO timeout has been considered, however when the
> > > > driver knows about unresponsive block device, swiftly clearing them
> > > > enables users and upper layers to react quickly.
> > > >
> > > > Verified with multiple device unplug cycles with pending IOs in virtio
> > > > used ring and some pending with device.
> > > >
> > > > In future instead of VQ broken, a more elegant method can be used. At
> > > > the moment the patch is kept to its minimal changes given its urgency
> > > > to fix broken kernels.
> > > >
> > > > Fixes: 43bb40c5b926 ("virtio_pci: Support surprise removal of virtio
> > > > pci device")
> > > > Cc: stable@vger.kernel.org
> > > > Reported-by: lirongqing@baidu.com
> > > > Closes:
> > > > https://lore.kernel.org/virtualization/c45dd68698cd47238c55fb73ca9b474
> > > > 1@baidu.com/
> > > > Co-developed-by: Chaitanya Kulkarni <kch@nvidia.com>
> > > > Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > ---
> > > >  drivers/block/virtio_blk.c | 54
> > > > ++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 54 insertions(+)
> > > >
> > > > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > > > index 2bf14a0e2815..59b49899b229 100644
> > > > --- a/drivers/block/virtio_blk.c
> > > > +++ b/drivers/block/virtio_blk.c
> > > > @@ -1562,10 +1562,64 @@ static int virtblk_probe(struct virtio_device
> > > *vdev)
> > > >  	return err;
> > > >  }
> > > >
> > > > +static bool virtblk_cancel_request(struct request *rq, void *data) {
> > > > +	struct virtblk_req *vbr = blk_mq_rq_to_pdu(rq);
> > > > +
> > > > +	vbr->in_hdr.status = VIRTIO_BLK_S_IOERR;
> > > > +	if (blk_mq_request_started(rq) && !blk_mq_request_completed(rq))
> > > > +		blk_mq_complete_request(rq);
> > > > +
> > > > +	return true;
> > > > +}
> > > > +
> > > > +static void virtblk_cleanup_reqs(struct virtio_blk *vblk) {
> > > > +	struct virtio_blk_vq *blk_vq;
> > > > +	struct request_queue *q;
> > > > +	struct virtqueue *vq;
> > > > +	unsigned long flags;
> > > > +	int i;
> > > > +
> > > > +	vq = vblk->vqs[0].vq;
> > > > +	if (!virtqueue_is_broken(vq))
> > > > +		return;
> > > > +
> > > > +	q = vblk->disk->queue;
> > > > +	/* Block upper layer to not get any new requests */
> > > > +	blk_mq_quiesce_queue(q);
> > > > +
> > > > +	for (i = 0; i < vblk->num_vqs; i++) {
> > > > +		blk_vq = &vblk->vqs[i];
> > > > +
> > > > +		/* Synchronize with any ongoing virtblk_poll() which may be
> > > > +		 * completing the requests to uppper layer which has already
> > > > +		 * crossed the broken vq check.
> > > > +		 */
> > > > +		spin_lock_irqsave(&blk_vq->lock, flags);
> > > > +		spin_unlock_irqrestore(&blk_vq->lock, flags);
> > > > +	}
> > > > +
> > > > +	blk_sync_queue(q);
> > > > +
> > > > +	/* Complete remaining pending requests with error */
> > > > +	blk_mq_tagset_busy_iter(&vblk->tag_set, virtblk_cancel_request,
> > > > +vblk);
> > > 
> > > Interrupts can still occur here. What prevents the race between
> > > virtblk_cancel_request() and virtblk_request_done()?
> > >
> > The PCI device which generates the interrupt is already removed so interrupt shouldn't arrive when executing cancel_request.
> > (This is ignoring the race that Ming pointed out. I am preparing the v1 that eliminates such condition.)
> > 
> > If there was ongoing virtblk_request_done() is synchronized by the for loop above.
> 
> Ah, I see now that:
> 
> +if (!virtqueue_is_broken(vq))
> +    return;
> 
> relates to:
> 
> static void virtio_pci_remove(struct pci_dev *pci_dev)
> {
> 	struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
> 	struct device *dev = get_device(&vp_dev->vdev.dev);
> 
> 	/*
> 	 * Device is marked broken on surprise removal so that virtio upper
> 	 * layers can abort any ongoing operation.
> 	 */
> 	if (!pci_device_is_present(pci_dev))
> 		virtio_break_device(&vp_dev->vdev);

It's not 100% reliable though. We did it opportunistically but if you
suddenly want to rely on it then you need to also synchronize
callbacks.

> Please rename virtblk_cleanup_reqs() to virtblk_cleanup_broken_device()
> or similar so it's clear that this function only applies when the device
> is broken? For example, it won't handle ACPI hot unplug requests because
> the device will still be present.
> 
> Thanks,
> Stefan
> 
> > 
> >  
> > > > +	blk_mq_tagset_wait_completed_request(&vblk->tag_set);
> > > > +
> > > > +	/*
> > > > +	 * Unblock any pending dispatch I/Os before we destroy device. From
> > > > +	 * del_gendisk() -> __blk_mark_disk_dead(disk) will set GD_DEAD
> > > flag,
> > > > +	 * that will make sure any new I/O from bio_queue_enter() to fail.
> > > > +	 */
> > > > +	blk_mq_unquiesce_queue(q);
> > > > +}
> > > > +
> > > >  static void virtblk_remove(struct virtio_device *vdev)  {
> > > >  	struct virtio_blk *vblk = vdev->priv;
> > > >
> > > > +	virtblk_cleanup_reqs(vblk);
> > > > +
> > > >  	/* Make sure no work handler is accessing the device. */
> > > >  	flush_work(&vblk->config_work);
> > > >
> > > > --
> > > > 2.34.1
> > > >
> >

Michael S. Tsirkin Feb. 22, 2024, 3:38 p.m. UTC | #12

On Thu, Feb 22, 2024 at 04:46:38AM +0000, Parav Pandit wrote:
> 
> 
> > From: Stefan Hajnoczi <stefanha@redhat.com>
> > Sent: Wednesday, February 21, 2024 3:35 AM
> > To: Parav Pandit <parav@nvidia.com>
> > 
> > On Sat, Feb 17, 2024 at 08:08:48PM +0200, Parav Pandit wrote:
> > > When the PCI device is surprise removed, requests won't complete from
> > > the device. These IOs are never completed and disk deletion hangs
> > > indefinitely.
> > >
> > > Fix it by aborting the IOs which the device will never complete when
> > > the VQ is broken.
> > >
> > > With this fix now fio completes swiftly.
> > > An alternative of IO timeout has been considered, however when the
> > > driver knows about unresponsive block device, swiftly clearing them
> > > enables users and upper layers to react quickly.
> > >
> > > Verified with multiple device unplug cycles with pending IOs in virtio
> > > used ring and some pending with device.
> > >
> > > In future instead of VQ broken, a more elegant method can be used. At
> > > the moment the patch is kept to its minimal changes given its urgency
> > > to fix broken kernels.
> > >
> > > Fixes: 43bb40c5b926 ("virtio_pci: Support surprise removal of virtio
> > > pci device")
> > > Cc: stable@vger.kernel.org
> > > Reported-by: lirongqing@baidu.com
> > > Closes:
> > > https://lore.kernel.org/virtualization/c45dd68698cd47238c55fb73ca9b474
> > > 1@baidu.com/
> > > Co-developed-by: Chaitanya Kulkarni <kch@nvidia.com>
> > > Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > ---
> > >  drivers/block/virtio_blk.c | 54
> > > ++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 54 insertions(+)
> > >
> > > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > > index 2bf14a0e2815..59b49899b229 100644
> > > --- a/drivers/block/virtio_blk.c
> > > +++ b/drivers/block/virtio_blk.c
> > > @@ -1562,10 +1562,64 @@ static int virtblk_probe(struct virtio_device
> > *vdev)
> > >  	return err;
> > >  }
> > >
> > > +static bool virtblk_cancel_request(struct request *rq, void *data) {
> > > +	struct virtblk_req *vbr = blk_mq_rq_to_pdu(rq);
> > > +
> > > +	vbr->in_hdr.status = VIRTIO_BLK_S_IOERR;
> > > +	if (blk_mq_request_started(rq) && !blk_mq_request_completed(rq))
> > > +		blk_mq_complete_request(rq);
> > > +
> > > +	return true;
> > > +}
> > > +
> > > +static void virtblk_cleanup_reqs(struct virtio_blk *vblk) {
> > > +	struct virtio_blk_vq *blk_vq;
> > > +	struct request_queue *q;
> > > +	struct virtqueue *vq;
> > > +	unsigned long flags;
> > > +	int i;
> > > +
> > > +	vq = vblk->vqs[0].vq;
> > > +	if (!virtqueue_is_broken(vq))
> > > +		return;
> > > +
> > > +	q = vblk->disk->queue;
> > > +	/* Block upper layer to not get any new requests */
> > > +	blk_mq_quiesce_queue(q);
> > > +
> > > +	for (i = 0; i < vblk->num_vqs; i++) {
> > > +		blk_vq = &vblk->vqs[i];
> > > +
> > > +		/* Synchronize with any ongoing virtblk_poll() which may be
> > > +		 * completing the requests to uppper layer which has already
> > > +		 * crossed the broken vq check.
> > > +		 */
> > > +		spin_lock_irqsave(&blk_vq->lock, flags);
> > > +		spin_unlock_irqrestore(&blk_vq->lock, flags);
> > > +	}
> > > +
> > > +	blk_sync_queue(q);
> > > +
> > > +	/* Complete remaining pending requests with error */
> > > +	blk_mq_tagset_busy_iter(&vblk->tag_set, virtblk_cancel_request,
> > > +vblk);
> > 
> > Interrupts can still occur here. What prevents the race between
> > virtblk_cancel_request() and virtblk_request_done()?
> >
> The PCI device which generates the interrupt is already removed so interrupt shouldn't arrive when executing cancel_request.
> (This is ignoring the race that Ming pointed out. I am preparing the v1 that eliminates such condition.)
> 
> If there was ongoing virtblk_request_done() is synchronized by the for loop above.
> 

Yes, this works, but I feel this is very subtle. This is why I am asking
whether we should instead call virtio_synchronize_cbs and then just
invoke all the callbacks one last time from virtio core? 


> > > +	blk_mq_tagset_wait_completed_request(&vblk->tag_set);
> > > +
> > > +	/*
> > > +	 * Unblock any pending dispatch I/Os before we destroy device. From
> > > +	 * del_gendisk() -> __blk_mark_disk_dead(disk) will set GD_DEAD
> > flag,
> > > +	 * that will make sure any new I/O from bio_queue_enter() to fail.
> > > +	 */
> > > +	blk_mq_unquiesce_queue(q);
> > > +}
> > > +
> > >  static void virtblk_remove(struct virtio_device *vdev)  {
> > >  	struct virtio_blk *vblk = vdev->priv;
> > >
> > > +	virtblk_cleanup_reqs(vblk);
> > > +
> > >  	/* Make sure no work handler is accessing the device. */
> > >  	flush_work(&vblk->config_work);
> > >
> > > --
> > > 2.34.1
> > >

virtio_blk: Fix device surprise removal

Commit Message

Comments

Patch