diff mbox series

dm: dm_blk_ioctl(): implement failover for SG_IO on dm-multipath

Message ID 20210422202130.30906-1-mwilck@suse.com (mailing list archive)
State New, archived
Headers show
Series dm: dm_blk_ioctl(): implement failover for SG_IO on dm-multipath | expand

Commit Message

Martin Wilck April 22, 2021, 8:21 p.m. UTC
From: Martin Wilck <mwilck@suse.com>

In virtual deployments, SCSI passthrough over dm-multipath devices is a
common setup. The qemu "pr-helper" was specifically invented for it. I
believe that this is the most important real-world scenario for sending
SG_IO ioctls to device-mapper devices.

In this configuration, guests send SCSI IO to the hypervisor in the form of
SG_IO ioctls issued by qemu. But on the device-mapper level, these SCSI
ioctls aren't treated like regular IO. Until commit 2361ae595352 ("dm mpath:
switch paths in dm_blk_ioctl() code path"), no path switching was done at
all. Worse though, if an SG_IO call fails because of a path error,
dm-multipath doesn't retry the IO on a another path; rather, the failure is
passed back to the guest, and paths are not marked as faulty.  This is in
stark contrast with regular block IO of guests on dm-multipath devices, and
certainly comes as a surprise to users who switch to SCSI passthrough in
qemu. In general, users of dm-multipath devices would probably expect failover
to work at least in a basic way.

This patch fixes this by taking a special code path for SG_IO on request-
based device mapper targets. Rather then just choosing a single path,
sending the IO to it, and failing to the caller if the IO on the path
failed, it retries the same IO on another path for certain error codes,
using the same logic as blk_path_error() to determine if a retry would make
sense for the given error code. Moreover, it sends a message to the
multipath target to mark the path as failed.

I am aware that this looks sort of hack-ish. I'm submitting it here as an
RFC, asking for guidance how to reach a clean implementation that would be
acceptable in mainline. I believe that it fixes an important problem.
Actually, it fixes the qemu SCSI-passthrough use case on dm-multipath,
which is non-functional without it.

One problem remains open: if all paths in a multipath map are failed,
normal multipath IO may switch to queueing mode (depending on
configuration). This isn't possible for SG_IO, as SG_IO requests can't
easily be queued like regular block I/O. Thus in the "no path" case, the
guest will still see an error. If anybody can conceive of a solution for
that, I'd be interested.

Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 block/scsi_ioctl.c     |   5 +-
 drivers/md/dm.c        | 134 ++++++++++++++++++++++++++++++++++++++++-
 include/linux/blkdev.h |   2 +
 3 files changed, 137 insertions(+), 4 deletions(-)

Comments

Martin Wilck April 22, 2021, 8:24 p.m. UTC | #1
I forgot the "RFC" subject prefix. Adding it this way, sorry.

Martin
Mike Snitzer April 28, 2021, 7:54 p.m. UTC | #2
On Thu, Apr 22 2021 at  4:21P -0400,
mwilck@suse.com <mwilck@suse.com> wrote:

> From: Martin Wilck <mwilck@suse.com>
> 
> In virtual deployments, SCSI passthrough over dm-multipath devices is a
> common setup. The qemu "pr-helper" was specifically invented for it. I
> believe that this is the most important real-world scenario for sending
> SG_IO ioctls to device-mapper devices.
> 
> In this configuration, guests send SCSI IO to the hypervisor in the form of
> SG_IO ioctls issued by qemu. But on the device-mapper level, these SCSI
> ioctls aren't treated like regular IO. Until commit 2361ae595352 ("dm mpath:
> switch paths in dm_blk_ioctl() code path"), no path switching was done at
> all. Worse though, if an SG_IO call fails because of a path error,
> dm-multipath doesn't retry the IO on a another path; rather, the failure is
> passed back to the guest, and paths are not marked as faulty.  This is in
> stark contrast with regular block IO of guests on dm-multipath devices, and
> certainly comes as a surprise to users who switch to SCSI passthrough in
> qemu. In general, users of dm-multipath devices would probably expect failover
> to work at least in a basic way.
> 
> This patch fixes this by taking a special code path for SG_IO on request-
> based device mapper targets. Rather then just choosing a single path,
> sending the IO to it, and failing to the caller if the IO on the path
> failed, it retries the same IO on another path for certain error codes,
> using the same logic as blk_path_error() to determine if a retry would make
> sense for the given error code. Moreover, it sends a message to the
> multipath target to mark the path as failed.
> 
> I am aware that this looks sort of hack-ish. I'm submitting it here as an
> RFC, asking for guidance how to reach a clean implementation that would be
> acceptable in mainline. I believe that it fixes an important problem.
> Actually, it fixes the qemu SCSI-passthrough use case on dm-multipath,
> which is non-functional without it.
> 
> One problem remains open: if all paths in a multipath map are failed,
> normal multipath IO may switch to queueing mode (depending on
> configuration). This isn't possible for SG_IO, as SG_IO requests can't
> easily be queued like regular block I/O. Thus in the "no path" case, the
> guest will still see an error. If anybody can conceive of a solution for
> that, I'd be interested.
> 
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>  block/scsi_ioctl.c     |   5 +-
>  drivers/md/dm.c        | 134 ++++++++++++++++++++++++++++++++++++++++-
>  include/linux/blkdev.h |   2 +
>  3 files changed, 137 insertions(+), 4 deletions(-)
> 
> diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
> index 6599bac0a78c..bcc60552f7b1 100644
> --- a/block/scsi_ioctl.c
> +++ b/block/scsi_ioctl.c
> @@ -279,8 +279,8 @@ static int blk_complete_sghdr_rq(struct request *rq, struct sg_io_hdr *hdr,
>  	return ret;
>  }
>  
> -static int sg_io(struct request_queue *q, struct gendisk *bd_disk,
> -		struct sg_io_hdr *hdr, fmode_t mode)
> +int sg_io(struct request_queue *q, struct gendisk *bd_disk,
> +	  struct sg_io_hdr *hdr, fmode_t mode)
>  {
>  	unsigned long start_time;
>  	ssize_t ret = 0;
> @@ -369,6 +369,7 @@ static int sg_io(struct request_queue *q, struct gendisk *bd_disk,
>  	blk_put_request(rq);
>  	return ret;
>  }
> +EXPORT_SYMBOL_GPL(sg_io);
>  
>  /**
>   * sg_scsi_ioctl  --  handle deprecated SCSI_IOCTL_SEND_COMMAND ioctl
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 50b693d776d6..443ac5e5406c 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -29,6 +29,8 @@
>  #include <linux/part_stat.h>
>  #include <linux/blk-crypto.h>
>  #include <linux/keyslot-manager.h>
> +#include <scsi/sg.h>
> +#include <scsi/scsi.h>
>  
>  #define DM_MSG_PREFIX "core"
>  

Ngh... not loving this at all.  But here is a patch (that I didn't
compile test) that should be folded in, still have to think about how
this could be made cleaner in general though.

Also, dm_sg_io_ioctl should probably be in dm-rq.c (maybe even dm-mpath.c!?)

From: Mike Snitzer <snitzer@redhat.com>
Date: Wed, 28 Apr 2021 15:03:20 -0400
Subject: [PATCH] dm: use scsi_result_to_blk_status rather than open-coding it

Other small cleanups/nits.

BUT the fact that dm.c now #includes <scsi/scsi.h> and <scsi/sg.h>
leaves a very bitter taste.

Also, hardcoding the issuing of a "fail_path" message (assumes tgt is
dm-mpath) but still having checks like !tgt->type->message.. its all
very contrived and awkward!

Not-Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm.c          | 50 ++++++++++++++++--------------------------------
 drivers/scsi/scsi_lib.c  | 21 ++++++++++++--------
 include/scsi/scsi_cmnd.h |  2 ++
 3 files changed, 32 insertions(+), 41 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b79adf021ef1..51e187309ebd 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -524,9 +524,9 @@ static int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
 #define dm_blk_report_zones		NULL
 #endif /* CONFIG_BLK_DEV_ZONED */
 
-static int _dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx,
-			     struct block_device **bdev,
-			     struct dm_target **target)
+static int __dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx,
+			      struct block_device **bdev,
+			      struct dm_target **target)
 {
 	struct dm_target *tgt;
 	struct dm_table *map;
@@ -565,7 +565,7 @@ static int _dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx,
 static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx,
 			    struct block_device **bdev)
 {
-	return _dm_prepare_ioctl(md, srcu_idx, bdev, NULL);
+	return __dm_prepare_ioctl(md, srcu_idx, bdev, NULL);
 }
 
 static void dm_unprepare_ioctl(struct mapped_device *md, int srcu_idx)
@@ -594,9 +594,9 @@ static int dm_sg_io_ioctl(struct block_device *bdev, fmode_t mode,
 		struct dm_target *tgt;
 		struct sg_io_hdr rhdr;
 
-		rc = _dm_prepare_ioctl(md, &srcu_idx, &bdev, &tgt);
+		rc = __dm_prepare_ioctl(md, &srcu_idx, &bdev, &tgt);
 		if (rc < 0) {
-			DMERR("%s: failed to get path: %d\n",
+			DMERR("%s: failed to get path: %d",
 			      __func__, rc);
 			goto out;
 		}
@@ -605,7 +605,7 @@ static int dm_sg_io_ioctl(struct block_device *bdev, fmode_t mode,
 
 		rc = sg_io(bdev->bd_disk->queue, bdev->bd_disk, &rhdr, mode);
 
-		DMDEBUG("SG_IO via %s: rc = %d D%02xH%02xM%02xS%02x\n",
+		DMDEBUG("SG_IO via %s: rc = %d D%02xH%02xM%02xS%02x",
 			bdevname(bdev, path_name), rc,
 			rhdr.driver_status, rhdr.host_status,
 			rhdr.msg_status, rhdr.status);
@@ -626,32 +626,16 @@ static int dm_sg_io_ioctl(struct block_device *bdev, fmode_t mode,
 		}
 
 		if (rhdr.info & SG_INFO_CHECK) {
-			/*
-			 * See if this is a target or path error.
-			 * Compare blk_path_error(), scsi_result_to_blk_status(),
-			 * blk_errors[].
-			 */
-			switch (rhdr.host_status) {
-			case DID_OK:
-				if (scsi_status_is_good(rhdr.status))
-					rc = 0;
-				break;
-			case DID_TARGET_FAILURE:
-				rc = -EREMOTEIO;
-				goto out;
-			case DID_NEXUS_FAILURE:
-				rc = -EBADE;
-				goto out;
-			case DID_ALLOC_FAILURE:
-				rc = -ENOSPC;
-				goto out;
-			case DID_MEDIUM_ERROR:
-				rc = -ENODATA;
-				goto out;
-			default:
-				/* Everything else is a path error */
+			blk_status_t sts = scsi_result_to_blk_status(rhdr.host_status, NULL);
+
+			/* See if this is a target or path error. */
+			if (sts == BLK_STS_OK)
+				rc = 0;
+			else if (blk_path_error(sts))
 				rc = -EIO;
-				break;
+			else {
+				rc = blk_status_to_errno(sts);
+				goto out;
 			}
 		}
 
@@ -674,7 +658,7 @@ static int dm_sg_io_ioctl(struct block_device *bdev, fmode_t mode,
 			scnprintf(bdbuf, sizeof(bdbuf), "%u:%u",
 				  MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev));
 
-			DMDEBUG("sending \"%s %s\" to target\n", argv[0], argv[1]);
+			DMDEBUG("sending \"%s %s\" to target", argv[0], argv[1]);
 			rc = tgt->type->message(tgt, 2, argv, NULL, 0);
 			if (rc < 0)
 				goto out;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 7d52a11e1b61..ecaaba66983c 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -617,7 +617,7 @@ static bool scsi_end_request(struct request *req, blk_status_t error,
  * Translate a SCSI result code into a blk_status_t value. May reset the host
  * byte of @cmd->result.
  */
-static blk_status_t scsi_result_to_blk_status(struct scsi_cmnd *cmd, int result)
+blk_status_t scsi_result_to_blk_status(int result, struct scsi_cmnd *cmd)
 {
 	switch (host_byte(result)) {
 	case DID_OK:
@@ -633,21 +633,26 @@ static blk_status_t scsi_result_to_blk_status(struct scsi_cmnd *cmd, int result)
 	case DID_TRANSPORT_MARGINAL:
 		return BLK_STS_TRANSPORT;
 	case DID_TARGET_FAILURE:
-		set_host_byte(cmd, DID_OK);
+		if (cmd)
+			set_host_byte(cmd, DID_OK);
 		return BLK_STS_TARGET;
 	case DID_NEXUS_FAILURE:
-		set_host_byte(cmd, DID_OK);
+		if (cmd)
+			set_host_byte(cmd, DID_OK);
 		return BLK_STS_NEXUS;
 	case DID_ALLOC_FAILURE:
-		set_host_byte(cmd, DID_OK);
+		if (cmd)
+			set_host_byte(cmd, DID_OK);
 		return BLK_STS_NOSPC;
 	case DID_MEDIUM_ERROR:
-		set_host_byte(cmd, DID_OK);
+		if (cmd)
+			set_host_byte(cmd, DID_OK);
 		return BLK_STS_MEDIUM;
 	default:
 		return BLK_STS_IOERR;
 	}
 }
+EXPORT_SYMBOL(scsi_result_to_blk_status);
 
 /* Helper for scsi_io_completion() when "reprep" action required. */
 static void scsi_io_completion_reprep(struct scsi_cmnd *cmd,
@@ -691,7 +696,7 @@ static void scsi_io_completion_action(struct scsi_cmnd *cmd, int result)
 	if (sense_valid)
 		sense_current = !scsi_sense_is_deferred(&sshdr);
 
-	blk_stat = scsi_result_to_blk_status(cmd, result);
+	blk_stat = scsi_result_to_blk_status(result, cmd);
 
 	if (host_byte(result) == DID_RESET) {
 		/* Third party bus reset or reset for error recovery
@@ -869,14 +874,14 @@ static int scsi_io_completion_nz_result(struct scsi_cmnd *cmd, int result,
 				    SCSI_SENSE_BUFFERSIZE);
 		}
 		if (sense_current)
-			*blk_statp = scsi_result_to_blk_status(cmd, result);
+			*blk_statp = scsi_result_to_blk_status(result, cmd);
 	} else if (blk_rq_bytes(req) == 0 && sense_current) {
 		/*
 		 * Flush commands do not transfers any data, and thus cannot use
 		 * good_bytes != blk_rq_bytes(req) as the signal for an error.
 		 * This sets *blk_statp explicitly for the problem case.
 		 */
-		*blk_statp = scsi_result_to_blk_status(cmd, result);
+		*blk_statp = scsi_result_to_blk_status(result, cmd);
 	}
 	/*
 	 * Recovered errors need reporting, but they're always treated as
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
index ace15b5dc956..19e76f8db1ea 100644
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -161,6 +161,8 @@ static inline struct scsi_driver *scsi_cmd_to_driver(struct scsi_cmnd *cmd)
 
 extern void scsi_finish_command(struct scsi_cmnd *cmd);
 
+extern blk_status_t scsi_result_to_blk_status(int result, struct scsi_cmnd *cmd);
+
 extern void *scsi_kmap_atomic_sg(struct scatterlist *sg, int sg_count,
 				 size_t *offset, size_t *len);
 extern void scsi_kunmap_atomic_sg(void *virt);
Martin Wilck April 28, 2021, 8:54 p.m. UTC | #3
On Wed, 2021-04-28 at 15:54 -0400, Mike Snitzer wrote:
> 
> Ngh... not loving this at all.  

I'm not surprised :-) I infer from your reply that you acknowledge this
is something that should be done one way or the other, and I'm glad
about that.

> But here is a patch (that I didn't
> compile test) that should be folded in, still have to think about how
> this could be made cleaner in general though.

Thanks, much appreciated.

> Also, dm_sg_io_ioctl should probably be in dm-rq.c (maybe even dm-
> mpath.c!?)
> 
> From: Mike Snitzer <snitzer@redhat.com>
> Date: Wed, 28 Apr 2021 15:03:20 -0400
> Subject: [PATCH] dm: use scsi_result_to_blk_status rather than open-
> coding it
> 
> Other small cleanups/nits.
> 
> BUT the fact that dm.c now #includes <scsi/scsi.h> and <scsi/sg.h>
> leaves a very bitter taste.
> 
> Also, hardcoding the issuing of a "fail_path" message (assumes tgt is
> dm-mpath) but still having checks like !tgt->type->message.. its all
> very contrived and awkward!

This I can explain. I need to check t->type->message because not all
targets define it. If I didn't, yet used message(), the kernel might
crash. I would have avoided the hard-coded "fail_path", too, if I had
found a way to figure out which messages a given target supports, or
whether the target in question is the multipath target in the first
place. But I couldn't figure it out. If it's possible, please tell me
how. 

Wrt "fail_path", the assumption is that other rq-based targets would
either not support "fail_path" and thus return an error, or implement
it with the same semantics as multipath. In both cases, my patch would
do the "right thing" (falling back to standard behavior in the error
case).

Thanks,
Martin
Hannes Reinecke April 29, 2021, 6:02 a.m. UTC | #4
On 4/28/21 9:54 PM, Mike Snitzer wrote:
> On Thu, Apr 22 2021 at  4:21P -0400,
> mwilck@suse.com <mwilck@suse.com> wrote:
> 
>> From: Martin Wilck <mwilck@suse.com>
>>
>> In virtual deployments, SCSI passthrough over dm-multipath devices is a
>> common setup. The qemu "pr-helper" was specifically invented for it. I
>> believe that this is the most important real-world scenario for sending
>> SG_IO ioctls to device-mapper devices.
>>
>> In this configuration, guests send SCSI IO to the hypervisor in the form of
>> SG_IO ioctls issued by qemu. But on the device-mapper level, these SCSI
>> ioctls aren't treated like regular IO. Until commit 2361ae595352 ("dm mpath:
>> switch paths in dm_blk_ioctl() code path"), no path switching was done at
>> all. Worse though, if an SG_IO call fails because of a path error,
>> dm-multipath doesn't retry the IO on a another path; rather, the failure is
>> passed back to the guest, and paths are not marked as faulty.  This is in
>> stark contrast with regular block IO of guests on dm-multipath devices, and
>> certainly comes as a surprise to users who switch to SCSI passthrough in
>> qemu. In general, users of dm-multipath devices would probably expect failover
>> to work at least in a basic way.
>>
>> This patch fixes this by taking a special code path for SG_IO on request-
>> based device mapper targets. Rather then just choosing a single path,
>> sending the IO to it, and failing to the caller if the IO on the path
>> failed, it retries the same IO on another path for certain error codes,
>> using the same logic as blk_path_error() to determine if a retry would make
>> sense for the given error code. Moreover, it sends a message to the
>> multipath target to mark the path as failed.
>>
>> I am aware that this looks sort of hack-ish. I'm submitting it here as an
>> RFC, asking for guidance how to reach a clean implementation that would be
>> acceptable in mainline. I believe that it fixes an important problem.
>> Actually, it fixes the qemu SCSI-passthrough use case on dm-multipath,
>> which is non-functional without it.
>>
>> One problem remains open: if all paths in a multipath map are failed,
>> normal multipath IO may switch to queueing mode (depending on
>> configuration). This isn't possible for SG_IO, as SG_IO requests can't
>> easily be queued like regular block I/O. Thus in the "no path" case, the
>> guest will still see an error. If anybody can conceive of a solution for
>> that, I'd be interested.
>>
>> Signed-off-by: Martin Wilck <mwilck@suse.com>
>> ---
>>   block/scsi_ioctl.c     |   5 +-
>>   drivers/md/dm.c        | 134 ++++++++++++++++++++++++++++++++++++++++-
>>   include/linux/blkdev.h |   2 +
>>   3 files changed, 137 insertions(+), 4 deletions(-)
>>
>> diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
>> index 6599bac0a78c..bcc60552f7b1 100644
>> --- a/block/scsi_ioctl.c
>> +++ b/block/scsi_ioctl.c
>> @@ -279,8 +279,8 @@ static int blk_complete_sghdr_rq(struct request *rq, struct sg_io_hdr *hdr,
>>   	return ret;
>>   }
>>   
>> -static int sg_io(struct request_queue *q, struct gendisk *bd_disk,
>> -		struct sg_io_hdr *hdr, fmode_t mode)
>> +int sg_io(struct request_queue *q, struct gendisk *bd_disk,
>> +	  struct sg_io_hdr *hdr, fmode_t mode)
>>   {
>>   	unsigned long start_time;
>>   	ssize_t ret = 0;
>> @@ -369,6 +369,7 @@ static int sg_io(struct request_queue *q, struct gendisk *bd_disk,
>>   	blk_put_request(rq);
>>   	return ret;
>>   }
>> +EXPORT_SYMBOL_GPL(sg_io);
>>   
>>   /**
>>    * sg_scsi_ioctl  --  handle deprecated SCSI_IOCTL_SEND_COMMAND ioctl
>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>> index 50b693d776d6..443ac5e5406c 100644
>> --- a/drivers/md/dm.c
>> +++ b/drivers/md/dm.c
>> @@ -29,6 +29,8 @@
>>   #include <linux/part_stat.h>
>>   #include <linux/blk-crypto.h>
>>   #include <linux/keyslot-manager.h>
>> +#include <scsi/sg.h>
>> +#include <scsi/scsi.h>
>>   
>>   #define DM_MSG_PREFIX "core"
>>   
> 
> Ngh... not loving this at all.  But here is a patch (that I didn't
> compile test) that should be folded in, still have to think about how
> this could be made cleaner in general though.
> 
> Also, dm_sg_io_ioctl should probably be in dm-rq.c (maybe even dm-mpath.c!?)
> 
I fully share your sentiments; this just feels so awkward, having to 
redo the logic in scsi_cmd_ioctl().

My original idea to that was to _use_ scsi_cmd_ioctl() directly from 
dm_blk_ioctl:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 50b693d776d6..007ff981f888 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -567,6 +567,9 @@ static int dm_blk_ioctl(struct block_device *bdev, 
fmode_t mode,
         struct mapped_device *md = bdev->bd_disk->private_data;
         int r, srcu_idx;

+       if (cmd == SG_IO)
+               return scsi_cmd_blk_ioctl(bdev, mode, cmd, arg);
+
         r = dm_prepare_ioctl(md, &srcu_idx, &bdev);
         if (r < 0)
                 goto out;


But that crashes horribly as sg_io() expects the request pdu to be of 
type 'scsi_request', which it definitely isn't here.

We _could_ prepend struct dm_rq_target_io with a struct scsi_request, 
seeing that basically _all_ dm-rq installations are on SCSI devices:

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 13b4385f4d5a..aa7856621f83 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -16,6 +16,7 @@
   * One of these is allocated per request.
   */
  struct dm_rq_target_io {
+       struct scsi_request scsi_req;
         struct mapped_device *md;
         struct dm_target *ti;
         struct request *orig, *clone;

Then the above should work, but we would increase the size of 
dm_rq_target_io by quite a lot. Although, given the current size of it, 
question is whether it matters...

Hmm?

Cheers,

Hannes
Martin Wilck April 29, 2021, 8:33 a.m. UTC | #5
On Wed, 2021-04-28 at 15:54 -0400, Mike Snitzer wrote:
> 
> @@ -626,32 +626,16 @@ static int dm_sg_io_ioctl(struct block_device
> *bdev, fmode_t mode,
>                 }
>  
>                 if (rhdr.info & SG_INFO_CHECK) {
> -                       /*
> -                        * See if this is a target or path error.
> -                        * Compare blk_path_error(),
> scsi_result_to_blk_status(),
> -                        * blk_errors[].
> -                        */
> -                       switch (rhdr.host_status) {
> -                       case DID_OK:
> -                               if (scsi_status_is_good(rhdr.status))
> -                                       rc = 0;
> -                               break;
> -                       case DID_TARGET_FAILURE:
> -                               rc = -EREMOTEIO;
> -                               goto out;
> -                       case DID_NEXUS_FAILURE:
> -                               rc = -EBADE;
> -                               goto out;
> -                       case DID_ALLOC_FAILURE:
> -                               rc = -ENOSPC;
> -                               goto out;
> -                       case DID_MEDIUM_ERROR:
> -                               rc = -ENODATA;
> -                               goto out;
> -                       default:
> -                               /* Everything else is a path error */
> +                       blk_status_t sts =
> scsi_result_to_blk_status(rhdr.host_status, NULL);

This change makes dm_mod depend on scsi_mod. 
Would you seriously prefer that over a re-implementation of the logic?

Regards
Martin
Paolo Bonzini April 29, 2021, 8:39 a.m. UTC | #6
On 29/04/21 10:33, Martin Wilck wrote:
> On Wed, 2021-04-28 at 15:54 -0400, Mike Snitzer wrote:
>>
>> @@ -626,32 +626,16 @@ static int dm_sg_io_ioctl(struct block_device
>> *bdev, fmode_t mode,
>>                  }
>>   
>>                  if (rhdr.info & SG_INFO_CHECK) {
>> -                       /*
>> -                        * See if this is a target or path error.
>> -                        * Compare blk_path_error(),
>> scsi_result_to_blk_status(),
>> -                        * blk_errors[].
>> -                        */
>> -                       switch (rhdr.host_status) {
>> -                       case DID_OK:
>> -                               if (scsi_status_is_good(rhdr.status))
>> -                                       rc = 0;
>> -                               break;
>> -                       case DID_TARGET_FAILURE:
>> -                               rc = -EREMOTEIO;
>> -                               goto out;
>> -                       case DID_NEXUS_FAILURE:
>> -                               rc = -EBADE;
>> -                               goto out;
>> -                       case DID_ALLOC_FAILURE:
>> -                               rc = -ENOSPC;
>> -                               goto out;
>> -                       case DID_MEDIUM_ERROR:
>> -                               rc = -ENODATA;
>> -                               goto out;
>> -                       default:
>> -                               /* Everything else is a path error */
>> +                       blk_status_t sts =
>> scsi_result_to_blk_status(rhdr.host_status, NULL);
> 
> This change makes dm_mod depend on scsi_mod.
> Would you seriously prefer that over a re-implementation of the logic?

You could just make it an inline function too.

Paolo
Mike Snitzer April 30, 2021, 8:15 p.m. UTC | #7
On Thu, Apr 29 2021 at  2:02am -0400,
Hannes Reinecke <hare@suse.de> wrote:

> On 4/28/21 9:54 PM, Mike Snitzer wrote:
> >On Thu, Apr 22 2021 at  4:21P -0400,
> >mwilck@suse.com <mwilck@suse.com> wrote:
> >
> >>From: Martin Wilck <mwilck@suse.com>
> >>
> >>In virtual deployments, SCSI passthrough over dm-multipath devices is a
> >>common setup. The qemu "pr-helper" was specifically invented for it. I
> >>believe that this is the most important real-world scenario for sending
> >>SG_IO ioctls to device-mapper devices.
> >>
> >>In this configuration, guests send SCSI IO to the hypervisor in the form of
> >>SG_IO ioctls issued by qemu. But on the device-mapper level, these SCSI
> >>ioctls aren't treated like regular IO. Until commit 2361ae595352 ("dm mpath:
> >>switch paths in dm_blk_ioctl() code path"), no path switching was done at
> >>all. Worse though, if an SG_IO call fails because of a path error,
> >>dm-multipath doesn't retry the IO on a another path; rather, the failure is
> >>passed back to the guest, and paths are not marked as faulty.  This is in
> >>stark contrast with regular block IO of guests on dm-multipath devices, and
> >>certainly comes as a surprise to users who switch to SCSI passthrough in
> >>qemu. In general, users of dm-multipath devices would probably expect failover
> >>to work at least in a basic way.
> >>
> >>This patch fixes this by taking a special code path for SG_IO on request-
> >>based device mapper targets. Rather then just choosing a single path,
> >>sending the IO to it, and failing to the caller if the IO on the path
> >>failed, it retries the same IO on another path for certain error codes,
> >>using the same logic as blk_path_error() to determine if a retry would make
> >>sense for the given error code. Moreover, it sends a message to the
> >>multipath target to mark the path as failed.
> >>
> >>I am aware that this looks sort of hack-ish. I'm submitting it here as an
> >>RFC, asking for guidance how to reach a clean implementation that would be
> >>acceptable in mainline. I believe that it fixes an important problem.
> >>Actually, it fixes the qemu SCSI-passthrough use case on dm-multipath,
> >>which is non-functional without it.
> >>
> >>One problem remains open: if all paths in a multipath map are failed,
> >>normal multipath IO may switch to queueing mode (depending on
> >>configuration). This isn't possible for SG_IO, as SG_IO requests can't
> >>easily be queued like regular block I/O. Thus in the "no path" case, the
> >>guest will still see an error. If anybody can conceive of a solution for
> >>that, I'd be interested.
> >>
> >>Signed-off-by: Martin Wilck <mwilck@suse.com>
> >>---
> >>  block/scsi_ioctl.c     |   5 +-
> >>  drivers/md/dm.c        | 134 ++++++++++++++++++++++++++++++++++++++++-
> >>  include/linux/blkdev.h |   2 +
> >>  3 files changed, 137 insertions(+), 4 deletions(-)
> >>
> >>diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
> >>index 6599bac0a78c..bcc60552f7b1 100644
> >>--- a/block/scsi_ioctl.c
> >>+++ b/block/scsi_ioctl.c
> >>@@ -279,8 +279,8 @@ static int blk_complete_sghdr_rq(struct request *rq, struct sg_io_hdr *hdr,
> >>  	return ret;
> >>  }
> >>-static int sg_io(struct request_queue *q, struct gendisk *bd_disk,
> >>-		struct sg_io_hdr *hdr, fmode_t mode)
> >>+int sg_io(struct request_queue *q, struct gendisk *bd_disk,
> >>+	  struct sg_io_hdr *hdr, fmode_t mode)
> >>  {
> >>  	unsigned long start_time;
> >>  	ssize_t ret = 0;
> >>@@ -369,6 +369,7 @@ static int sg_io(struct request_queue *q, struct gendisk *bd_disk,
> >>  	blk_put_request(rq);
> >>  	return ret;
> >>  }
> >>+EXPORT_SYMBOL_GPL(sg_io);
> >>  /**
> >>   * sg_scsi_ioctl  --  handle deprecated SCSI_IOCTL_SEND_COMMAND ioctl
> >>diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >>index 50b693d776d6..443ac5e5406c 100644
> >>--- a/drivers/md/dm.c
> >>+++ b/drivers/md/dm.c
> >>@@ -29,6 +29,8 @@
> >>  #include <linux/part_stat.h>
> >>  #include <linux/blk-crypto.h>
> >>  #include <linux/keyslot-manager.h>
> >>+#include <scsi/sg.h>
> >>+#include <scsi/scsi.h>
> >>  #define DM_MSG_PREFIX "core"
> >
> >Ngh... not loving this at all.  But here is a patch (that I didn't
> >compile test) that should be folded in, still have to think about how
> >this could be made cleaner in general though.
> >
> >Also, dm_sg_io_ioctl should probably be in dm-rq.c (maybe even dm-mpath.c!?)
> >
> I fully share your sentiments; this just feels so awkward, having to
> redo the logic in scsi_cmd_ioctl().
> 
> My original idea to that was to _use_ scsi_cmd_ioctl() directly from
> dm_blk_ioctl:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 50b693d776d6..007ff981f888 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -567,6 +567,9 @@ static int dm_blk_ioctl(struct block_device
> *bdev, fmode_t mode,
>         struct mapped_device *md = bdev->bd_disk->private_data;
>         int r, srcu_idx;
> 
> +       if (cmd == SG_IO)
> +               return scsi_cmd_blk_ioctl(bdev, mode, cmd, arg);
> +
>         r = dm_prepare_ioctl(md, &srcu_idx, &bdev);
>         if (r < 0)
>                 goto out;
> 
> 
> But that crashes horribly as sg_io() expects the request pdu to be
> of type 'scsi_request', which it definitely isn't here.
> 
> We _could_ prepend struct dm_rq_target_io with a struct
> scsi_request, seeing that basically _all_ dm-rq installations are on
> SCSI devices:

If calling sg_io() is an issue then how does Martin's latest patchset,
that exports and calls sg_io direct from DM, work!?

And _no_ I have no interest in embedding struct scsi_request in
dm_rq_target_io.

Mike

> 
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index 13b4385f4d5a..aa7856621f83 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -16,6 +16,7 @@
>   * One of these is allocated per request.
>   */
>  struct dm_rq_target_io {
> +       struct scsi_request scsi_req;
>         struct mapped_device *md;
>         struct dm_target *ti;
>         struct request *orig, *clone;
> 
> Then the above should work, but we would increase the size of
> dm_rq_target_io by quite a lot. Although, given the current size of
> it, question is whether it matters...
> 
> Hmm?
> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
>
Martin Wilck May 3, 2021, 7:27 a.m. UTC | #8
On Fri, 2021-04-30 at 16:15 -0400, Mike Snitzer wrote:
> 
> If calling sg_io() is an issue then how does Martin's latest
> patchset,
> that exports and calls sg_io direct from DM, work!?

It works by doing sg_io on the _path_ devices. There's no difference
wrt to the current code in this respect. My code just retries on a
different path when one fails.

Hannes' original idea had passed the _dm_ device directly to
scsi_cmd_blk_ioctl().

Martin
diff mbox series

Patch

diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index 6599bac0a78c..bcc60552f7b1 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -279,8 +279,8 @@  static int blk_complete_sghdr_rq(struct request *rq, struct sg_io_hdr *hdr,
 	return ret;
 }
 
-static int sg_io(struct request_queue *q, struct gendisk *bd_disk,
-		struct sg_io_hdr *hdr, fmode_t mode)
+int sg_io(struct request_queue *q, struct gendisk *bd_disk,
+	  struct sg_io_hdr *hdr, fmode_t mode)
 {
 	unsigned long start_time;
 	ssize_t ret = 0;
@@ -369,6 +369,7 @@  static int sg_io(struct request_queue *q, struct gendisk *bd_disk,
 	blk_put_request(rq);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(sg_io);
 
 /**
  * sg_scsi_ioctl  --  handle deprecated SCSI_IOCTL_SEND_COMMAND ioctl
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 50b693d776d6..443ac5e5406c 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -29,6 +29,8 @@ 
 #include <linux/part_stat.h>
 #include <linux/blk-crypto.h>
 #include <linux/keyslot-manager.h>
+#include <scsi/sg.h>
+#include <scsi/scsi.h>
 
 #define DM_MSG_PREFIX "core"
 
@@ -522,8 +524,9 @@  static int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
 #define dm_blk_report_zones		NULL
 #endif /* CONFIG_BLK_DEV_ZONED */
 
-static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx,
-			    struct block_device **bdev)
+static int _dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx,
+			     struct block_device **bdev,
+			     struct dm_target **target)
 {
 	struct dm_target *tgt;
 	struct dm_table *map;
@@ -553,20 +556,147 @@  static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx,
 		goto retry;
 	}
 
+	if (r >= 0 && target)
+		*target = tgt;
+
 	return r;
 }
 
+static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx,
+			    struct block_device **bdev)
+{
+	return _dm_prepare_ioctl(md, srcu_idx, bdev, NULL);
+}
+
 static void dm_unprepare_ioctl(struct mapped_device *md, int srcu_idx)
 {
 	dm_put_live_table(md, srcu_idx);
 }
 
+static int dm_sg_io_ioctl(struct block_device *bdev, fmode_t mode,
+			  void __user *arg)
+{
+	struct mapped_device *md = bdev->bd_disk->private_data;
+	struct sg_io_hdr hdr;
+	int rc, srcu_idx;
+	char path_name[BDEVNAME_SIZE];
+
+	if (copy_from_user(&hdr, arg, sizeof(hdr)))
+		return -EFAULT;
+
+	if (hdr.interface_id != 'S')
+		return -EINVAL;
+
+	if (hdr.dxfer_len > (queue_max_hw_sectors(bdev->bd_disk->queue) << 9))
+		return -EIO;
+
+	for (;;) {
+		struct dm_target *tgt;
+		struct sg_io_hdr rhdr;
+
+		rc = _dm_prepare_ioctl(md, &srcu_idx, &bdev, &tgt);
+		if (rc < 0) {
+			DMERR("%s: failed to get path: %d\n",
+			      __func__, rc);
+			goto out;
+		}
+
+		rhdr = hdr;
+
+		rc = sg_io(bdev->bd_disk->queue, bdev->bd_disk, &rhdr, mode);
+
+		DMDEBUG("SG_IO via %s: rc = %d D%02xH%02xM%02xS%02x\n",
+			bdevname(bdev, path_name), rc,
+			rhdr.driver_status, rhdr.host_status,
+			rhdr.msg_status, rhdr.status);
+
+		/*
+		 * Errors resulting from invalid parameters shouldn't be retried
+		 * on another path.
+		 */
+		switch (rc) {
+		case -ENOIOCTLCMD:
+		case -ENOTTY:
+		case -EFAULT:
+		case -EINVAL:
+		case -EPERM:
+			goto out;
+		default:
+			break;
+		}
+
+		if (rhdr.info & SG_INFO_CHECK) {
+			/*
+			 * See if this is a target or path error.
+			 * Compare blk_path_error(), scsi_result_to_blk_status(),
+			 * blk_errors[].
+			 */
+			switch (rhdr.host_status) {
+			case DID_OK:
+				if (scsi_status_is_good(rhdr.status))
+					rc = 0;
+				break;
+			case DID_TARGET_FAILURE:
+				rc = -EREMOTEIO;
+				goto out;
+			case DID_NEXUS_FAILURE:
+				rc = -EBADE;
+				goto out;
+			case DID_ALLOC_FAILURE:
+				rc = -ENOSPC;
+				goto out;
+			case DID_MEDIUM_ERROR:
+				rc = -ENODATA;
+				goto out;
+			default:
+				/* Everything else is a path error */
+				rc = -EIO;
+				break;
+			}
+		}
+
+		if (rc == 0) {
+			/* success */
+			if (copy_to_user(arg, &rhdr, sizeof(rhdr)))
+				rc = -EFAULT;
+			goto out;
+		}
+
+		/* Failure - fail path by sending a message to the target */
+		if (!tgt->type->message) {
+			DMWARN("invalid target!");
+			rc = -EIO;
+			goto out;
+		} else {
+			char bdbuf[BDEVT_SIZE];
+			char *argv[2] = { "fail_path", bdbuf };
+
+			scnprintf(bdbuf, sizeof(bdbuf), "%u:%u",
+				  MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev));
+
+			DMDEBUG("sending \"%s %s\" to target\n", argv[0], argv[1]);
+			rc = tgt->type->message(tgt, 2, argv, NULL, 0);
+			if (rc < 0)
+				goto out;
+		}
+
+		dm_unprepare_ioctl(md, srcu_idx);
+	}
+out:
+	dm_unprepare_ioctl(md, srcu_idx);
+	return rc;
+}
+
 static int dm_blk_ioctl(struct block_device *bdev, fmode_t mode,
 			unsigned int cmd, unsigned long arg)
 {
 	struct mapped_device *md = bdev->bd_disk->private_data;
 	int r, srcu_idx;
 
+	if ((dm_get_md_type(md) == DM_TYPE_REQUEST_BASED) &&
+	    cmd == SG_IO)
+		return dm_sg_io_ioctl(bdev, mode, (void __user *)arg);
+
 	r = dm_prepare_ioctl(md, &srcu_idx, &bdev);
 	if (r < 0)
 		goto out;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c032cfe133c7..bded0e6546da 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -934,6 +934,8 @@  extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			  unsigned int, void __user *);
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
+extern int sg_io(struct request_queue *, struct gendisk *,
+		 struct sg_io_hdr *, fmode_t);
 extern int get_sg_io_hdr(struct sg_io_hdr *hdr, const void __user *argp);
 extern int put_sg_io_hdr(const struct sg_io_hdr *hdr, void __user *argp);