diff mbox

[RFC] xen-block: introduces extra request to pass-through SCSI commands

Message ID 1456717031-13423-1-git-send-email-bob.liu@oracle.com (mailing list archive)
State New, archived
Headers show

Commit Message

Bob Liu Feb. 29, 2016, 3:37 a.m. UTC
1) What is this patch about?
This patch introduces an new block operation (BLKIF_OP_EXTRA_FLAG).
A request with BLKIF_OP_EXTRA_FLAG set means the following request is an
extra request which is used to pass through SCSI commands.
This is like a simplified version of XEN_NETIF_EXTRA_* in netif.h.
It can be extended easily to transmit other per-request/bio data from frontend
to backend e.g Data Integrity Field per bio.

2) Why we need this?
Currently only raw data segments are transmitted from blkfront to blkback, which
means some advanced features are lost.
 * Guest knows nothing about features of the real backend storage.
	For example, on bare-metal environment INQUIRY SCSI command can be used
	to query storage device information. If it's a SSD or flash device we
	can have the option to use the device as a fast cache.
	But this can't happen in current domU guests, because blkfront only
	knows it's just a normal virtual disk

 * Failover Clusters in Windows
	Failover clusters require SCSI-3 persistent reservation target disks,
	but now this can't work in domU.

3) Known issues:
 * Security issues, how to 'validate' this extra request payload.
   E.g SCSI operates on LUN bases (the whole disk) while we really just want to
   operate on partitions

 * Can't pass SCSI commands through if the backend storage driver is bio-based
   instead of request-based.

4) Alternative approach: Using PVSCSI instead:
 * Doubt PVSCSI can support as many type of backend storage devices as Xen-block.

 * Much longer path:
   ioctl() -> SCSI upper layer -> Middle layer -> PVSCSI-frontend -> PVSCSI-backend -> Target framework(LIO?) ->

   With xen-block we only need:
   ioctl() -> blkfront -> blkback ->

 * xen-block has been existed for many years, widely used and more stable.

Welcome any input, thank you!

Signed-off-by: Bob Liu <bob.liu@oracle.com>
---
 xen/include/public/io/blkif.h |   73 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

Comments

Jürgen Groß Feb. 29, 2016, 8:12 a.m. UTC | #1
On 29/02/16 04:37, Bob Liu wrote:
> 1) What is this patch about?
> This patch introduces an new block operation (BLKIF_OP_EXTRA_FLAG).
> A request with BLKIF_OP_EXTRA_FLAG set means the following request is an
> extra request which is used to pass through SCSI commands.
> This is like a simplified version of XEN_NETIF_EXTRA_* in netif.h.
> It can be extended easily to transmit other per-request/bio data from frontend
> to backend e.g Data Integrity Field per bio.
> 
> 2) Why we need this?
> Currently only raw data segments are transmitted from blkfront to blkback, which
> means some advanced features are lost.
>  * Guest knows nothing about features of the real backend storage.
> 	For example, on bare-metal environment INQUIRY SCSI command can be used
> 	to query storage device information. If it's a SSD or flash device we
> 	can have the option to use the device as a fast cache.
> 	But this can't happen in current domU guests, because blkfront only
> 	knows it's just a normal virtual disk
> 
>  * Failover Clusters in Windows
> 	Failover clusters require SCSI-3 persistent reservation target disks,
> 	but now this can't work in domU.
> 
> 3) Known issues:
>  * Security issues, how to 'validate' this extra request payload.
>    E.g SCSI operates on LUN bases (the whole disk) while we really just want to
>    operate on partitions

It's not only validation: some operations just affect the whole LUN
(e.g. Reserve/Release). And what about "multi-LUN" commands like
"report LUNs"?

>  * Can't pass SCSI commands through if the backend storage driver is bio-based
>    instead of request-based.
> 
> 4) Alternative approach: Using PVSCSI instead:
>  * Doubt PVSCSI can support as many type of backend storage devices as Xen-block.

pvSCSI won't need to support all types of backends. It's enough to
support those where passing through SCSI commands makes sense.

Seems to be a similar issue as the above mentioned problem with
bio-based backend storage drivers.

>  * Much longer path:
>    ioctl() -> SCSI upper layer -> Middle layer -> PVSCSI-frontend -> PVSCSI-backend -> Target framework(LIO?) ->
> 
>    With xen-block we only need:
>    ioctl() -> blkfront -> blkback ->

I'd like to see performance numbers before making a decision.

>  * xen-block has been existed for many years, widely used and more stable.

Adding another SCSI passthrough capability wasn't accepted for pvSCSI
(that's the reason I used the Target Framework). Why do you think it
will be accepted for pvblk?

This is not my personal opinion, just a heads up from someone who had a
try already. ;-)


Juergen

> Welcome any input, thank you!
> 
> Signed-off-by: Bob Liu <bob.liu@oracle.com>
> ---
>  xen/include/public/io/blkif.h |   73 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 73 insertions(+)
> 
> diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
> index 99f0326..7c10bce 100644
> --- a/xen/include/public/io/blkif.h
> +++ b/xen/include/public/io/blkif.h
> @@ -635,6 +635,28 @@
>  #define BLKIF_OP_INDIRECT          6
>  
>  /*
> + * Recognised only if "feature-extra-request" is present in backend xenbus info.
> + * A request with BLKIF_OP_EXTRA_FLAG indicates an extra request is followed
> + * in the shared ring buffer.
> + *
> + * By this way, extra data like SCSI command, DIF/DIX and other per-request/bio
> + * data can be transmitted from frontend to backend.
> + *
> + * The 'wire' format is like:
> + *  Request 1: xen_blkif_request
> + * [Request 2: xen_blkif_extra_request]    (only if request 1 has BLKIF_OP_EXTRA_FLAG)
> + *  Request 3: xen_blkif_request
> + *  Request 4: xen_blkif_request
> + * [Request 5: xen_blkif_extra_request]    (only if request 4 has BLKIF_OP_EXTRA_FLAG)
> + *  ...
> + *  Request N: xen_blkif_request
> + *
> + * If a backend does not recognize BLKIF_OP_EXTRA_FLAG, it should *not* create the
> + * "feature-extra-request" node!
> + */
> +#define BLKIF_OP_EXTRA_FLAG (0x80)
> +
> +/*
>   * Maximum scatter/gather segments per request.
>   * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE.
>   * NB. This could be 12 if the ring indexes weren't stored in the same page.
> @@ -703,10 +725,61 @@ struct blkif_request_indirect {
>  };
>  typedef struct blkif_request_indirect blkif_request_indirect_t;
>  
> +enum blkif_extra_request_type {
> +	BLKIF_EXTRA_TYPE_SCSI_CMD = 1,		/* Transmit SCSI command.  */
> +};
> +
> +struct scsi_cmd_req {
> +	/*
> +	 * Grant mapping for transmiting SCSI command to backend, and
> +	 * also receive sense data from backend.
> +	 * One 4KB page is enough.
> +	 */
> +	grant_ref_t cmd_gref;
> +	/* Length of SCSI command in the grant mapped page. */
> +	unsigned int cmd_len;
> +
> +	/*
> +	 * SCSI command may require transmiting data segment length less
> +	 * than a sector(512 bytes).
> +	 * Record num_sg and last segment length in extra request so that
> +	 * backend can know about them.
> +	 */
> +	unsigned int num_sg;
> +	unsigned int last_sg_len;
> +};
> +
> +/*
> + * Extra request, must follow a normal-request and a normal-request can
> + * only be followed by one extra request.
> + */
> +struct blkif_request_extra {
> +	uint8_t type;		/* BLKIF_EXTRA_TYPE_* */
> +	uint16_t _pad1;
> +#ifndef CONFIG_X86_32
> +	uint32_t _pad2;		/* offsetof(blkif_...,u.extra.id) == 8 */
> +#endif
> +	uint64_t id;
> +	struct scsi_cmd_req scsi_cmd;
> +} __attribute__((__packed__));
> +typedef struct blkif_request_extra blkif_request_extra_t;
> +
> +struct scsi_cmd_res {
> +	unsigned int resid_len;
> +	/* Length of sense data returned in grant mapped page. */
> +	unsigned int sense_len;
> +};
> +
> +struct blkif_response_extra {
> +	uint8_t type;  /* BLKIF_EXTRA_TYPE_* */
> +	struct scsi_cmd_res scsi_cmd;
> +} __attribute__((__packed__));
> +
>  struct blkif_response {
>      uint64_t        id;              /* copied from request */
>      uint8_t         operation;       /* copied from request */
>      int16_t         status;          /* BLKIF_RSP_???       */
> +    struct blkif_response_extra extra;
>  };
>  typedef struct blkif_response blkif_response_t;
>  
>
Paul Durrant Feb. 29, 2016, 9:13 a.m. UTC | #2
> -----Original Message-----
> From: Bob Liu [mailto:bob.liu@oracle.com]
> Sent: 29 February 2016 03:37
> To: xen-devel@lists.xen.org
> Cc: Ian Jackson; jbeulich@suse.com; Roger Pau Monne; jgross@suse.com;
> Paul Durrant; konrad.wilk@oracle.com; Bob Liu
> Subject: [RFC PATCH] xen-block: introduces extra request to pass-through
> SCSI commands
> 
> 1) What is this patch about?
> This patch introduces an new block operation (BLKIF_OP_EXTRA_FLAG).
> A request with BLKIF_OP_EXTRA_FLAG set means the following request is an
> extra request which is used to pass through SCSI commands.
> This is like a simplified version of XEN_NETIF_EXTRA_* in netif.h.
> It can be extended easily to transmit other per-request/bio data from
> frontend
> to backend e.g Data Integrity Field per bio.
> 
> 2) Why we need this?
> Currently only raw data segments are transmitted from blkfront to blkback,
> which
> means some advanced features are lost.
>  * Guest knows nothing about features of the real backend storage.
> 	For example, on bare-metal environment INQUIRY SCSI command
> can be used
> 	to query storage device information. If it's a SSD or flash device we
> 	can have the option to use the device as a fast cache.
> 	But this can't happen in current domU guests, because blkfront only
> 	knows it's just a normal virtual disk
> 

That's the sort of information that should be advertised via xenstore then. There already feature flags for specific things but if some form of throughput/latency information is meaningful to a frontend stack then perhaps that can be advertised too.

>  * Failover Clusters in Windows
> 	Failover clusters require SCSI-3 persistent reservation target disks,
> 	but now this can't work in domU.
> 

That's true but allowing arbitrary SCSI messages through is not the way forward IMO. Just because Windows thinks every HBA is SCSI doesn't mean other OS do so I think reservation/release should have dedicated messages in the blkif protocol if it's desirable to support clustering in the frontend.

> 3) Known issues:
>  * Security issues, how to 'validate' this extra request payload.
>    E.g SCSI operates on LUN bases (the whole disk) while we really just want
> to
>    operate on partitions
> 
>  * Can't pass SCSI commands through if the backend storage driver is bio-
> based
>    instead of request-based.
> 
> 4) Alternative approach: Using PVSCSI instead:
>  * Doubt PVSCSI can support as many type of backend storage devices as
> Xen-block.
> 

LIO can interface to any block device in much the same way blkback does IIRC.

>  * Much longer path:
>    ioctl() -> SCSI upper layer -> Middle layer -> PVSCSI-frontend -> PVSCSI-
> backend -> Target framework(LIO?) ->
> 
>    With xen-block we only need:
>    ioctl() -> blkfront -> blkback ->
> 

...and what happens if the block device that blkback is talking to is a SCSI LUN?

That latter path is also not true for Windows. You've got all the SCSI translation logic in the frontend when using blkif so that first path would collapse to:

Disk driver -> (SCSI) HBA Driver -> xen-scsiback -> LIO -> backstore -> XXX

>  * xen-block has been existed for many years, widely used and more stable.
> 

It's definitely widely used, but it has had stability issues in recent times.

  Paul

> Welcome any input, thank you!
> 
> Signed-off-by: Bob Liu <bob.liu@oracle.com>
> ---
>  xen/include/public/io/blkif.h |   73
> +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 73 insertions(+)
> 
> diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
> index 99f0326..7c10bce 100644
> --- a/xen/include/public/io/blkif.h
> +++ b/xen/include/public/io/blkif.h
> @@ -635,6 +635,28 @@
>  #define BLKIF_OP_INDIRECT          6
> 
>  /*
> + * Recognised only if "feature-extra-request" is present in backend xenbus
> info.
> + * A request with BLKIF_OP_EXTRA_FLAG indicates an extra request is
> followed
> + * in the shared ring buffer.
> + *
> + * By this way, extra data like SCSI command, DIF/DIX and other per-
> request/bio
> + * data can be transmitted from frontend to backend.
> + *
> + * The 'wire' format is like:
> + *  Request 1: xen_blkif_request
> + * [Request 2: xen_blkif_extra_request]    (only if request 1 has
> BLKIF_OP_EXTRA_FLAG)
> + *  Request 3: xen_blkif_request
> + *  Request 4: xen_blkif_request
> + * [Request 5: xen_blkif_extra_request]    (only if request 4 has
> BLKIF_OP_EXTRA_FLAG)
> + *  ...
> + *  Request N: xen_blkif_request
> + *
> + * If a backend does not recognize BLKIF_OP_EXTRA_FLAG, it should *not*
> create the
> + * "feature-extra-request" node!
> + */
> +#define BLKIF_OP_EXTRA_FLAG (0x80)
> +
> +/*
>   * Maximum scatter/gather segments per request.
>   * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE.
>   * NB. This could be 12 if the ring indexes weren't stored in the same page.
> @@ -703,10 +725,61 @@ struct blkif_request_indirect {
>  };
>  typedef struct blkif_request_indirect blkif_request_indirect_t;
> 
> +enum blkif_extra_request_type {
> +	BLKIF_EXTRA_TYPE_SCSI_CMD = 1,		/* Transmit SCSI
> command.  */
> +};
> +
> +struct scsi_cmd_req {
> +	/*
> +	 * Grant mapping for transmiting SCSI command to backend, and
> +	 * also receive sense data from backend.
> +	 * One 4KB page is enough.
> +	 */
> +	grant_ref_t cmd_gref;
> +	/* Length of SCSI command in the grant mapped page. */
> +	unsigned int cmd_len;
> +
> +	/*
> +	 * SCSI command may require transmiting data segment length less
> +	 * than a sector(512 bytes).
> +	 * Record num_sg and last segment length in extra request so that
> +	 * backend can know about them.
> +	 */
> +	unsigned int num_sg;
> +	unsigned int last_sg_len;
> +};
> +
> +/*
> + * Extra request, must follow a normal-request and a normal-request can
> + * only be followed by one extra request.
> + */
> +struct blkif_request_extra {
> +	uint8_t type;		/* BLKIF_EXTRA_TYPE_* */
> +	uint16_t _pad1;
> +#ifndef CONFIG_X86_32
> +	uint32_t _pad2;		/* offsetof(blkif_...,u.extra.id) == 8
> */
> +#endif
> +	uint64_t id;
> +	struct scsi_cmd_req scsi_cmd;
> +} __attribute__((__packed__));
> +typedef struct blkif_request_extra blkif_request_extra_t;
> +
> +struct scsi_cmd_res {
> +	unsigned int resid_len;
> +	/* Length of sense data returned in grant mapped page. */
> +	unsigned int sense_len;
> +};
> +
> +struct blkif_response_extra {
> +	uint8_t type;  /* BLKIF_EXTRA_TYPE_* */
> +	struct scsi_cmd_res scsi_cmd;
> +} __attribute__((__packed__));
> +
>  struct blkif_response {
>      uint64_t        id;              /* copied from request */
>      uint8_t         operation;       /* copied from request */
>      int16_t         status;          /* BLKIF_RSP_???       */
> +    struct blkif_response_extra extra;
>  };
>  typedef struct blkif_response blkif_response_t;
> 
> --
> 1.7.10.4
Konrad Rzeszutek Wilk Feb. 29, 2016, 2:55 p.m. UTC | #3
On Mon, Feb 29, 2016 at 09:13:41AM +0000, Paul Durrant wrote:
> > -----Original Message-----
> > From: Bob Liu [mailto:bob.liu@oracle.com]
> > Sent: 29 February 2016 03:37
> > To: xen-devel@lists.xen.org
> > Cc: Ian Jackson; jbeulich@suse.com; Roger Pau Monne; jgross@suse.com;
> > Paul Durrant; konrad.wilk@oracle.com; Bob Liu
> > Subject: [RFC PATCH] xen-block: introduces extra request to pass-through
> > SCSI commands
> > 
> > 1) What is this patch about?
> > This patch introduces an new block operation (BLKIF_OP_EXTRA_FLAG).
> > A request with BLKIF_OP_EXTRA_FLAG set means the following request is an
> > extra request which is used to pass through SCSI commands.
> > This is like a simplified version of XEN_NETIF_EXTRA_* in netif.h.
> > It can be extended easily to transmit other per-request/bio data from
> > frontend
> > to backend e.g Data Integrity Field per bio.
> > 
> > 2) Why we need this?
> > Currently only raw data segments are transmitted from blkfront to blkback,
> > which
> > means some advanced features are lost.
> >  * Guest knows nothing about features of the real backend storage.
> > 	For example, on bare-metal environment INQUIRY SCSI command
> > can be used
> > 	to query storage device information. If it's a SSD or flash device we
> > 	can have the option to use the device as a fast cache.
> > 	But this can't happen in current domU guests, because blkfront only
> > 	knows it's just a normal virtual disk
> > 
> 
> That's the sort of information that should be advertised via xenstore then. There already feature flags for specific things but if some form of throughput/latency information is meaningful to a frontend stack then perhaps that can be advertised too.

Certainly could be put on the XenStore. Do you envision this being done
pre guest creation (so toolstack does it), or the backend finds this
and populates the XenStore keys?

Or that the frontend writes an XenStore key 'scsi-inq=vpd80' and the backend
responds by populating an 'scsi-inq-vpd80=' <binary blob>'? If so can
the XenStore accept binary payloads? Can it be more than 4K?


> 
> >  * Failover Clusters in Windows
> > 	Failover clusters require SCSI-3 persistent reservation target disks,
> > 	but now this can't work in domU.
> > 
> 
> That's true but allowing arbitrary SCSI messages through is not the way forward IMO. Just because Windows thinks every HBA is SCSI doesn't mean other OS do so I think reservation/release should have dedicated messages in the blkif protocol if it's desirable to support clustering in the frontend.

Could you expand a bit on the 'dedicated message' you have in mind please?

> 
> > 3) Known issues:
> >  * Security issues, how to 'validate' this extra request payload.
> >    E.g SCSI operates on LUN bases (the whole disk) while we really just want
> > to
> >    operate on partitions
> > 
> >  * Can't pass SCSI commands through if the backend storage driver is bio-
> > based
> >    instead of request-based.
> > 
> > 4) Alternative approach: Using PVSCSI instead:
> >  * Doubt PVSCSI can support as many type of backend storage devices as
> > Xen-block.
> > 
> 
> LIO can interface to any block device in much the same way blkback does IIRC.

But it can't do multipath or LVMs - which is an integral component.

Anyhow that is more of a implementation specific quirk.
> 
> >  * Much longer path:
> >    ioctl() -> SCSI upper layer -> Middle layer -> PVSCSI-frontend -> PVSCSI-
> > backend -> Target framework(LIO?) ->
> > 
> >    With xen-block we only need:
> >    ioctl() -> blkfront -> blkback ->
> > 
> 
> ...and what happens if the block device that blkback is talking to is a SCSI LUN?
> 
> That latter path is also not true for Windows. You've got all the SCSI translation logic in the frontend when using blkif so that first path would collapse to:
> 
> Disk driver -> (SCSI) HBA Driver -> xen-scsiback -> LIO -> backstore -> XXX

I don't know if it matters on the length of the path for say SCSI INQ. It isn't like
that is performance specific. Neither are the clustering SCSI commands.

> 
> >  * xen-block has been existed for many years, widely used and more stable.
> > 
> 
> It's definitely widely used, but it has had stability issues in recent times.

Oh? Could you send the bug-reports to me and Roger, CC xen-devel and LKML please ?
Konrad Rzeszutek Wilk Feb. 29, 2016, 3:05 p.m. UTC | #4
On Mon, Feb 29, 2016 at 09:12:30AM +0100, Juergen Gross wrote:
> On 29/02/16 04:37, Bob Liu wrote:
> > 1) What is this patch about?
> > This patch introduces an new block operation (BLKIF_OP_EXTRA_FLAG).
> > A request with BLKIF_OP_EXTRA_FLAG set means the following request is an
> > extra request which is used to pass through SCSI commands.
> > This is like a simplified version of XEN_NETIF_EXTRA_* in netif.h.
> > It can be extended easily to transmit other per-request/bio data from frontend
> > to backend e.g Data Integrity Field per bio.
> > 
> > 2) Why we need this?
> > Currently only raw data segments are transmitted from blkfront to blkback, which
> > means some advanced features are lost.
> >  * Guest knows nothing about features of the real backend storage.
> > 	For example, on bare-metal environment INQUIRY SCSI command can be used
> > 	to query storage device information. If it's a SSD or flash device we
> > 	can have the option to use the device as a fast cache.
> > 	But this can't happen in current domU guests, because blkfront only
> > 	knows it's just a normal virtual disk
> > 
> >  * Failover Clusters in Windows
> > 	Failover clusters require SCSI-3 persistent reservation target disks,
> > 	but now this can't work in domU.
> > 
> > 3) Known issues:
> >  * Security issues, how to 'validate' this extra request payload.
> >    E.g SCSI operates on LUN bases (the whole disk) while we really just want to
> >    operate on partitions
> 
> It's not only validation: some operations just affect the whole LUN
> (e.g. Reserve/Release). And what about "multi-LUN" commands like
> "report LUNs"?

Don't expose them. Bob and I want to get an idea of what would be a good
compromise to allow some SCSI specific (or perhaps ATA specific or DIF/DIX?) type of
commands go through the PV driver.

Would it be better if it was through XenBus? But that may not work for some
that are tied closely to requests, such as DIF/DIX.

However the 'DISCARD' for example worked out - it is an umbrella for both
SCSI UNMAP and ATA DISCARD operation and hides the complexity of the low level
protocol. Could there be an 'INQ' ? Since the SCSI VPD is the most exhaustive
in terms of details it may make sense to base it on that..?

> 
> >  * Can't pass SCSI commands through if the backend storage driver is bio-based
> >    instead of request-based.
> > 
> > 4) Alternative approach: Using PVSCSI instead:
> >  * Doubt PVSCSI can support as many type of backend storage devices as Xen-block.
> 
> pvSCSI won't need to support all types of backends. It's enough to
> support those where passing through SCSI commands makes sense.
> 
> Seems to be a similar issue as the above mentioned problem with
> bio-based backend storage drivers.

In particular the ones we care about are:
 - Multipath over FibreChannel devices.
 - Linear mapping (LVM) over the multipath.
 - And then potentially an filesystem on top of that
 - .. and a raw file on the filesystem.

Having SCSI VPD 0x83 page sent to frontend for that would be good.

Not sure about SCSI reservations. Perhaps those are more of .. unique
in that the implementation would have to make sure that the guest
owns the whole LUN. But that is implementation question.

This is about the design - how would you envision to to cram in 
SCSI commands or DIF/DIX commands or ATA commands via PV block layer?

> 
> >  * Much longer path:
> >    ioctl() -> SCSI upper layer -> Middle layer -> PVSCSI-frontend -> PVSCSI-backend -> Target framework(LIO?) ->
> > 
> >    With xen-block we only need:
> >    ioctl() -> blkfront -> blkback ->
> 
> I'd like to see performance numbers before making a decision.

For SCSI INQ? <laughs>

Or are you talking about raw READ/WRITE?
> 
> >  * xen-block has been existed for many years, widely used and more stable.
> 
> Adding another SCSI passthrough capability wasn't accepted for pvSCSI
> (that's the reason I used the Target Framework). Why do you think it
> will be accepted for pvblk?
> 
> This is not my personal opinion, just a heads up from someone who had a
> try already. ;-)

Right. So SCSI passthrough out. What about mediated access for
specific SCSI, or specific ATA, or DIF/DIX ones? And how would you do it
knowing the SCSI maintainers much better than we do?
Paul Durrant Feb. 29, 2016, 3:28 p.m. UTC | #5
> -----Original Message-----
> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> Sent: 29 February 2016 14:56
> To: Paul Durrant
> Cc: Bob Liu; xen-devel@lists.xen.org; Ian Jackson; jbeulich@suse.com; Roger
> Pau Monne; jgross@suse.com
> Subject: Re: [RFC PATCH] xen-block: introduces extra request to pass-
> through SCSI commands
> 
> On Mon, Feb 29, 2016 at 09:13:41AM +0000, Paul Durrant wrote:
> > > -----Original Message-----
> > > From: Bob Liu [mailto:bob.liu@oracle.com]
> > > Sent: 29 February 2016 03:37
> > > To: xen-devel@lists.xen.org
> > > Cc: Ian Jackson; jbeulich@suse.com; Roger Pau Monne;
> jgross@suse.com;
> > > Paul Durrant; konrad.wilk@oracle.com; Bob Liu
> > > Subject: [RFC PATCH] xen-block: introduces extra request to pass-
> through
> > > SCSI commands
> > >
> > > 1) What is this patch about?
> > > This patch introduces an new block operation (BLKIF_OP_EXTRA_FLAG).
> > > A request with BLKIF_OP_EXTRA_FLAG set means the following request
> is an
> > > extra request which is used to pass through SCSI commands.
> > > This is like a simplified version of XEN_NETIF_EXTRA_* in netif.h.
> > > It can be extended easily to transmit other per-request/bio data from
> > > frontend
> > > to backend e.g Data Integrity Field per bio.
> > >
> > > 2) Why we need this?
> > > Currently only raw data segments are transmitted from blkfront to
> blkback,
> > > which
> > > means some advanced features are lost.
> > >  * Guest knows nothing about features of the real backend storage.
> > > 	For example, on bare-metal environment INQUIRY SCSI command
> > > can be used
> > > 	to query storage device information. If it's a SSD or flash device we
> > > 	can have the option to use the device as a fast cache.
> > > 	But this can't happen in current domU guests, because blkfront only
> > > 	knows it's just a normal virtual disk
> > >
> >
> > That's the sort of information that should be advertised via xenstore then.
> There already feature flags for specific things but if some form of
> throughput/latency information is meaningful to a frontend stack then
> perhaps that can be advertised too.
> 
> Certainly could be put on the XenStore. Do you envision this being done
> pre guest creation (so toolstack does it), or the backend finds this
> and populates the XenStore keys?
> 
> Or that the frontend writes an XenStore key 'scsi-inq=vpd80' and the
> backend
> responds by populating an 'scsi-inq-vpd80=' <binary blob>'? If so can
> the XenStore accept binary payloads? Can it be more than 4K?
> 

I was thinking more along the lines of blkback creating xenstore keys with any relevant information. We have sector size and number of sectors already but I don't see any harm in having values for other quantities that may be useful to a frontend. Bouncing SCSI inquiries via xenstore was certainly not what I was thinking.

> 
> >
> > >  * Failover Clusters in Windows
> > > 	Failover clusters require SCSI-3 persistent reservation target disks,
> > > 	but now this can't work in domU.
> > >
> >
> > That's true but allowing arbitrary SCSI messages through is not the way
> forward IMO. Just because Windows thinks every HBA is SCSI doesn't mean
> other OS do so I think reservation/release should have dedicated messages
> in the blkif protocol if it's desirable to support clustering in the frontend.
> 
> Could you expand a bit on the 'dedicated message' you have in mind please?
> 

As in we create message types to reserve/release whatever backend is being used and the backend uses whatever mechanism is appropriate to deal with that. E.g. if it were qdisk talking to a raw file then that might just be an flock.

> >
> > > 3) Known issues:
> > >  * Security issues, how to 'validate' this extra request payload.
> > >    E.g SCSI operates on LUN bases (the whole disk) while we really just
> want
> > > to
> > >    operate on partitions
> > >
> > >  * Can't pass SCSI commands through if the backend storage driver is bio-
> > > based
> > >    instead of request-based.
> > >
> > > 4) Alternative approach: Using PVSCSI instead:
> > >  * Doubt PVSCSI can support as many type of backend storage devices as
> > > Xen-block.
> > >
> >
> > LIO can interface to any block device in much the same way blkback does
> IIRC.
> 
> But it can't do multipath or LVMs - which is an integral component.
> 

Really? I was not aware of that limitation and it surprises me since AFAIK LIO can also use a raw file as a backstore which seems like it would be above either of those.

> Anyhow that is more of a implementation specific quirk.
> >
> > >  * Much longer path:
> > >    ioctl() -> SCSI upper layer -> Middle layer -> PVSCSI-frontend -> PVSCSI-
> > > backend -> Target framework(LIO?) ->
> > >
> > >    With xen-block we only need:
> > >    ioctl() -> blkfront -> blkback ->
> > >
> >
> > ...and what happens if the block device that blkback is talking to is a SCSI
> LUN?
> >
> > That latter path is also not true for Windows. You've got all the SCSI
> translation logic in the frontend when using blkif so that first path would
> collapse to:
> >
> > Disk driver -> (SCSI) HBA Driver -> xen-scsiback -> LIO -> backstore -> XXX
> 
> I don't know if it matters on the length of the path for say SCSI INQ. It isn't
> like
> that is performance specific. Neither are the clustering SCSI commands.
> 
> >
> > >  * xen-block has been existed for many years, widely used and more
> stable.
> > >
> >
> > It's definitely widely used, but it has had stability issues in recent times.
> 
> Oh? Could you send the bug-reports to me and Roger, CC xen-devel and
> LKML please ?

I was casting my mind back to incompatibilities that crept in with persistent grants. TBH I haven't used blkback much since then; I tend to use qemu qdisk as my backend these days.

  Paul
Jürgen Groß Feb. 29, 2016, 3:34 p.m. UTC | #6
On 29/02/16 16:05, Konrad Rzeszutek Wilk wrote:
> On Mon, Feb 29, 2016 at 09:12:30AM +0100, Juergen Gross wrote:
>> On 29/02/16 04:37, Bob Liu wrote:
>>> 1) What is this patch about?
>>> This patch introduces an new block operation (BLKIF_OP_EXTRA_FLAG).
>>> A request with BLKIF_OP_EXTRA_FLAG set means the following request is an
>>> extra request which is used to pass through SCSI commands.
>>> This is like a simplified version of XEN_NETIF_EXTRA_* in netif.h.
>>> It can be extended easily to transmit other per-request/bio data from frontend
>>> to backend e.g Data Integrity Field per bio.
>>>
>>> 2) Why we need this?
>>> Currently only raw data segments are transmitted from blkfront to blkback, which
>>> means some advanced features are lost.
>>>  * Guest knows nothing about features of the real backend storage.
>>> 	For example, on bare-metal environment INQUIRY SCSI command can be used
>>> 	to query storage device information. If it's a SSD or flash device we
>>> 	can have the option to use the device as a fast cache.
>>> 	But this can't happen in current domU guests, because blkfront only
>>> 	knows it's just a normal virtual disk
>>>
>>>  * Failover Clusters in Windows
>>> 	Failover clusters require SCSI-3 persistent reservation target disks,
>>> 	but now this can't work in domU.
>>>
>>> 3) Known issues:
>>>  * Security issues, how to 'validate' this extra request payload.
>>>    E.g SCSI operates on LUN bases (the whole disk) while we really just want to
>>>    operate on partitions
>>
>> It's not only validation: some operations just affect the whole LUN
>> (e.g. Reserve/Release). And what about "multi-LUN" commands like
>> "report LUNs"?
> 
> Don't expose them. Bob and I want to get an idea of what would be a good
> compromise to allow some SCSI specific (or perhaps ATA specific or DIF/DIX?) type of
> commands go through the PV driver.
> 
> Would it be better if it was through XenBus? But that may not work for some
> that are tied closely to requests, such as DIF/DIX.
> 
> However the 'DISCARD' for example worked out - it is an umbrella for both
> SCSI UNMAP and ATA DISCARD operation and hides the complexity of the low level
> protocol. Could there be an 'INQ' ? Since the SCSI VPD is the most exhaustive
> in terms of details it may make sense to base it on that..?
> 
>>
>>>  * Can't pass SCSI commands through if the backend storage driver is bio-based
>>>    instead of request-based.
>>>
>>> 4) Alternative approach: Using PVSCSI instead:
>>>  * Doubt PVSCSI can support as many type of backend storage devices as Xen-block.
>>
>> pvSCSI won't need to support all types of backends. It's enough to
>> support those where passing through SCSI commands makes sense.
>>
>> Seems to be a similar issue as the above mentioned problem with
>> bio-based backend storage drivers.
> 
> In particular the ones we care about are:
>  - Multipath over FibreChannel devices.
>  - Linear mapping (LVM) over the multipath.
>  - And then potentially an filesystem on top of that
>  - .. and a raw file on the filesystem.
> 
> Having SCSI VPD 0x83 page sent to frontend for that would be good.
> 
> Not sure about SCSI reservations. Perhaps those are more of .. unique
> in that the implementation would have to make sure that the guest
> owns the whole LUN. But that is implementation question.
> 
> This is about the design - how would you envision to to cram in 
> SCSI commands or DIF/DIX commands or ATA commands via PV block layer?

Have some kind of abstraction which can be mapped to SCSI commands
easily, but don't stick to the SCSI definitions. Use your own
structures, commands etc. and build SCSI commands from those in the
backend. This way you avoid having to emulate SCSI. Instead of
naming it "SCSI passthrough" call it "special commands". :-)
I would add only those operations you really need. Add an inquiry
operation which returns the supported "special commands". The
availability of the inquiry can be reported via Xenstore.

> 
>>
>>>  * Much longer path:
>>>    ioctl() -> SCSI upper layer -> Middle layer -> PVSCSI-frontend -> PVSCSI-backend -> Target framework(LIO?) ->
>>>
>>>    With xen-block we only need:
>>>    ioctl() -> blkfront -> blkback ->
>>
>> I'd like to see performance numbers before making a decision.
> 
> For SCSI INQ? <laughs>
> 
> Or are you talking about raw READ/WRITE?

READ/WRITE. I'm quite sure pvSCSI will be slower than pvblk, but I don't
know how much slower.

>>
>>>  * xen-block has been existed for many years, widely used and more stable.
>>
>> Adding another SCSI passthrough capability wasn't accepted for pvSCSI
>> (that's the reason I used the Target Framework). Why do you think it
>> will be accepted for pvblk?
>>
>> This is not my personal opinion, just a heads up from someone who had a
>> try already. ;-)
> 
> Right. So SCSI passthrough out. What about mediated access for
> specific SCSI, or specific ATA, or DIF/DIX ones? And how would you do it
> knowing the SCSI maintainers much better than we do?

Don't call it SCSI command passthrough, or use the target framework.
As stated above: I think renaming the whole feature is better. You'll
avoid much trouble and weird configuration problems.

Juergen
Roger Pau Monné Feb. 29, 2016, 3:35 p.m. UTC | #7
El 29/2/16 a les 16:28, Paul Durrant ha escrit:
>> -----Original Message-----
>> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
>> Sent: 29 February 2016 14:56
>> To: Paul Durrant
>> Cc: Bob Liu; xen-devel@lists.xen.org; Ian Jackson; jbeulich@suse.com; Roger
>> Pau Monne; jgross@suse.com
>> Subject: Re: [RFC PATCH] xen-block: introduces extra request to pass-
>> through SCSI commands
>>
>> On Mon, Feb 29, 2016 at 09:13:41AM +0000, Paul Durrant wrote:
>>>> -----Original Message-----
>>>> From: Bob Liu [mailto:bob.liu@oracle.com]
>>>> Sent: 29 February 2016 03:37
>>>> To: xen-devel@lists.xen.org
>>>> Cc: Ian Jackson; jbeulich@suse.com; Roger Pau Monne;
>> jgross@suse.com;
>>>> Paul Durrant; konrad.wilk@oracle.com; Bob Liu
>>>> Subject: [RFC PATCH] xen-block: introduces extra request to pass-
>>>>  * xen-block has been existed for many years, widely used and more
>> stable.
>>>>
>>>
>>> It's definitely widely used, but it has had stability issues in recent times.
>>
>> Oh? Could you send the bug-reports to me and Roger, CC xen-devel and
>> LKML please ?
> 
> I was casting my mind back to incompatibilities that crept in with persistent grants. TBH I haven't used blkback much since then; I tend to use qemu qdisk as my backend these days.

FWIW, QEMU Qdisk also has persistent grants support... and was
implemented more or less at the same time as blkback persistent grants.

Roger.
Ian Jackson Feb. 29, 2016, 4:14 p.m. UTC | #8
Bob Liu writes ("[RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands"):
> 4) Alternative approach: Using PVSCSI instead:
>  * Doubt PVSCSI can support as many type of backend storage devices
>    as Xen-block.

I don't understand why this would be the case.  Your "extra requests"
are SCSI CDBs, so your proposal only works with SCSI targets.

>  * Much longer path:
>    ioctl() -> SCSI upper layer -> Middle layer -> PVSCSI-frontend -> PVSCSI-backend -> Target framework(LIO?) ->

The pvscsi pieces here is precisely the multiplexer that you really
want, for this all to work correctly.

If you want your guests to be able to drive this thing properly, you
need to make the thing look like a SCSI target anyway.  So effectively
you've reinvented PVSCSI.

When you say "longer path" do you really mean that the performance is
poor ?  I don't think this is due to the number of layers here.  The
layers are quite thin.

If there is a performance problem with PVSCSI then perhaps that should
be investigated.

Ian.
Ian Jackson Feb. 29, 2016, 4:29 p.m. UTC | #9
Ian Jackson writes ("Re: [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands"):
> [stuff suggesting use of PVSCSI instead]

For the avoidance of doubt:

1. Thanks very much for bringing this proposal to us at the concept
stage.  It is much easier to discuss these matters in a constructive
way before a lot of effort has been put into an implementation.

2. I should explain the downsides which I see in your proposal:

- Your suggestion has bad security properties: previously, the PV
  block protocol would present only a very simple and narrow
  interface.  Your SCSI CDB passthrough proposal means that guests
  would be able to activate features in SCSI targets which would be
  unexpected and unintended by the host administrator.  Such features
  would perhaps even be unknown to the host administrator.

  This could be mitigated by making this feature configurable, of
  course, defaulting to off, along with clear documentation.  But it's
  not a desirable property.

- For similar reasons it will often be difficult to use such a feature
  safely.  Guest software in particular might expect that it can
  safely use whatever features it can see, and do all sorts of
  exciting things.

- It involves duplicating multiplexing logic which already exists in
  PVSCSI.

Ian.
Konrad Rzeszutek Wilk Feb. 29, 2016, 4:48 p.m. UTC | #10
On Mon, Feb 29, 2016 at 04:35:36PM +0100, Roger Pau Monné wrote:
> El 29/2/16 a les 16:28, Paul Durrant ha escrit:
> >> -----Original Message-----
> >> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> >> Sent: 29 February 2016 14:56
> >> To: Paul Durrant
> >> Cc: Bob Liu; xen-devel@lists.xen.org; Ian Jackson; jbeulich@suse.com; Roger
> >> Pau Monne; jgross@suse.com
> >> Subject: Re: [RFC PATCH] xen-block: introduces extra request to pass-
> >> through SCSI commands
> >>
> >> On Mon, Feb 29, 2016 at 09:13:41AM +0000, Paul Durrant wrote:
> >>>> -----Original Message-----
> >>>> From: Bob Liu [mailto:bob.liu@oracle.com]
> >>>> Sent: 29 February 2016 03:37
> >>>> To: xen-devel@lists.xen.org
> >>>> Cc: Ian Jackson; jbeulich@suse.com; Roger Pau Monne;
> >> jgross@suse.com;
> >>>> Paul Durrant; konrad.wilk@oracle.com; Bob Liu
> >>>> Subject: [RFC PATCH] xen-block: introduces extra request to pass-
> >>>>  * xen-block has been existed for many years, widely used and more
> >> stable.
> >>>>
> >>>
> >>> It's definitely widely used, but it has had stability issues in recent times.
> >>
> >> Oh? Could you send the bug-reports to me and Roger, CC xen-devel and
> >> LKML please ?
> > 
> > I was casting my mind back to incompatibilities that crept in with persistent grants. TBH I haven't used blkback much since then; I tend to use qemu qdisk as my backend these days.
> 
> FWIW, QEMU Qdisk also has persistent grants support... and was
> implemented more or less at the same time as blkback persistent grants.

<nods>

Perhaps Pauls' definition of  'recent times' means == two years or so :-)

:-)
> 
> Roger.
>
Paul Durrant Feb. 29, 2016, 4:56 p.m. UTC | #11
> -----Original Message-----
[snip]
> > > I was casting my mind back to incompatibilities that crept in with
> persistent grants. TBH I haven't used blkback much since then; I tend to use
> qemu qdisk as my backend these days.
> >
> > FWIW, QEMU Qdisk also has persistent grants support... and was
> > implemented more or less at the same time as blkback persistent grants.
> 
> <nods>
> 
> Perhaps Pauls' definition of  'recent times' means == two years or so :-)
> 
> :-)

Indeed. That's a blink of an eye for someone as ancient as me :-)

  Paul

> >
> > Roger.
> >
Bob Liu Feb. 29, 2016, 11:45 p.m. UTC | #12
On 03/01/2016 12:29 AM, Ian Jackson wrote:
> Ian Jackson writes ("Re: [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands"):
>> [stuff suggesting use of PVSCSI instead]
> 
> For the avoidance of doubt:
> 
> 1. Thanks very much for bringing this proposal to us at the concept
> stage.  It is much easier to discuss these matters in a constructive
> way before a lot of effort has been put into an implementation.
> 
> 2. I should explain the downsides which I see in your proposal:
> 
> - Your suggestion has bad security properties: previously, the PV
>   block protocol would present only a very simple and narrow
>   interface.  Your SCSI CDB passthrough proposal means that guests
>   would be able to activate features in SCSI targets which would be
>   unexpected and unintended by the host administrator.  Such features
>   would perhaps even be unknown to the host administrator.
> 
>   This could be mitigated by making this feature configurable, of
>   course, defaulting to off, along with clear documentation.  But it's
>   not a desirable property.
> 
> - For similar reasons it will often be difficult to use such a feature
>   safely.  Guest software in particular might expect that it can
>   safely use whatever features it can see, and do all sorts of
>   exciting things.
> 
> - It involves duplicating multiplexing logic which already exists in
>   PVSCSI.
> 

One thing I'm still not sure about PVSCSI is do we have the same security issue since LIO can interface to any block device.
E.g when using a partition /dev/sda1 as the PVSCSI-backend, but the PVSCSI-frontend may send SCSI operates on LUN bases (the whole disk).

P.S. Thanks to all of you, it helps a lot!
Bob Liu Feb. 29, 2016, 11:45 p.m. UTC | #13
On 03/01/2016 12:29 AM, Ian Jackson wrote:
> Ian Jackson writes ("Re: [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands"):
>> [stuff suggesting use of PVSCSI instead]
> 
> For the avoidance of doubt:
> 
> 1. Thanks very much for bringing this proposal to us at the concept
> stage.  It is much easier to discuss these matters in a constructive
> way before a lot of effort has been put into an implementation.
> 
> 2. I should explain the downsides which I see in your proposal:
> 
> - Your suggestion has bad security properties: previously, the PV
>   block protocol would present only a very simple and narrow
>   interface.  Your SCSI CDB passthrough proposal means that guests
>   would be able to activate features in SCSI targets which would be
>   unexpected and unintended by the host administrator.  Such features
>   would perhaps even be unknown to the host administrator.
> 
>   This could be mitigated by making this feature configurable, of
>   course, defaulting to off, along with clear documentation.  But it's
>   not a desirable property.
> 
> - For similar reasons it will often be difficult to use such a feature
>   safely.  Guest software in particular might expect that it can
>   safely use whatever features it can see, and do all sorts of
>   exciting things.
> 
> - It involves duplicating multiplexing logic which already exists in
>   PVSCSI.
> 

One thing I'm still not sure about PVSCSI is do we have the same security issue since LIO can interface to any block device.
E.g when using a partition /dev/sda1 as the PVSCSI-backend, but the PVSCSI-frontend may still send SCSI operates on LUN bases (the whole disk).

P.S. Thanks to all of you, it helps a lot!
Ian Jackson March 1, 2016, 6:08 p.m. UTC | #14
Bob Liu writes ("Re: [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands"):
> One thing I'm still not sure about PVSCSI is do we have the same security issue since LIO can interface to any block device.
> E.g when using a partition /dev/sda1 as the PVSCSI-backend, but the PVSCSI-frontend may still send SCSI operates on LUN bases (the whole disk).

I don't think you can use pvscsi to passthrough a partition such as
/dev/sda1.  Such a thing is not a SCSI command target.

Ian.
Jürgen Groß March 2, 2016, 7:39 a.m. UTC | #15
On 01/03/16 19:08, Ian Jackson wrote:
> Bob Liu writes ("Re: [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands"):
>> One thing I'm still not sure about PVSCSI is do we have the same security issue since LIO can interface to any block device.
>> E.g when using a partition /dev/sda1 as the PVSCSI-backend, but the PVSCSI-frontend may still send SCSI operates on LUN bases (the whole disk).
> 
> I don't think you can use pvscsi to passthrough a partition such as
> /dev/sda1.  Such a thing is not a SCSI command target.

It might be possible via the fileio target backend. In this case LUN
based SCSI operations are ignored/refused/emulated by LIO.


Juergen
Bob Liu March 2, 2016, 7:57 a.m. UTC | #16
Hi Juergen,

On 03/02/2016 03:39 PM, Juergen Gross wrote:
> On 01/03/16 19:08, Ian Jackson wrote:
>> Bob Liu writes ("Re: [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands"):
>>> One thing I'm still not sure about PVSCSI is do we have the same security issue since LIO can interface to any block device.
>>> E.g when using a partition /dev/sda1 as the PVSCSI-backend, but the PVSCSI-frontend may still send SCSI operates on LUN bases (the whole disk).
>>
>> I don't think you can use pvscsi to passthrough a partition such as
>> /dev/sda1.  Such a thing is not a SCSI command target.
> 
> It might be possible via the fileio target backend. In this case LUN
> based SCSI operations are ignored/refused/emulated by LIO.
> 

Do you know whether pvscsi can work on top of multipath(the device-mapper framework) or LVMs?
Thank you!

Bob
Ian Jackson March 2, 2016, 11:40 a.m. UTC | #17
Bob Liu writes ("Re: [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands"):
> Do you know whether pvscsi can work on top of multipath(the device-mapper framework) or LVMs?

No, it can't.  devmapper and LVM work with the block device
abstraction.

Implicitly you seem to be suggesting that you want to use dm-multipath
and LVM, but also send other SCSI CDBs from the upper layers through
to the underlying SCSI storage target.

I can't see how that could cause anything but pain.  In many cases
"the underlying SCSI storage target" wouldn't be well defined.  Even
if it was, these side channel SCSI commands are likely to Go Wrong in
exciting ways.

What SCSI commands do you want to send ?

Ian.
Paul Durrant March 2, 2016, 11:46 a.m. UTC | #18
> -----Original Message-----
> From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com]
> Sent: 02 March 2016 11:41
> To: Bob Liu
> Cc: Juergen Gross; xen-devel@lists.xen.org; jbeulich@suse.com; Roger Pau
> Monne; Paul Durrant; konrad.wilk@oracle.com
> Subject: Re: [RFC PATCH] xen-block: introduces extra request to pass-
> through SCSI commands
> 
> Bob Liu writes ("Re: [RFC PATCH] xen-block: introduces extra request to
> pass-through SCSI commands"):
> > Do you know whether pvscsi can work on top of multipath(the device-
> mapper framework) or LVMs?
> 
> No, it can't.  devmapper and LVM work with the block device
> abstraction.
>

Google suggests you can use LIO to present an LV as a virtual SCSI LUN, in which case pvscsi should work.

  Paul
 
> Implicitly you seem to be suggesting that you want to use dm-multipath
> and LVM, but also send other SCSI CDBs from the upper layers through
> to the underlying SCSI storage target.
> 
> I can't see how that could cause anything but pain.  In many cases
> "the underlying SCSI storage target" wouldn't be well defined.  Even
> if it was, these side channel SCSI commands are likely to Go Wrong in
> exciting ways.
> 
> What SCSI commands do you want to send ?
> 
> Ian.
Jürgen Groß March 2, 2016, noon UTC | #19
On 02/03/16 12:46, Paul Durrant wrote:
>> -----Original Message-----
>> From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com]
>> Sent: 02 March 2016 11:41
>> To: Bob Liu
>> Cc: Juergen Gross; xen-devel@lists.xen.org; jbeulich@suse.com; Roger Pau
>> Monne; Paul Durrant; konrad.wilk@oracle.com
>> Subject: Re: [RFC PATCH] xen-block: introduces extra request to pass-
>> through SCSI commands
>>
>> Bob Liu writes ("Re: [RFC PATCH] xen-block: introduces extra request to
>> pass-through SCSI commands"):
>>> Do you know whether pvscsi can work on top of multipath(the device-
>> mapper framework) or LVMs?
>>
>> No, it can't.  devmapper and LVM work with the block device
>> abstraction.
>>
> 
> Google suggests you can use LIO to present an LV as a virtual SCSI LUN, in which case pvscsi should work.

Indeed: http://linux-iscsi.org/wiki/Targetcli#IBLOCK


Juergen
Bob Liu March 2, 2016, 12:28 p.m. UTC | #20
On 03/02/2016 07:40 PM, Ian Jackson wrote:
> Bob Liu writes ("Re: [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands"):
>> Do you know whether pvscsi can work on top of multipath(the device-mapper framework) or LVMs?
> 
> No, it can't.  devmapper and LVM work with the block device
> abstraction.
> 
> Implicitly you seem to be suggesting that you want to use dm-multipath
> and LVM, but also send other SCSI CDBs from the upper layers through
> to the underlying SCSI storage target.
> 

Exactly!

> I can't see how that could cause anything but pain.  In many cases
> "the underlying SCSI storage target" wouldn't be well defined.  Even
> if it was, these side channel SCSI commands are likely to Go Wrong in
> exciting ways.
> 
> What SCSI commands do you want to send ?
> 

* INQUIRY

* PERSISTENT RESERVE IN
* PERSISTENT RESERVE OUT
This is for Failover Clusters in Windows, not sure whether more commands are required.
I didn't get a required scsi commands list in the failover document.
Ian Jackson March 2, 2016, 2:44 p.m. UTC | #21
Bob Liu writes ("Re: [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands"):
> On 03/02/2016 07:40 PM, Ian Jackson wrote:
> > I can't see how that could cause anything but pain.  In many cases
> > "the underlying SCSI storage target" wouldn't be well defined.  Even
> > if it was, these side channel SCSI commands are likely to Go Wrong in
> > exciting ways.
> > 
> > What SCSI commands do you want to send ?
> 
> * INQUIRY

... but why ?

> * PERSISTENT RESERVE IN
> * PERSISTENT RESERVE OUT
> 
> This is for Failover Clusters in Windows, not sure whether more
> commands are required.  I didn't get a required scsi commands list
> in the failover document.

So you want to be able to reserve the volume against concurrent
access ?  If you're using LVM, such a reservation should apply to the
LVM LV, not to the underlying physical storage device, clearly.  So I
think LIO [1] + PVSCSI might be what you want.

Ian.

[1] http://linux-iscsi.org/wiki/LIO
Paul Durrant March 3, 2016, 11:54 a.m. UTC | #22
> -----Original Message-----
> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> Sent: 02 March 2016 17:23
> To: Ian Jackson
> Cc: Bob Liu; Juergen Gross; xen-devel@lists.xen.org; jbeulich@suse.com;
> Roger Pau Monne; Paul Durrant
> Subject: Re: [RFC PATCH] xen-block: introduces extra request to pass-
> through SCSI commands
> 
> On Wed, Mar 02, 2016 at 02:44:12PM +0000, Ian Jackson wrote:
> > Bob Liu writes ("Re: [RFC PATCH] xen-block: introduces extra request to
> pass-through SCSI commands"):
> > > On 03/02/2016 07:40 PM, Ian Jackson wrote:
> > > > I can't see how that could cause anything but pain.  In many cases
> > > > "the underlying SCSI storage target" wouldn't be well defined.  Even
> > > > if it was, these side channel SCSI commands are likely to Go Wrong in
> > > > exciting ways.
> > > >
> > > > What SCSI commands do you want to send ?
> > >
> > > * INQUIRY
> >
> > ... but why ?
> 
> Need to expose VPD information to the guest for application usage.
> 
> >
> > > * PERSISTENT RESERVE IN
> > > * PERSISTENT RESERVE OUT
> > >
> > > This is for Failover Clusters in Windows, not sure whether more
> > > commands are required.  I didn't get a required scsi commands list
> > > in the failover document.
> >
> > So you want to be able to reserve the volume against concurrent
> > access ?  If you're using LVM, such a reservation should apply to the
> > LVM LV, not to the underlying physical storage device, clearly.  So I
> > think LIO [1] + PVSCSI might be what you want.
> >
> 
> Except it would mean writting a new Windows PV driver and replacing the
> existing block one.

Actually, that's quite desirable. The existing driver is actually a SCSI HBA driver (because that's all you can write is you use thr Windows STORPORT wrapper... which you need to if you want to pass WHQL) so making it use a PV scsi protocol would probably make it smaller since all the scsi <-> blkif translation code could be ripped out and all the stuff to synthesize INQUIRY responses based on vbd data in xenstore could go away too.

  Paul

> 
> 
> > Ian.
> >
> > [1] http://linux-iscsi.org/wiki/LIO
Ian Jackson March 3, 2016, 12:03 p.m. UTC | #23
Paul Durrant writes ("RE: [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands"):
> Actually, that's quite desirable. The existing driver is actually a
> SCSI HBA driver (because that's all you can write is you use thr
> Windows STORPORT wrapper... which you need to if you want to pass
> WHQL) so making it use a PV scsi protocol would probably make it
> smaller since all the scsi <-> blkif translation code could be
> ripped out and all the stuff to synthesize INQUIRY responses based
> on vbd data in xenstore could go away too.

Do non-Linux hosts have something like LIO ?

Ian.
Jürgen Groß March 3, 2016, 12:25 p.m. UTC | #24
On 03/03/16 13:03, Ian Jackson wrote:
> Paul Durrant writes ("RE: [RFC PATCH] xen-block: introduces extra request to pass-through SCSI commands"):
>> Actually, that's quite desirable. The existing driver is actually a
>> SCSI HBA driver (because that's all you can write is you use thr
>> Windows STORPORT wrapper... which you need to if you want to pass
>> WHQL) so making it use a PV scsi protocol would probably make it
>> smaller since all the scsi <-> blkif translation code could be
>> ripped out and all the stuff to synthesize INQUIRY responses based
>> on vbd data in xenstore could go away too.
> 
> Do non-Linux hosts have something like LIO ?

Are there any pvSCSI backends other than the one in Linux?


Juergen
Konrad Rzeszutek Wilk March 3, 2016, 2:07 p.m. UTC | #25
On March 3, 2016 7:25:11 AM EST, Juergen Gross <jgross@suse.com> wrote:
>On 03/03/16 13:03, Ian Jackson wrote:
>> Paul Durrant writes ("RE: [RFC PATCH] xen-block: introduces extra
>request to pass-through SCSI commands"):
>>> Actually, that's quite desirable. The existing driver is actually a
>>> SCSI HBA driver (because that's all you can write is you use thr
>>> Windows STORPORT wrapper... which you need to if you want to pass
>>> WHQL) so making it use a PV scsi protocol would probably make it
>>> smaller since all the scsi <-> blkif translation code could be
>>> ripped out and all the stuff to synthesize INQUIRY responses based
>>> on vbd data in xenstore could go away too.
>> 

Wait a minute? You have that in the driver? Is this documented somewhere? Did I miss this in blkif.h?


>> Do non-Linux hosts have something like LIO ?
>
>Are there any pvSCSI backends other than the one in Linux?
>

Not that I know off.
Paul Durrant March 3, 2016, 2:19 p.m. UTC | #26
> -----Original Message-----

> From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of

> Konrad Rzeszutek Wilk

> Sent: 03 March 2016 14:07

> To: Juergen Gross; Ian Jackson; Paul Durrant

> Cc: xen-devel@lists.xen.org; jbeulich@suse.com; Roger Pau Monne

> Subject: Re: [Xen-devel] [RFC PATCH] xen-block: introduces extra request to

> pass-through SCSI commands

> 

> On March 3, 2016 7:25:11 AM EST, Juergen Gross <jgross@suse.com> wrote:

> >On 03/03/16 13:03, Ian Jackson wrote:

> >> Paul Durrant writes ("RE: [RFC PATCH] xen-block: introduces extra

> >request to pass-through SCSI commands"):

> >>> Actually, that's quite desirable. The existing driver is actually a

> >>> SCSI HBA driver (because that's all you can write is you use thr

> >>> Windows STORPORT wrapper... which you need to if you want to pass

> >>> WHQL) so making it use a PV scsi protocol would probably make it

> >>> smaller since all the scsi <-> blkif translation code could be

> >>> ripped out and all the stuff to synthesize INQUIRY responses based

> >>> on vbd data in xenstore could go away too.

> >>

> 

> Wait a minute? You have that in the driver? Is this documented somewhere?

> Did I miss this in blkif.h?

> 


Yes, the Windows XENVBD driver has a whole load of code for synthesizing INQUIRY responses since that's how Windows probes for disks. There's nothing in blkif.h for any of this though since it’s an implementation detail of the Windows frontend. However, if you want to take a look, go to http://xenbits.xen.org/gitweb/?p=pvdrivers/win/xenvbd.git;a=blob;f=src/xenvbd/pdoinquiry.c. There's even code in there for base64 decoding VPD pages from xenstore! (That's a XenServer implementation artefact... I'm not recommending it as a way forward ;-))

  Paul

> 

> >> Do non-Linux hosts have something like LIO ?

> >

> >Are there any pvSCSI backends other than the one in Linux?

> >

> 

> Not that I know off.

> 

> 

> 

> 

> _______________________________________________

> Xen-devel mailing list

> Xen-devel@lists.xen.org

> http://lists.xen.org/xen-devel
diff mbox

Patch

diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
index 99f0326..7c10bce 100644
--- a/xen/include/public/io/blkif.h
+++ b/xen/include/public/io/blkif.h
@@ -635,6 +635,28 @@ 
 #define BLKIF_OP_INDIRECT          6
 
 /*
+ * Recognised only if "feature-extra-request" is present in backend xenbus info.
+ * A request with BLKIF_OP_EXTRA_FLAG indicates an extra request is followed
+ * in the shared ring buffer.
+ *
+ * By this way, extra data like SCSI command, DIF/DIX and other per-request/bio
+ * data can be transmitted from frontend to backend.
+ *
+ * The 'wire' format is like:
+ *  Request 1: xen_blkif_request
+ * [Request 2: xen_blkif_extra_request]    (only if request 1 has BLKIF_OP_EXTRA_FLAG)
+ *  Request 3: xen_blkif_request
+ *  Request 4: xen_blkif_request
+ * [Request 5: xen_blkif_extra_request]    (only if request 4 has BLKIF_OP_EXTRA_FLAG)
+ *  ...
+ *  Request N: xen_blkif_request
+ *
+ * If a backend does not recognize BLKIF_OP_EXTRA_FLAG, it should *not* create the
+ * "feature-extra-request" node!
+ */
+#define BLKIF_OP_EXTRA_FLAG (0x80)
+
+/*
  * Maximum scatter/gather segments per request.
  * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE.
  * NB. This could be 12 if the ring indexes weren't stored in the same page.
@@ -703,10 +725,61 @@  struct blkif_request_indirect {
 };
 typedef struct blkif_request_indirect blkif_request_indirect_t;
 
+enum blkif_extra_request_type {
+	BLKIF_EXTRA_TYPE_SCSI_CMD = 1,		/* Transmit SCSI command.  */
+};
+
+struct scsi_cmd_req {
+	/*
+	 * Grant mapping for transmiting SCSI command to backend, and
+	 * also receive sense data from backend.
+	 * One 4KB page is enough.
+	 */
+	grant_ref_t cmd_gref;
+	/* Length of SCSI command in the grant mapped page. */
+	unsigned int cmd_len;
+
+	/*
+	 * SCSI command may require transmiting data segment length less
+	 * than a sector(512 bytes).
+	 * Record num_sg and last segment length in extra request so that
+	 * backend can know about them.
+	 */
+	unsigned int num_sg;
+	unsigned int last_sg_len;
+};
+
+/*
+ * Extra request, must follow a normal-request and a normal-request can
+ * only be followed by one extra request.
+ */
+struct blkif_request_extra {
+	uint8_t type;		/* BLKIF_EXTRA_TYPE_* */
+	uint16_t _pad1;
+#ifndef CONFIG_X86_32
+	uint32_t _pad2;		/* offsetof(blkif_...,u.extra.id) == 8 */
+#endif
+	uint64_t id;
+	struct scsi_cmd_req scsi_cmd;
+} __attribute__((__packed__));
+typedef struct blkif_request_extra blkif_request_extra_t;
+
+struct scsi_cmd_res {
+	unsigned int resid_len;
+	/* Length of sense data returned in grant mapped page. */
+	unsigned int sense_len;
+};
+
+struct blkif_response_extra {
+	uint8_t type;  /* BLKIF_EXTRA_TYPE_* */
+	struct scsi_cmd_res scsi_cmd;
+} __attribute__((__packed__));
+
 struct blkif_response {
     uint64_t        id;              /* copied from request */
     uint8_t         operation;       /* copied from request */
     int16_t         status;          /* BLKIF_RSP_???       */
+    struct blkif_response_extra extra;
 };
 typedef struct blkif_response blkif_response_t;