diff mbox

[RFC] Data integrity extension support for xen-block

Message ID 1460023200-20255-1-git-send-email-bob.liu@oracle.com (mailing list archive)
State New, archived
Headers show

Commit Message

Bob Liu April 7, 2016, 10 a.m. UTC
* What's data integrity extension and why?
Modern filesystems feature checksumming of data and metadata to protect against
data corruption.  However, the detection of the corruption is done at read time
which could potentially be months after the data was written.  At that point the
original data that the application tried to write is most likely lost.

The solution in Linux is the data integrity framework which enables protection
information to be pinned to I/Os and sent to/received from controllers that
support it. struct bio has been extended with a pointer to a struct bip which
in turn contains the integrity metadata. The bip is essentially a trimmed down
bio with a bio_vec and some housekeeping.

* Issues when xen-block get involved.
xen-blkfront only transmits the normal data of struct bio while the integrity
metadata buffer(struct bio_integrity_payload in each bio) is ignored.

* Proposal of transmitting bio integrity payload.
Adding an extra request following the normal data request, this extra request
contains the integrity payload.
The xen-blkback will reconstruct an new bio with both received normal data and
integrity metadata.

Welcome any better ideas, thank you!

[1] http://lwn.net/Articles/280023/
[2] https://www.kernel.org/doc/Documentation/block/data-integrity.txt

Signed-off-by: Bob Liu <bob.liu@oracle.com>
---
 xen/include/public/io/blkif.h |   50 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

Comments

Ian Jackson April 7, 2016, 3:32 p.m. UTC | #1
Bob Liu writes ("[RFC PATCH] Data integrity extension support for xen-block"):
> +/*
> + * Extra request, must follow a normal-request and a normal-request can
> + * only be followed by one extra request.
> + */

I don't myself have (yet) an opinion about the syntax of this.  I'd
like to hear from others.

...
> +enum blkif_extra_request_type {
> +	BLKIF_EXTRA_TYPE_DIX = 1,		/* Data integrity extension payload.  */

This needs a definition of the semantics of the payload.  I suggest
this be done by references to appropriate external specifications.

> + * Recognized only if "feature-extra-request" is present in backend
> + * xenbus info.

(Wordwrapped for quoting: please wrap it yourself.)

I think the frontend needs to know whether the data integrity
extension is supported, not whether the extra request is supported.

If the supported length of the integrity data might vary (which I
think it might), it also needs to know that length.

Ian.
Jürgen Groß April 7, 2016, 3:55 p.m. UTC | #2
On 07/04/16 12:00, Bob Liu wrote:
> * What's data integrity extension and why?
> Modern filesystems feature checksumming of data and metadata to protect against
> data corruption.  However, the detection of the corruption is done at read time
> which could potentially be months after the data was written.  At that point the
> original data that the application tried to write is most likely lost.
> 
> The solution in Linux is the data integrity framework which enables protection
> information to be pinned to I/Os and sent to/received from controllers that
> support it. struct bio has been extended with a pointer to a struct bip which
> in turn contains the integrity metadata. The bip is essentially a trimmed down
> bio with a bio_vec and some housekeeping.
> 
> * Issues when xen-block get involved.
> xen-blkfront only transmits the normal data of struct bio while the integrity
> metadata buffer(struct bio_integrity_payload in each bio) is ignored.
> 
> * Proposal of transmitting bio integrity payload.
> Adding an extra request following the normal data request, this extra request
> contains the integrity payload.
> The xen-blkback will reconstruct an new bio with both received normal data and
> integrity metadata.
> 
> Welcome any better ideas, thank you!
> 
> [1] http://lwn.net/Articles/280023/
> [2] https://www.kernel.org/doc/Documentation/block/data-integrity.txt
> 
> Signed-off-by: Bob Liu <bob.liu@oracle.com>
> ---
>  xen/include/public/io/blkif.h |   50 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
> index 99f0326..3d8d39f 100644
> --- a/xen/include/public/io/blkif.h
> +++ b/xen/include/public/io/blkif.h
> @@ -635,6 +635,28 @@
>  #define BLKIF_OP_INDIRECT          6
>  
>  /*
> + * Recognized only if "feature-extra-request" is present in backend xenbus info.
> + * A request with BLKIF_OP_EXTRA_FLAG indicates an extra request is followed
> + * in the shared ring buffer.
> + *
> + * By this way, extra data like bio integrity payload can be transmitted from
> + * frontend to backend.
> + *
> + * The 'wire' format is like:
> + *  Request 1: xen_blkif_request
> + * [Request 2: xen_blkif_extra_request]    (only if request 1 has BLKIF_OP_EXTRA_FLAG)
> + *  Request 3: xen_blkif_request
> + *  Request 4: xen_blkif_request
> + * [Request 5: xen_blkif_extra_request]    (only if request 4 has BLKIF_OP_EXTRA_FLAG)
> + *  ...
> + *  Request N: xen_blkif_request
> + *
> + * If a backend does not recognize BLKIF_OP_EXTRA_FLAG, it should *not* create the
> + * "feature-extra-request" node!
> + */
> +#define BLKIF_OP_EXTRA_FLAG (0x80)
> +
> +/*
>   * Maximum scatter/gather segments per request.
>   * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE.
>   * NB. This could be 12 if the ring indexes weren't stored in the same page.
> @@ -703,6 +725,34 @@ struct blkif_request_indirect {
>  };
>  typedef struct blkif_request_indirect blkif_request_indirect_t;
>  
> +enum blkif_extra_request_type {
> +	BLKIF_EXTRA_TYPE_DIX = 1,		/* Data integrity extension payload.  */
> +};
> +
> +struct bio_integrity_req {
> +	/*
> +	 * Grant mapping for transmitting bio integrity payload to backend.
> +	 */
> +	grant_ref_t *gref;
> +	unsigned int nr_grefs;
> +	unsigned int len;
> +};

How does the payload look like? It's structure should be defined here
or a reference to it's definition in case it is a standard should be
given.

> +
> +/*
> + * Extra request, must follow a normal-request and a normal-request can
> + * only be followed by one extra request.
> + */

Wouldn't it be possible to include this in the original request (e.g.
it could be the first or last indirect segment) ?


Juergen

> +struct blkif_request_extra {
> +	uint8_t type;		/* BLKIF_EXTRA_TYPE_* */
> +	uint16_t _pad1;
> +#ifndef CONFIG_X86_32
> +	uint32_t _pad2;		/* offsetof(blkif_...,u.extra.id) == 8 */
> +#endif
> +	uint64_t id;
> +	struct bio_integrity_req bi_req;
> +} __attribute__((__packed__));
> +typedef struct blkif_request_extra blkif_request_extra_t;
> +
>  struct blkif_response {
>      uint64_t        id;              /* copied from request */
>      uint8_t         operation;       /* copied from request */
>
Bob Liu April 8, 2016, 1:24 a.m. UTC | #3
On 04/07/2016 11:55 PM, Juergen Gross wrote:
> On 07/04/16 12:00, Bob Liu wrote:
>> * What's data integrity extension and why?
>> Modern filesystems feature checksumming of data and metadata to protect against
>> data corruption.  However, the detection of the corruption is done at read time
>> which could potentially be months after the data was written.  At that point the
>> original data that the application tried to write is most likely lost.
>>
>> The solution in Linux is the data integrity framework which enables protection
>> information to be pinned to I/Os and sent to/received from controllers that
>> support it. struct bio has been extended with a pointer to a struct bip which
>> in turn contains the integrity metadata. The bip is essentially a trimmed down
>> bio with a bio_vec and some housekeeping.
>>
>> * Issues when xen-block get involved.
>> xen-blkfront only transmits the normal data of struct bio while the integrity
>> metadata buffer(struct bio_integrity_payload in each bio) is ignored.
>>
>> * Proposal of transmitting bio integrity payload.
>> Adding an extra request following the normal data request, this extra request
>> contains the integrity payload.
>> The xen-blkback will reconstruct an new bio with both received normal data and
>> integrity metadata.
>>
>> Welcome any better ideas, thank you!
>>
>> [1] http://lwn.net/Articles/280023/
>> [2] https://www.kernel.org/doc/Documentation/block/data-integrity.txt
>>
>> Signed-off-by: Bob Liu <bob.liu@oracle.com>
>> ---
>>  xen/include/public/io/blkif.h |   50 +++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 50 insertions(+)
>>
>> diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
>> index 99f0326..3d8d39f 100644
>> --- a/xen/include/public/io/blkif.h
>> +++ b/xen/include/public/io/blkif.h
>> @@ -635,6 +635,28 @@
>>  #define BLKIF_OP_INDIRECT          6
>>  
>>  /*
>> + * Recognized only if "feature-extra-request" is present in backend xenbus info.
>> + * A request with BLKIF_OP_EXTRA_FLAG indicates an extra request is followed
>> + * in the shared ring buffer.
>> + *
>> + * By this way, extra data like bio integrity payload can be transmitted from
>> + * frontend to backend.
>> + *
>> + * The 'wire' format is like:
>> + *  Request 1: xen_blkif_request
>> + * [Request 2: xen_blkif_extra_request]    (only if request 1 has BLKIF_OP_EXTRA_FLAG)
>> + *  Request 3: xen_blkif_request
>> + *  Request 4: xen_blkif_request
>> + * [Request 5: xen_blkif_extra_request]    (only if request 4 has BLKIF_OP_EXTRA_FLAG)
>> + *  ...
>> + *  Request N: xen_blkif_request
>> + *
>> + * If a backend does not recognize BLKIF_OP_EXTRA_FLAG, it should *not* create the
>> + * "feature-extra-request" node!
>> + */
>> +#define BLKIF_OP_EXTRA_FLAG (0x80)
>> +
>> +/*
>>   * Maximum scatter/gather segments per request.
>>   * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE.
>>   * NB. This could be 12 if the ring indexes weren't stored in the same page.
>> @@ -703,6 +725,34 @@ struct blkif_request_indirect {
>>  };
>>  typedef struct blkif_request_indirect blkif_request_indirect_t;
>>  
>> +enum blkif_extra_request_type {
>> +	BLKIF_EXTRA_TYPE_DIX = 1,		/* Data integrity extension payload.  */
>> +};
>> +
>> +struct bio_integrity_req {
>> +	/*
>> +	 * Grant mapping for transmitting bio integrity payload to backend.
>> +	 */
>> +	grant_ref_t *gref;
>> +	unsigned int nr_grefs;
>> +	unsigned int len;
>> +};
> 
> How does the payload look like? It's structure should be defined here
> or a reference to it's definition in case it is a standard should be
> given.
> 

The payload is also described using struct bio_vec(the same as bio).

/*
 * bio integrity payload
 */
struct bio_integrity_payload {
	struct bio		*bip_bio;	/* parent bio */

	struct bvec_iter	bip_iter;

	bio_end_io_t		*bip_end_io;	/* saved I/O completion fn */

	unsigned short		bip_slab;	/* slab the bip came from */
	unsigned short		bip_vcnt;	/* # of integrity bio_vecs */
	unsigned short		bip_max_vcnt;	/* integrity bio_vec slots */
	unsigned short		bip_flags;	/* control flags */

	struct work_struct	bip_work;	/* I/O completion */

	struct bio_vec		*bip_vec;
	struct bio_vec		bip_inline_vecs[0];/* embedded bvec array */
};

>> +
>> +/*
>> + * Extra request, must follow a normal-request and a normal-request can
>> + * only be followed by one extra request.
>> + */
> 
> Wouldn't it be possible to include this in the original request (e.g.
> it could be the first or last indirect segment) ?
> 

Yes, that's also an option.

The common way for block layer/driver to handle integrity metadata is using two scatterlists.
One containing the data as usual, and one containing the integrity metadata.
The block driver should transmit both two scatterlists.

I'm worry about you may think embedding the metadata scatterlist in the original request is too hacky.

Roger, do you have any better idea?

Thanks,
Bob
Jürgen Groß April 8, 2016, 4:04 a.m. UTC | #4
On 08/04/16 03:24, Bob Liu wrote:
> 
> On 04/07/2016 11:55 PM, Juergen Gross wrote:
>> On 07/04/16 12:00, Bob Liu wrote:
>>> * What's data integrity extension and why?
>>> Modern filesystems feature checksumming of data and metadata to protect against
>>> data corruption.  However, the detection of the corruption is done at read time
>>> which could potentially be months after the data was written.  At that point the
>>> original data that the application tried to write is most likely lost.
>>>
>>> The solution in Linux is the data integrity framework which enables protection
>>> information to be pinned to I/Os and sent to/received from controllers that
>>> support it. struct bio has been extended with a pointer to a struct bip which
>>> in turn contains the integrity metadata. The bip is essentially a trimmed down
>>> bio with a bio_vec and some housekeeping.
>>>
>>> * Issues when xen-block get involved.
>>> xen-blkfront only transmits the normal data of struct bio while the integrity
>>> metadata buffer(struct bio_integrity_payload in each bio) is ignored.
>>>
>>> * Proposal of transmitting bio integrity payload.
>>> Adding an extra request following the normal data request, this extra request
>>> contains the integrity payload.
>>> The xen-blkback will reconstruct an new bio with both received normal data and
>>> integrity metadata.
>>>
>>> Welcome any better ideas, thank you!
>>>
>>> [1] http://lwn.net/Articles/280023/
>>> [2] https://www.kernel.org/doc/Documentation/block/data-integrity.txt
>>>
>>> Signed-off-by: Bob Liu <bob.liu@oracle.com>
>>> ---
>>>  xen/include/public/io/blkif.h |   50 +++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 50 insertions(+)
>>>
>>> diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
>>> index 99f0326..3d8d39f 100644
>>> --- a/xen/include/public/io/blkif.h
>>> +++ b/xen/include/public/io/blkif.h
>>> @@ -635,6 +635,28 @@
>>>  #define BLKIF_OP_INDIRECT          6
>>>  
>>>  /*
>>> + * Recognized only if "feature-extra-request" is present in backend xenbus info.
>>> + * A request with BLKIF_OP_EXTRA_FLAG indicates an extra request is followed
>>> + * in the shared ring buffer.
>>> + *
>>> + * By this way, extra data like bio integrity payload can be transmitted from
>>> + * frontend to backend.
>>> + *
>>> + * The 'wire' format is like:
>>> + *  Request 1: xen_blkif_request
>>> + * [Request 2: xen_blkif_extra_request]    (only if request 1 has BLKIF_OP_EXTRA_FLAG)
>>> + *  Request 3: xen_blkif_request
>>> + *  Request 4: xen_blkif_request
>>> + * [Request 5: xen_blkif_extra_request]    (only if request 4 has BLKIF_OP_EXTRA_FLAG)
>>> + *  ...
>>> + *  Request N: xen_blkif_request
>>> + *
>>> + * If a backend does not recognize BLKIF_OP_EXTRA_FLAG, it should *not* create the
>>> + * "feature-extra-request" node!
>>> + */
>>> +#define BLKIF_OP_EXTRA_FLAG (0x80)
>>> +
>>> +/*
>>>   * Maximum scatter/gather segments per request.
>>>   * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE.
>>>   * NB. This could be 12 if the ring indexes weren't stored in the same page.
>>> @@ -703,6 +725,34 @@ struct blkif_request_indirect {
>>>  };
>>>  typedef struct blkif_request_indirect blkif_request_indirect_t;
>>>  
>>> +enum blkif_extra_request_type {
>>> +	BLKIF_EXTRA_TYPE_DIX = 1,		/* Data integrity extension payload.  */
>>> +};
>>> +
>>> +struct bio_integrity_req {
>>> +	/*
>>> +	 * Grant mapping for transmitting bio integrity payload to backend.
>>> +	 */
>>> +	grant_ref_t *gref;
>>> +	unsigned int nr_grefs;
>>> +	unsigned int len;
>>> +};
>>
>> How does the payload look like? It's structure should be defined here
>> or a reference to it's definition in case it is a standard should be
>> given.
>>
> 
> The payload is also described using struct bio_vec(the same as bio).
> 
> /*
>  * bio integrity payload
>  */
> struct bio_integrity_payload {
> 	struct bio		*bip_bio;	/* parent bio */

And e.g. Windows guests know how those look like? And BSDs? And others?
All Linux versions with all possible kernel configurations have the
same layout?

I don't think so.

> 
> 	struct bvec_iter	bip_iter;
> 
> 	bio_end_io_t		*bip_end_io;	/* saved I/O completion fn */
> 
> 	unsigned short		bip_slab;	/* slab the bip came from */
> 	unsigned short		bip_vcnt;	/* # of integrity bio_vecs */
> 	unsigned short		bip_max_vcnt;	/* integrity bio_vec slots */
> 	unsigned short		bip_flags;	/* control flags */
> 
> 	struct work_struct	bip_work;	/* I/O completion */
> 
> 	struct bio_vec		*bip_vec;
> 	struct bio_vec		bip_inline_vecs[0];/* embedded bvec array */
> };
> 
>>> +
>>> +/*
>>> + * Extra request, must follow a normal-request and a normal-request can
>>> + * only be followed by one extra request.
>>> + */
>>
>> Wouldn't it be possible to include this in the original request (e.g.
>> it could be the first or last indirect segment) ?
>>
> 
> Yes, that's also an option.
> 
> The common way for block layer/driver to handle integrity metadata is using two scatterlists.
> One containing the data as usual, and one containing the integrity metadata.
> The block driver should transmit both two scatterlists.
> 
> I'm worry about you may think embedding the metadata scatterlist in the original request is too hacky.

I think you have to specify a clean OS agnostic interface for the
integrity data first. Then we can see how it fits into the request.


Juergen
Roger Pau Monné April 8, 2016, 9:44 a.m. UTC | #5
On Fri, 8 Apr 2016, Bob Liu wrote:
> 
> On 04/07/2016 11:55 PM, Juergen Gross wrote:
> > On 07/04/16 12:00, Bob Liu wrote:
> >> * What's data integrity extension and why?
> >> Modern filesystems feature checksumming of data and metadata to protect against
> >> data corruption.  However, the detection of the corruption is done at read time
> >> which could potentially be months after the data was written.  At that point the
> >> original data that the application tried to write is most likely lost.
> >>
> >> The solution in Linux is the data integrity framework which enables protection
> >> information to be pinned to I/Os and sent to/received from controllers that
> >> support it. struct bio has been extended with a pointer to a struct bip which
> >> in turn contains the integrity metadata. The bip is essentially a trimmed down
> >> bio with a bio_vec and some housekeeping.
> >>
> >> * Issues when xen-block get involved.
> >> xen-blkfront only transmits the normal data of struct bio while the integrity
> >> metadata buffer(struct bio_integrity_payload in each bio) is ignored.
> >>
> >> * Proposal of transmitting bio integrity payload.
> >> Adding an extra request following the normal data request, this extra request
> >> contains the integrity payload.
> >> The xen-blkback will reconstruct an new bio with both received normal data and
> >> integrity metadata.
> >>
> >> Welcome any better ideas, thank you!
> >>
> >> [1] http://lwn.net/Articles/280023/
> >> [2] https://www.kernel.org/doc/Documentation/block/data-integrity.txt
> >>
> >> Signed-off-by: Bob Liu <bob.liu@oracle.com>
> >> ---
> >>  xen/include/public/io/blkif.h |   50 +++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 50 insertions(+)
> >>
> >> diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
> >> index 99f0326..3d8d39f 100644
> >> --- a/xen/include/public/io/blkif.h
> >> +++ b/xen/include/public/io/blkif.h
> >> @@ -635,6 +635,28 @@
> >>  #define BLKIF_OP_INDIRECT          6
> >>  
> >>  /*
> >> + * Recognized only if "feature-extra-request" is present in backend xenbus info.
> >> + * A request with BLKIF_OP_EXTRA_FLAG indicates an extra request is followed
> >> + * in the shared ring buffer.
> >> + *
> >> + * By this way, extra data like bio integrity payload can be transmitted from
> >> + * frontend to backend.
> >> + *
> >> + * The 'wire' format is like:
> >> + *  Request 1: xen_blkif_request
> >> + * [Request 2: xen_blkif_extra_request]    (only if request 1 has BLKIF_OP_EXTRA_FLAG)
> >> + *  Request 3: xen_blkif_request
> >> + *  Request 4: xen_blkif_request
> >> + * [Request 5: xen_blkif_extra_request]    (only if request 4 has BLKIF_OP_EXTRA_FLAG)
> >> + *  ...
> >> + *  Request N: xen_blkif_request
> >> + *
> >> + * If a backend does not recognize BLKIF_OP_EXTRA_FLAG, it should *not* create the
> >> + * "feature-extra-request" node!
> >> + */
> >> +#define BLKIF_OP_EXTRA_FLAG (0x80)
> >> +
> >> +/*
> >>   * Maximum scatter/gather segments per request.
> >>   * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE.
> >>   * NB. This could be 12 if the ring indexes weren't stored in the same page.
> >> @@ -703,6 +725,34 @@ struct blkif_request_indirect {
> >>  };
> >>  typedef struct blkif_request_indirect blkif_request_indirect_t;
> >>  
> >> +enum blkif_extra_request_type {
> >> +	BLKIF_EXTRA_TYPE_DIX = 1,		/* Data integrity extension payload.  */
> >> +};
> >> +
> >> +struct bio_integrity_req {
> >> +	/*
> >> +	 * Grant mapping for transmitting bio integrity payload to backend.
> >> +	 */
> >> +	grant_ref_t *gref;
> >> +	unsigned int nr_grefs;
> >> +	unsigned int len;
> >> +};
> > 
> > How does the payload look like? It's structure should be defined here
> > or a reference to it's definition in case it is a standard should be
> > given.
> > 
> 
> The payload is also described using struct bio_vec(the same as bio).
> 
> /*
>  * bio integrity payload
>  */
> struct bio_integrity_payload {
> 	struct bio		*bip_bio;	/* parent bio */
> 
> 	struct bvec_iter	bip_iter;
> 
> 	bio_end_io_t		*bip_end_io;	/* saved I/O completion fn */
> 
> 	unsigned short		bip_slab;	/* slab the bip came from */
> 	unsigned short		bip_vcnt;	/* # of integrity bio_vecs */
> 	unsigned short		bip_max_vcnt;	/* integrity bio_vec slots */
> 	unsigned short		bip_flags;	/* control flags */
> 
> 	struct work_struct	bip_work;	/* I/O completion */
> 
> 	struct bio_vec		*bip_vec;
> 	struct bio_vec		bip_inline_vecs[0];/* embedded bvec array */
> };

There's no way we are going to embed such a Linux specific payload into 
the PV block protocol. Also, I have the feeling there are a lot of fields 
in this struct that make no sense to transmit on the ring (work_struct?).

TBH, I don't know much about this integrity thing. Why does the frontend 
needs to create and pass this bio_integrity_payload around? Couldn't this 
be created from blkback before sending the request down to the disk? Then 
blkback would check the result and either return BLKIF_RSP_OKAY or 
BLKIF_RSP_ERROR if the metadata doesn't match?
 
Roger.
Bob Liu April 8, 2016, 10:13 a.m. UTC | #6
On 04/08/2016 05:44 PM, Roger Pau Monné wrote:
> On Fri, 8 Apr 2016, Bob Liu wrote:
>>
>> On 04/07/2016 11:55 PM, Juergen Gross wrote:
>>> On 07/04/16 12:00, Bob Liu wrote:
>>>> * What's data integrity extension and why?
>>>> Modern filesystems feature checksumming of data and metadata to protect against
>>>> data corruption.  However, the detection of the corruption is done at read time
>>>> which could potentially be months after the data was written.  At that point the
>>>> original data that the application tried to write is most likely lost.
>>>>
>>>> The solution in Linux is the data integrity framework which enables protection
>>>> information to be pinned to I/Os and sent to/received from controllers that
>>>> support it. struct bio has been extended with a pointer to a struct bip which
>>>> in turn contains the integrity metadata. The bip is essentially a trimmed down
>>>> bio with a bio_vec and some housekeeping.
>>>>
>>>> * Issues when xen-block get involved.
>>>> xen-blkfront only transmits the normal data of struct bio while the integrity
>>>> metadata buffer(struct bio_integrity_payload in each bio) is ignored.
>>>>
>>>> * Proposal of transmitting bio integrity payload.
>>>> Adding an extra request following the normal data request, this extra request
>>>> contains the integrity payload.
>>>> The xen-blkback will reconstruct an new bio with both received normal data and
>>>> integrity metadata.
>>>>
>>>> Welcome any better ideas, thank you!
>>>>
>>>> [1] http://lwn.net/Articles/280023/
>>>> [2] https://www.kernel.org/doc/Documentation/block/data-integrity.txt
>>>>
>>>> Signed-off-by: Bob Liu <bob.liu@oracle.com>
>>>> ---
>>>>  xen/include/public/io/blkif.h |   50 +++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 50 insertions(+)
>>>>
>>>> diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
>>>> index 99f0326..3d8d39f 100644
>>>> --- a/xen/include/public/io/blkif.h
>>>> +++ b/xen/include/public/io/blkif.h
>>>> @@ -635,6 +635,28 @@
>>>>  #define BLKIF_OP_INDIRECT          6
>>>>  
>>>>  /*
>>>> + * Recognized only if "feature-extra-request" is present in backend xenbus info.
>>>> + * A request with BLKIF_OP_EXTRA_FLAG indicates an extra request is followed
>>>> + * in the shared ring buffer.
>>>> + *
>>>> + * By this way, extra data like bio integrity payload can be transmitted from
>>>> + * frontend to backend.
>>>> + *
>>>> + * The 'wire' format is like:
>>>> + *  Request 1: xen_blkif_request
>>>> + * [Request 2: xen_blkif_extra_request]    (only if request 1 has BLKIF_OP_EXTRA_FLAG)
>>>> + *  Request 3: xen_blkif_request
>>>> + *  Request 4: xen_blkif_request
>>>> + * [Request 5: xen_blkif_extra_request]    (only if request 4 has BLKIF_OP_EXTRA_FLAG)
>>>> + *  ...
>>>> + *  Request N: xen_blkif_request
>>>> + *
>>>> + * If a backend does not recognize BLKIF_OP_EXTRA_FLAG, it should *not* create the
>>>> + * "feature-extra-request" node!
>>>> + */
>>>> +#define BLKIF_OP_EXTRA_FLAG (0x80)
>>>> +
>>>> +/*
>>>>   * Maximum scatter/gather segments per request.
>>>>   * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE.
>>>>   * NB. This could be 12 if the ring indexes weren't stored in the same page.
>>>> @@ -703,6 +725,34 @@ struct blkif_request_indirect {
>>>>  };
>>>>  typedef struct blkif_request_indirect blkif_request_indirect_t;
>>>>  
>>>> +enum blkif_extra_request_type {
>>>> +	BLKIF_EXTRA_TYPE_DIX = 1,		/* Data integrity extension payload.  */
>>>> +};
>>>> +
>>>> +struct bio_integrity_req {
>>>> +	/*
>>>> +	 * Grant mapping for transmitting bio integrity payload to backend.
>>>> +	 */
>>>> +	grant_ref_t *gref;
>>>> +	unsigned int nr_grefs;
>>>> +	unsigned int len;
>>>> +};
>>>
>>> How does the payload look like? It's structure should be defined here
>>> or a reference to it's definition in case it is a standard should be
>>> given.
>>>
>>
>> The payload is also described using struct bio_vec(the same as bio).
>>
>> /*
>>  * bio integrity payload
>>  */
>> struct bio_integrity_payload {
>> 	struct bio		*bip_bio;	/* parent bio */
>>
>> 	struct bvec_iter	bip_iter;
>>
>> 	bio_end_io_t		*bip_end_io;	/* saved I/O completion fn */
>>
>> 	unsigned short		bip_slab;	/* slab the bip came from */
>> 	unsigned short		bip_vcnt;	/* # of integrity bio_vecs */
>> 	unsigned short		bip_max_vcnt;	/* integrity bio_vec slots */
>> 	unsigned short		bip_flags;	/* control flags */
>>
>> 	struct work_struct	bip_work;	/* I/O completion */
>>
>> 	struct bio_vec		*bip_vec;
>> 	struct bio_vec		bip_inline_vecs[0];/* embedded bvec array */
>> };
> 
> There's no way we are going to embed such a Linux specific payload into 
> the PV block protocol. Also, I have the feeling there are a lot of fields 
> in this struct that make no sense to transmit on the ring (work_struct?).
> 

Only the bio_vec data bip_vec pointed is necessary to be transmitted.

> TBH, I don't know much about this integrity thing. Why does the frontend 
> needs to create and pass this bio_integrity_payload around? Couldn't this 
> be created from blkback before sending the request down to the disk? Then 
> blkback would check the result and either return BLKIF_RSP_OKAY or 
> BLKIF_RSP_ERROR if the metadata doesn't match?
>  

Yes, but that's only one use case.

The Linux data integrity framework also allows the user space application or filesystem generating the metadata.
* A filesystem in Guest that is integrity-aware can prepare I/Os with metadata attached.
* Filesystems in Guest are capable of transferring metadata from user space.
Those metadata get lost if we don't pass them through in blkfront.

You may have a look at:
[1] http://lwn.net/Articles/280023/
[2] https://www.kernel.org/doc/Documentation/block/data-integrity.txt

Thanks,
Bob
Ian Jackson April 8, 2016, 1:42 p.m. UTC | #7
FYI I have things I want to say in this conversation but today I am
concentrating on committing for the 4.7 freeze.

Ian.
David Vrabel April 8, 2016, 2:16 p.m. UTC | #8
On 07/04/16 11:00, Bob Liu wrote:
> * What's data integrity extension and why?
> Modern filesystems feature checksumming of data and metadata to protect against
> data corruption.  However, the detection of the corruption is done at read time
> which could potentially be months after the data was written.  At that point the
> original data that the application tried to write is most likely lost.
> 
> The solution in Linux is the data integrity framework which enables protection
> information to be pinned to I/Os and sent to/received from controllers that
> support it. struct bio has been extended with a pointer to a struct bip which
> in turn contains the integrity metadata. The bip is essentially a trimmed down
> bio with a bio_vec and some housekeeping.
> 
> * Issues when xen-block get involved.
> xen-blkfront only transmits the normal data of struct bio while the integrity
> metadata buffer(struct bio_integrity_payload in each bio) is ignored.
> 
> * Proposal of transmitting bio integrity payload.
> Adding an extra request following the normal data request, this extra request
> contains the integrity payload.
> The xen-blkback will reconstruct an new bio with both received normal data and
> integrity metadata.

You need to read the relevant SCSI specification and find out what
interfaces and behaviour the hardware has so you can specify compatible
interfaces in blkif.

My (brief) reading around this suggests that the integrity data has a
specific format (a CRC of some form) and the integrity data written for
sector S and retrieved verbatim when sector S is re-read.

David
Ian Jackson April 8, 2016, 2:20 p.m. UTC | #9
David Vrabel writes ("Re: [RFC PATCH] Data integrity extension support for xen-block"):
> You need to read the relevant SCSI specification and find out what
> interfaces and behaviour the hardware has so you can specify compatible
> interfaces in blkif.
> 
> My (brief) reading around this suggests that the integrity data has a
> specific format (a CRC of some form) and the integrity data written for
> sector S and retrieved verbatim when sector S is re-read.

I think it's this:

https://en.wikipedia.org/wiki/Data_Integrity_Field
https://www.kernel.org/doc/Documentation/block/data-integrity.txt

In which case AFAICT the format is up to the guest (ie the operating
system or file system) and it's opaque to the host (the storage) -
unless the guest consents, of course.

Ian.
David Vrabel April 8, 2016, 2:32 p.m. UTC | #10
On 08/04/16 15:20, Ian Jackson wrote:
> David Vrabel writes ("Re: [RFC PATCH] Data integrity extension support for xen-block"):
>> You need to read the relevant SCSI specification and find out what
>> interfaces and behaviour the hardware has so you can specify compatible
>> interfaces in blkif.
>>
>> My (brief) reading around this suggests that the integrity data has a
>> specific format (a CRC of some form) and the integrity data written for
>> sector S and retrieved verbatim when sector S is re-read.
> 
> I think it's this:
> 
> https://en.wikipedia.org/wiki/Data_Integrity_Field
> https://www.kernel.org/doc/Documentation/block/data-integrity.txt
> 
> In which case AFAICT the format is up to the guest (ie the operating
> system or file system) and it's opaque to the host (the storage) -
> unless the guest consents, of course.

I disagree, but I can't work out where to get the relevant T10 PI/DIF
spec from to provide an authoritative link[1].  The DI metadata has as a
set of well defined format, most of which include a 16-bit GUARD CRC, a
32 bit REFERENCE tag and 16 bit for user defined usage.

The application cannot use all the bits for its own use since the
hardware may check the GUARD and REFERENCE tags itself.

David

[0] Try: https://www.usenix.org/legacy/event/lsf07/tech/petersen.pdf
Bob Liu April 11, 2016, 12:32 p.m. UTC | #11
On 04/08/2016 10:32 PM, David Vrabel wrote:
> On 08/04/16 15:20, Ian Jackson wrote:
>> David Vrabel writes ("Re: [RFC PATCH] Data integrity extension support for xen-block"):
>>> You need to read the relevant SCSI specification and find out what
>>> interfaces and behaviour the hardware has so you can specify compatible
>>> interfaces in blkif.
>>>
>>> My (brief) reading around this suggests that the integrity data has a
>>> specific format (a CRC of some form) and the integrity data written for
>>> sector S and retrieved verbatim when sector S is re-read.
>>
>> I think it's this:
>>
>> https://en.wikipedia.org/wiki/Data_Integrity_Field
>> https://www.kernel.org/doc/Documentation/block/data-integrity.txt
>>
>> In which case AFAICT the format is up to the guest (ie the operating
>> system or file system) and it's opaque to the host (the storage) -
>> unless the guest consents, of course.
> 
> I disagree, but I can't work out where to get the relevant T10 PI/DIF
> spec from to provide an authoritative link[1].  The DI metadata has as a
> set of well defined format, most of which include a 16-bit GUARD CRC, a
> 32 bit REFERENCE tag and 16 bit for user defined usage.
> 

Yes.

> The application cannot use all the bits for its own use since the
> hardware may check the GUARD and REFERENCE tags itself.
> 
> David
> 
> [0] Try: https://www.usenix.org/legacy/event/lsf07/tech/petersen.pdf
> 

And https://oss.oracle.com/projects/data-integrity/dist/documentation/dix.pdf

No matter the actual format of the Integrity Meta Data looks like, it can be mapped to a scatter-list by using:
blk_rq_map_integrity_sg(struct request_queue *q, struct bio *bio, struct scatterlist *sglist)
just like blk_rq_map_sg(struct request_queue *q, struct request *rq, struct scatterlist *sglist) for normal data.

The extra scatter-list can be seen as the interface, we just need to find a good way transmitting this extra scatter-list between blkfront and blkback.
Konrad Rzeszutek Wilk April 11, 2016, 3:04 p.m. UTC | #12
> * A filesystem in Guest that is integrity-aware can prepare I/Os with metadata attached.
> * Filesystems in Guest are capable of transferring metadata from user space.
> Those metadata get lost if we don't pass them through in blkfront.
> 
> You may have a look at:
> [1] http://lwn.net/Articles/280023/
> [2] https://www.kernel.org/doc/Documentation/block/data-integrity.txt

And Google for SCSI DIF/DIX which can give you some ideas. There is also
this slide-deck: https://oss.oracle.com/~mkp/docs/ols2008-slides.pdf

Also note that some devices (like NVMe) also implement some of this:
http://www.flashmemorysummit.com/English/Collaterals/Proceedings/2014/20140804_Seminar_F_Busch.pdf
diff mbox

Patch

diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
index 99f0326..3d8d39f 100644
--- a/xen/include/public/io/blkif.h
+++ b/xen/include/public/io/blkif.h
@@ -635,6 +635,28 @@ 
 #define BLKIF_OP_INDIRECT          6
 
 /*
+ * Recognized only if "feature-extra-request" is present in backend xenbus info.
+ * A request with BLKIF_OP_EXTRA_FLAG indicates an extra request is followed
+ * in the shared ring buffer.
+ *
+ * By this way, extra data like bio integrity payload can be transmitted from
+ * frontend to backend.
+ *
+ * The 'wire' format is like:
+ *  Request 1: xen_blkif_request
+ * [Request 2: xen_blkif_extra_request]    (only if request 1 has BLKIF_OP_EXTRA_FLAG)
+ *  Request 3: xen_blkif_request
+ *  Request 4: xen_blkif_request
+ * [Request 5: xen_blkif_extra_request]    (only if request 4 has BLKIF_OP_EXTRA_FLAG)
+ *  ...
+ *  Request N: xen_blkif_request
+ *
+ * If a backend does not recognize BLKIF_OP_EXTRA_FLAG, it should *not* create the
+ * "feature-extra-request" node!
+ */
+#define BLKIF_OP_EXTRA_FLAG (0x80)
+
+/*
  * Maximum scatter/gather segments per request.
  * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE.
  * NB. This could be 12 if the ring indexes weren't stored in the same page.
@@ -703,6 +725,34 @@  struct blkif_request_indirect {
 };
 typedef struct blkif_request_indirect blkif_request_indirect_t;
 
+enum blkif_extra_request_type {
+	BLKIF_EXTRA_TYPE_DIX = 1,		/* Data integrity extension payload.  */
+};
+
+struct bio_integrity_req {
+	/*
+	 * Grant mapping for transmitting bio integrity payload to backend.
+	 */
+	grant_ref_t *gref;
+	unsigned int nr_grefs;
+	unsigned int len;
+};
+
+/*
+ * Extra request, must follow a normal-request and a normal-request can
+ * only be followed by one extra request.
+ */
+struct blkif_request_extra {
+	uint8_t type;		/* BLKIF_EXTRA_TYPE_* */
+	uint16_t _pad1;
+#ifndef CONFIG_X86_32
+	uint32_t _pad2;		/* offsetof(blkif_...,u.extra.id) == 8 */
+#endif
+	uint64_t id;
+	struct bio_integrity_req bi_req;
+} __attribute__((__packed__));
+typedef struct blkif_request_extra blkif_request_extra_t;
+
 struct blkif_response {
     uint64_t        id;              /* copied from request */
     uint8_t         operation;       /* copied from request */