diff mbox series

[v2,2/3] block: support PI at non-zero offset within metadata

Message ID 20240201130126.211402-3-joshi.k@samsung.com (mailing list archive)
State New, archived
Headers show
Series Block integrity with flexible-offset PI | expand

Commit Message

Kanchan Joshi Feb. 1, 2024, 1:01 p.m. UTC
Block layer integrity processing assumes that protection information
(PI) is placed in the first bytes of each metadata block.

Remove this limitation and include the metadata before the PI in the
calculation of the guard tag.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Chinmay Gameti <c.gameti@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 block/bio-integrity.c         |  1 +
 block/blk-integrity.c         |  1 +
 block/t10-pi.c                | 52 +++++++++++++++++++++++++----------
 include/linux/blk-integrity.h |  1 +
 include/linux/blkdev.h        |  1 +
 5 files changed, 42 insertions(+), 14 deletions(-)

Comments

Keith Busch Sept. 26, 2024, 3:07 p.m. UTC | #1
On Thu, Feb 01, 2024 at 06:31:25PM +0530, Kanchan Joshi wrote:
> Block layer integrity processing assumes that protection information
> (PI) is placed in the first bytes of each metadata block.
> 
> Remove this limitation and include the metadata before the PI in the
> calculation of the guard tag.

Very late reply, but I am just now discovering the consequences of this
patch.

We have drives with this format, 64b metadata with PI at the end. With
previous kernels, we had written data to these drives. Those kernel
versions disabled the GUARD generation, so the metadata was written
without it, and everything was fine.

Now we upgrade to 6.9+, and this kernel enables the GUARD check. All the
data previously written to this drive is unreadable because the GUARD is
invalid.

Not sure exactly what to do about this, but it is a broken kernel
upgrade path, so wanted to throw that information out there.
Kanchan Joshi Sept. 26, 2024, 4:38 p.m. UTC | #2
On 9/26/2024 8:37 PM, Keith Busch wrote:
> On Thu, Feb 01, 2024 at 06:31:25PM +0530, Kanchan Joshi wrote:
>> Block layer integrity processing assumes that protection information
>> (PI) is placed in the first bytes of each metadata block.
>>
>> Remove this limitation and include the metadata before the PI in the
>> calculation of the guard tag.
> 
> Very late reply, but I am just now discovering the consequences of this
> patch.
> 
> We have drives with this format, 64b metadata with PI at the end. With
> previous kernels, we had written data to these drives. Those kernel
> versions disabled the GUARD generation, so the metadata was written
> without it, and everything was fine.
> 
> Now we upgrade to 6.9+, and this kernel enables the GUARD check. All the
> data previously written to this drive is unreadable because the GUARD is
> invalid.
> 
> Not sure exactly what to do about this, but it is a broken kernel
> upgrade path, so wanted to throw that information out there.
> 

Ah, writing to the disk without any guard, but after that reading with 
the guard!

I wish if there was a way to format the NVMe only to reset its pi type 
(from non-zero to 0).
But there are kernel knobs too. Hope you are able to get to the same 
state (as nop profile) by clearing write_generate and read_verify:
echo 0 > /sys/block/nvme0n1/integrity/read_verify
Keith Busch Sept. 26, 2024, 4:55 p.m. UTC | #3
On Thu, Sep 26, 2024 at 10:08:09PM +0530, Kanchan Joshi wrote:
> But there are kernel knobs too. Hope you are able to get to the same 
> state (as nop profile) by clearing write_generate and read_verify:
> echo 0 > /sys/block/nvme0n1/integrity/read_verify

It's not the kernel's verify causing the failure; it's the end device.
For nvme, it'll return status 0x282, End-to-end Guard Check Error, so
the kernel doesn't have a chance to check the data. We'd need to turn
off the command's NVME_RW_PRINFO_PRCHK_GUARD flag, but there's currently
no knob to toggle that.
Kanchan Joshi Sept. 27, 2024, 4:07 p.m. UTC | #4
On 9/26/2024 10:25 PM, Keith Busch wrote:
> On Thu, Sep 26, 2024 at 10:08:09PM +0530, Kanchan Joshi wrote:
>> But there are kernel knobs too. Hope you are able to get to the same
>> state (as nop profile) by clearing write_generate and read_verify:
>> echo 0 > /sys/block/nvme0n1/integrity/read_verify
> 
> It's not the kernel's verify causing the failure; it's the end device.
> For nvme, it'll return status 0x282, End-to-end Guard Check Error, so
> the kernel doesn't have a chance to check the data. We'd need to turn
> off the command's NVME_RW_PRINFO_PRCHK_GUARD flag, but there's currently
> no knob to toggle that.

Indeed.
I spent a good deal of time on this today. I was thinking to connect 
block read_verify/write_generate knobs to influence things at nvme level 
(those PRCHK flags). But that will not be enough. Because with those 
knobs block-layer will not attach meta-buffer, which is still needed.

The data was written under the condition when nvme driver set the 
pi_type to 0 (even though at device level it was non-zero) during 
integrity registration.

Thinking whether it will make sense to have a knob at the block-layer 
level to do something like that i.e., override the set integrity-profile 
with nop.
Martin K. Petersen Sept. 30, 2024, 5:57 p.m. UTC | #5
Kanchan,

> I spent a good deal of time on this today. I was thinking to connect 
> block read_verify/write_generate knobs to influence things at nvme level 
> (those PRCHK flags). But that will not be enough. Because with those 
> knobs block-layer will not attach meta-buffer, which is still needed.
>
> The data was written under the condition when nvme driver set the
> pi_type to 0 (even though at device level it was non-zero) during
> integrity registration.
>
> Thinking whether it will make sense to have a knob at the block-layer
> level to do something like that i.e., override the set
> integrity-profile with nop.

SCSI went to great lengths to ensure that invalid protection information
would never be written during normal operation, regardless of whether
the host sent PI or not. And thus the only time one would anticipate a
PI error was if the data had actually been corrupted.
Javier Gonzalez Oct. 1, 2024, 7:27 a.m. UTC | #6
On 30.09.2024 13:57, Martin K. Petersen wrote:
>
>Kanchan,
>
>> I spent a good deal of time on this today. I was thinking to connect
>> block read_verify/write_generate knobs to influence things at nvme level
>> (those PRCHK flags). But that will not be enough. Because with those
>> knobs block-layer will not attach meta-buffer, which is still needed.
>>
>> The data was written under the condition when nvme driver set the
>> pi_type to 0 (even though at device level it was non-zero) during
>> integrity registration.
>>
>> Thinking whether it will make sense to have a knob at the block-layer
>> level to do something like that i.e., override the set
>> integrity-profile with nop.
>
>SCSI went to great lengths to ensure that invalid protection information
>would never be written during normal operation, regardless of whether
>the host sent PI or not. And thus the only time one would anticipate a
>PI error was if the data had actually been corrupted.
>

Is this something we should work on bringin to the NVMe TWG?

Javier
Keith Busch Oct. 1, 2024, 3:37 p.m. UTC | #7
On Tue, Oct 01, 2024 at 09:27:08AM +0200, Javier González wrote:
> On 30.09.2024 13:57, Martin K. Petersen wrote:
> > 
> > Kanchan,
> > 
> > > I spent a good deal of time on this today. I was thinking to connect
> > > block read_verify/write_generate knobs to influence things at nvme level
> > > (those PRCHK flags). But that will not be enough. Because with those
> > > knobs block-layer will not attach meta-buffer, which is still needed.
> > > 
> > > The data was written under the condition when nvme driver set the
> > > pi_type to 0 (even though at device level it was non-zero) during
> > > integrity registration.
> > > 
> > > Thinking whether it will make sense to have a knob at the block-layer
> > > level to do something like that i.e., override the set
> > > integrity-profile with nop.
> > 
> > SCSI went to great lengths to ensure that invalid protection information
> > would never be written during normal operation, regardless of whether
> > the host sent PI or not. And thus the only time one would anticipate a
> > PI error was if the data had actually been corrupted.
> > 
> 
> Is this something we should work on bringin to the NVMe TWG?

Maybe. It looks like they did the spec this way one purpose with the
ability to toggle guard tags per IO.

Just some more background on this because it may sound odd to use a data
protection namespace format that the kernel didn't support:

In this use case, writes to the device primarily come from the
passthrough interface, which could always use the guard tags for
end-to-end protection. The kernel block IO was the only path that had
the limitation.

Besides the passthrough interface, though, the setup uses kernel block
layer to write the partition tables. Upgrading from 6.8 -> 6.9 won't be
able to read the partition table on these devices. I'm still not sure
the best way to handle this, though.
Javier Gonzalez Oct. 2, 2024, 10:29 a.m. UTC | #8
On 01.10.2024 09:37, Keith Busch wrote:
>On Tue, Oct 01, 2024 at 09:27:08AM +0200, Javier González wrote:
>> On 30.09.2024 13:57, Martin K. Petersen wrote:
>> >
>> > Kanchan,
>> >
>> > > I spent a good deal of time on this today. I was thinking to connect
>> > > block read_verify/write_generate knobs to influence things at nvme level
>> > > (those PRCHK flags). But that will not be enough. Because with those
>> > > knobs block-layer will not attach meta-buffer, which is still needed.
>> > >
>> > > The data was written under the condition when nvme driver set the
>> > > pi_type to 0 (even though at device level it was non-zero) during
>> > > integrity registration.
>> > >
>> > > Thinking whether it will make sense to have a knob at the block-layer
>> > > level to do something like that i.e., override the set
>> > > integrity-profile with nop.
>> >
>> > SCSI went to great lengths to ensure that invalid protection information
>> > would never be written during normal operation, regardless of whether
>> > the host sent PI or not. And thus the only time one would anticipate a
>> > PI error was if the data had actually been corrupted.
>> >
>>
>> Is this something we should work on bringin to the NVMe TWG?
>
>Maybe. It looks like they did the spec this way one purpose with the
>ability to toggle guard tags per IO.
>
>Just some more background on this because it may sound odd to use a data
>protection namespace format that the kernel didn't support:
>
>In this use case, writes to the device primarily come from the
>passthrough interface, which could always use the guard tags for
>end-to-end protection. The kernel block IO was the only path that had
>the limitation.
>
>Besides the passthrough interface, though, the setup uses kernel block
>layer to write the partition tables. Upgrading from 6.8 -> 6.9 won't be
>able to read the partition table on these devices. I'm still not sure
>the best way to handle this, though.

Mmmm. Need to think more about this.

Seems this is by design in NVMe.  Will try to come back to Mike and Judy
to understand the background for this at the time...
Martin K. Petersen Oct. 2, 2024, 4:18 p.m. UTC | #9
Keith,

> Besides the passthrough interface, though, the setup uses kernel block
> layer to write the partition tables. Upgrading from 6.8 -> 6.9 won't
> be able to read the partition table on these devices. I'm still not
> sure the best way to handle this, though.

That's why, despite Type 2 supporting offset ref tags, the kernel always
uses the LBA. This is to ensure that any application which doesn't have
prior knowledge of the contents of each block's PI can perform a read
without failing. Including the kernel itself when reading the partition
table. But it's not just a kernel problem, the BIOS also needs to be
able to read arbitrary blocks, at least at the beginning and end of the
device.

Many of the features of the original DIF spec were squarely aimed at use
inside storage arrays where there was a single entity, the array
firmware, performing all reads and writes.

With DIX, since the interface needed to be used by array firmware and
general purpose operating systems alike, we needed to add the all
required features to the interface to accommodate every use case. And
NVMe inherited this. But NVMe did not really capture which things are
suitable only in the "single application" scenario.

I'm not sure there's a good fix. Other than as a passthrough user, the
burden is on you to ensure that things can be read by somebody else.
Could be backup applications, virus scanners, system firmware, etc.
Keith Busch Oct. 2, 2024, 7:03 p.m. UTC | #10
On Wed, Oct 02, 2024 at 12:18:37PM -0400, Martin K. Petersen wrote:
> I'm not sure there's a good fix. Other than as a passthrough user, the
> burden is on you to ensure that things can be read by somebody else.
> Could be backup applications, virus scanners, system firmware, etc.

Maybe we should disable device side guard checks on read commands. NVMe
spec allows us to do that. This way the host would always be able to
read data, and the host can optionally validate the guard on its side if
it wants to.
diff mbox series

Patch

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index c9a16fba58b9..2e3e8e04961e 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -395,6 +395,7 @@  static blk_status_t bio_integrity_process(struct bio *bio,
 	iter.tuple_size = bi->tuple_size;
 	iter.seed = proc_iter->bi_sector;
 	iter.prot_buf = bvec_virt(bip->bip_vec);
+	iter.pi_offset = bi->pi_offset;
 
 	__bio_for_each_segment(bv, bio, bviter, *proc_iter) {
 		void *kaddr = bvec_kmap_local(&bv);
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index d4e9b4556d14..ccbeb6dfa87a 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -370,6 +370,7 @@  void blk_integrity_register(struct gendisk *disk, struct blk_integrity *template
 	bi->profile = template->profile ? template->profile : &nop_profile;
 	bi->tuple_size = template->tuple_size;
 	bi->tag_size = template->tag_size;
+	bi->pi_offset = template->pi_offset;
 
 	blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, disk->queue);
 
diff --git a/block/t10-pi.c b/block/t10-pi.c
index 251a7b188963..d90892fd6f2a 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -32,12 +32,16 @@  static __be16 t10_pi_ip_fn(__be16 csum, void *data, unsigned int len)
 static blk_status_t t10_pi_generate(struct blk_integrity_iter *iter,
 		csum_fn *fn, enum t10_dif_type type)
 {
+	u8 offset = iter->pi_offset;
 	unsigned int i;
 
 	for (i = 0 ; i < iter->data_size ; i += iter->interval) {
-		struct t10_pi_tuple *pi = iter->prot_buf;
+		struct t10_pi_tuple *pi = iter->prot_buf + offset;
 
 		pi->guard_tag = fn(0, iter->data_buf, iter->interval);
+		if (offset)
+			pi->guard_tag = fn(pi->guard_tag, iter->prot_buf,
+					   offset);
 		pi->app_tag = 0;
 
 		if (type == T10_PI_TYPE1_PROTECTION)
@@ -56,12 +60,13 @@  static blk_status_t t10_pi_generate(struct blk_integrity_iter *iter,
 static blk_status_t t10_pi_verify(struct blk_integrity_iter *iter,
 		csum_fn *fn, enum t10_dif_type type)
 {
+	u8 offset = iter->pi_offset;
 	unsigned int i;
 
 	BUG_ON(type == T10_PI_TYPE0_PROTECTION);
 
 	for (i = 0 ; i < iter->data_size ; i += iter->interval) {
-		struct t10_pi_tuple *pi = iter->prot_buf;
+		struct t10_pi_tuple *pi = iter->prot_buf + offset;
 		__be16 csum;
 
 		if (type == T10_PI_TYPE1_PROTECTION ||
@@ -84,6 +89,8 @@  static blk_status_t t10_pi_verify(struct blk_integrity_iter *iter,
 		}
 
 		csum = fn(0, iter->data_buf, iter->interval);
+		if (offset)
+			csum = fn(csum, iter->prot_buf, offset);
 
 		if (pi->guard_tag != csum) {
 			pr_err("%s: guard tag error at sector %llu " \
@@ -134,8 +141,10 @@  static blk_status_t t10_pi_type1_verify_ip(struct blk_integrity_iter *iter)
  */
 static void t10_pi_type1_prepare(struct request *rq)
 {
-	const int tuple_sz = rq->q->integrity.tuple_size;
+	struct blk_integrity *bi = &rq->q->integrity;
+	const int tuple_sz = bi->tuple_size;
 	u32 ref_tag = t10_pi_ref_tag(rq);
+	u8 offset = bi->pi_offset;
 	struct bio *bio;
 
 	__rq_for_each_bio(bio, rq) {
@@ -154,7 +163,7 @@  static void t10_pi_type1_prepare(struct request *rq)
 
 			p = bvec_kmap_local(&iv);
 			for (j = 0; j < iv.bv_len; j += tuple_sz) {
-				struct t10_pi_tuple *pi = p;
+				struct t10_pi_tuple *pi = p + offset;
 
 				if (be32_to_cpu(pi->ref_tag) == virt)
 					pi->ref_tag = cpu_to_be32(ref_tag);
@@ -183,9 +192,11 @@  static void t10_pi_type1_prepare(struct request *rq)
  */
 static void t10_pi_type1_complete(struct request *rq, unsigned int nr_bytes)
 {
-	unsigned intervals = nr_bytes >> rq->q->integrity.interval_exp;
-	const int tuple_sz = rq->q->integrity.tuple_size;
+	struct blk_integrity *bi = &rq->q->integrity;
+	unsigned intervals = nr_bytes >> bi->interval_exp;
+	const int tuple_sz = bi->tuple_size;
 	u32 ref_tag = t10_pi_ref_tag(rq);
+	u8 offset = bi->pi_offset;
 	struct bio *bio;
 
 	__rq_for_each_bio(bio, rq) {
@@ -200,7 +211,7 @@  static void t10_pi_type1_complete(struct request *rq, unsigned int nr_bytes)
 
 			p = bvec_kmap_local(&iv);
 			for (j = 0; j < iv.bv_len && intervals; j += tuple_sz) {
-				struct t10_pi_tuple *pi = p;
+				struct t10_pi_tuple *pi = p + offset;
 
 				if (be32_to_cpu(pi->ref_tag) == ref_tag)
 					pi->ref_tag = cpu_to_be32(virt);
@@ -288,12 +299,16 @@  static __be64 ext_pi_crc64(u64 crc, void *data, unsigned int len)
 static blk_status_t ext_pi_crc64_generate(struct blk_integrity_iter *iter,
 					enum t10_dif_type type)
 {
+	u8 offset = iter->pi_offset;
 	unsigned int i;
 
 	for (i = 0 ; i < iter->data_size ; i += iter->interval) {
-		struct crc64_pi_tuple *pi = iter->prot_buf;
+		struct crc64_pi_tuple *pi = iter->prot_buf + offset;
 
 		pi->guard_tag = ext_pi_crc64(0, iter->data_buf, iter->interval);
+		if (offset)
+			pi->guard_tag = ext_pi_crc64(be64_to_cpu(pi->guard_tag),
+					iter->prot_buf, offset);
 		pi->app_tag = 0;
 
 		if (type == T10_PI_TYPE1_PROTECTION)
@@ -319,10 +334,11 @@  static bool ext_pi_ref_escape(u8 *ref_tag)
 static blk_status_t ext_pi_crc64_verify(struct blk_integrity_iter *iter,
 				      enum t10_dif_type type)
 {
+	u8 offset = iter->pi_offset;
 	unsigned int i;
 
 	for (i = 0; i < iter->data_size; i += iter->interval) {
-		struct crc64_pi_tuple *pi = iter->prot_buf;
+		struct crc64_pi_tuple *pi = iter->prot_buf + offset;
 		u64 ref, seed;
 		__be64 csum;
 
@@ -344,6 +360,10 @@  static blk_status_t ext_pi_crc64_verify(struct blk_integrity_iter *iter,
 		}
 
 		csum = ext_pi_crc64(0, iter->data_buf, iter->interval);
+		if (offset)
+			csum = ext_pi_crc64(be64_to_cpu(csum), iter->prot_buf,
+					    offset);
+
 		if (pi->guard_tag != csum) {
 			pr_err("%s: guard tag error at sector %llu " \
 			       "(rcvd %016llx, want %016llx)\n",
@@ -373,8 +393,10 @@  static blk_status_t ext_pi_type1_generate_crc64(struct blk_integrity_iter *iter)
 
 static void ext_pi_type1_prepare(struct request *rq)
 {
-	const int tuple_sz = rq->q->integrity.tuple_size;
+	struct blk_integrity *bi = &rq->q->integrity;
+	const int tuple_sz = bi->tuple_size;
 	u64 ref_tag = ext_pi_ref_tag(rq);
+	u8 offset = bi->pi_offset;
 	struct bio *bio;
 
 	__rq_for_each_bio(bio, rq) {
@@ -393,7 +415,7 @@  static void ext_pi_type1_prepare(struct request *rq)
 
 			p = bvec_kmap_local(&iv);
 			for (j = 0; j < iv.bv_len; j += tuple_sz) {
-				struct crc64_pi_tuple *pi = p;
+				struct crc64_pi_tuple *pi = p +  offset;
 				u64 ref = get_unaligned_be48(pi->ref_tag);
 
 				if (ref == virt)
@@ -411,9 +433,11 @@  static void ext_pi_type1_prepare(struct request *rq)
 
 static void ext_pi_type1_complete(struct request *rq, unsigned int nr_bytes)
 {
-	unsigned intervals = nr_bytes >> rq->q->integrity.interval_exp;
-	const int tuple_sz = rq->q->integrity.tuple_size;
+	struct blk_integrity *bi = &rq->q->integrity;
+	unsigned intervals = nr_bytes >> bi->interval_exp;
+	const int tuple_sz = bi->tuple_size;
 	u64 ref_tag = ext_pi_ref_tag(rq);
+	u8 offset = bi->pi_offset;
 	struct bio *bio;
 
 	__rq_for_each_bio(bio, rq) {
@@ -428,7 +452,7 @@  static void ext_pi_type1_complete(struct request *rq, unsigned int nr_bytes)
 
 			p = bvec_kmap_local(&iv);
 			for (j = 0; j < iv.bv_len && intervals; j += tuple_sz) {
-				struct crc64_pi_tuple *pi = p;
+				struct crc64_pi_tuple *pi = p + offset;
 				u64 ref = get_unaligned_be48(pi->ref_tag);
 
 				if (ref == ref_tag)
diff --git a/include/linux/blk-integrity.h b/include/linux/blk-integrity.h
index 378b2459efe2..e253e7bd0d17 100644
--- a/include/linux/blk-integrity.h
+++ b/include/linux/blk-integrity.h
@@ -20,6 +20,7 @@  struct blk_integrity_iter {
 	unsigned int		data_size;
 	unsigned short		interval;
 	unsigned char		tuple_size;
+	unsigned char		pi_offset;
 	const char		*disk_name;
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d7cac3de65b3..bb4d811fee46 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -108,6 +108,7 @@  struct blk_integrity {
 	const struct blk_integrity_profile	*profile;
 	unsigned char				flags;
 	unsigned char				tuple_size;
+	unsigned char				pi_offset;
 	unsigned char				interval_exp;
 	unsigned char				tag_size;
 };