Message ID | 20240201130126.211402-3-joshi.k@samsung.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Block integrity with flexible-offset PI | expand |
On Thu, Feb 01, 2024 at 06:31:25PM +0530, Kanchan Joshi wrote: > Block layer integrity processing assumes that protection information > (PI) is placed in the first bytes of each metadata block. > > Remove this limitation and include the metadata before the PI in the > calculation of the guard tag. Very late reply, but I am just now discovering the consequences of this patch. We have drives with this format, 64b metadata with PI at the end. With previous kernels, we had written data to these drives. Those kernel versions disabled the GUARD generation, so the metadata was written without it, and everything was fine. Now we upgrade to 6.9+, and this kernel enables the GUARD check. All the data previously written to this drive is unreadable because the GUARD is invalid. Not sure exactly what to do about this, but it is a broken kernel upgrade path, so wanted to throw that information out there.
On 9/26/2024 8:37 PM, Keith Busch wrote: > On Thu, Feb 01, 2024 at 06:31:25PM +0530, Kanchan Joshi wrote: >> Block layer integrity processing assumes that protection information >> (PI) is placed in the first bytes of each metadata block. >> >> Remove this limitation and include the metadata before the PI in the >> calculation of the guard tag. > > Very late reply, but I am just now discovering the consequences of this > patch. > > We have drives with this format, 64b metadata with PI at the end. With > previous kernels, we had written data to these drives. Those kernel > versions disabled the GUARD generation, so the metadata was written > without it, and everything was fine. > > Now we upgrade to 6.9+, and this kernel enables the GUARD check. All the > data previously written to this drive is unreadable because the GUARD is > invalid. > > Not sure exactly what to do about this, but it is a broken kernel > upgrade path, so wanted to throw that information out there. > Ah, writing to the disk without any guard, but after that reading with the guard! I wish if there was a way to format the NVMe only to reset its pi type (from non-zero to 0). But there are kernel knobs too. Hope you are able to get to the same state (as nop profile) by clearing write_generate and read_verify: echo 0 > /sys/block/nvme0n1/integrity/read_verify
On Thu, Sep 26, 2024 at 10:08:09PM +0530, Kanchan Joshi wrote: > But there are kernel knobs too. Hope you are able to get to the same > state (as nop profile) by clearing write_generate and read_verify: > echo 0 > /sys/block/nvme0n1/integrity/read_verify It's not the kernel's verify causing the failure; it's the end device. For nvme, it'll return status 0x282, End-to-end Guard Check Error, so the kernel doesn't have a chance to check the data. We'd need to turn off the command's NVME_RW_PRINFO_PRCHK_GUARD flag, but there's currently no knob to toggle that.
On 9/26/2024 10:25 PM, Keith Busch wrote: > On Thu, Sep 26, 2024 at 10:08:09PM +0530, Kanchan Joshi wrote: >> But there are kernel knobs too. Hope you are able to get to the same >> state (as nop profile) by clearing write_generate and read_verify: >> echo 0 > /sys/block/nvme0n1/integrity/read_verify > > It's not the kernel's verify causing the failure; it's the end device. > For nvme, it'll return status 0x282, End-to-end Guard Check Error, so > the kernel doesn't have a chance to check the data. We'd need to turn > off the command's NVME_RW_PRINFO_PRCHK_GUARD flag, but there's currently > no knob to toggle that. Indeed. I spent a good deal of time on this today. I was thinking to connect block read_verify/write_generate knobs to influence things at nvme level (those PRCHK flags). But that will not be enough. Because with those knobs block-layer will not attach meta-buffer, which is still needed. The data was written under the condition when nvme driver set the pi_type to 0 (even though at device level it was non-zero) during integrity registration. Thinking whether it will make sense to have a knob at the block-layer level to do something like that i.e., override the set integrity-profile with nop.
Kanchan, > I spent a good deal of time on this today. I was thinking to connect > block read_verify/write_generate knobs to influence things at nvme level > (those PRCHK flags). But that will not be enough. Because with those > knobs block-layer will not attach meta-buffer, which is still needed. > > The data was written under the condition when nvme driver set the > pi_type to 0 (even though at device level it was non-zero) during > integrity registration. > > Thinking whether it will make sense to have a knob at the block-layer > level to do something like that i.e., override the set > integrity-profile with nop. SCSI went to great lengths to ensure that invalid protection information would never be written during normal operation, regardless of whether the host sent PI or not. And thus the only time one would anticipate a PI error was if the data had actually been corrupted.
On 30.09.2024 13:57, Martin K. Petersen wrote: > >Kanchan, > >> I spent a good deal of time on this today. I was thinking to connect >> block read_verify/write_generate knobs to influence things at nvme level >> (those PRCHK flags). But that will not be enough. Because with those >> knobs block-layer will not attach meta-buffer, which is still needed. >> >> The data was written under the condition when nvme driver set the >> pi_type to 0 (even though at device level it was non-zero) during >> integrity registration. >> >> Thinking whether it will make sense to have a knob at the block-layer >> level to do something like that i.e., override the set >> integrity-profile with nop. > >SCSI went to great lengths to ensure that invalid protection information >would never be written during normal operation, regardless of whether >the host sent PI or not. And thus the only time one would anticipate a >PI error was if the data had actually been corrupted. > Is this something we should work on bringin to the NVMe TWG? Javier
On Tue, Oct 01, 2024 at 09:27:08AM +0200, Javier González wrote: > On 30.09.2024 13:57, Martin K. Petersen wrote: > > > > Kanchan, > > > > > I spent a good deal of time on this today. I was thinking to connect > > > block read_verify/write_generate knobs to influence things at nvme level > > > (those PRCHK flags). But that will not be enough. Because with those > > > knobs block-layer will not attach meta-buffer, which is still needed. > > > > > > The data was written under the condition when nvme driver set the > > > pi_type to 0 (even though at device level it was non-zero) during > > > integrity registration. > > > > > > Thinking whether it will make sense to have a knob at the block-layer > > > level to do something like that i.e., override the set > > > integrity-profile with nop. > > > > SCSI went to great lengths to ensure that invalid protection information > > would never be written during normal operation, regardless of whether > > the host sent PI or not. And thus the only time one would anticipate a > > PI error was if the data had actually been corrupted. > > > > Is this something we should work on bringin to the NVMe TWG? Maybe. It looks like they did the spec this way one purpose with the ability to toggle guard tags per IO. Just some more background on this because it may sound odd to use a data protection namespace format that the kernel didn't support: In this use case, writes to the device primarily come from the passthrough interface, which could always use the guard tags for end-to-end protection. The kernel block IO was the only path that had the limitation. Besides the passthrough interface, though, the setup uses kernel block layer to write the partition tables. Upgrading from 6.8 -> 6.9 won't be able to read the partition table on these devices. I'm still not sure the best way to handle this, though.
On 01.10.2024 09:37, Keith Busch wrote: >On Tue, Oct 01, 2024 at 09:27:08AM +0200, Javier González wrote: >> On 30.09.2024 13:57, Martin K. Petersen wrote: >> > >> > Kanchan, >> > >> > > I spent a good deal of time on this today. I was thinking to connect >> > > block read_verify/write_generate knobs to influence things at nvme level >> > > (those PRCHK flags). But that will not be enough. Because with those >> > > knobs block-layer will not attach meta-buffer, which is still needed. >> > > >> > > The data was written under the condition when nvme driver set the >> > > pi_type to 0 (even though at device level it was non-zero) during >> > > integrity registration. >> > > >> > > Thinking whether it will make sense to have a knob at the block-layer >> > > level to do something like that i.e., override the set >> > > integrity-profile with nop. >> > >> > SCSI went to great lengths to ensure that invalid protection information >> > would never be written during normal operation, regardless of whether >> > the host sent PI or not. And thus the only time one would anticipate a >> > PI error was if the data had actually been corrupted. >> > >> >> Is this something we should work on bringin to the NVMe TWG? > >Maybe. It looks like they did the spec this way one purpose with the >ability to toggle guard tags per IO. > >Just some more background on this because it may sound odd to use a data >protection namespace format that the kernel didn't support: > >In this use case, writes to the device primarily come from the >passthrough interface, which could always use the guard tags for >end-to-end protection. The kernel block IO was the only path that had >the limitation. > >Besides the passthrough interface, though, the setup uses kernel block >layer to write the partition tables. Upgrading from 6.8 -> 6.9 won't be >able to read the partition table on these devices. I'm still not sure >the best way to handle this, though. Mmmm. Need to think more about this. Seems this is by design in NVMe. Will try to come back to Mike and Judy to understand the background for this at the time...
Keith, > Besides the passthrough interface, though, the setup uses kernel block > layer to write the partition tables. Upgrading from 6.8 -> 6.9 won't > be able to read the partition table on these devices. I'm still not > sure the best way to handle this, though. That's why, despite Type 2 supporting offset ref tags, the kernel always uses the LBA. This is to ensure that any application which doesn't have prior knowledge of the contents of each block's PI can perform a read without failing. Including the kernel itself when reading the partition table. But it's not just a kernel problem, the BIOS also needs to be able to read arbitrary blocks, at least at the beginning and end of the device. Many of the features of the original DIF spec were squarely aimed at use inside storage arrays where there was a single entity, the array firmware, performing all reads and writes. With DIX, since the interface needed to be used by array firmware and general purpose operating systems alike, we needed to add the all required features to the interface to accommodate every use case. And NVMe inherited this. But NVMe did not really capture which things are suitable only in the "single application" scenario. I'm not sure there's a good fix. Other than as a passthrough user, the burden is on you to ensure that things can be read by somebody else. Could be backup applications, virus scanners, system firmware, etc.
On Wed, Oct 02, 2024 at 12:18:37PM -0400, Martin K. Petersen wrote: > I'm not sure there's a good fix. Other than as a passthrough user, the > burden is on you to ensure that things can be read by somebody else. > Could be backup applications, virus scanners, system firmware, etc. Maybe we should disable device side guard checks on read commands. NVMe spec allows us to do that. This way the host would always be able to read data, and the host can optionally validate the guard on its side if it wants to.
diff --git a/block/bio-integrity.c b/block/bio-integrity.c index c9a16fba58b9..2e3e8e04961e 100644 --- a/block/bio-integrity.c +++ b/block/bio-integrity.c @@ -395,6 +395,7 @@ static blk_status_t bio_integrity_process(struct bio *bio, iter.tuple_size = bi->tuple_size; iter.seed = proc_iter->bi_sector; iter.prot_buf = bvec_virt(bip->bip_vec); + iter.pi_offset = bi->pi_offset; __bio_for_each_segment(bv, bio, bviter, *proc_iter) { void *kaddr = bvec_kmap_local(&bv); diff --git a/block/blk-integrity.c b/block/blk-integrity.c index d4e9b4556d14..ccbeb6dfa87a 100644 --- a/block/blk-integrity.c +++ b/block/blk-integrity.c @@ -370,6 +370,7 @@ void blk_integrity_register(struct gendisk *disk, struct blk_integrity *template bi->profile = template->profile ? template->profile : &nop_profile; bi->tuple_size = template->tuple_size; bi->tag_size = template->tag_size; + bi->pi_offset = template->pi_offset; blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, disk->queue); diff --git a/block/t10-pi.c b/block/t10-pi.c index 251a7b188963..d90892fd6f2a 100644 --- a/block/t10-pi.c +++ b/block/t10-pi.c @@ -32,12 +32,16 @@ static __be16 t10_pi_ip_fn(__be16 csum, void *data, unsigned int len) static blk_status_t t10_pi_generate(struct blk_integrity_iter *iter, csum_fn *fn, enum t10_dif_type type) { + u8 offset = iter->pi_offset; unsigned int i; for (i = 0 ; i < iter->data_size ; i += iter->interval) { - struct t10_pi_tuple *pi = iter->prot_buf; + struct t10_pi_tuple *pi = iter->prot_buf + offset; pi->guard_tag = fn(0, iter->data_buf, iter->interval); + if (offset) + pi->guard_tag = fn(pi->guard_tag, iter->prot_buf, + offset); pi->app_tag = 0; if (type == T10_PI_TYPE1_PROTECTION) @@ -56,12 +60,13 @@ static blk_status_t t10_pi_generate(struct blk_integrity_iter *iter, static blk_status_t t10_pi_verify(struct blk_integrity_iter *iter, csum_fn *fn, enum t10_dif_type type) { + u8 offset = iter->pi_offset; unsigned int i; BUG_ON(type == T10_PI_TYPE0_PROTECTION); for (i = 0 ; i < iter->data_size ; i += iter->interval) { - struct t10_pi_tuple *pi = iter->prot_buf; + struct t10_pi_tuple *pi = iter->prot_buf + offset; __be16 csum; if (type == T10_PI_TYPE1_PROTECTION || @@ -84,6 +89,8 @@ static blk_status_t t10_pi_verify(struct blk_integrity_iter *iter, } csum = fn(0, iter->data_buf, iter->interval); + if (offset) + csum = fn(csum, iter->prot_buf, offset); if (pi->guard_tag != csum) { pr_err("%s: guard tag error at sector %llu " \ @@ -134,8 +141,10 @@ static blk_status_t t10_pi_type1_verify_ip(struct blk_integrity_iter *iter) */ static void t10_pi_type1_prepare(struct request *rq) { - const int tuple_sz = rq->q->integrity.tuple_size; + struct blk_integrity *bi = &rq->q->integrity; + const int tuple_sz = bi->tuple_size; u32 ref_tag = t10_pi_ref_tag(rq); + u8 offset = bi->pi_offset; struct bio *bio; __rq_for_each_bio(bio, rq) { @@ -154,7 +163,7 @@ static void t10_pi_type1_prepare(struct request *rq) p = bvec_kmap_local(&iv); for (j = 0; j < iv.bv_len; j += tuple_sz) { - struct t10_pi_tuple *pi = p; + struct t10_pi_tuple *pi = p + offset; if (be32_to_cpu(pi->ref_tag) == virt) pi->ref_tag = cpu_to_be32(ref_tag); @@ -183,9 +192,11 @@ static void t10_pi_type1_prepare(struct request *rq) */ static void t10_pi_type1_complete(struct request *rq, unsigned int nr_bytes) { - unsigned intervals = nr_bytes >> rq->q->integrity.interval_exp; - const int tuple_sz = rq->q->integrity.tuple_size; + struct blk_integrity *bi = &rq->q->integrity; + unsigned intervals = nr_bytes >> bi->interval_exp; + const int tuple_sz = bi->tuple_size; u32 ref_tag = t10_pi_ref_tag(rq); + u8 offset = bi->pi_offset; struct bio *bio; __rq_for_each_bio(bio, rq) { @@ -200,7 +211,7 @@ static void t10_pi_type1_complete(struct request *rq, unsigned int nr_bytes) p = bvec_kmap_local(&iv); for (j = 0; j < iv.bv_len && intervals; j += tuple_sz) { - struct t10_pi_tuple *pi = p; + struct t10_pi_tuple *pi = p + offset; if (be32_to_cpu(pi->ref_tag) == ref_tag) pi->ref_tag = cpu_to_be32(virt); @@ -288,12 +299,16 @@ static __be64 ext_pi_crc64(u64 crc, void *data, unsigned int len) static blk_status_t ext_pi_crc64_generate(struct blk_integrity_iter *iter, enum t10_dif_type type) { + u8 offset = iter->pi_offset; unsigned int i; for (i = 0 ; i < iter->data_size ; i += iter->interval) { - struct crc64_pi_tuple *pi = iter->prot_buf; + struct crc64_pi_tuple *pi = iter->prot_buf + offset; pi->guard_tag = ext_pi_crc64(0, iter->data_buf, iter->interval); + if (offset) + pi->guard_tag = ext_pi_crc64(be64_to_cpu(pi->guard_tag), + iter->prot_buf, offset); pi->app_tag = 0; if (type == T10_PI_TYPE1_PROTECTION) @@ -319,10 +334,11 @@ static bool ext_pi_ref_escape(u8 *ref_tag) static blk_status_t ext_pi_crc64_verify(struct blk_integrity_iter *iter, enum t10_dif_type type) { + u8 offset = iter->pi_offset; unsigned int i; for (i = 0; i < iter->data_size; i += iter->interval) { - struct crc64_pi_tuple *pi = iter->prot_buf; + struct crc64_pi_tuple *pi = iter->prot_buf + offset; u64 ref, seed; __be64 csum; @@ -344,6 +360,10 @@ static blk_status_t ext_pi_crc64_verify(struct blk_integrity_iter *iter, } csum = ext_pi_crc64(0, iter->data_buf, iter->interval); + if (offset) + csum = ext_pi_crc64(be64_to_cpu(csum), iter->prot_buf, + offset); + if (pi->guard_tag != csum) { pr_err("%s: guard tag error at sector %llu " \ "(rcvd %016llx, want %016llx)\n", @@ -373,8 +393,10 @@ static blk_status_t ext_pi_type1_generate_crc64(struct blk_integrity_iter *iter) static void ext_pi_type1_prepare(struct request *rq) { - const int tuple_sz = rq->q->integrity.tuple_size; + struct blk_integrity *bi = &rq->q->integrity; + const int tuple_sz = bi->tuple_size; u64 ref_tag = ext_pi_ref_tag(rq); + u8 offset = bi->pi_offset; struct bio *bio; __rq_for_each_bio(bio, rq) { @@ -393,7 +415,7 @@ static void ext_pi_type1_prepare(struct request *rq) p = bvec_kmap_local(&iv); for (j = 0; j < iv.bv_len; j += tuple_sz) { - struct crc64_pi_tuple *pi = p; + struct crc64_pi_tuple *pi = p + offset; u64 ref = get_unaligned_be48(pi->ref_tag); if (ref == virt) @@ -411,9 +433,11 @@ static void ext_pi_type1_prepare(struct request *rq) static void ext_pi_type1_complete(struct request *rq, unsigned int nr_bytes) { - unsigned intervals = nr_bytes >> rq->q->integrity.interval_exp; - const int tuple_sz = rq->q->integrity.tuple_size; + struct blk_integrity *bi = &rq->q->integrity; + unsigned intervals = nr_bytes >> bi->interval_exp; + const int tuple_sz = bi->tuple_size; u64 ref_tag = ext_pi_ref_tag(rq); + u8 offset = bi->pi_offset; struct bio *bio; __rq_for_each_bio(bio, rq) { @@ -428,7 +452,7 @@ static void ext_pi_type1_complete(struct request *rq, unsigned int nr_bytes) p = bvec_kmap_local(&iv); for (j = 0; j < iv.bv_len && intervals; j += tuple_sz) { - struct crc64_pi_tuple *pi = p; + struct crc64_pi_tuple *pi = p + offset; u64 ref = get_unaligned_be48(pi->ref_tag); if (ref == ref_tag) diff --git a/include/linux/blk-integrity.h b/include/linux/blk-integrity.h index 378b2459efe2..e253e7bd0d17 100644 --- a/include/linux/blk-integrity.h +++ b/include/linux/blk-integrity.h @@ -20,6 +20,7 @@ struct blk_integrity_iter { unsigned int data_size; unsigned short interval; unsigned char tuple_size; + unsigned char pi_offset; const char *disk_name; }; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index d7cac3de65b3..bb4d811fee46 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -108,6 +108,7 @@ struct blk_integrity { const struct blk_integrity_profile *profile; unsigned char flags; unsigned char tuple_size; + unsigned char pi_offset; unsigned char interval_exp; unsigned char tag_size; };