diff mbox

bio-integrity: revert "stop abusing bi_end_io"

Message ID alpine.LRH.2.02.1708030938110.21391@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mikulas Patocka Aug. 3, 2017, 2:10 p.m. UTC
On Wed, 2 Aug 2017, Christoph Hellwig wrote:

> On Wed, Aug 02, 2017 at 02:27:50PM +0200, Milan Broz wrote:
> > In dm-integrity target we register integrity profile that have
> > both generate_fn and verify_fn callbacks set to NULL.
> > 
> > This is used if dm-integrity is stacked under a dm-crypt device
> > for authenticated encryption (integrity payload contains authentication
> > tag and IV seed).
> > 
> > In this case the verification is done through own crypto API
> > processing inside dm-crypt; integrity profile is only holder
> > of these data. (And memory is owned by dm-crypt as well.)
> 
> Maybe that's where the problem lies?  You're abusing the integrity
> payload for something that is not end to end data integrity at all
> and then wonder why it breaks?  Also the commit that introduced your
> code had absolutely no review by Martin or any of the core block
> folks.
> 
> The right fix is to revert the dm-crypt commit.

That dm-crypt commit that uses bio integrity payload came 3 months before 
7c20f11680a441df09de7235206f70115fbf6290 and it was already present in 
4.12.

The patch 7c20f11680a441df09de7235206f70115fbf6290 not only breaks 
dm-crypt and dm-integrity. It also breaks regular integrity payload usage 
when stacking drivers are used.

Suppose that a bio goes the through the device mapper stack. At each level 
a clone bio is allocated and forwarded to a lower level. It eventually 
reaches the request-based physical disk driver.

In the kernel 4.12, when the bio reaches the request-based driver, the 
function blk_queue_bio is called, it calls bio_integrity_prep, 
bio_integrity_prep saves the bio->bi_end_io in the structure 
bio_integrity_payload and sets bio->bi_end_io = bio_integrity_endio. When 
the bio is finished, bio_integrity_endio is called, it performs the 
verification (a call to bio_integrity_verify_fn), restores bio->bi_end_io 
and calls it.

The patch 7c20f11680a441df09de7235206f70115fbf6290 changes it so that 
bio->bi_end_io is not changed and bio_integrity_endio is called directly 
from bio_endio if the bio has integrity payload. The unintended 
consequence of this change is that when a request with integrity payload 
is finished and bio_endio is called for each cloned bio, the verification 
routine bio_integrity_verify_fn is called for every level in the stack (it 
used to be called only for the lowest level in 4.12). At different levels 
in the stack, the bio may have different bi_sector value, so the repeated 
verification can't succeed.

I think the patch 7c20f11680a441df09de7235206f70115fbf6290 should be 
reverted and it should be reworked for the next merge window, so that it 
calls bio_integrity_verify_fn only for the lowest level.

Mikulas



From: Mikulas Patocka <mpatocka@redhat.co>

The patch 7c20f11680a441df09de7235206f70115fbf6290 ("bio-integrity: stop 
abusing bi_end_io") changes the code so that it doesn't replace 
bio_end_io with bio_integrity_endio and calls bio_integrity_endio directly 
from bio_endio.

The unintended consequence of this change is that when a bio with 
integrity payload is passed through the device stack, bio_integrity_endio 
is called for each level of the stack (it used to be called only for the 
lowest level before).

This breaks dm-integrity and dm-crypt and it also causes verification 
errors when a bio with regular integrity payload is passed through the 
device mapper stack.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

Comments

Christoph Hellwig Aug. 5, 2017, 1:44 p.m. UTC | #1
On Thu, Aug 03, 2017 at 10:10:55AM -0400, Mikulas Patocka wrote:
> That dm-crypt commit that uses bio integrity payload came 3 months before 
> 7c20f11680a441df09de7235206f70115fbf6290 and it was already present in 
> 4.12.

And on it's own that isn't an argument if your usage is indeed wrong,
and that's why we are having this discussion.

> The patch 7c20f11680a441df09de7235206f70115fbf6290 changes it so that 
> bio->bi_end_io is not changed and bio_integrity_endio is called directly 
> from bio_endio if the bio has integrity payload. The unintended 
> consequence of this change is that when a request with integrity payload 
> is finished and bio_endio is called for each cloned bio, the verification 
> routine bio_integrity_verify_fn is called for every level in the stack (it 
> used to be called only for the lowest level in 4.12). At different levels 
> in the stack, the bio may have different bi_sector value, so the repeated 
> verification can't succeed.

We can simply add another bio flag to get back to the previous
behavior.  That being said thing to do in the end is to verify it
at the top of the stack, and not the bottom eventuall.  I can cook
up a patch for that.

> I think the patch 7c20f11680a441df09de7235206f70115fbf6290 should be 
> reverted and it should be reworked for the next merge window, so that it 
> calls bio_integrity_verify_fn only for the lowest level.

And with this you will just reintroduce the memory leak for
on-stack bios we've fixed.

I'll take a look at not calling the verify function for every level,
which is wrong, and in the meantime we can discuss the uses and abuses
of the bio integrity code by dm in the separate subthread.
Mikulas Patocka Aug. 5, 2017, 2:46 p.m. UTC | #2
On Sat, 5 Aug 2017, Christoph Hellwig wrote:

> 
> On Thu, Aug 03, 2017 at 10:10:55AM -0400, Mikulas Patocka wrote:
> > That dm-crypt commit that uses bio integrity payload came 3 months before 
> > 7c20f11680a441df09de7235206f70115fbf6290 and it was already present in 
> > 4.12.
> 
> And on it's own that isn't an argument if your usage is indeed wrong,
> and that's why we are having this discussion.
> 
> > The patch 7c20f11680a441df09de7235206f70115fbf6290 changes it so that 
> > bio->bi_end_io is not changed and bio_integrity_endio is called directly 
> > from bio_endio if the bio has integrity payload. The unintended 
> > consequence of this change is that when a request with integrity payload 
> > is finished and bio_endio is called for each cloned bio, the verification 
> > routine bio_integrity_verify_fn is called for every level in the stack (it 
> > used to be called only for the lowest level in 4.12). At different levels 
> > in the stack, the bio may have different bi_sector value, so the repeated 
> > verification can't succeed.
> 
> We can simply add another bio flag to get back to the previous
> behavior.  That being said thing to do in the end is to verify it
> at the top of the stack, and not the bottom eventuall.  I can cook
> up a patch for that.

The sector number in the integrity tag must match the physical sector 
number. So, it must be verified at the bottom.

If the sector number in the integrity tag matched the logical sector 
number in the logical volume, it would create a lot of problems - for 
example, it would be impossible to move the logical volume and it would be 
impossible to copy the full physical volume.

> > I think the patch 7c20f11680a441df09de7235206f70115fbf6290 should be 
> > reverted and it should be reworked for the next merge window, so that it 
> > calls bio_integrity_verify_fn only for the lowest level.
> 
> And with this you will just reintroduce the memory leak for
> on-stack bios we've fixed.
> 
> I'll take a look at not calling the verify function for every level,
> which is wrong, and in the meantime we can discuss the uses and abuses
> of the bio integrity code by dm in the separate subthread.

It would be usefull if dm-crypt can put the autenticated tag directly into 
the integrity data. But for that usage, the lowest level physical driver 
must not generate or verify the integrity data - it must just read them 
and write them.

Mikulas
Martin K. Petersen Aug. 5, 2017, 8:19 p.m. UTC | #3
Christoph,

> We can simply add another bio flag to get back to the previous
> behavior.  That being said thing to do in the end is to verify it
> at the top of the stack, and not the bottom eventuall.  I can cook
> up a patch for that.

Yeah, the original code was careful about only adding the verification
hook to the top bio.

A bio flag is probably the path of least resistance. It already exists,
actually: BIP_BLOCK_INTEGRITY. But we'll need to make sure it gets
masked out when a bio is cloned. And then we can key off of that and
REQ_OP_READ in the endio function.

I prefer that approach to reverting Christoph's commit.
Martin K. Petersen Aug. 5, 2017, 8:25 p.m. UTC | #4
Mikulas,

> The sector number in the integrity tag must match the physical sector 
> number. So, it must be verified at the bottom.

The ref tag seed matches the submitter block number (typically block
layer sector for the top device) and is remapped to and from the LBA by
the SCSI disk driver or HBA firmware.

So the verification is designed to be done by the top level entity that
attaches the protection information to the bio. In this case
bio_integrity_prep().
Mikulas Patocka Aug. 6, 2017, 6:49 p.m. UTC | #5
On Sat, 5 Aug 2017, Martin K. Petersen wrote:

> Mikulas,
> 
> > The sector number in the integrity tag must match the physical sector 
> > number. So, it must be verified at the bottom.
> 
> The ref tag seed matches the submitter block number (typically block
> layer sector for the top device) and is remapped to and from the LBA by
> the SCSI disk driver or HBA firmware.
> 
> So the verification is designed to be done by the top level entity that
> attaches the protection information to the bio. In this case
> bio_integrity_prep().

bio_integrity_prep is called by blk_queue_bio - that is the lowest level 
physical disk, not the top level. It is not called by device mapper.

Documentation/block/data-integrity.txt says that bio_integrity_prep() can 
be called by integrity-aware filesystem, but it seems that no such 
filesystem exists.

If you create the integrity tag at or above device mapper level, you will 
run into problems because the same device can be accessed using device 
mapper and using physical volume /dev/sd*. If you create integrity tags at 
device mapper level, they will contain device mapper's logical sector 
number and the sector number won't match if you access the device directly 
using /dev/sd*.

> -- 
> Martin K. Petersen	Oracle Linux Engineering

Mikulas
Martin K. Petersen Aug. 7, 2017, 3:48 p.m. UTC | #6
Mikulas,

> If you create the integrity tag at or above device mapper level, you
> will run into problems because the same device can be accessed using
> device mapper and using physical volume /dev/sd*. If you create
> integrity tags at device mapper level, they will contain device
> mapper's logical sector number and the sector number won't match if
> you access the device directly using /dev/sd*.

For writes, the bip seed value is adjusted every time a bio is cloned,
sliced and diced as it traverses partitioning/MD/DM. And then at the
bottom of the stack, the ref tag values in the protection information
buffer are verified against the adjusted seed value and remapped
according to the starting logical block number. The reverse is taking
place for reads.

This is much faster than doing a remapping of the actual protection
buffer values every time the I/O transitions from one address space to
the other. In addition, some HBA hardware allows us to program the PI
engine with the seed value. So the submitter value to LBA conversion can
be done on the fly in hardware.
Milan Broz Aug. 8, 2017, 6:47 a.m. UTC | #7
Hi,

On 08/07/2017 05:48 PM, Martin K. Petersen wrote:
> 
>> If you create the integrity tag at or above device mapper level, you
>> will run into problems because the same device can be accessed using
>> device mapper and using physical volume /dev/sd*. If you create
>> integrity tags at device mapper level, they will contain device
>> mapper's logical sector number and the sector number won't match if
>> you access the device directly using /dev/sd*.
> 
> For writes, the bip seed value is adjusted every time a bio is cloned,
> sliced and diced as it traverses partitioning/MD/DM. And then at the
> bottom of the stack, the ref tag values in the protection information
> buffer are verified against the adjusted seed value and remapped
> according to the starting logical block number. The reverse is taking
> place for reads.
> 
> This is much faster than doing a remapping of the actual protection
> buffer values every time the I/O transitions from one address space to
> the other. In addition, some HBA hardware allows us to program the PI
> engine with the seed value. So the submitter value to LBA conversion can
> be done on the fly in hardware.

Yes, this is how enterprise storage works and how the PI was designed.

For authenticated encryption with additional data (AEAD) we have to authenticate
the sector number as well (as part or additional data - authenticated
but not encrypted part).

Unfortunately the whole design cannot work this way if the AEAD is implemented
on some higher layer. (Higher layer has a big advantage that you do not need to trust
underlying storage stack because you will detect even intentional tampering with data.)
 So, only the layer that own the keys (here dmcrypt) can calculate or verify auth. tags
(and then decrypt data). Nobody else can "remap" this sector in tag - without the key
you cannot recalculate auth. tags.

In other words, we just treat the whole additional data (delivered in bio-integrity)
as "Application tag" (if using DIF terminology...)


Anyway, could we please fix the 4.13 kernel now to not crash with that
dm-integrity use? This is urgent for people that already play with dm-integrity from 4.12.


If you want a following discussion, I would really welcome to see what exactly
is wrong because I think we used the bio-integrity interface through existing
API and according to doc.
Just we are unfortunately the first user for own profile (different than
these defined in T10).
And I think it can help in future to people that will try to implement
bio-integrity-aware filesystem as well.

Thanks,
Milan
Christoph Hellwig Aug. 9, 2017, 2:07 p.m. UTC | #8
On Sat, Aug 05, 2017 at 04:19:30PM -0400, Martin K. Petersen wrote:
> A bio flag is probably the path of least resistance. It already exists,
> actually: BIP_BLOCK_INTEGRITY. But we'll need to make sure it gets
> masked out when a bio is cloned. And then we can key off of that and
> REQ_OP_READ in the endio function.

bio_integrity_clone doesn't even clone it - it allocates a new
bip and then only copies over bip_vec, bip_vcnt and bit_iter.

I'll preparate a patch for this and will send it out together with
the orginal hotfix for the verify_fn, which also needs a trivial
line length fix anyway.
diff mbox

Patch

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index f733beab6ca2..8df4eb103ba9 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -102,7 +102,7 @@  EXPORT_SYMBOL(bio_integrity_alloc);
  * Description: Used to free the integrity portion of a bio. Usually
  * called from bio_free().
  */
-static void bio_integrity_free(struct bio *bio)
+void bio_integrity_free(struct bio *bio)
 {
 	struct bio_integrity_payload *bip = bio_integrity(bio);
 	struct bio_set *bs = bio->bi_pool;
@@ -120,8 +120,8 @@  static void bio_integrity_free(struct bio *bio)
 	}
 
 	bio->bi_integrity = NULL;
-	bio->bi_opf &= ~REQ_INTEGRITY;
 }
+EXPORT_SYMBOL(bio_integrity_free);
 
 /**
  * bio_integrity_add_page - Attach integrity metadata
@@ -326,6 +326,12 @@  bool bio_integrity_prep(struct bio *bio)
 		offset = 0;
 	}
 
+	/* Install custom I/O completion handler if read verify is enabled */
+	if (bio_data_dir(bio) == READ) {
+		bip->bip_end_io = bio->bi_end_io;
+		bio->bi_end_io = bio_integrity_endio;
+	}
+
 	/* Auto-generate integrity metadata if this is a write */
 	if (bio_data_dir(bio) == WRITE) {
 		bio_integrity_process(bio, &bio->bi_iter,
@@ -369,12 +375,13 @@  static void bio_integrity_verify_fn(struct work_struct *work)
 		bio->bi_status = BLK_STS_IOERR;
 	}
 
-	bio_integrity_free(bio);
+	/* Restore original bio completion handler */
+	bio->bi_end_io = bip->bip_end_io;
 	bio_endio(bio);
 }
 
 /**
- * __bio_integrity_endio - Integrity I/O completion function
+ * bio_integrity_endio - Integrity I/O completion function
  * @bio:	Protected bio
  * @error:	Pointer to errno
  *
@@ -385,19 +392,27 @@  static void bio_integrity_verify_fn(struct work_struct *work)
  * in process context.	This function postpones completion
  * accordingly.
  */
-bool __bio_integrity_endio(struct bio *bio)
+void bio_integrity_endio(struct bio *bio)
 {
-	if (bio_op(bio) == REQ_OP_READ && !bio->bi_status) {
-		struct bio_integrity_payload *bip = bio_integrity(bio);
+	struct bio_integrity_payload *bip = bio_integrity(bio);
 
-		INIT_WORK(&bip->bip_work, bio_integrity_verify_fn);
-		queue_work(kintegrityd_wq, &bip->bip_work);
-		return false;
+	BUG_ON(bip->bip_bio != bio);
+
+	/* In case of an I/O error there is no point in verifying the
+	 * integrity metadata.  Restore original bio end_io handler
+	 * and run it.
+	 */
+	if (bio->bi_status) {
+		bio->bi_end_io = bip->bip_end_io;
+		bio_endio(bio);
+
+		return;
 	}
 
-	bio_integrity_free(bio);
-	return true;
+	INIT_WORK(&bip->bip_work, bio_integrity_verify_fn);
+	queue_work(kintegrityd_wq, &bip->bip_work);
 }
+EXPORT_SYMBOL(bio_integrity_endio);
 
 /**
  * bio_integrity_advance - Advance integrity vector
diff --git a/block/bio.c b/block/bio.c
index 9cabf5d0be20..a6b225324a61 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -243,6 +243,9 @@  struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
 void bio_uninit(struct bio *bio)
 {
 	bio_disassociate_task(bio);
+
+	if (bio_integrity(bio))
+		bio_integrity_free(bio);
 }
 EXPORT_SYMBOL(bio_uninit);
 
@@ -1810,8 +1813,6 @@  void bio_endio(struct bio *bio)
 again:
 	if (!bio_remaining_done(bio))
 		return;
-	if (!bio_integrity_endio(bio))
-		return;
 
 	/*
 	 * Need to have a real endio function for chained bios, otherwise
diff --git a/block/blk.h b/block/blk.h
index 3a3d715bd725..01ebb8185f6b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -81,21 +81,10 @@  static inline void blk_queue_enter_live(struct request_queue *q)
 
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 void blk_flush_integrity(void);
-bool __bio_integrity_endio(struct bio *);
-static inline bool bio_integrity_endio(struct bio *bio)
-{
-	if (bio_integrity(bio))
-		return __bio_integrity_endio(bio);
-	return true;
-}
 #else
 static inline void blk_flush_integrity(void)
 {
 }
-static inline bool bio_integrity_endio(struct bio *bio)
-{
-	return true;
-}
 #endif
 
 void blk_timeout_work(struct work_struct *work);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7b1cf4ba0902..1eba19580185 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -320,6 +320,8 @@  struct bio_integrity_payload {
 
 	struct bvec_iter	bip_iter;
 
+	bio_end_io_t		*bip_end_io;	/* saved I/O completion fn */
+
 	unsigned short		bip_slab;	/* slab the bip came from */
 	unsigned short		bip_vcnt;	/* # of integrity bio_vecs */
 	unsigned short		bip_max_vcnt;	/* integrity bio_vec slots */
@@ -737,8 +739,10 @@  struct biovec_slab {
 		bip_for_each_vec(_bvl, _bio->bi_integrity, _iter)
 
 extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int);
+extern void bio_integrity_free(struct bio *);
 extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int);
 extern bool bio_integrity_prep(struct bio *);
+extern void bio_integrity_endio(struct bio *);
 extern void bio_integrity_advance(struct bio *, unsigned int);
 extern void bio_integrity_trim(struct bio *);
 extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t);
@@ -768,6 +772,11 @@  static inline bool bio_integrity_prep(struct bio *bio)
 	return true;
 }
 
+static inline void bio_integrity_free(struct bio *bio)
+{
+	return;
+}
+
 static inline int bio_integrity_clone(struct bio *bio, struct bio *bio_src,
 				      gfp_t gfp_mask)
 {