diff mbox series

[5/8] dm: always setup ->orig_bio in alloc_io

Message ID 20220412085616.1409626-6-ming.lei@redhat.com (mailing list archive)
State Accepted, archived
Delegated to: Mike Snitzer
Headers show
Series dm: io accounting & polling improvement | expand

Commit Message

Ming Lei April 12, 2022, 8:56 a.m. UTC
The current DM codes setup ->orig_bio after __map_bio() returns,
and not only cause kernel panic for dm zone, but also a bit ugly
and tricky, especially the waiting until ->orig_bio is set in
dm_submit_bio_remap().

The reason is that one new bio is cloned from original FS bio to
represent the mapped part, which just serves io accounting.

Now we have switched to bdev based io accounting interface, and we
can retrieve sectors/bio_op from both the real original bio and the
added fields of .sector_offset & .sectors easily, so the new cloned
bio isn't necessary any more.

Not only fixes dm-zone's kernel panic, but also cleans up dm io
accounting & split a bit.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-core.h |  8 ++++++-
 drivers/md/dm.c      | 51 ++++++++++++++++++++++----------------------
 2 files changed, 32 insertions(+), 27 deletions(-)

Comments

Mike Snitzer April 12, 2022, 8:52 p.m. UTC | #1
On Tue, Apr 12 2022 at  4:56P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> The current DM codes setup ->orig_bio after __map_bio() returns,
> and not only cause kernel panic for dm zone, but also a bit ugly
> and tricky, especially the waiting until ->orig_bio is set in
> dm_submit_bio_remap().
> 
> The reason is that one new bio is cloned from original FS bio to
> represent the mapped part, which just serves io accounting.
> 
> Now we have switched to bdev based io accounting interface, and we
> can retrieve sectors/bio_op from both the real original bio and the
> added fields of .sector_offset & .sectors easily, so the new cloned
> bio isn't necessary any more.
> 
> Not only fixes dm-zone's kernel panic, but also cleans up dm io
> accounting & split a bit.

You're conflating quite a few things here.  DM zone really has no
business accessing io->orig_bio (dm-zone.c can just as easily inspect
the tio->clone, because it hasn't been remapped yet it reflects the
io->origin_bio, so there is no need to look at io->orig_bio) -- but
yes I clearly broke things during the 5.18 merge and it needs fixing
ASAP.

But I'm (ab)using io->orig_bio assignment to indicate to completion
that it may proceed.  See these dm-5.19 commits to see it imposed even
further:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19&id=311a8e6650601a79079000466db77386c5ec2abb
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19&id=56219ebb5f5c84785aa821f755d545eae41bdb1a

And then leveraged here:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19&id=4aa7a368370c2a172d5a0b8927c6332c4b6a3514

Could be all these dm-5.19 changes suck.. but I do know dm-zone.c is
too tightly coupled to DM core.  So I'll focus on that first, fix
5.18, and then circle back to "what's next?".

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Damien Le Moal April 12, 2022, 10:38 p.m. UTC | #2
On 4/13/22 05:52, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  4:56P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
>> The current DM codes setup ->orig_bio after __map_bio() returns,
>> and not only cause kernel panic for dm zone, but also a bit ugly
>> and tricky, especially the waiting until ->orig_bio is set in
>> dm_submit_bio_remap().
>>
>> The reason is that one new bio is cloned from original FS bio to
>> represent the mapped part, which just serves io accounting.
>>
>> Now we have switched to bdev based io accounting interface, and we
>> can retrieve sectors/bio_op from both the real original bio and the
>> added fields of .sector_offset & .sectors easily, so the new cloned
>> bio isn't necessary any more.
>>
>> Not only fixes dm-zone's kernel panic, but also cleans up dm io
>> accounting & split a bit.
> 
> You're conflating quite a few things here.  DM zone really has no
> business accessing io->orig_bio (dm-zone.c can just as easily inspect
> the tio->clone, because it hasn't been remapped yet it reflects the
> io->origin_bio, so there is no need to look at io->orig_bio) -- but
> yes I clearly broke things during the 5.18 merge and it needs fixing
> ASAP.

Problem is that we need to look at the BIO op in submission AND completion
path to handle zone append requests. So looking at the clone on submission
is OK since its op is still the original one. But on the completion path,
the clone may now be a regular write emulating a zone append op. And
looking at the clone only does not allow detecting that zone append.

We could have the orig_bio op saved in dm_io to avoid completely looking
at the orig_bio for detecting zone append.

> 
> But I'm (ab)using io->orig_bio assignment to indicate to completion
> that it may proceed.  See these dm-5.19 commits to see it imposed even
> further:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19&id=311a8e6650601a79079000466db77386c5ec2abb
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19&id=56219ebb5f5c84785aa821f755d545eae41bdb1a
> 
> And then leveraged here:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19&id=4aa7a368370c2a172d5a0b8927c6332c4b6a3514
> 
> Could be all these dm-5.19 changes suck.. but I do know dm-zone.c is
> too tightly coupled to DM core.  So I'll focus on that first, fix
> 5.18, and then circle back to "what's next?".
> 
> Mike
>
Mike Snitzer April 12, 2022, 11 p.m. UTC | #3
On Tue, Apr 12 2022 at  6:38P -0400,
Damien Le Moal <damien.lemoal@opensource.wdc.com> wrote:

> On 4/13/22 05:52, Mike Snitzer wrote:
> > On Tue, Apr 12 2022 at  4:56P -0400,
> > Ming Lei <ming.lei@redhat.com> wrote:
> > 
> >> The current DM codes setup ->orig_bio after __map_bio() returns,
> >> and not only cause kernel panic for dm zone, but also a bit ugly
> >> and tricky, especially the waiting until ->orig_bio is set in
> >> dm_submit_bio_remap().
> >>
> >> The reason is that one new bio is cloned from original FS bio to
> >> represent the mapped part, which just serves io accounting.
> >>
> >> Now we have switched to bdev based io accounting interface, and we
> >> can retrieve sectors/bio_op from both the real original bio and the
> >> added fields of .sector_offset & .sectors easily, so the new cloned
> >> bio isn't necessary any more.
> >>
> >> Not only fixes dm-zone's kernel panic, but also cleans up dm io
> >> accounting & split a bit.
> > 
> > You're conflating quite a few things here.  DM zone really has no
> > business accessing io->orig_bio (dm-zone.c can just as easily inspect
> > the tio->clone, because it hasn't been remapped yet it reflects the
> > io->origin_bio, so there is no need to look at io->orig_bio) -- but
> > yes I clearly broke things during the 5.18 merge and it needs fixing
> > ASAP.
> 
> Problem is that we need to look at the BIO op in submission AND completion
> path to handle zone append requests. So looking at the clone on submission
> is OK since its op is still the original one. But on the completion path,
> the clone may now be a regular write emulating a zone append op. And
> looking at the clone only does not allow detecting that zone append.
> 
> We could have the orig_bio op saved in dm_io to avoid completely looking
> at the orig_bio for detecting zone append.

Can you please try the following patch?

Really sorry for breaking dm-zone.c; please teach this man how to test
the basics of all things dm-zoned (is there a testsuite in the tools
or something?).

Thanks,
Mike

diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index c1ca9be4b79e..896127366000 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -360,16 +360,20 @@ static int dm_update_zone_wp_offset(struct mapped_device *md, unsigned int zno,
 	return 0;
 }
 
+struct orig_bio_details {
+	unsigned int op;
+	unsigned int nr_sectors;
+};
+
 /*
  * First phase of BIO mapping for targets with zone append emulation:
  * check all BIO that change a zone writer pointer and change zone
  * append operations into regular write operations.
  */
-static bool dm_zone_map_bio_begin(struct mapped_device *md,
-				  struct bio *orig_bio, struct bio *clone)
+static bool dm_zone_map_bio_begin(struct mapped_device *md, unsigned int zno,
+				  struct bio *clone)
 {
 	sector_t zsectors = blk_queue_zone_sectors(md->queue);
-	unsigned int zno = bio_zone_no(orig_bio);
 	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
 
 	/*
@@ -384,7 +388,7 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
 		WRITE_ONCE(md->zwp_offset[zno], zwp_offset);
 	}
 
-	switch (bio_op(orig_bio)) {
+	switch (bio_op(clone)) {
 	case REQ_OP_ZONE_RESET:
 	case REQ_OP_ZONE_FINISH:
 		return true;
@@ -401,9 +405,8 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
 		 * target zone.
 		 */
 		clone->bi_opf = REQ_OP_WRITE | REQ_NOMERGE |
-			(orig_bio->bi_opf & (~REQ_OP_MASK));
-		clone->bi_iter.bi_sector =
-			orig_bio->bi_iter.bi_sector + zwp_offset;
+			(clone->bi_opf & (~REQ_OP_MASK));
+		clone->bi_iter.bi_sector += zwp_offset;
 		break;
 	default:
 		DMWARN_LIMIT("Invalid BIO operation");
@@ -423,11 +426,10 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
  * data written to a zone. Note that at this point, the remapped clone BIO
  * may already have completed, so we do not touch it.
  */
-static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
-					struct bio *orig_bio,
+static blk_status_t dm_zone_map_bio_end(struct mapped_device *md, unsigned int zno,
+					struct orig_bio_details *orig_bio_details,
 					unsigned int nr_sectors)
 {
-	unsigned int zno = bio_zone_no(orig_bio);
 	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
 
 	/* The clone BIO may already have been completed and failed */
@@ -435,7 +437,7 @@ static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
 		return BLK_STS_IOERR;
 
 	/* Update the zone wp offset */
-	switch (bio_op(orig_bio)) {
+	switch (orig_bio_details->op) {
 	case REQ_OP_ZONE_RESET:
 		WRITE_ONCE(md->zwp_offset[zno], 0);
 		return BLK_STS_OK;
@@ -452,7 +454,7 @@ static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
 		 * Check that the target did not truncate the write operation
 		 * emulating a zone append.
 		 */
-		if (nr_sectors != bio_sectors(orig_bio)) {
+		if (nr_sectors != orig_bio_details->nr_sectors) {
 			DMWARN_LIMIT("Truncated write for zone append");
 			return BLK_STS_IOERR;
 		}
@@ -488,7 +490,7 @@ static inline void dm_zone_unlock(struct request_queue *q,
 	bio_clear_flag(clone, BIO_ZONE_WRITE_LOCKED);
 }
 
-static bool dm_need_zone_wp_tracking(struct bio *orig_bio)
+static bool dm_need_zone_wp_tracking(struct bio *clone)
 {
 	/*
 	 * Special processing is not needed for operations that do not need the
@@ -496,15 +498,15 @@ static bool dm_need_zone_wp_tracking(struct bio *orig_bio)
 	 * zones and all operations that do not modify directly a sequential
 	 * zone write pointer.
 	 */
-	if (op_is_flush(orig_bio->bi_opf) && !bio_sectors(orig_bio))
+	if (op_is_flush(clone->bi_opf) && !bio_sectors(clone))
 		return false;
-	switch (bio_op(orig_bio)) {
+	switch (bio_op(clone)) {
 	case REQ_OP_WRITE_ZEROES:
 	case REQ_OP_WRITE:
 	case REQ_OP_ZONE_RESET:
 	case REQ_OP_ZONE_FINISH:
 	case REQ_OP_ZONE_APPEND:
-		return bio_zone_is_seq(orig_bio);
+		return bio_zone_is_seq(clone);
 	default:
 		return false;
 	}
@@ -519,8 +521,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
 	struct dm_target *ti = tio->ti;
 	struct mapped_device *md = io->md;
 	struct request_queue *q = md->queue;
-	struct bio *orig_bio = io->orig_bio;
 	struct bio *clone = &tio->clone;
+	struct orig_bio_details orig_bio_details;
 	unsigned int zno;
 	blk_status_t sts;
 	int r;
@@ -529,18 +531,21 @@ int dm_zone_map_bio(struct dm_target_io *tio)
 	 * IOs that do not change a zone write pointer do not need
 	 * any additional special processing.
 	 */
-	if (!dm_need_zone_wp_tracking(orig_bio))
+	if (!dm_need_zone_wp_tracking(clone))
 		return ti->type->map(ti, clone);
 
 	/* Lock the target zone */
-	zno = bio_zone_no(orig_bio);
+	zno = bio_zone_no(clone);
 	dm_zone_lock(q, zno, clone);
 
+	orig_bio_details.nr_sectors = bio_sectors(clone);
+	orig_bio_details.op = bio_op(clone);
+
 	/*
 	 * Check that the bio and the target zone write pointer offset are
 	 * both valid, and if the bio is a zone append, remap it to a write.
 	 */
-	if (!dm_zone_map_bio_begin(md, orig_bio, clone)) {
+	if (!dm_zone_map_bio_begin(md, zno, clone)) {
 		dm_zone_unlock(q, zno, clone);
 		return DM_MAPIO_KILL;
 	}
@@ -560,7 +565,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
 		 * The target submitted the clone BIO. The target zone will
 		 * be unlocked on completion of the clone.
 		 */
-		sts = dm_zone_map_bio_end(md, orig_bio, *tio->len_ptr);
+		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
+					  *tio->len_ptr);
 		break;
 	case DM_MAPIO_REMAPPED:
 		/*
@@ -568,7 +574,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
 		 * unlock the target zone here as the clone will not be
 		 * submitted.
 		 */
-		sts = dm_zone_map_bio_end(md, orig_bio, *tio->len_ptr);
+		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
+					  *tio->len_ptr);
 		if (sts != BLK_STS_OK)
 			dm_zone_unlock(q, zno, clone);
 		break;

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Damien Le Moal April 12, 2022, 11:31 p.m. UTC | #4
On 4/13/22 08:00, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  6:38P -0400,
> Damien Le Moal <damien.lemoal@opensource.wdc.com> wrote:
> 
>> On 4/13/22 05:52, Mike Snitzer wrote:
>>> On Tue, Apr 12 2022 at  4:56P -0400,
>>> Ming Lei <ming.lei@redhat.com> wrote:
>>>
>>>> The current DM codes setup ->orig_bio after __map_bio() returns,
>>>> and not only cause kernel panic for dm zone, but also a bit ugly
>>>> and tricky, especially the waiting until ->orig_bio is set in
>>>> dm_submit_bio_remap().
>>>>
>>>> The reason is that one new bio is cloned from original FS bio to
>>>> represent the mapped part, which just serves io accounting.
>>>>
>>>> Now we have switched to bdev based io accounting interface, and we
>>>> can retrieve sectors/bio_op from both the real original bio and the
>>>> added fields of .sector_offset & .sectors easily, so the new cloned
>>>> bio isn't necessary any more.
>>>>
>>>> Not only fixes dm-zone's kernel panic, but also cleans up dm io
>>>> accounting & split a bit.
>>>
>>> You're conflating quite a few things here.  DM zone really has no
>>> business accessing io->orig_bio (dm-zone.c can just as easily inspect
>>> the tio->clone, because it hasn't been remapped yet it reflects the
>>> io->origin_bio, so there is no need to look at io->orig_bio) -- but
>>> yes I clearly broke things during the 5.18 merge and it needs fixing
>>> ASAP.
>>
>> Problem is that we need to look at the BIO op in submission AND completion
>> path to handle zone append requests. So looking at the clone on submission
>> is OK since its op is still the original one. But on the completion path,
>> the clone may now be a regular write emulating a zone append op. And
>> looking at the clone only does not allow detecting that zone append.
>>
>> We could have the orig_bio op saved in dm_io to avoid completely looking
>> at the orig_bio for detecting zone append.
> 
> Can you please try the following patch?

OK. Will do right away.

> 
> Really sorry for breaking dm-zone.c; please teach this man how to test
> the basics of all things dm-zoned (is there a testsuite in the tools
> or something?).

We have an internal test suite to check all things related to zone. We run
that weekly on all RC releases. We did not catch the problem earlier as we
do not run against for-next trees in previous cycles. We could add such
runs :)

We would be happy to contribute stuff for testing. Ideally, integrating
that into blktest, with a new DM group, would be nice. That was discussed
in past LSF. Maybe a topic again for this year ? Beside zone stuff, I am
sure we can add more DM tests (I am sure you do also have a test suite ?).

For quick tests, I generally use a zoned nullblk device. I am attaching 2
scripts which allow creating and deleting nullblk devices easily.

> 
> Thanks,
> Mike
> 
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index c1ca9be4b79e..896127366000 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -360,16 +360,20 @@ static int dm_update_zone_wp_offset(struct mapped_device *md, unsigned int zno,
>  	return 0;
>  }
>  
> +struct orig_bio_details {
> +	unsigned int op;
> +	unsigned int nr_sectors;
> +};
> +
>  /*
>   * First phase of BIO mapping for targets with zone append emulation:
>   * check all BIO that change a zone writer pointer and change zone
>   * append operations into regular write operations.
>   */
> -static bool dm_zone_map_bio_begin(struct mapped_device *md,
> -				  struct bio *orig_bio, struct bio *clone)
> +static bool dm_zone_map_bio_begin(struct mapped_device *md, unsigned int zno,
> +				  struct bio *clone)
>  {
>  	sector_t zsectors = blk_queue_zone_sectors(md->queue);
> -	unsigned int zno = bio_zone_no(orig_bio);
>  	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
>  
>  	/*
> @@ -384,7 +388,7 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>  		WRITE_ONCE(md->zwp_offset[zno], zwp_offset);
>  	}
>  
> -	switch (bio_op(orig_bio)) {
> +	switch (bio_op(clone)) {
>  	case REQ_OP_ZONE_RESET:
>  	case REQ_OP_ZONE_FINISH:
>  		return true;
> @@ -401,9 +405,8 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>  		 * target zone.
>  		 */
>  		clone->bi_opf = REQ_OP_WRITE | REQ_NOMERGE |
> -			(orig_bio->bi_opf & (~REQ_OP_MASK));
> -		clone->bi_iter.bi_sector =
> -			orig_bio->bi_iter.bi_sector + zwp_offset;
> +			(clone->bi_opf & (~REQ_OP_MASK));
> +		clone->bi_iter.bi_sector += zwp_offset;
>  		break;
>  	default:
>  		DMWARN_LIMIT("Invalid BIO operation");
> @@ -423,11 +426,10 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>   * data written to a zone. Note that at this point, the remapped clone BIO
>   * may already have completed, so we do not touch it.
>   */
> -static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
> -					struct bio *orig_bio,
> +static blk_status_t dm_zone_map_bio_end(struct mapped_device *md, unsigned int zno,
> +					struct orig_bio_details *orig_bio_details,
>  					unsigned int nr_sectors)
>  {
> -	unsigned int zno = bio_zone_no(orig_bio);
>  	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
>  
>  	/* The clone BIO may already have been completed and failed */
> @@ -435,7 +437,7 @@ static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
>  		return BLK_STS_IOERR;
>  
>  	/* Update the zone wp offset */
> -	switch (bio_op(orig_bio)) {
> +	switch (orig_bio_details->op) {
>  	case REQ_OP_ZONE_RESET:
>  		WRITE_ONCE(md->zwp_offset[zno], 0);
>  		return BLK_STS_OK;
> @@ -452,7 +454,7 @@ static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
>  		 * Check that the target did not truncate the write operation
>  		 * emulating a zone append.
>  		 */
> -		if (nr_sectors != bio_sectors(orig_bio)) {
> +		if (nr_sectors != orig_bio_details->nr_sectors) {
>  			DMWARN_LIMIT("Truncated write for zone append");
>  			return BLK_STS_IOERR;
>  		}
> @@ -488,7 +490,7 @@ static inline void dm_zone_unlock(struct request_queue *q,
>  	bio_clear_flag(clone, BIO_ZONE_WRITE_LOCKED);
>  }
>  
> -static bool dm_need_zone_wp_tracking(struct bio *orig_bio)
> +static bool dm_need_zone_wp_tracking(struct bio *clone)
>  {
>  	/*
>  	 * Special processing is not needed for operations that do not need the
> @@ -496,15 +498,15 @@ static bool dm_need_zone_wp_tracking(struct bio *orig_bio)
>  	 * zones and all operations that do not modify directly a sequential
>  	 * zone write pointer.
>  	 */
> -	if (op_is_flush(orig_bio->bi_opf) && !bio_sectors(orig_bio))
> +	if (op_is_flush(clone->bi_opf) && !bio_sectors(clone))
>  		return false;
> -	switch (bio_op(orig_bio)) {
> +	switch (bio_op(clone)) {
>  	case REQ_OP_WRITE_ZEROES:
>  	case REQ_OP_WRITE:
>  	case REQ_OP_ZONE_RESET:
>  	case REQ_OP_ZONE_FINISH:
>  	case REQ_OP_ZONE_APPEND:
> -		return bio_zone_is_seq(orig_bio);
> +		return bio_zone_is_seq(clone);
>  	default:
>  		return false;
>  	}
> @@ -519,8 +521,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  	struct dm_target *ti = tio->ti;
>  	struct mapped_device *md = io->md;
>  	struct request_queue *q = md->queue;
> -	struct bio *orig_bio = io->orig_bio;
>  	struct bio *clone = &tio->clone;
> +	struct orig_bio_details orig_bio_details;
>  	unsigned int zno;
>  	blk_status_t sts;
>  	int r;
> @@ -529,18 +531,21 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  	 * IOs that do not change a zone write pointer do not need
>  	 * any additional special processing.
>  	 */
> -	if (!dm_need_zone_wp_tracking(orig_bio))
> +	if (!dm_need_zone_wp_tracking(clone))
>  		return ti->type->map(ti, clone);
>  
>  	/* Lock the target zone */
> -	zno = bio_zone_no(orig_bio);
> +	zno = bio_zone_no(clone);
>  	dm_zone_lock(q, zno, clone);
>  
> +	orig_bio_details.nr_sectors = bio_sectors(clone);
> +	orig_bio_details.op = bio_op(clone);
> +
>  	/*
>  	 * Check that the bio and the target zone write pointer offset are
>  	 * both valid, and if the bio is a zone append, remap it to a write.
>  	 */
> -	if (!dm_zone_map_bio_begin(md, orig_bio, clone)) {
> +	if (!dm_zone_map_bio_begin(md, zno, clone)) {
>  		dm_zone_unlock(q, zno, clone);
>  		return DM_MAPIO_KILL;
>  	}
> @@ -560,7 +565,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  		 * The target submitted the clone BIO. The target zone will
>  		 * be unlocked on completion of the clone.
>  		 */
> -		sts = dm_zone_map_bio_end(md, orig_bio, *tio->len_ptr);
> +		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
> +					  *tio->len_ptr);
>  		break;
>  	case DM_MAPIO_REMAPPED:
>  		/*
> @@ -568,7 +574,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  		 * unlock the target zone here as the clone will not be
>  		 * submitted.
>  		 */
> -		sts = dm_zone_map_bio_end(md, orig_bio, *tio->len_ptr);
> +		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
> +					  *tio->len_ptr);
>  		if (sts != BLK_STS_OK)
>  			dm_zone_unlock(q, zno, clone);
>  		break;
>
Damien Le Moal April 13, 2022, midnight UTC | #5
On 4/13/22 08:00, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  6:38P -0400,
> Damien Le Moal <damien.lemoal@opensource.wdc.com> wrote:
> 
>> On 4/13/22 05:52, Mike Snitzer wrote:
>>> On Tue, Apr 12 2022 at  4:56P -0400,
>>> Ming Lei <ming.lei@redhat.com> wrote:
>>>
>>>> The current DM codes setup ->orig_bio after __map_bio() returns,
>>>> and not only cause kernel panic for dm zone, but also a bit ugly
>>>> and tricky, especially the waiting until ->orig_bio is set in
>>>> dm_submit_bio_remap().
>>>>
>>>> The reason is that one new bio is cloned from original FS bio to
>>>> represent the mapped part, which just serves io accounting.
>>>>
>>>> Now we have switched to bdev based io accounting interface, and we
>>>> can retrieve sectors/bio_op from both the real original bio and the
>>>> added fields of .sector_offset & .sectors easily, so the new cloned
>>>> bio isn't necessary any more.
>>>>
>>>> Not only fixes dm-zone's kernel panic, but also cleans up dm io
>>>> accounting & split a bit.
>>>
>>> You're conflating quite a few things here.  DM zone really has no
>>> business accessing io->orig_bio (dm-zone.c can just as easily inspect
>>> the tio->clone, because it hasn't been remapped yet it reflects the
>>> io->origin_bio, so there is no need to look at io->orig_bio) -- but
>>> yes I clearly broke things during the 5.18 merge and it needs fixing
>>> ASAP.
>>
>> Problem is that we need to look at the BIO op in submission AND completion
>> path to handle zone append requests. So looking at the clone on submission
>> is OK since its op is still the original one. But on the completion path,
>> the clone may now be a regular write emulating a zone append op. And
>> looking at the clone only does not allow detecting that zone append.
>>
>> We could have the orig_bio op saved in dm_io to avoid completely looking
>> at the orig_bio for detecting zone append.
> 
> Can you please try the following patch?

This works. I tested with a zoned nullblk + dm-crypt, forcing the zone
append emulation code to be used. I ran zonefs tests on top of that with
no issues. I will run btrfs tests too later today to exercise things a
little more.

> 
> Really sorry for breaking dm-zone.c; please teach this man how to test
> the basics of all things dm-zoned (is there a testsuite in the tools
> or something?).
> 
> Thanks,
> Mike
> 
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index c1ca9be4b79e..896127366000 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -360,16 +360,20 @@ static int dm_update_zone_wp_offset(struct mapped_device *md, unsigned int zno,
>  	return 0;
>  }
>  
> +struct orig_bio_details {
> +	unsigned int op;
> +	unsigned int nr_sectors;
> +};
> +
>  /*
>   * First phase of BIO mapping for targets with zone append emulation:
>   * check all BIO that change a zone writer pointer and change zone
>   * append operations into regular write operations.
>   */
> -static bool dm_zone_map_bio_begin(struct mapped_device *md,
> -				  struct bio *orig_bio, struct bio *clone)
> +static bool dm_zone_map_bio_begin(struct mapped_device *md, unsigned int zno,
> +				  struct bio *clone)
>  {
>  	sector_t zsectors = blk_queue_zone_sectors(md->queue);
> -	unsigned int zno = bio_zone_no(orig_bio);
>  	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
>  
>  	/*
> @@ -384,7 +388,7 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>  		WRITE_ONCE(md->zwp_offset[zno], zwp_offset);
>  	}
>  
> -	switch (bio_op(orig_bio)) {
> +	switch (bio_op(clone)) {
>  	case REQ_OP_ZONE_RESET:
>  	case REQ_OP_ZONE_FINISH:
>  		return true;
> @@ -401,9 +405,8 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>  		 * target zone.
>  		 */
>  		clone->bi_opf = REQ_OP_WRITE | REQ_NOMERGE |
> -			(orig_bio->bi_opf & (~REQ_OP_MASK));
> -		clone->bi_iter.bi_sector =
> -			orig_bio->bi_iter.bi_sector + zwp_offset;
> +			(clone->bi_opf & (~REQ_OP_MASK));
> +		clone->bi_iter.bi_sector += zwp_offset;
>  		break;
>  	default:
>  		DMWARN_LIMIT("Invalid BIO operation");
> @@ -423,11 +426,10 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>   * data written to a zone. Note that at this point, the remapped clone BIO
>   * may already have completed, so we do not touch it.
>   */
> -static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
> -					struct bio *orig_bio,
> +static blk_status_t dm_zone_map_bio_end(struct mapped_device *md, unsigned int zno,
> +					struct orig_bio_details *orig_bio_details,
>  					unsigned int nr_sectors)
>  {
> -	unsigned int zno = bio_zone_no(orig_bio);
>  	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
>  
>  	/* The clone BIO may already have been completed and failed */
> @@ -435,7 +437,7 @@ static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
>  		return BLK_STS_IOERR;
>  
>  	/* Update the zone wp offset */
> -	switch (bio_op(orig_bio)) {
> +	switch (orig_bio_details->op) {
>  	case REQ_OP_ZONE_RESET:
>  		WRITE_ONCE(md->zwp_offset[zno], 0);
>  		return BLK_STS_OK;
> @@ -452,7 +454,7 @@ static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
>  		 * Check that the target did not truncate the write operation
>  		 * emulating a zone append.
>  		 */
> -		if (nr_sectors != bio_sectors(orig_bio)) {
> +		if (nr_sectors != orig_bio_details->nr_sectors) {
>  			DMWARN_LIMIT("Truncated write for zone append");
>  			return BLK_STS_IOERR;
>  		}
> @@ -488,7 +490,7 @@ static inline void dm_zone_unlock(struct request_queue *q,
>  	bio_clear_flag(clone, BIO_ZONE_WRITE_LOCKED);
>  }
>  
> -static bool dm_need_zone_wp_tracking(struct bio *orig_bio)
> +static bool dm_need_zone_wp_tracking(struct bio *clone)
>  {
>  	/*
>  	 * Special processing is not needed for operations that do not need the
> @@ -496,15 +498,15 @@ static bool dm_need_zone_wp_tracking(struct bio *orig_bio)
>  	 * zones and all operations that do not modify directly a sequential
>  	 * zone write pointer.
>  	 */
> -	if (op_is_flush(orig_bio->bi_opf) && !bio_sectors(orig_bio))
> +	if (op_is_flush(clone->bi_opf) && !bio_sectors(clone))
>  		return false;
> -	switch (bio_op(orig_bio)) {
> +	switch (bio_op(clone)) {
>  	case REQ_OP_WRITE_ZEROES:
>  	case REQ_OP_WRITE:
>  	case REQ_OP_ZONE_RESET:
>  	case REQ_OP_ZONE_FINISH:
>  	case REQ_OP_ZONE_APPEND:
> -		return bio_zone_is_seq(orig_bio);
> +		return bio_zone_is_seq(clone);
>  	default:
>  		return false;
>  	}
> @@ -519,8 +521,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  	struct dm_target *ti = tio->ti;
>  	struct mapped_device *md = io->md;
>  	struct request_queue *q = md->queue;
> -	struct bio *orig_bio = io->orig_bio;
>  	struct bio *clone = &tio->clone;
> +	struct orig_bio_details orig_bio_details;
>  	unsigned int zno;
>  	blk_status_t sts;
>  	int r;
> @@ -529,18 +531,21 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  	 * IOs that do not change a zone write pointer do not need
>  	 * any additional special processing.
>  	 */
> -	if (!dm_need_zone_wp_tracking(orig_bio))
> +	if (!dm_need_zone_wp_tracking(clone))
>  		return ti->type->map(ti, clone);
>  
>  	/* Lock the target zone */
> -	zno = bio_zone_no(orig_bio);
> +	zno = bio_zone_no(clone);
>  	dm_zone_lock(q, zno, clone);
>  
> +	orig_bio_details.nr_sectors = bio_sectors(clone);
> +	orig_bio_details.op = bio_op(clone);
> +
>  	/*
>  	 * Check that the bio and the target zone write pointer offset are
>  	 * both valid, and if the bio is a zone append, remap it to a write.
>  	 */
> -	if (!dm_zone_map_bio_begin(md, orig_bio, clone)) {
> +	if (!dm_zone_map_bio_begin(md, zno, clone)) {
>  		dm_zone_unlock(q, zno, clone);
>  		return DM_MAPIO_KILL;
>  	}
> @@ -560,7 +565,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  		 * The target submitted the clone BIO. The target zone will
>  		 * be unlocked on completion of the clone.
>  		 */
> -		sts = dm_zone_map_bio_end(md, orig_bio, *tio->len_ptr);
> +		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
> +					  *tio->len_ptr);
>  		break;
>  	case DM_MAPIO_REMAPPED:
>  		/*
> @@ -568,7 +574,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  		 * unlock the target zone here as the clone will not be
>  		 * submitted.
>  		 */
> -		sts = dm_zone_map_bio_end(md, orig_bio, *tio->len_ptr);
> +		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
> +					  *tio->len_ptr);
>  		if (sts != BLK_STS_OK)
>  			dm_zone_unlock(q, zno, clone);
>  		break;
>
Ming Lei April 13, 2022, 1:56 a.m. UTC | #6
On Tue, Apr 12, 2022 at 04:52:40PM -0400, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  4:56P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > The current DM codes setup ->orig_bio after __map_bio() returns,
> > and not only cause kernel panic for dm zone, but also a bit ugly
> > and tricky, especially the waiting until ->orig_bio is set in
> > dm_submit_bio_remap().
> > 
> > The reason is that one new bio is cloned from original FS bio to
> > represent the mapped part, which just serves io accounting.
> > 
> > Now we have switched to bdev based io accounting interface, and we
> > can retrieve sectors/bio_op from both the real original bio and the
> > added fields of .sector_offset & .sectors easily, so the new cloned
> > bio isn't necessary any more.
> > 
> > Not only fixes dm-zone's kernel panic, but also cleans up dm io
> > accounting & split a bit.
> 
> You're conflating quite a few things here.  DM zone really has no
> business accessing io->orig_bio (dm-zone.c can just as easily inspect
> the tio->clone, because it hasn't been remapped yet it reflects the
> io->origin_bio, so there is no need to look at io->orig_bio) -- but
> yes I clearly broke things during the 5.18 merge and it needs fixing
> ASAP.

You can just consider the cleanup part of this patches, :-)

1) no late assignment of ->orig_bio, and always set it in alloc_io()

2) no waiting on on ->origi_bio, especially the waiting is done in
fast path of dm_submit_bio_remap().

3) no split for io accounting


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer April 13, 2022, 6:12 a.m. UTC | #7
On Tue, Apr 12 2022 at  9:56P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Tue, Apr 12, 2022 at 04:52:40PM -0400, Mike Snitzer wrote:
> > On Tue, Apr 12 2022 at  4:56P -0400,
> > Ming Lei <ming.lei@redhat.com> wrote:
> > 
> > > The current DM codes setup ->orig_bio after __map_bio() returns,
> > > and not only cause kernel panic for dm zone, but also a bit ugly
> > > and tricky, especially the waiting until ->orig_bio is set in
> > > dm_submit_bio_remap().
> > > 
> > > The reason is that one new bio is cloned from original FS bio to
> > > represent the mapped part, which just serves io accounting.
> > > 
> > > Now we have switched to bdev based io accounting interface, and we
> > > can retrieve sectors/bio_op from both the real original bio and the
> > > added fields of .sector_offset & .sectors easily, so the new cloned
> > > bio isn't necessary any more.
> > > 
> > > Not only fixes dm-zone's kernel panic, but also cleans up dm io
> > > accounting & split a bit.
> > 
> > You're conflating quite a few things here.  DM zone really has no
> > business accessing io->orig_bio (dm-zone.c can just as easily inspect
> > the tio->clone, because it hasn't been remapped yet it reflects the
> > io->origin_bio, so there is no need to look at io->orig_bio) -- but
> > yes I clearly broke things during the 5.18 merge and it needs fixing
> > ASAP.
> 
> You can just consider the cleanup part of this patches, :-)

I will.  But your following list doesn't reflect any "cleanup" that I
saw in your patchset.  Pretty fundamental changes that are similar,
but different, to the dm-5.19 changes I've staged.

> 1) no late assignment of ->orig_bio, and always set it in alloc_io()
>
> 2) no waiting on on ->origi_bio, especially the waiting is done in
> fast path of dm_submit_bio_remap().

For 5.18 waiting on io->orig_bio just enables a signal that the IO was
split and can be accounted.

For 5.19 I also plan on using late io->orig_bio assignment as an
alternative to the full-blown refcounting currently done with
io->io_count.  I've yet to quantify the gains with focused testing but
in theory this approach should scale better on large systems with many
concurrent IO threads to the same device (RCU is primary constraint
now).

I'll try to write a bpfrace script to measure how frequently "waiting on
io->orig_bio" occurs for dm_submit_bio_remap() heavy usage (like
dm-crypt). But I think we'll find it is very rarely, if ever, waited
on in the fast path.

> 3) no split for io accounting

DM's more recent approach to splitting has never been done for benefit
or use of IO accounting, see this commit for its origin:
18a25da84354c6b ("dm: ensure bio submission follows a depth-first tree walk")

Not sure why you keep poking fun at DM only doing a single split when:
that is the actual design.  DM splits off orig_bio then recurses to
handle the remainder of the bio that wasn't issued.  Storing it in
io->orig_bio (previously io->bio) was always a means of reflecting
things properly. And yes IO accounting is one use, the other is IO
completion. But unfortunately DM's IO accounting has always been a
mess ever since the above commit. Changes in 5.18 fixed that.

But again, DM's splitting has _nothing_ to do with IO accounting.
Splitting only happens when needed for IO submission given constraints
of DM target(s) or underlying layers.

All said, I will look closer at your entire set and see if it better
to go with your approach.  This patch in particular is interesting
(avoids cloning and other complexity of bio_split + bio_chain):
https://patchwork.kernel.org/project/dm-devel/patch/20220412085616.1409626-6-ming.lei@redhat.com/

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Ming Lei April 13, 2022, 12:26 p.m. UTC | #8
On Wed, Apr 13, 2022 at 02:12:47AM -0400, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  9:56P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Tue, Apr 12, 2022 at 04:52:40PM -0400, Mike Snitzer wrote:
> > > On Tue, Apr 12 2022 at  4:56P -0400,
> > > Ming Lei <ming.lei@redhat.com> wrote:
> > > 
> > > > The current DM codes setup ->orig_bio after __map_bio() returns,
> > > > and not only cause kernel panic for dm zone, but also a bit ugly
> > > > and tricky, especially the waiting until ->orig_bio is set in
> > > > dm_submit_bio_remap().
> > > > 
> > > > The reason is that one new bio is cloned from original FS bio to
> > > > represent the mapped part, which just serves io accounting.
> > > > 
> > > > Now we have switched to bdev based io accounting interface, and we
> > > > can retrieve sectors/bio_op from both the real original bio and the
> > > > added fields of .sector_offset & .sectors easily, so the new cloned
> > > > bio isn't necessary any more.
> > > > 
> > > > Not only fixes dm-zone's kernel panic, but also cleans up dm io
> > > > accounting & split a bit.
> > > 
> > > You're conflating quite a few things here.  DM zone really has no
> > > business accessing io->orig_bio (dm-zone.c can just as easily inspect
> > > the tio->clone, because it hasn't been remapped yet it reflects the
> > > io->origin_bio, so there is no need to look at io->orig_bio) -- but
> > > yes I clearly broke things during the 5.18 merge and it needs fixing
> > > ASAP.
> > 
> > You can just consider the cleanup part of this patches, :-)
> 
> I will.  But your following list doesn't reflect any "cleanup" that I
> saw in your patchset.  Pretty fundamental changes that are similar,
> but different, to the dm-5.19 changes I've staged.
> 
> > 1) no late assignment of ->orig_bio, and always set it in alloc_io()
> >
> > 2) no waiting on on ->origi_bio, especially the waiting is done in
> > fast path of dm_submit_bio_remap().
> 
> For 5.18 waiting on io->orig_bio just enables a signal that the IO was
> split and can be accounted.
> 
> For 5.19 I also plan on using late io->orig_bio assignment as an
> alternative to the full-blown refcounting currently done with
> io->io_count.  I've yet to quantify the gains with focused testing but
> in theory this approach should scale better on large systems with many
> concurrent IO threads to the same device (RCU is primary constraint
> now).
> 
> I'll try to write a bpfrace script to measure how frequently "waiting on
> io->orig_bio" occurs for dm_submit_bio_remap() heavy usage (like
> dm-crypt). But I think we'll find it is very rarely, if ever, waited
> on in the fast path.

The waiting depends on CPU and device's speed, if device is quicker than
CPU, the wait should be longer. Testing in one environment is usually
not enough.

> 
> > 3) no split for io accounting
> 
> DM's more recent approach to splitting has never been done for benefit
> or use of IO accounting, see this commit for its origin:
> 18a25da84354c6b ("dm: ensure bio submission follows a depth-first tree walk")
> 
> Not sure why you keep poking fun at DM only doing a single split when:
> that is the actual design.  DM splits off orig_bio then recurses to
> handle the remainder of the bio that wasn't issued.  Storing it in
> io->orig_bio (previously io->bio) was always a means of reflecting
> things properly. And yes IO accounting is one use, the other is IO
> completion. But unfortunately DM's IO accounting has always been a
> mess ever since the above commit. Changes in 5.18 fixed that.
> 
> But again, DM's splitting has _nothing_ to do with IO accounting.
> Splitting only happens when needed for IO submission given constraints
> of DM target(s) or underlying layers.

What I meant is that the bio returned from bio_split() is only for
io accounting. Yeah, the comment said it can be for io completion too,
but that is easily done without the splitted bio.

> 
> All said, I will look closer at your entire set and see if it better
> to go with your approach.  This patch in particular is interesting
> (avoids cloning and other complexity of bio_split + bio_chain):
> https://patchwork.kernel.org/project/dm-devel/patch/20220412085616.1409626-6-ming.lei@redhat.com/

That patch shows we can avoid the extra split, also shows that the
splitted bio from bio_split() is for io accounting only.


thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer April 13, 2022, 5:58 p.m. UTC | #9
On Wed, Apr 13 2022 at  8:26P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Wed, Apr 13, 2022 at 02:12:47AM -0400, Mike Snitzer wrote:
> > On Tue, Apr 12 2022 at  9:56P -0400,
> > Ming Lei <ming.lei@redhat.com> wrote:
> > 
> > > On Tue, Apr 12, 2022 at 04:52:40PM -0400, Mike Snitzer wrote:
> > > > On Tue, Apr 12 2022 at  4:56P -0400,
> > > > Ming Lei <ming.lei@redhat.com> wrote:
> > > > 
> > > > > The current DM codes setup ->orig_bio after __map_bio() returns,
> > > > > and not only cause kernel panic for dm zone, but also a bit ugly
> > > > > and tricky, especially the waiting until ->orig_bio is set in
> > > > > dm_submit_bio_remap().
> > > > > 
> > > > > The reason is that one new bio is cloned from original FS bio to
> > > > > represent the mapped part, which just serves io accounting.
> > > > > 
> > > > > Now we have switched to bdev based io accounting interface, and we
> > > > > can retrieve sectors/bio_op from both the real original bio and the
> > > > > added fields of .sector_offset & .sectors easily, so the new cloned
> > > > > bio isn't necessary any more.
> > > > > 
> > > > > Not only fixes dm-zone's kernel panic, but also cleans up dm io
> > > > > accounting & split a bit.
> > > > 
> > > > You're conflating quite a few things here.  DM zone really has no
> > > > business accessing io->orig_bio (dm-zone.c can just as easily inspect
> > > > the tio->clone, because it hasn't been remapped yet it reflects the
> > > > io->origin_bio, so there is no need to look at io->orig_bio) -- but
> > > > yes I clearly broke things during the 5.18 merge and it needs fixing
> > > > ASAP.
> > > 
> > > You can just consider the cleanup part of this patches, :-)
> > 
> > I will.  But your following list doesn't reflect any "cleanup" that I
> > saw in your patchset.  Pretty fundamental changes that are similar,
> > but different, to the dm-5.19 changes I've staged.
> > 
> > > 1) no late assignment of ->orig_bio, and always set it in alloc_io()
> > >
> > > 2) no waiting on on ->origi_bio, especially the waiting is done in
> > > fast path of dm_submit_bio_remap().
> > 
> > For 5.18 waiting on io->orig_bio just enables a signal that the IO was
> > split and can be accounted.
> > 
> > For 5.19 I also plan on using late io->orig_bio assignment as an
> > alternative to the full-blown refcounting currently done with
> > io->io_count.  I've yet to quantify the gains with focused testing but
> > in theory this approach should scale better on large systems with many
> > concurrent IO threads to the same device (RCU is primary constraint
> > now).
> > 
> > I'll try to write a bpfrace script to measure how frequently "waiting on
> > io->orig_bio" occurs for dm_submit_bio_remap() heavy usage (like
> > dm-crypt). But I think we'll find it is very rarely, if ever, waited
> > on in the fast path.
> 
> The waiting depends on CPU and device's speed, if device is quicker than
> CPU, the wait should be longer. Testing in one environment is usually
> not enough.
> 
> > 
> > > 3) no split for io accounting
> > 
> > DM's more recent approach to splitting has never been done for benefit
> > or use of IO accounting, see this commit for its origin:
> > 18a25da84354c6b ("dm: ensure bio submission follows a depth-first tree walk")
> > 
> > Not sure why you keep poking fun at DM only doing a single split when:
> > that is the actual design.  DM splits off orig_bio then recurses to
> > handle the remainder of the bio that wasn't issued.  Storing it in
> > io->orig_bio (previously io->bio) was always a means of reflecting
> > things properly. And yes IO accounting is one use, the other is IO
> > completion. But unfortunately DM's IO accounting has always been a
> > mess ever since the above commit. Changes in 5.18 fixed that.
> > 
> > But again, DM's splitting has _nothing_ to do with IO accounting.
> > Splitting only happens when needed for IO submission given constraints
> > of DM target(s) or underlying layers.
> 
> What I meant is that the bio returned from bio_split() is only for
> io accounting. Yeah, the comment said it can be for io completion too,
> but that is easily done without the splitted bio.
>
> > All said, I will look closer at your entire set and see if it better
> > to go with your approach.  This patch in particular is interesting
> > (avoids cloning and other complexity of bio_split + bio_chain):
> > https://patchwork.kernel.org/project/dm-devel/patch/20220412085616.1409626-6-ming.lei@redhat.com/
> 
> That patch shows we can avoid the extra split, also shows that the
> splitted bio from bio_split() is for io accounting only.

Yes I see that now.  But it also served to preserve the original bio
for use in completion.  Not a big deal, but it did track the head of
the bio_chain.

The bigger issue with this patch is that you've caused
dm_submit_bio_remap() to go back to accounting the entire original bio
before any split occurs.  That is a problem because you'll end up
accounting that bio for every split, so in split heavy workloads the
IO accounting won't reflect when the IO is actually issued and we'll
regress back to having very inaccurate and incorrect IO accounting for
dm_submit_bio_remap() heavy targets (e.g. dm-crypt).

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Ming Lei April 14, 2022, 12:36 a.m. UTC | #10
On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> On Wed, Apr 13 2022 at  8:26P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Wed, Apr 13, 2022 at 02:12:47AM -0400, Mike Snitzer wrote:
> > > On Tue, Apr 12 2022 at  9:56P -0400,
> > > Ming Lei <ming.lei@redhat.com> wrote:
> > > 
> > > > On Tue, Apr 12, 2022 at 04:52:40PM -0400, Mike Snitzer wrote:
> > > > > On Tue, Apr 12 2022 at  4:56P -0400,
> > > > > Ming Lei <ming.lei@redhat.com> wrote:
> > > > > 
> > > > > > The current DM codes setup ->orig_bio after __map_bio() returns,
> > > > > > and not only cause kernel panic for dm zone, but also a bit ugly
> > > > > > and tricky, especially the waiting until ->orig_bio is set in
> > > > > > dm_submit_bio_remap().
> > > > > > 
> > > > > > The reason is that one new bio is cloned from original FS bio to
> > > > > > represent the mapped part, which just serves io accounting.
> > > > > > 
> > > > > > Now we have switched to bdev based io accounting interface, and we
> > > > > > can retrieve sectors/bio_op from both the real original bio and the
> > > > > > added fields of .sector_offset & .sectors easily, so the new cloned
> > > > > > bio isn't necessary any more.
> > > > > > 
> > > > > > Not only fixes dm-zone's kernel panic, but also cleans up dm io
> > > > > > accounting & split a bit.
> > > > > 
> > > > > You're conflating quite a few things here.  DM zone really has no
> > > > > business accessing io->orig_bio (dm-zone.c can just as easily inspect
> > > > > the tio->clone, because it hasn't been remapped yet it reflects the
> > > > > io->origin_bio, so there is no need to look at io->orig_bio) -- but
> > > > > yes I clearly broke things during the 5.18 merge and it needs fixing
> > > > > ASAP.
> > > > 
> > > > You can just consider the cleanup part of this patches, :-)
> > > 
> > > I will.  But your following list doesn't reflect any "cleanup" that I
> > > saw in your patchset.  Pretty fundamental changes that are similar,
> > > but different, to the dm-5.19 changes I've staged.
> > > 
> > > > 1) no late assignment of ->orig_bio, and always set it in alloc_io()
> > > >
> > > > 2) no waiting on on ->origi_bio, especially the waiting is done in
> > > > fast path of dm_submit_bio_remap().
> > > 
> > > For 5.18 waiting on io->orig_bio just enables a signal that the IO was
> > > split and can be accounted.
> > > 
> > > For 5.19 I also plan on using late io->orig_bio assignment as an
> > > alternative to the full-blown refcounting currently done with
> > > io->io_count.  I've yet to quantify the gains with focused testing but
> > > in theory this approach should scale better on large systems with many
> > > concurrent IO threads to the same device (RCU is primary constraint
> > > now).
> > > 
> > > I'll try to write a bpfrace script to measure how frequently "waiting on
> > > io->orig_bio" occurs for dm_submit_bio_remap() heavy usage (like
> > > dm-crypt). But I think we'll find it is very rarely, if ever, waited
> > > on in the fast path.
> > 
> > The waiting depends on CPU and device's speed, if device is quicker than
> > CPU, the wait should be longer. Testing in one environment is usually
> > not enough.
> > 
> > > 
> > > > 3) no split for io accounting
> > > 
> > > DM's more recent approach to splitting has never been done for benefit
> > > or use of IO accounting, see this commit for its origin:
> > > 18a25da84354c6b ("dm: ensure bio submission follows a depth-first tree walk")
> > > 
> > > Not sure why you keep poking fun at DM only doing a single split when:
> > > that is the actual design.  DM splits off orig_bio then recurses to
> > > handle the remainder of the bio that wasn't issued.  Storing it in
> > > io->orig_bio (previously io->bio) was always a means of reflecting
> > > things properly. And yes IO accounting is one use, the other is IO
> > > completion. But unfortunately DM's IO accounting has always been a
> > > mess ever since the above commit. Changes in 5.18 fixed that.
> > > 
> > > But again, DM's splitting has _nothing_ to do with IO accounting.
> > > Splitting only happens when needed for IO submission given constraints
> > > of DM target(s) or underlying layers.
> > 
> > What I meant is that the bio returned from bio_split() is only for
> > io accounting. Yeah, the comment said it can be for io completion too,
> > but that is easily done without the splitted bio.
> >
> > > All said, I will look closer at your entire set and see if it better
> > > to go with your approach.  This patch in particular is interesting
> > > (avoids cloning and other complexity of bio_split + bio_chain):
> > > https://patchwork.kernel.org/project/dm-devel/patch/20220412085616.1409626-6-ming.lei@redhat.com/
> > 
> > That patch shows we can avoid the extra split, also shows that the
> > splitted bio from bio_split() is for io accounting only.
> 
> Yes I see that now.  But it also served to preserve the original bio
> for use in completion.  Not a big deal, but it did track the head of
> the bio_chain.
> 
> The bigger issue with this patch is that you've caused
> dm_submit_bio_remap() to go back to accounting the entire original bio
> before any split occurs.  That is a problem because you'll end up
> accounting that bio for every split, so in split heavy workloads the
> IO accounting won't reflect when the IO is actually issued and we'll
> regress back to having very inaccurate and incorrect IO accounting for
> dm_submit_bio_remap() heavy targets (e.g. dm-crypt).

Good catch, but we know the length of mapped part in original bio before
calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
something like the following delta change should address it:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index db23efd6bbf6..06b554f3104b 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
 
 	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
 	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
+
+	if (ci->sector_count > len) {
+		/* setup the mapped part for accounting */
+		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
+		ci->io->sectors = len;
+		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
+	}
 	__map_bio(clone);
 
 	ci->sector += len;
@@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 	if (error || !ci.sector_count)
 		goto out;
 
-	/* setup the mapped part for accounting */
-	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
-	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
-	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
-
 	bio_trim(bio, ci.io->sectors, ci.sector_count);
 	trace_block_split(bio, bio->bi_iter.bi_sector);
 	bio_inc_remaining(bio);
Mike Snitzer April 14, 2022, 2:25 a.m. UTC | #11
On Wed, Apr 13 2022 at  8:36P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > 
> > The bigger issue with this patch is that you've caused
> > dm_submit_bio_remap() to go back to accounting the entire original bio
> > before any split occurs.  That is a problem because you'll end up
> > accounting that bio for every split, so in split heavy workloads the
> > IO accounting won't reflect when the IO is actually issued and we'll
> > regress back to having very inaccurate and incorrect IO accounting for
> > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> 
> Good catch, but we know the length of mapped part in original bio before
> calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> something like the following delta change should address it:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index db23efd6bbf6..06b554f3104b 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
>  
>  	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
>  	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
> +
> +	if (ci->sector_count > len) {
> +		/* setup the mapped part for accounting */
> +		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> +		ci->io->sectors = len;
> +		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
> +	}
>  	__map_bio(clone);
>  
>  	ci->sector += len;
> @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
>  	if (error || !ci.sector_count)
>  		goto out;
>  
> -	/* setup the mapped part for accounting */
> -	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> -	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> -	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
> -
>  	bio_trim(bio, ci.io->sectors, ci.sector_count);
>  	trace_block_split(bio, bio->bi_iter.bi_sector);
>  	bio_inc_remaining(bio);
> 
> -- 
> Ming
> 

Unfortunately we do need splitting after __map_bio() because a dm
target's ->map can use dm_accept_partial_bio() to further reduce a
bio's mapped part.

But I think dm_accept_partial_bio() could be trained to update
tio->io->sectors?

dm_accept_partial_bio() has been around for a long time, it keeps
growing BUG_ONs that are actually helpful to narrow its use to "normal
IO", so it should be OK.

Running 'make check' in a built cryptsetup source tree should be a
good test for DM target interface functionality.

But there aren't automated tests for IO accounting correctness yet.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Ming Lei April 14, 2022, 3:57 a.m. UTC | #12
On Wed, Apr 13, 2022 at 10:25:45PM -0400, Mike Snitzer wrote:
> On Wed, Apr 13 2022 at  8:36P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > > 
> > > The bigger issue with this patch is that you've caused
> > > dm_submit_bio_remap() to go back to accounting the entire original bio
> > > before any split occurs.  That is a problem because you'll end up
> > > accounting that bio for every split, so in split heavy workloads the
> > > IO accounting won't reflect when the IO is actually issued and we'll
> > > regress back to having very inaccurate and incorrect IO accounting for
> > > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> > 
> > Good catch, but we know the length of mapped part in original bio before
> > calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> > something like the following delta change should address it:
> > 
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index db23efd6bbf6..06b554f3104b 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
> >  
> >  	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
> >  	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
> > +
> > +	if (ci->sector_count > len) {
> > +		/* setup the mapped part for accounting */
> > +		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> > +		ci->io->sectors = len;
> > +		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
> > +	}
> >  	__map_bio(clone);
> >  
> >  	ci->sector += len;
> > @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> >  	if (error || !ci.sector_count)
> >  		goto out;
> >  
> > -	/* setup the mapped part for accounting */
> > -	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> > -	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> > -	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
> > -
> >  	bio_trim(bio, ci.io->sectors, ci.sector_count);
> >  	trace_block_split(bio, bio->bi_iter.bi_sector);
> >  	bio_inc_remaining(bio);
> > 
> > -- 
> > Ming
> > 
> 
> Unfortunately we do need splitting after __map_bio() because a dm
> target's ->map can use dm_accept_partial_bio() to further reduce a
> bio's mapped part.
> 
> But I think dm_accept_partial_bio() could be trained to update
> tio->io->sectors?

->orig_bio is just for serving io accounting, but ->orig_bio isn't
passed to dm_accept_partial_bio(), and not gets updated after
dm_accept_partial_bio() is called.

If that is one issue, it must be one existed issue in dm io accounting
since ->orig_bio isn't updated when dm_accept_partial_bio() is called.

So do we have to update it?

> 
> dm_accept_partial_bio() has been around for a long time, it keeps
> growing BUG_ONs that are actually helpful to narrow its use to "normal
> IO", so it should be OK.
> 
> Running 'make check' in a built cryptsetup source tree should be a
> good test for DM target interface functionality.

Care to share the test tree?

> 
> But there aren't automated tests for IO accounting correctness yet.

I did verify io accounting by running dm-thin with blk-throttle, and the
observed throughput is same with expected setting. Running both small bs
and large bs, so non-split and split code path are covered.

Maybe you can add this kind of test into dm io accounting automated test.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer April 14, 2022, 5:45 p.m. UTC | #13
On Wed, Apr 13 2022 at 11:57P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Wed, Apr 13, 2022 at 10:25:45PM -0400, Mike Snitzer wrote:
> > On Wed, Apr 13 2022 at  8:36P -0400,
> > Ming Lei <ming.lei@redhat.com> wrote:
> > 
> > > On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > > > 
> > > > The bigger issue with this patch is that you've caused
> > > > dm_submit_bio_remap() to go back to accounting the entire original bio
> > > > before any split occurs.  That is a problem because you'll end up
> > > > accounting that bio for every split, so in split heavy workloads the
> > > > IO accounting won't reflect when the IO is actually issued and we'll
> > > > regress back to having very inaccurate and incorrect IO accounting for
> > > > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> > > 
> > > Good catch, but we know the length of mapped part in original bio before
> > > calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> > > something like the following delta change should address it:
> > > 
> > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > index db23efd6bbf6..06b554f3104b 100644
> > > --- a/drivers/md/dm.c
> > > +++ b/drivers/md/dm.c
> > > @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
> > >  
> > >  	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
> > >  	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
> > > +
> > > +	if (ci->sector_count > len) {
> > > +		/* setup the mapped part for accounting */
> > > +		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> > > +		ci->io->sectors = len;
> > > +		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
> > > +	}
> > >  	__map_bio(clone);
> > >  
> > >  	ci->sector += len;
> > > @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> > >  	if (error || !ci.sector_count)
> > >  		goto out;
> > >  
> > > -	/* setup the mapped part for accounting */
> > > -	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> > > -	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> > > -	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
> > > -
> > >  	bio_trim(bio, ci.io->sectors, ci.sector_count);
> > >  	trace_block_split(bio, bio->bi_iter.bi_sector);
> > >  	bio_inc_remaining(bio);
> > > 
> > > -- 
> > > Ming
> > > 
> > 
> > Unfortunately we do need splitting after __map_bio() because a dm
> > target's ->map can use dm_accept_partial_bio() to further reduce a
> > bio's mapped part.
> > 
> > But I think dm_accept_partial_bio() could be trained to update
> > tio->io->sectors?
> 
> ->orig_bio is just for serving io accounting, but ->orig_bio isn't
> passed to dm_accept_partial_bio(), and not gets updated after
> dm_accept_partial_bio() is called.
> 
> If that is one issue, it must be one existed issue in dm io accounting
> since ->orig_bio isn't updated when dm_accept_partial_bio() is called.

Recall that ->orig_bio is updated after the bio_split() at the bottom of
dm_split_and_process_bio().

That bio_split() is based on ci->sector_count, which is reduced as a
side-effect of dm_accept_partial_bio() reducing tio->len_ptr.  It is
pretty circuitous so I can absolutely understand why you didn't
immediately appreciate the interface.  The block comment above
dm_accept_partial_bio() does a pretty comprehensive job of explaining.

But basically dm_accept_partial_bio() provides DM targets access to
control DM core's splitting if they find that they cannot accommodate
the entirety of the clone bio that is sent to their ->map.
dm_accept_partial_bio() may only ever be called from a target's ->map

> So do we have to update it?
> 
> > 
> > dm_accept_partial_bio() has been around for a long time, it keeps
> > growing BUG_ONs that are actually helpful to narrow its use to "normal
> > IO", so it should be OK.
> > 
> > Running 'make check' in a built cryptsetup source tree should be a
> > good test for DM target interface functionality.
> 
> Care to share the test tree?

https://gitlab.com/cryptsetup/cryptsetup.git

> 
> > 
> > But there aren't automated tests for IO accounting correctness yet.
> 
> I did verify io accounting by running dm-thin with blk-throttle, and the
> observed throughput is same with expected setting. Running both small bs
> and large bs, so non-split and split code path are covered.
> 
> Maybe you can add this kind of test into dm io accounting automated test.

Yeah, something like that would be good.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Ming Lei April 15, 2022, 12:14 a.m. UTC | #14
On Thu, Apr 14, 2022 at 01:45:33PM -0400, Mike Snitzer wrote:
> On Wed, Apr 13 2022 at 11:57P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Wed, Apr 13, 2022 at 10:25:45PM -0400, Mike Snitzer wrote:
> > > On Wed, Apr 13 2022 at  8:36P -0400,
> > > Ming Lei <ming.lei@redhat.com> wrote:
> > > 
> > > > On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > > > > 
> > > > > The bigger issue with this patch is that you've caused
> > > > > dm_submit_bio_remap() to go back to accounting the entire original bio
> > > > > before any split occurs.  That is a problem because you'll end up
> > > > > accounting that bio for every split, so in split heavy workloads the
> > > > > IO accounting won't reflect when the IO is actually issued and we'll
> > > > > regress back to having very inaccurate and incorrect IO accounting for
> > > > > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> > > > 
> > > > Good catch, but we know the length of mapped part in original bio before
> > > > calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> > > > something like the following delta change should address it:
> > > > 
> > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > > index db23efd6bbf6..06b554f3104b 100644
> > > > --- a/drivers/md/dm.c
> > > > +++ b/drivers/md/dm.c
> > > > @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
> > > >  
> > > >  	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
> > > >  	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
> > > > +
> > > > +	if (ci->sector_count > len) {
> > > > +		/* setup the mapped part for accounting */
> > > > +		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> > > > +		ci->io->sectors = len;
> > > > +		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
> > > > +	}
> > > >  	__map_bio(clone);
> > > >  
> > > >  	ci->sector += len;
> > > > @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> > > >  	if (error || !ci.sector_count)
> > > >  		goto out;
> > > >  
> > > > -	/* setup the mapped part for accounting */
> > > > -	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> > > > -	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> > > > -	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
> > > > -
> > > >  	bio_trim(bio, ci.io->sectors, ci.sector_count);
> > > >  	trace_block_split(bio, bio->bi_iter.bi_sector);
> > > >  	bio_inc_remaining(bio);
> > > > 
> > > > -- 
> > > > Ming
> > > > 
> > > 
> > > Unfortunately we do need splitting after __map_bio() because a dm
> > > target's ->map can use dm_accept_partial_bio() to further reduce a
> > > bio's mapped part.
> > > 
> > > But I think dm_accept_partial_bio() could be trained to update
> > > tio->io->sectors?
> > 
> > ->orig_bio is just for serving io accounting, but ->orig_bio isn't
> > passed to dm_accept_partial_bio(), and not gets updated after
> > dm_accept_partial_bio() is called.
> > 
> > If that is one issue, it must be one existed issue in dm io accounting
> > since ->orig_bio isn't updated when dm_accept_partial_bio() is called.
> 
> Recall that ->orig_bio is updated after the bio_split() at the bottom of
> dm_split_and_process_bio().
> 
> That bio_split() is based on ci->sector_count, which is reduced as a
> side-effect of dm_accept_partial_bio() reducing tio->len_ptr.  It is
> pretty circuitous so I can absolutely understand why you didn't
> immediately appreciate the interface.  The block comment above
> dm_accept_partial_bio() does a pretty comprehensive job of explaining.

Go it now, thanks for the explanation.

As you mentioned, it can be addressed in dm_accept_partial_bio()
by updating ti->io->sectors.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer April 15, 2022, 9:06 p.m. UTC | #15
On Thu, Apr 14 2022 at  8:14P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Thu, Apr 14, 2022 at 01:45:33PM -0400, Mike Snitzer wrote:
> > On Wed, Apr 13 2022 at 11:57P -0400,
> > Ming Lei <ming.lei@redhat.com> wrote:
> > 
> > > On Wed, Apr 13, 2022 at 10:25:45PM -0400, Mike Snitzer wrote:
> > > > On Wed, Apr 13 2022 at  8:36P -0400,
> > > > Ming Lei <ming.lei@redhat.com> wrote:
> > > > 
> > > > > On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > > > > > 
> > > > > > The bigger issue with this patch is that you've caused
> > > > > > dm_submit_bio_remap() to go back to accounting the entire original bio
> > > > > > before any split occurs.  That is a problem because you'll end up
> > > > > > accounting that bio for every split, so in split heavy workloads the
> > > > > > IO accounting won't reflect when the IO is actually issued and we'll
> > > > > > regress back to having very inaccurate and incorrect IO accounting for
> > > > > > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> > > > > 
> > > > > Good catch, but we know the length of mapped part in original bio before
> > > > > calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> > > > > something like the following delta change should address it:
> > > > > 
> > > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > > > index db23efd6bbf6..06b554f3104b 100644
> > > > > --- a/drivers/md/dm.c
> > > > > +++ b/drivers/md/dm.c
> > > > > @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
> > > > >  
> > > > >  	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
> > > > >  	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
> > > > > +
> > > > > +	if (ci->sector_count > len) {
> > > > > +		/* setup the mapped part for accounting */
> > > > > +		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> > > > > +		ci->io->sectors = len;
> > > > > +		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
> > > > > +	}
> > > > >  	__map_bio(clone);
> > > > >  
> > > > >  	ci->sector += len;
> > > > > @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> > > > >  	if (error || !ci.sector_count)
> > > > >  		goto out;
> > > > >  
> > > > > -	/* setup the mapped part for accounting */
> > > > > -	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> > > > > -	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> > > > > -	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
> > > > > -
> > > > >  	bio_trim(bio, ci.io->sectors, ci.sector_count);
> > > > >  	trace_block_split(bio, bio->bi_iter.bi_sector);
> > > > >  	bio_inc_remaining(bio);
> > > > > 
> > > > > -- 
> > > > > Ming
> > > > > 
> > > > 
> > > > Unfortunately we do need splitting after __map_bio() because a dm
> > > > target's ->map can use dm_accept_partial_bio() to further reduce a
> > > > bio's mapped part.
> > > > 
> > > > But I think dm_accept_partial_bio() could be trained to update
> > > > tio->io->sectors?
> > > 
> > > ->orig_bio is just for serving io accounting, but ->orig_bio isn't
> > > passed to dm_accept_partial_bio(), and not gets updated after
> > > dm_accept_partial_bio() is called.
> > > 
> > > If that is one issue, it must be one existed issue in dm io accounting
> > > since ->orig_bio isn't updated when dm_accept_partial_bio() is called.
> > 
> > Recall that ->orig_bio is updated after the bio_split() at the bottom of
> > dm_split_and_process_bio().
> > 
> > That bio_split() is based on ci->sector_count, which is reduced as a
> > side-effect of dm_accept_partial_bio() reducing tio->len_ptr.  It is
> > pretty circuitous so I can absolutely understand why you didn't
> > immediately appreciate the interface.  The block comment above
> > dm_accept_partial_bio() does a pretty comprehensive job of explaining.
> 
> Go it now, thanks for the explanation.
> 
> As you mentioned, it can be addressed in dm_accept_partial_bio()
> by updating ti->io->sectors.

Yes, I rebased your patchset ontop of dm-5.19 and fixed up your
splitting like we discussed.  I'll be rebasing ontop of v5.18-rc3 once
it is released but please have a look at this 'dm-5.19-v2' branch:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19-v2

And this commit in particular:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19-v2&id=fe5a99da8b0d0518342f5cdb522a06b0f123ca09

Once I've verified with you that it looks OK I'll fold it into your
commit (at the same time I rebase on v5.18-rc3 early next week).

In general this rebase sucked.. but I wanted to layer your changes
ontop of earlier changes I had already staged for 5.19. I've at least
verified that DM's bio polling seems to work like usual and also
verified that cryptsetup's 'make check' passes.

No urgency on you reviewing before early next week.  Have a great
weekend.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Ming Lei April 17, 2022, 2:22 a.m. UTC | #16
On Fri, Apr 15, 2022 at 05:06:55PM -0400, Mike Snitzer wrote:
> On Thu, Apr 14 2022 at  8:14P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Thu, Apr 14, 2022 at 01:45:33PM -0400, Mike Snitzer wrote:
> > > On Wed, Apr 13 2022 at 11:57P -0400,
> > > Ming Lei <ming.lei@redhat.com> wrote:
> > > 
> > > > On Wed, Apr 13, 2022 at 10:25:45PM -0400, Mike Snitzer wrote:
> > > > > On Wed, Apr 13 2022 at  8:36P -0400,
> > > > > Ming Lei <ming.lei@redhat.com> wrote:
> > > > > 
> > > > > > On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > > > > > > 
> > > > > > > The bigger issue with this patch is that you've caused
> > > > > > > dm_submit_bio_remap() to go back to accounting the entire original bio
> > > > > > > before any split occurs.  That is a problem because you'll end up
> > > > > > > accounting that bio for every split, so in split heavy workloads the
> > > > > > > IO accounting won't reflect when the IO is actually issued and we'll
> > > > > > > regress back to having very inaccurate and incorrect IO accounting for
> > > > > > > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> > > > > > 
> > > > > > Good catch, but we know the length of mapped part in original bio before
> > > > > > calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> > > > > > something like the following delta change should address it:
> > > > > > 
> > > > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > > > > index db23efd6bbf6..06b554f3104b 100644
> > > > > > --- a/drivers/md/dm.c
> > > > > > +++ b/drivers/md/dm.c
> > > > > > @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
> > > > > >  
> > > > > >  	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
> > > > > >  	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
> > > > > > +
> > > > > > +	if (ci->sector_count > len) {
> > > > > > +		/* setup the mapped part for accounting */
> > > > > > +		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> > > > > > +		ci->io->sectors = len;
> > > > > > +		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
> > > > > > +	}
> > > > > >  	__map_bio(clone);
> > > > > >  
> > > > > >  	ci->sector += len;
> > > > > > @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> > > > > >  	if (error || !ci.sector_count)
> > > > > >  		goto out;
> > > > > >  
> > > > > > -	/* setup the mapped part for accounting */
> > > > > > -	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> > > > > > -	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> > > > > > -	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
> > > > > > -
> > > > > >  	bio_trim(bio, ci.io->sectors, ci.sector_count);
> > > > > >  	trace_block_split(bio, bio->bi_iter.bi_sector);
> > > > > >  	bio_inc_remaining(bio);
> > > > > > 
> > > > > > -- 
> > > > > > Ming
> > > > > > 
> > > > > 
> > > > > Unfortunately we do need splitting after __map_bio() because a dm
> > > > > target's ->map can use dm_accept_partial_bio() to further reduce a
> > > > > bio's mapped part.
> > > > > 
> > > > > But I think dm_accept_partial_bio() could be trained to update
> > > > > tio->io->sectors?
> > > > 
> > > > ->orig_bio is just for serving io accounting, but ->orig_bio isn't
> > > > passed to dm_accept_partial_bio(), and not gets updated after
> > > > dm_accept_partial_bio() is called.
> > > > 
> > > > If that is one issue, it must be one existed issue in dm io accounting
> > > > since ->orig_bio isn't updated when dm_accept_partial_bio() is called.
> > > 
> > > Recall that ->orig_bio is updated after the bio_split() at the bottom of
> > > dm_split_and_process_bio().
> > > 
> > > That bio_split() is based on ci->sector_count, which is reduced as a
> > > side-effect of dm_accept_partial_bio() reducing tio->len_ptr.  It is
> > > pretty circuitous so I can absolutely understand why you didn't
> > > immediately appreciate the interface.  The block comment above
> > > dm_accept_partial_bio() does a pretty comprehensive job of explaining.
> > 
> > Go it now, thanks for the explanation.
> > 
> > As you mentioned, it can be addressed in dm_accept_partial_bio()
> > by updating ti->io->sectors.
> 
> Yes, I rebased your patchset ontop of dm-5.19 and fixed up your
> splitting like we discussed.  I'll be rebasing ontop of v5.18-rc3 once
> it is released but please have a look at this 'dm-5.19-v2' branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19-v2
> 
> And this commit in particular:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19-v2&id=fe5a99da8b0d0518342f5cdb522a06b0f123ca09
> 
> Once I've verified with you that it looks OK I'll fold it into your
> commit (at the same time I rebase on v5.18-rc3 early next week).

Hi Mike,

Your delta change looks good, thanks for fixing it!

Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
diff mbox series

Patch

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 4277853c7535..aefb080c230d 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -247,7 +247,12 @@  struct dm_io {
 	blk_short_t flags;
 	atomic_t io_count;
 	struct mapped_device *md;
+
+	/* The three fields represent mapped part of original bio */
 	struct bio *orig_bio;
+	unsigned int sector_offset; /* offset to end of orig_bio */
+	unsigned int sectors;
+
 	blk_status_t status;
 	spinlock_t lock;
 	unsigned long start_time;
@@ -264,7 +269,8 @@  struct dm_io {
  */
 enum {
 	DM_IO_START_ACCT,
-	DM_IO_ACCOUNTED
+	DM_IO_ACCOUNTED,
+	DM_IO_SPLITTED
 };
 
 static inline bool dm_io_flagged(struct dm_io *io, unsigned int bit)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 31eacc0e93ed..df1d013fb793 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -509,8 +509,10 @@  static void dm_io_acct(struct dm_io *io, bool end)
 	/* If REQ_PREFLUSH set save any payload but do not account it */
 	if (bio_is_flush_with_data(bio))
 		sectors = 0;
-	else
+	else if (likely(!(dm_io_flagged(io, DM_IO_SPLITTED))))
 		sectors = bio_sectors(bio);
+	else
+		sectors = io->sectors;
 
 	if (!end)
 		bdev_start_io_acct(bio->bi_bdev, sectors, bio_op(bio),
@@ -518,10 +520,21 @@  static void dm_io_acct(struct dm_io *io, bool end)
 	else
 		bdev_end_io_acct(bio->bi_bdev, bio_op(bio), start_time);
 
-	if (unlikely(dm_stats_used(&md->stats)))
+	if (unlikely(dm_stats_used(&md->stats))) {
+		sector_t sector;
+
+		if (likely(!dm_io_flagged(io, DM_IO_SPLITTED))) {
+			sector = bio->bi_iter.bi_sector;
+			sectors = bio_sectors(bio);
+		} else {
+			sector = bio_end_sector(bio) - io->sector_offset;
+			sectors = io->sectors;
+		}
+
 		dm_stats_account_io(&md->stats, bio_data_dir(bio),
-				    bio->bi_iter.bi_sector, bio_sectors(bio),
+				    sector, sectors,
 				    end, start_time, stats_aux);
+	}
 }
 
 static void __dm_start_io_acct(struct dm_io *io)
@@ -576,7 +589,7 @@  static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
 	io->status = 0;
 	atomic_set(&io->io_count, 1);
 	this_cpu_inc(*md->pending_io);
-	io->orig_bio = NULL;
+	io->orig_bio = bio;
 	io->md = md;
 	io->map_task = current;
 	spin_lock_init(&io->lock);
@@ -1222,13 +1235,6 @@  void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone)
 		/* Still in target's map function */
 		dm_io_set_flag(io, DM_IO_START_ACCT);
 	} else {
-		/*
-		 * Called by another thread, managed by DM target,
-		 * wait for dm_split_and_process_bio() to store
-		 * io->orig_bio
-		 */
-		while (unlikely(!smp_load_acquire(&io->orig_bio)))
-			msleep(1);
 		dm_start_io_acct(io, clone);
 	}
 
@@ -1557,7 +1563,6 @@  static void dm_split_and_process_bio(struct mapped_device *md,
 				     struct dm_table *map, struct bio *bio)
 {
 	struct clone_info ci;
-	struct bio *orig_bio = NULL;
 	int error = 0;
 
 	init_clone_info(&ci, md, map, bio);
@@ -1573,22 +1578,16 @@  static void dm_split_and_process_bio(struct mapped_device *md,
 	if (error || !ci.sector_count)
 		goto out;
 
-	/*
-	 * Remainder must be passed to submit_bio_noacct() so it gets handled
-	 * *after* bios already submitted have been completely processed.
-	 * We take a clone of the original to store in ci.io->orig_bio to be
-	 * used by dm_end_io_acct() and for dm_io_complete() to use for
-	 * completion handling.
-	 */
-	orig_bio = bio_split(bio, bio_sectors(bio) - ci.sector_count,
-			     GFP_NOIO, &md->queue->bio_split);
-	bio_chain(orig_bio, bio);
-	trace_block_split(orig_bio, bio->bi_iter.bi_sector);
+	/* setup the mapped part for accounting */
+	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
+	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
+	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
+
+	bio_trim(bio, ci.io->sectors, ci.sector_count);
+	trace_block_split(bio, bio->bi_iter.bi_sector);
+	bio_inc_remaining(bio);
 	submit_bio_noacct(bio);
 out:
-	if (!orig_bio)
-		orig_bio = bio;
-	smp_store_release(&ci.io->orig_bio, orig_bio);
 	if (dm_io_flagged(ci.io, DM_IO_START_ACCT))
 		dm_start_io_acct(ci.io, NULL);