mbox series

[RFC,0/3] Add support of iopoll for dm device

Message ID 20201020065420.124885-1-jefflexu@linux.alibaba.com (mailing list archive)
Headers show
Series Add support of iopoll for dm device | expand

Message

Jingbo Xu Oct. 20, 2020, 6:54 a.m. UTC
This patch set adds support of iopoll for dm device.

This is only an RFC patch. I'm really looking forward getting your
feedbacks on if you're interested in supporting iopoll for dm device,
or if there's a better design to implement that.

Thanks.


[Purpose]
IO polling is an important mode of io_uring. Currently only mq devices
support iopoll. As for dm devices, only dm-multipath is request-base,
while others are all bio-based and have no support for iopoll.
Supporting iopoll for dm devices can be of great meaning when the
device seen by application is dm device such as dm-linear/dm-stripe,
in which case iopoll is not usable for io_uring.


[Design Notes]

cookie
------
Let's start from cookie. Cookie is one important concept in iopoll. It
is used to identify one specific request in one specific hardware queue.
The concept of cookie is initially designed as a per-bio concept, and
thus it doesn't work well when bio-split involved. When bio is split,
the returned cookie is indeed the cookie of one of the split bio, and
the following polling on this returned cookie can only guarantee the
completion of this specific split bio, while the other split bios may
be still uncompleted. Bio-split is also resolved for dm device, though
in another form, in which case the original bio submitted to the dm
device may be split into multiple bios submitted to the underlying
devices.

In previous discussion, Lei Ming has suggested that iopoll should be
disabled for bio-split. This works for the normal bio-split (done in
blk_mq_submit_bio->__blk_queue_split), while iopoll will never be
usable for dm devices if this also applies for dm device.

So come back to the design of the cookie for dm devices. At the very
beginning I want to refactor the design of cookie, from 'unsigned int'
type to the pointer type for dm device, so that cookie can point to
something, e.g. a list containing all cookies of one original bio,
something like this:

struct dm_io { // the returned cookie points to dm_io
	...
	struct list_head cookies; 
};

In this design, we can fetch all cookies of one original bio, but the
implementation can be difficult and buggy. For example, the
'struct dm_io' structure may be already freed when the returned cookie
is used in blk_poll(). Then what if maintain a refcount in struct dm_io
so that 'struct dm_io' structure can not be freed until blk_poll()
called? Then the 'struct dm_io' structure will never be freed if the
IO polling is not used at all.

So finally I gived up refactoring the design of cookie and maintain all
cookies of one original bio. The initial design of cookie gets retained.
The returned cookie is still the cookie of one of the split bio, and thus
it is not used at all when polling dm devices. The polling of dm device,
is indeed iterating and polling on all hardware queues (in polling mode)
of all underlying target devices.

This design is simple and easy to implement. But I'm not sure if the
performance can be an issue if one dm device mapped to too many target
devices or the dm stack depth grows.

REQ_NOWAIT
----------
The polling bio will set REQ_NOWAIT in bio->bi_flags, and the device
need to be capable of handling REQ_NOWAIT. Commit 021a24460dc2
("block: add QUEUE_FLAG_NOWAIT") and commit 6abc49468eea ("dm: add
support for REQ_NOWAIT and enable it for linear target") add this
support for dm device, and currently only dm-linear supports that.

hybrid polling
--------------
When execute hybrid polling, cookie is used to fetch the request,
and check if the request has completed or not. In the design for
dm device described above, the returned cookie is still the cookie
of one of the split bios, and thus we have no way checking if all
the split bios have completed or not. Thus in the current
implementation, hybrid polling is not supported for dm device.


[Performance]
I test 4k-randread on a dm-linear mapped to only one nvme device.

SSD: INTEL SSDPEDME800G4
dm-linear:dmsetup create testdev --table "0 209715200 linear /dev/nvme0n1 0"

fio configs:
```
ioengine=io_uring
iodepth=128
numjobs=1
thread
rw=randread
direct=1
registerfiles=1
#hipri=1 with iopoll enabled, hipri=0 with iopoll disabled
bs=4k
size=100G
runtime=60
time_based
group_reporting
randrepeat=0

[device]
filename=/dev/mapper/testdev
```

The test result shows that there's ~40% performance growth when iopoll
enabled.

test | iops | bw | avg lat
---- | ---- | ---- | ----
iopoll-disabled | 244k | 953MiB/s  | 524us
iopoll-enabled  | 345k | 1349MiB/s | 370us

[TODO]
The store method of "io_poll"/"io_poll_delay" has not been adapted
for dm device.


Jeffle Xu (3):
  block/mq: add iterator for polling hw queues
  block: add back ->poll_fn in request queue
  dm: add support for IO polling

 block/blk-mq.c         | 30 ++++++++++++++++++++++++------
 drivers/md/dm-core.h   |  1 +
 drivers/md/dm-table.c  | 30 ++++++++++++++++++++++++++++++
 drivers/md/dm.c        | 40 ++++++++++++++++++++++++++++++++++++++++
 include/linux/blk-mq.h |  6 ++++++
 include/linux/blkdev.h |  3 +++
 6 files changed, 104 insertions(+), 6 deletions(-)

Comments

Mike Snitzer Oct. 21, 2020, 8:39 p.m. UTC | #1
On Tue, Oct 20 2020 at  2:54am -0400,
Jeffle Xu <jefflexu@linux.alibaba.com> wrote:

> This patch set adds support of iopoll for dm device.
> 
> This is only an RFC patch. I'm really looking forward getting your
> feedbacks on if you're interested in supporting iopoll for dm device,
> or if there's a better design to implement that.
> 
> Thanks.
> 
> 
> [Purpose]
> IO polling is an important mode of io_uring. Currently only mq devices
> support iopoll. As for dm devices, only dm-multipath is request-base,
> while others are all bio-based and have no support for iopoll.
> Supporting iopoll for dm devices can be of great meaning when the
> device seen by application is dm device such as dm-linear/dm-stripe,
> in which case iopoll is not usable for io_uring.

I appreciate you taking the initiative on this; polling support is on my
TODO so your work serves as a nice reminder to pursue this more
urgently.

> [Design Notes]
> 
> cookie
> ------
> Let's start from cookie. Cookie is one important concept in iopoll. It
> is used to identify one specific request in one specific hardware queue.
> The concept of cookie is initially designed as a per-bio concept, and
> thus it doesn't work well when bio-split involved. When bio is split,
> the returned cookie is indeed the cookie of one of the split bio, and
> the following polling on this returned cookie can only guarantee the
> completion of this specific split bio, while the other split bios may
> be still uncompleted. Bio-split is also resolved for dm device, though
> in another form, in which case the original bio submitted to the dm
> device may be split into multiple bios submitted to the underlying
> devices.
> 
> In previous discussion, Lei Ming has suggested that iopoll should be
> disabled for bio-split. This works for the normal bio-split (done in
> blk_mq_submit_bio->__blk_queue_split), while iopoll will never be
> usable for dm devices if this also applies for dm device.
> 
> So come back to the design of the cookie for dm devices. At the very
> beginning I want to refactor the design of cookie, from 'unsigned int'
> type to the pointer type for dm device, so that cookie can point to
> something, e.g. a list containing all cookies of one original bio,
> something like this:
> 
> struct dm_io { // the returned cookie points to dm_io
> 	...
> 	struct list_head cookies; 
> };
> 
> In this design, we can fetch all cookies of one original bio, but the
> implementation can be difficult and buggy. For example, the
> 'struct dm_io' structure may be already freed when the returned cookie
> is used in blk_poll(). Then what if maintain a refcount in struct dm_io
> so that 'struct dm_io' structure can not be freed until blk_poll()
> called? Then the 'struct dm_io' structure will never be freed if the
> IO polling is not used at all.

I'd have to look closer at the race in the code you wrote (though you
didn't share it); but we cannot avoid properly mapping a cookie to each
split bio.  Otherwise you resort to inefficiently polling everything.

Seems your attempt to have the cookie point to a dm_io object was likely
too coarse-grained (when bios are split they get their own dm_io on
recursive re-entry to DM core from ->submit_bio); but isn't having a
list of cookies still too imprecise for polling purposes?  You could
easily have a list that translates to many blk-mq queues.  Possibly
better than your current approach of polling everything -- but not
much.

> So finally I gived up refactoring the design of cookie and maintain all
> cookies of one original bio. The initial design of cookie gets retained.
> The returned cookie is still the cookie of one of the split bio, and thus
> it is not used at all when polling dm devices. The polling of dm device,
> is indeed iterating and polling on all hardware queues (in polling mode)
> of all underlying target devices.
>
> This design is simple and easy to implement. But I'm not sure if the
> performance can be an issue if one dm device mapped to too many target
> devices or the dm stack depth grows.

You showed 40% performance improvement but it really just does the bare
minimum of implementing polling.  It is so simplistic that I really
don't think the design even serves as a starting point for what will be
needed.  So this needs further design time.

What you've _done_ could serve as a stop-gap but I'd really rather we
get it properly designed from the start.

> REQ_NOWAIT
> ----------
> The polling bio will set REQ_NOWAIT in bio->bi_flags, and the device
> need to be capable of handling REQ_NOWAIT. Commit 021a24460dc2
> ("block: add QUEUE_FLAG_NOWAIT") and commit 6abc49468eea ("dm: add
> support for REQ_NOWAIT and enable it for linear target") add this
> support for dm device, and currently only dm-linear supports that.
> 
> hybrid polling
> --------------
> When execute hybrid polling, cookie is used to fetch the request,
> and check if the request has completed or not. In the design for
> dm device described above, the returned cookie is still the cookie
> of one of the split bios, and thus we have no way checking if all
> the split bios have completed or not. Thus in the current
> implementation, hybrid polling is not supported for dm device.

I'll look closer at all this.  But DM targets do allow for additional
target specific per-bio-data to be added to the overall per-bio-data
size (via ti->per_io_data_size).  DM core _could_ internalize adding
additional space to per-bio-data for storing a unique cookie per
split/clone bio.

Conversely, block-core's bio cloning could manage this so long as it
knew how to access the offset into the established per-bio-data.  But
that might be more challenging to do without impacting use-cases outside
of DM.

Thanks,
Mike


> [Performance]
> I test 4k-randread on a dm-linear mapped to only one nvme device.
> 
> SSD: INTEL SSDPEDME800G4
> dm-linear:dmsetup create testdev --table "0 209715200 linear /dev/nvme0n1 0"
> 
> fio configs:
> ```
> ioengine=io_uring
> iodepth=128
> numjobs=1
> thread
> rw=randread
> direct=1
> registerfiles=1
> #hipri=1 with iopoll enabled, hipri=0 with iopoll disabled
> bs=4k
> size=100G
> runtime=60
> time_based
> group_reporting
> randrepeat=0
> 
> [device]
> filename=/dev/mapper/testdev
> ```
> 
> The test result shows that there's ~40% performance growth when iopoll
> enabled.
> 
> test | iops | bw | avg lat
> ---- | ---- | ---- | ----
> iopoll-disabled | 244k | 953MiB/s  | 524us
> iopoll-enabled  | 345k | 1349MiB/s | 370us
> 
> [TODO]
> The store method of "io_poll"/"io_poll_delay" has not been adapted
> for dm device.
> 
> 
> Jeffle Xu (3):
>   block/mq: add iterator for polling hw queues
>   block: add back ->poll_fn in request queue
>   dm: add support for IO polling
> 
>  block/blk-mq.c         | 30 ++++++++++++++++++++++++------
>  drivers/md/dm-core.h   |  1 +
>  drivers/md/dm-table.c  | 30 ++++++++++++++++++++++++++++++
>  drivers/md/dm.c        | 40 ++++++++++++++++++++++++++++++++++++++++
>  include/linux/blk-mq.h |  6 ++++++
>  include/linux/blkdev.h |  3 +++
>  6 files changed, 104 insertions(+), 6 deletions(-)
> 
> -- 
> 2.27.0
>
Jingbo Xu Oct. 22, 2020, 5:28 a.m. UTC | #2
On 10/22/20 4:39 AM, Mike Snitzer wrote:

> What you've _done_ could serve as a stop-gap but I'd really rather we
> get it properly designed from the start.

Indeed I totally agree with you that the design should be done nicely at 
the very beginning. And this

is indeed the purpose of this RFC patch.


>> This patch set adds support of iopoll for dm device.
>>
>> This is only an RFC patch. I'm really looking forward getting your
>> feedbacks on if you're interested in supporting iopoll for dm device,
>> or if there's a better design to implement that.
>>
>> Thanks.
>>
>>
>> [Purpose]
>> IO polling is an important mode of io_uring. Currently only mq devices
>> support iopoll. As for dm devices, only dm-multipath is request-base,
>> while others are all bio-based and have no support for iopoll.
>> Supporting iopoll for dm devices can be of great meaning when the
>> device seen by application is dm device such as dm-linear/dm-stripe,
>> in which case iopoll is not usable for io_uring.
> I appreciate you taking the initiative on this; polling support is on my
> TODO so your work serves as a nice reminder to pursue this more
> urgently.

It's a good news that iopoll for DM is meaningful.


> but we cannot avoid properly mapping a cookie to each
> split bio.  Otherwise you resort to inefficiently polling everything.

Yes. At the very beginning  I tried to build the mapping a cookie to 
each bio, but I failed with several

blocking design issues. By the way maybe we could clarify these design 
issues here, if you'd like.


>
> Seems your attempt to have the cookie point to a dm_io object was likely
> too coarse-grained (when bios are split they get their own dm_io on
> recursive re-entry to DM core from ->submit_bio); but isn't having a
> list of cookies still too imprecise for polling purposes?  You could
> easily have a list that translates to many blk-mq queues.  Possibly
> better than your current approach of polling everything -- but not
> much.

To make the discussion more specific, assume that dm0 is mapped to 
dm1/2/3, while dm1 mapped to

nvme1, dm2 mapped to dm2, etc..

                     dm0

dm1             dm2            dm3

nvme1        nvme2        nvme3


Then the returned cookie of dm0 could be pointer pointing to dm_io 
object of dm0.

struct dm_io {  // the returned cookie points to dm_io object
	...
+	struct list_head cookies;
};

struct dm_target_io {
	...
	/*
	 * The corresponding polling hw queue if submitted to mq device (such as nvme1/2/3),
	 * NULL if submitted to dm device (such as dm1/2/3)
	 */
+	struct blk_mq_hw_ctx *hctx;
+	struct list_head      node;  // add to @cookies list
};

The @cookies list of dm_io object could maintain all dm_target_io objects
of all **none-dm** devices, that is, all hw queues that we should poll on.


returned  ->  @cookies list	
cookie	      of dm_io object of dm0
		   |
		   +--> dm_target_io	 ->  dm_target_io     ->  dm_target_io
			object of nvme1      object of nvme2	  object of nvme3

When polling returned cookie of dm0, actually we're polling @cookies 
list. Once one of the dm_target_io

completed (e.g. nvme2), it should be removed from the @cookies list., 
and thus we should only focus on

hw queues that have not completed.



>> [Design Notes]
>>
>> cookie
>> ------
>> Let's start from cookie. Cookie is one important concept in iopoll. It
>> is used to identify one specific request in one specific hardware queue.
>> The concept of cookie is initially designed as a per-bio concept, and
>> thus it doesn't work well when bio-split involved. When bio is split,
>> the returned cookie is indeed the cookie of one of the split bio, and
>> the following polling on this returned cookie can only guarantee the
>> completion of this specific split bio, while the other split bios may
>> be still uncompleted. Bio-split is also resolved for dm device, though
>> in another form, in which case the original bio submitted to the dm
>> device may be split into multiple bios submitted to the underlying
>> devices.
>>
>> In previous discussion, Lei Ming has suggested that iopoll should be
>> disabled for bio-split. This works for the normal bio-split (done in
>> blk_mq_submit_bio->__blk_queue_split), while iopoll will never be
>> usable for dm devices if this also applies for dm device.
>>
>> So come back to the design of the cookie for dm devices. At the very
>> beginning I want to refactor the design of cookie, from 'unsigned int'
>> type to the pointer type for dm device, so that cookie can point to
>> something, e.g. a list containing all cookies of one original bio,
>> something like this:
>>
>> struct dm_io { // the returned cookie points to dm_io
>> 	...
>> 	struct list_head cookies;
>> };
>>
>> In this design, we can fetch all cookies of one original bio, but the
>> implementation can be difficult and buggy. For example, the
>> 'struct dm_io' structure may be already freed when the returned cookie
>> is used in blk_poll(). Then what if maintain a refcount in struct dm_io
>> so that 'struct dm_io' structure can not be freed until blk_poll()
>> called? Then the 'struct dm_io' structure will never be freed if the
>> IO polling is not used at all.
> I'd have to look closer at the race in the code you wrote (though you
> didn't share it);

I worried that dm_target_io/dm_io objects could have been freed 
before/when we are polling on them,

and thus could cause use-after-free when accessing @cookies list in 
dm_target_io. It could happen

when there are multiple polling instance. io_uring has implemented 
per-instance polling thread. If

there are two bios submitted to dm0, please consider the following race 
sequence:


1. race 1: dm_target_io/dm_io objects could have been freed ****when**** 
we are polling on them

```*
*

*thread1 polling on bio1                    thread2 polling on bio2*

     fetch dm_io object by the cookie

reaps completions in nvme 1/2/3,

completes bio1 and frees dm_io object of bio1 by the way

     use-after-free when accessing dm_io object

```

Maybe we should get a refcount of dm_io of bio1 when start polling, but 
I'm not sure if the design is

elegant or not.



2. race 2: dm_target_io/dm_io objects could have been freed 
****before**** we are polling on them

```

*thread1 polling on bio1                    thread2 polling on bio2*

reaps completions in nvme 1/2/3,

clone_endio

     dec_pending

         free_io(md, io);  // free dm_io object

     __blkdev_direct_IO

         READ_ONCE(dio->waiter)  // dio->waiter is still none-NULL

         bio_endio(io->orig_bio) //call bi_end_io() of original bio, 
that is blkdev_bio_end_io

             WRITE_ONCE(dio->waiter, NULL);

         blk_poll

             fetch dm_io object by the cookie  // use-after-free!

```


Thanks.

Jeffle.
Mike Snitzer Oct. 26, 2020, 6:53 p.m. UTC | #3
On Thu, Oct 22 2020 at  1:28am -0400,
JeffleXu <jefflexu@linux.alibaba.com> wrote:

> 
> On 10/22/20 4:39 AM, Mike Snitzer wrote:
> 
> >What you've _done_ could serve as a stop-gap but I'd really rather we
> >get it properly designed from the start.
> 
> Indeed I totally agree with you that the design should be done
> nicely at the very beginning. And this
> 
> is indeed the purpose of this RFC patch.
> 
> 
> >>This patch set adds support of iopoll for dm device.
> >>
> >>This is only an RFC patch. I'm really looking forward getting your
> >>feedbacks on if you're interested in supporting iopoll for dm device,
> >>or if there's a better design to implement that.
> >>
> >>Thanks.
> >>
> >>
> >>[Purpose]
> >>IO polling is an important mode of io_uring. Currently only mq devices
> >>support iopoll. As for dm devices, only dm-multipath is request-base,
> >>while others are all bio-based and have no support for iopoll.
> >>Supporting iopoll for dm devices can be of great meaning when the
> >>device seen by application is dm device such as dm-linear/dm-stripe,
> >>in which case iopoll is not usable for io_uring.
> >I appreciate you taking the initiative on this; polling support is on my
> >TODO so your work serves as a nice reminder to pursue this more
> >urgently.
> 
> It's a good news that iopoll for DM is meaningful.
> 
> 
> >but we cannot avoid properly mapping a cookie to each
> >split bio.  Otherwise you resort to inefficiently polling everything.
> 
> Yes. At the very beginning  I tried to build the mapping a cookie to
> each bio, but I failed with several
> 
> blocking design issues. By the way maybe we could clarify these
> design issues here, if you'd like.

Biggest issue I'm seeing is that block core's bio-based IO submission
implementation really never seriously carried the blk_qc_t changes
through. The cookie return from __submit_bio is thrown away when
recursion occurs in __submit_bio_noacct -- last bio submission's cookie
is what is returned back to caller.  That cookie could be very far
removed from all the others returned earlier in the recursion.

Fixing this would require quite some design and cleanup.

> >Seems your attempt to have the cookie point to a dm_io object was likely
> >too coarse-grained (when bios are split they get their own dm_io on
> >recursive re-entry to DM core from ->submit_bio); but isn't having a
> >list of cookies still too imprecise for polling purposes?  You could
> >easily have a list that translates to many blk-mq queues.  Possibly
> >better than your current approach of polling everything -- but not
> >much.
> 
> To make the discussion more specific, assume that dm0 is mapped to
> dm1/2/3, while dm1 mapped to
> 
> nvme1, dm2 mapped to dm2, etc..
> 
>                     dm0
> 
> dm1             dm2            dm3
> 
> nvme1        nvme2        nvme3
> 
> 
> Then the returned cookie of dm0 could be pointer pointing to dm_io
> object of dm0.
> 
> struct dm_io {  // the returned cookie points to dm_io object
> 	...
> +	struct list_head cookies;
> };
> 
> struct dm_target_io {
> 	...
> 	/*
> 	 * The corresponding polling hw queue if submitted to mq device (such as nvme1/2/3),
> 	 * NULL if submitted to dm device (such as dm1/2/3)
> 	 */
> +	struct blk_mq_hw_ctx *hctx;
> +	struct list_head      node;  // add to @cookies list
> };
> 
> The @cookies list of dm_io object could maintain all dm_target_io objects
> of all **none-dm** devices, that is, all hw queues that we should poll on.
> 
> 
> returned  ->  @cookies list	
> cookie	      of dm_io object of dm0
> 		   |
> 		   +--> dm_target_io	 ->  dm_target_io     ->  dm_target_io
> 			object of nvme1      object of nvme2	  object of nvme3
> 
> When polling returned cookie of dm0, actually we're polling @cookies
> list. Once one of the dm_target_io
> 
> completed (e.g. nvme2), it should be removed from the @cookies
> list., and thus we should only focus on
> 
> hw queues that have not completed.

What you detailed there isn't properly modeling what it needs to.
A given dm_target_io could result in quite a few bios (e.g. for
dm-striped we clone each bio for each of N stripes).  So the fan-out,
especially if then stacked on N layers of stacked devices, to all the
various hctx at the lowest layers is like herding cats.

But the recursion in block core's submit_bio path makes that challenging
to say the least.  So much so that any solution related to enabling
proper bio-based IO polling is going to need a pretty significant
investment in fixing block core (storing __submit_bio()'s cookie during
recursion, possibly storing to driver provided memory location,
e.g. DM initialized bio->submit_cookie pointer to a blk_qc_t within a DM
clone bio's per-bio-data).

SO __submit_bio_noacct would become:

   retp = &ret; 
   if (bio->submit_cookie)
          retp = bio->submit_cookie;
   *retp = __submit_bio(bio);

> >>[Design Notes]
> >>
> >>cookie
> >>------
> >>Let's start from cookie. Cookie is one important concept in iopoll. It
> >>is used to identify one specific request in one specific hardware queue.
> >>The concept of cookie is initially designed as a per-bio concept, and
> >>thus it doesn't work well when bio-split involved. When bio is split,
> >>the returned cookie is indeed the cookie of one of the split bio, and
> >>the following polling on this returned cookie can only guarantee the
> >>completion of this specific split bio, while the other split bios may
> >>be still uncompleted. Bio-split is also resolved for dm device, though
> >>in another form, in which case the original bio submitted to the dm
> >>device may be split into multiple bios submitted to the underlying
> >>devices.
> >>
> >>In previous discussion, Lei Ming has suggested that iopoll should be
> >>disabled for bio-split. This works for the normal bio-split (done in
> >>blk_mq_submit_bio->__blk_queue_split), while iopoll will never be
> >>usable for dm devices if this also applies for dm device.
> >>
> >>So come back to the design of the cookie for dm devices. At the very
> >>beginning I want to refactor the design of cookie, from 'unsigned int'
> >>type to the pointer type for dm device, so that cookie can point to
> >>something, e.g. a list containing all cookies of one original bio,
> >>something like this:
> >>
> >>struct dm_io { // the returned cookie points to dm_io
> >>	...
> >>	struct list_head cookies;
> >>};
> >>
> >>In this design, we can fetch all cookies of one original bio, but the
> >>implementation can be difficult and buggy. For example, the
> >>'struct dm_io' structure may be already freed when the returned cookie
> >>is used in blk_poll(). Then what if maintain a refcount in struct dm_io
> >>so that 'struct dm_io' structure can not be freed until blk_poll()
> >>called? Then the 'struct dm_io' structure will never be freed if the
> >>IO polling is not used at all.
> >I'd have to look closer at the race in the code you wrote (though you
> >didn't share it);
> 
> I worried that dm_target_io/dm_io objects could have been freed
> before/when we are polling on them,
> 
> and thus could cause use-after-free when accessing @cookies list in
> dm_target_io. It could happen
> 
> when there are multiple polling instance. io_uring has implemented
> per-instance polling thread. If
> 
> there are two bios submitted to dm0, please consider the following
> race sequence:

The lifetime of the bios should be fine given that the cloning nature of
DM requires that all clones complete before the origin may complete.

I think you probably just got caught out by the recursive nature of the bio
submission path -- makes creating a graph of submitted bios and their
associated per-bio-data and generated cookies a mess to track (again,
like herding cats).

Could also be you didn't quite understand the DM code's various
structures.

In any case, the block core changes needed to make bio-based IO polling
work is the main limiting factor right now.

Not sure it is worth the investment... but I could be persuaded to try
harder! ;)

But then once block core is fixed to allow this, we _still_ need to link
all the various 'struct dm_poll_data' structures to allow blk_poll()
to call DM specific method to walk all in the list to calling blk_poll()
for stored cookie and request_queue*, e.g.:

struct dm_poll_data {
       blk_qc_t cookie;
       struct request_queue *queue;
       struct list_head list;
};

Again, it is the recursive nature of submit_bio() that makes this
challenging.

Mike
Jingbo Xu Nov. 2, 2020, 3:14 a.m. UTC | #4
On 10/27/20 2:53 AM, Mike Snitzer wrote:
> What you detailed there isn't properly modeling what it needs to.
> A given dm_target_io could result in quite a few bios (e.g. for
> dm-striped we clone each bio for each of N stripes).  So the fan-out,
> especially if then stacked on N layers of stacked devices, to all the
> various hctx at the lowest layers is like herding cats.
>
> But the recursion in block core's submit_bio path makes that challenging
> to say the least.  So much so that any solution related to enabling
> proper bio-based IO polling is going to need a pretty significant
> investment in fixing block core (storing __submit_bio()'s cookie during
> recursion, possibly storing to driver provided memory location,
> e.g. DM initialized bio->submit_cookie pointer to a blk_qc_t within a DM
> clone bio's per-bio-data).
>
> SO __submit_bio_noacct would become:
>
>     retp = &ret;
>     if (bio->submit_cookie)
>            retp = bio->submit_cookie;
>     *retp = __submit_bio(bio);

Sorry for the late reply. Exactly I missed this point before. IF you 
have not started working on this, I'd

like to try to implement this as an RFC.


> I think you probably just got caught out by the recursive nature of the bio
> submission path -- makes creating a graph of submitted bios and their
> associated per-bio-data and generated cookies a mess to track (again,
> like herding cats).
>
> Could also be you didn't quite understand the DM code's various
> structures.
>
> In any case, the block core changes needed to make bio-based IO polling
> work is the main limiting factor right now.
Yes the logic is kind of subtle and maybe what I'm concerned here is 
really should be concerned

at the coding phase.


Thanks.

Jeffle Xu
Mike Snitzer Nov. 2, 2020, 3:28 p.m. UTC | #5
On Sun, Nov 01 2020 at 10:14pm -0500,
JeffleXu <jefflexu@linux.alibaba.com> wrote:

> 
> On 10/27/20 2:53 AM, Mike Snitzer wrote:
> >What you detailed there isn't properly modeling what it needs to.
> >A given dm_target_io could result in quite a few bios (e.g. for
> >dm-striped we clone each bio for each of N stripes).  So the fan-out,
> >especially if then stacked on N layers of stacked devices, to all the
> >various hctx at the lowest layers is like herding cats.
> >
> >But the recursion in block core's submit_bio path makes that challenging
> >to say the least.  So much so that any solution related to enabling
> >proper bio-based IO polling is going to need a pretty significant
> >investment in fixing block core (storing __submit_bio()'s cookie during
> >recursion, possibly storing to driver provided memory location,
> >e.g. DM initialized bio->submit_cookie pointer to a blk_qc_t within a DM
> >clone bio's per-bio-data).
> >
> >SO __submit_bio_noacct would become:
> >
> >    retp = &ret;
> >    if (bio->submit_cookie)
> >           retp = bio->submit_cookie;
> >    *retp = __submit_bio(bio);
> 
> Sorry for the late reply. Exactly I missed this point before. IF you
> have not started working on this, I'd like to try to implement this as
> an RFC.

I did start on this line of development but it needs quite a bit more
work.  Even the pseudo code I provided above isn't useful in the context
of DM clone bios that have their own per-bio-data to assist with this
implementation.  Because the __submit_bio_noacct() recursive call
drivers/md/dm.c:__split_and_process_bio() makes is supplying the
original bio (modified to only point to remaining work).

But sure, I'm not opposed to you carrying this line of work forward.  I
can always lend a hand if you need help later or if you need to hand it
off to me.

> >I think you probably just got caught out by the recursive nature of the bio
> >submission path -- makes creating a graph of submitted bios and their
> >associated per-bio-data and generated cookies a mess to track (again,
> >like herding cats).
> >
> >Could also be you didn't quite understand the DM code's various
> >structures.
> >
> >In any case, the block core changes needed to make bio-based IO polling
> >work is the main limiting factor right now.
>
> Yes the logic is kind of subtle and maybe what I'm concerned here is
> really should be concerned at the coding phase.

Definitely, lots of little details and associations.

Mike
Jingbo Xu Nov. 3, 2020, 8:59 a.m. UTC | #6
On 11/2/20 11:28 PM, Mike Snitzer wrote:
> On Sun, Nov 01 2020 at 10:14pm -0500,
> JeffleXu <jefflexu@linux.alibaba.com> wrote:
>
>> On 10/27/20 2:53 AM, Mike Snitzer wrote:
>>> What you detailed there isn't properly modeling what it needs to.
>>> A given dm_target_io could result in quite a few bios (e.g. for
>>> dm-striped we clone each bio for each of N stripes).  So the fan-out,
>>> especially if then stacked on N layers of stacked devices, to all the
>>> various hctx at the lowest layers is like herding cats.
>>>
>>> But the recursion in block core's submit_bio path makes that challenging
>>> to say the least.  So much so that any solution related to enabling
>>> proper bio-based IO polling is going to need a pretty significant
>>> investment in fixing block core (storing __submit_bio()'s cookie during
>>> recursion, possibly storing to driver provided memory location,
>>> e.g. DM initialized bio->submit_cookie pointer to a blk_qc_t within a DM
>>> clone bio's per-bio-data).
>>>
>>> SO __submit_bio_noacct would become:
>>>
>>>     retp = &ret;
>>>     if (bio->submit_cookie)
>>>            retp = bio->submit_cookie;
>>>     *retp = __submit_bio(bio);
>> Sorry for the late reply. Exactly I missed this point before. IF you
>> have not started working on this, I'd like to try to implement this as
>> an RFC.
> I did start on this line of development but it needs quite a bit more
> work.  Even the pseudo code I provided above isn't useful in the context
> of DM clone bios that have their own per-bio-data to assist with this
> implementation.  Because the __submit_bio_noacct() recursive call
> drivers/md/dm.c:__split_and_process_bio() makes is supplying the
> original bio (modified to only point to remaining work).
>
> But sure, I'm not opposed to you carrying this line of work forward.  I
> can always lend a hand if you need help later or if you need to hand it
> off to me.

Thanks. I will continue working on this.

Jeffle
Jingbo Xu Nov. 4, 2020, 6:47 a.m. UTC | #7
On 11/2/20 11:28 PM, Mike Snitzer wrote:
> On Sun, Nov 01 2020 at 10:14pm -0500,
> JeffleXu <jefflexu@linux.alibaba.com> wrote:
>
>> On 10/27/20 2:53 AM, Mike Snitzer wrote:
>>> What you detailed there isn't properly modeling what it needs to.
>>> A given dm_target_io could result in quite a few bios (e.g. for
>>> dm-striped we clone each bio for each of N stripes).  So the fan-out,
>>> especially if then stacked on N layers of stacked devices, to all the
>>> various hctx at the lowest layers is like herding cats.
>>>
>>> But the recursion in block core's submit_bio path makes that challenging
>>> to say the least.  So much so that any solution related to enabling
>>> proper bio-based IO polling is going to need a pretty significant
>>> investment in fixing block core (storing __submit_bio()'s cookie during
>>> recursion, possibly storing to driver provided memory location,
>>> e.g. DM initialized bio->submit_cookie pointer to a blk_qc_t within a DM
>>> clone bio's per-bio-data).
>>>
>>> SO __submit_bio_noacct would become:
>>>
>>>     retp = &ret;
>>>     if (bio->submit_cookie)
>>>            retp = bio->submit_cookie;
>>>     *retp = __submit_bio(bio);
>> Sorry for the late reply. Exactly I missed this point before. IF you
>> have not started working on this, I'd like to try to implement this as
>> an RFC.
> I did start on this line of development but it needs quite a bit more
> work.  Even the pseudo code I provided above isn't useful in the context
> of DM clone bios that have their own per-bio-data to assist with this
> implementation.  Because the __submit_bio_noacct() recursive call
> drivers/md/dm.c:__split_and_process_bio() makes is supplying the
> original bio (modified to only point to remaining work).

Yes I noticed this recently. Since the depth-first splitting introduced 
in commit 18a25da84354

("dm: ensure bio submission follows a depth-first tree walk"), one bio 
to dm device can be

split into multiple bios to this dm device.

```

one bio to dm device (dm0) = one dm_io (to nvme0) + one bio to this same 
dm device (dm0)

```


In this case we need a mechanism to track all split sub-bios of the very 
beginning original bio.

I'm doubted if this should be implemented in block layer like:

```

struct bio {

     ...

     struct list_head  cookies;

};

```

After all it's only used by bio-based queue, or more specifically only 
dm device currently.


Another design I can come up with is to maintain a global data structure 
for the very beginning

original bio. Currently the blocking point is that now one original bio 
to the dm device (@bio of

dm_submit()) can correspond to multiple dm_io and thus we have nowhere 
to place the

@cookies list.


Now we have to maintain one data structure for every original bio, 
something like

```

struct dm_poll_instance {

     ...

     struct list_head cookies;

};

```


We can transfer this dm_poll_instance between split bios by 
bio->bi_private, like

```

dm_submit_bio(...) {

     struct dm_poll_instance *ins;

     if (bio->bi_private)

         ins = bio->bi_private;

     else {

         ins = alloc_poll_instance();

         bio->bi_private = ins;

     }

     ...

}

```
Mike Snitzer Nov. 4, 2020, 3:08 p.m. UTC | #8
On Wed, Nov 04 2020 at  1:47am -0500,
JeffleXu <jefflexu@linux.alibaba.com> wrote:

> 
> On 11/2/20 11:28 PM, Mike Snitzer wrote:
> >On Sun, Nov 01 2020 at 10:14pm -0500,
> >JeffleXu <jefflexu@linux.alibaba.com> wrote:
> >
> >>On 10/27/20 2:53 AM, Mike Snitzer wrote:
> >>>What you detailed there isn't properly modeling what it needs to.
> >>>A given dm_target_io could result in quite a few bios (e.g. for
> >>>dm-striped we clone each bio for each of N stripes).  So the fan-out,
> >>>especially if then stacked on N layers of stacked devices, to all the
> >>>various hctx at the lowest layers is like herding cats.
> >>>
> >>>But the recursion in block core's submit_bio path makes that challenging
> >>>to say the least.  So much so that any solution related to enabling
> >>>proper bio-based IO polling is going to need a pretty significant
> >>>investment in fixing block core (storing __submit_bio()'s cookie during
> >>>recursion, possibly storing to driver provided memory location,
> >>>e.g. DM initialized bio->submit_cookie pointer to a blk_qc_t within a DM
> >>>clone bio's per-bio-data).
> >>>
> >>>SO __submit_bio_noacct would become:
> >>>
> >>>    retp = &ret;
> >>>    if (bio->submit_cookie)
> >>>           retp = bio->submit_cookie;
> >>>    *retp = __submit_bio(bio);
> >>Sorry for the late reply. Exactly I missed this point before. IF you
> >>have not started working on this, I'd like to try to implement this as
> >>an RFC.
> >I did start on this line of development but it needs quite a bit more
> >work.  Even the pseudo code I provided above isn't useful in the context
> >of DM clone bios that have their own per-bio-data to assist with this
> >implementation.  Because the __submit_bio_noacct() recursive call
> >drivers/md/dm.c:__split_and_process_bio() makes is supplying the
> >original bio (modified to only point to remaining work).
> 
> Yes I noticed this recently. Since the depth-first splitting
> introduced in commit 18a25da84354
> 
> ("dm: ensure bio submission follows a depth-first tree walk"), one
> bio to dm device can be
> 
> split into multiple bios to this dm device.
> 
> ```
> 
> one bio to dm device (dm0) = one dm_io (to nvme0) + one bio to this
> same dm device (dm0)
> 
> ```
> 
> 
> In this case we need a mechanism to track all split sub-bios of the
> very beginning original bio.

Yes, splitting causes additional potential for sub-bios.  There are
other cases that cause a 1-to-many bio generation (e.g. dm-striped) or
splitting cases where a DM target makes use of dm_accept_partial_bio
(e.g. dm-snapshot, dm-integrity, dm-writecache, etc).


> I'm doubted if this should be implemented in block layer like:
> 
> ```
> 
> struct bio {
> 
>     ...
> 
>     struct list_head  cookies;
> 
> };
> 
> ```
> 
> After all it's only used by bio-based queue, or more specifically
> only dm device currently.

I do think this line of work really should be handled in block core
because I cannot see any reason why MD or bcache or whatever bio-based
device wouldn't want the ability to better support io_uring (with IO
poll).

> Another design I can come up with is to maintain a global data
> structure for the very beginning
> original bio. Currently the blocking point is that now one original
> bio to the dm device (@bio of dm_submit()) can correspond to multiple
> dm_io and thus we have nowhere to place the @cookies list.

Yes, and that will always be the case.  We need the design to handle an
arbitrary sprawl of splitting from a given bio.  The graph of bios
resulting from that fan-out needs to be walked at various levels -- be
it the top-level original bio's submit_bio() returned cookie or some
intermediate point in the chain of bios.

The problem is the lifetime of the data structure created for a given
split bio versus layering boundaries (that come from block core's
simplistic recursion via bio using submit_bio).

> Now we have to maintain one data structure for every original bio,
> something like
> 
> ```
> 
> struct dm_poll_instance {
> 
>     ...
> 
>     struct list_head cookies;
> 
> };
> 
> ```

I do think we need a hybrid where at the point of recursion we're able
to make the associated data structure available across the recursion
boundary so that modeling the association in a chain of split bios is
possible. (e.g. struct dm_poll_data or dm_poll_instance as you named it,
_but_ that struct definition would live in block core, but would be part
of per-bio-data; so 'struct blk_poll_data' is more logical name when
elevated to block core).

It _might_ be worthwhile to see if a new BIO_ flag could be added to
allow augmenting the bio_split + bio_chain pattern to also track this
additional case of carrying additional data per-bio while creating
bio-chains.  I may not be clear yet, said differently: augmenting
bio_chain to not only chain bios, but to _also_ thread/chain together
per-bio-data that lives within those chained bios.  SO you have the
chain of bios _and_ the chain of potentially opaque void * that happens
to point to a list head for a list of 'struct blk_poll_data'.

Does that make sense?

> We can transfer this dm_poll_instance between split bios by
> bio->bi_private, like
> 
> ```
> 
> dm_submit_bio(...) {
> 
>     struct dm_poll_instance *ins;
> 
>     if (bio->bi_private)
> 
>         ins = bio->bi_private;
> 
>     else {
> 
>         ins = alloc_poll_instance();
> 
>         bio->bi_private = ins;
> 
>     }
> 
>     ...
> 
> }
> 
> ```

Sadly, we cannot (ab)use bi_private for this given its (ab)used via the
bio_chain() interface.  It's almost like we need to add a new pointer in
the bio that isn't left for block core to hijack.

There is the well-worn pattern of saving off the original bi_private,
hooking a new endio method and then when that endio is called restoring
bi_private but we really want to avoid excessive indirect function calls
for this usecase.  The entire point of implementing blk_poll support is
for performance after all.

Mike
Jingbo Xu Nov. 6, 2020, 2:51 a.m. UTC | #9
On 11/4/20 11:08 PM, Mike Snitzer wrote:
>> I'm doubted if this should be implemented in block layer like:
>>
>> ```
>>
>> struct bio {
>>
>>      ...
>>
>>      struct list_head  cookies;
>>
>> };
>>
>> ```
>>
>> After all it's only used by bio-based queue, or more specifically
>> only dm device currently.
> I do think this line of work really should be handled in block core
> because I cannot see any reason why MD or bcache or whatever bio-based
> device wouldn't want the ability to better support io_uring (with IO
> poll).
>
>> Another design I can come up with is to maintain a global data
>> structure for the very beginning
>> original bio. Currently the blocking point is that now one original
>> bio to the dm device (@bio of dm_submit()) can correspond to multiple
>> dm_io and thus we have nowhere to place the @cookies list.
> Yes, and that will always be the case.  We need the design to handle an
> arbitrary sprawl of splitting from a given bio.  The graph of bios
> resulting from that fan-out needs to be walked at various levels -- be
> it the top-level original bio's submit_bio() returned cookie or some
> intermediate point in the chain of bios.
>
> The problem is the lifetime of the data structure created for a given
> split bio versus layering boundaries (that come from block core's
> simplistic recursion via bio using submit_bio).
>
>> Now we have to maintain one data structure for every original bio,
>> something like
>>
>> ```
>>
>> struct dm_poll_instance {
>>
>>      ...
>>
>>      struct list_head cookies;
>>
>> };
>>
>> ```
> I do think we need a hybrid where at the point of recursion we're able
> to make the associated data structure available across the recursion
> boundary so that modeling the association in a chain of split bios is
> possible. (e.g. struct dm_poll_data or dm_poll_instance as you named it,
> _but_ that struct definition would live in block core, but would be part
> of per-bio-data; so 'struct blk_poll_data' is more logical name when
> elevated to block core).
>
> It _might_ be worthwhile to see if a new BIO_ flag could be added to
> allow augmenting the bio_split + bio_chain pattern to also track this
> additional case of carrying additional data per-bio while creating
> bio-chains.  I may not be clear yet, said differently: augmenting
> bio_chain to not only chain bios, but to _also_ thread/chain together
> per-bio-data that lives within those chained bios.  SO you have the
> chain of bios _and_ the chain of potentially opaque void * that happens
> to point to a list head for a list of 'struct blk_poll_data'.
>
> Does that make sense?


I'm doubted if it really makes sense to maintain a *complete* bio chain 
across the recursive

call boundary.


Considering the following device stack:

```

                                   dm0

         dm1                   dm2           dm3

nvme0  nvme1          ....               ....

```


We have the following bio graph (please let me know if it's wrong or the 
image can't display)


For example, for dm1 there are three bios containing valid cookie in the 
end, that is

bio 9/10/11. We only need to insert the corresponding blk_poll_data 
(containing

request_queue, cookie, etc) of these three bios into the very beginning 
original

bio (that is bio0). Of course we can track all the sub-bios down through 
the device

stack, layer by layer, e.g.,

- get bio 1/2/3 from bio 0

- get bio 4 from bio 3

- finally get bio 9 from bio 4

But I'm doubted if it's overkill to just implement the iopoll.


Another issue still unclear is that, if we should implement the iopoll 
in a recursive way.

In a recursive way, to poll dm 0, we should poll all its sub-devices, 
that is, bio 4/5/7/8.

Oppositely we can insert only the bottom bio (bio 9/10/11 which have 
valid cookie) at

the very beginning (at submit_bio() phase), and to poll dm 0, we only 
need to poll bio

9/10/11.


To implement this non-recursive design, we can add a field in struct bio

```

struct bio {

     ...

     struct bio *orig;

}

```

@orig points to the original bio inputted into submit_bio(), that is, bio 0.


@orig field is transmitted through bio splitting.

```

bio_split()

     split->orig = bio->orig ? : bio


dm.c: __clone_and_map_data_bio

     clone->orig = bio->orig ? : bio

```


Finally bio 9/10/11 can be inserted into bio 0.

```

blk-mq.c: blk_mq_submit_bio

     if (bio->orig)

         init blk_poll_data and insert it into bio->orig's @cookies list

```

>
>> We can transfer this dm_poll_instance between split bios by
>> bio->bi_private, like
>>
>> ```
>>
>> dm_submit_bio(...) {
>>
>>      struct dm_poll_instance *ins;
>>
>>      if (bio->bi_private)
>>
>>          ins = bio->bi_private;
>>
>>      else {
>>
>>          ins = alloc_poll_instance();
>>
>>          bio->bi_private = ins;
>>
>>      }
>>
>>      ...
>>
>> }
>>
>> ```
> Sadly, we cannot (ab)use bi_private for this given its (ab)used via the
> bio_chain() interface.  It's almost like we need to add a new pointer in
> the bio that isn't left for block core to hijack.
>
> There is the well-worn pattern of saving off the original bi_private,
> hooking a new endio method and then when that endio is called restoring
> bi_private but we really want to avoid excessive indirect function calls
> for this usecase.  The entire point of implementing blk_poll support is
> for performance after all.
>
> Mike
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Nov. 6, 2020, 5:45 p.m. UTC | #10
On Thu, Nov 05 2020 at  9:51pm -0500,
JeffleXu <jefflexu@linux.alibaba.com> wrote:

> 
> On 11/4/20 11:08 PM, Mike Snitzer wrote:
> >>I'm doubted if this should be implemented in block layer like:
> >>
> >>```
> >>
> >>struct bio {
> >>
> >>     ...
> >>
> >>     struct list_head  cookies;
> >>
> >>};
> >>
> >>```
> >>
> >>After all it's only used by bio-based queue, or more specifically
> >>only dm device currently.
> >I do think this line of work really should be handled in block core
> >because I cannot see any reason why MD or bcache or whatever bio-based
> >device wouldn't want the ability to better support io_uring (with IO
> >poll).
> >
> >>Another design I can come up with is to maintain a global data
> >>structure for the very beginning
> >>original bio. Currently the blocking point is that now one original
> >>bio to the dm device (@bio of dm_submit()) can correspond to multiple
> >>dm_io and thus we have nowhere to place the @cookies list.
> >Yes, and that will always be the case.  We need the design to handle an
> >arbitrary sprawl of splitting from a given bio.  The graph of bios
> >resulting from that fan-out needs to be walked at various levels -- be
> >it the top-level original bio's submit_bio() returned cookie or some
> >intermediate point in the chain of bios.
> >
> >The problem is the lifetime of the data structure created for a given
> >split bio versus layering boundaries (that come from block core's
> >simplistic recursion via bio using submit_bio).
> >
> >>Now we have to maintain one data structure for every original bio,
> >>something like
> >>
> >>```
> >>
> >>struct dm_poll_instance {
> >>
> >>     ...
> >>
> >>     struct list_head cookies;
> >>
> >>};
> >>
> >>```
> >I do think we need a hybrid where at the point of recursion we're able
> >to make the associated data structure available across the recursion
> >boundary so that modeling the association in a chain of split bios is
> >possible. (e.g. struct dm_poll_data or dm_poll_instance as you named it,
> >_but_ that struct definition would live in block core, but would be part
> >of per-bio-data; so 'struct blk_poll_data' is more logical name when
> >elevated to block core).
> >
> >It _might_ be worthwhile to see if a new BIO_ flag could be added to
> >allow augmenting the bio_split + bio_chain pattern to also track this
> >additional case of carrying additional data per-bio while creating
> >bio-chains.  I may not be clear yet, said differently: augmenting
> >bio_chain to not only chain bios, but to _also_ thread/chain together
> >per-bio-data that lives within those chained bios.  SO you have the
> >chain of bios _and_ the chain of potentially opaque void * that happens
> >to point to a list head for a list of 'struct blk_poll_data'.
> >
> >Does that make sense?
> 
> 
> I'm doubted if it really makes sense to maintain a *complete* bio
> chain across the recursive
> 
> call boundary.
> 
> 
> Considering the following device stack:
> 
> ```
> 
>                                   dm0
> 
>         dm1                   dm2           dm3
> 
> nvme0  nvme1          ....               ....
> 
> ```
> 
> 
> We have the following bio graph (please let me know if it's wrong or
> the image can't display)
> 
> 
> For example, for dm1 there are three bios containing valid cookie in
> the end, that is
> 
> bio 9/10/11. We only need to insert the corresponding blk_poll_data
> (containing
> 
> request_queue, cookie, etc) of these three bios into the very
> beginning original
> 
> bio (that is bio0). Of course we can track all the sub-bios down
> through the device
> 
> stack, layer by layer, e.g.,
> 
> - get bio 1/2/3 from bio 0
> 
> - get bio 4 from bio 3
> 
> - finally get bio 9 from bio 4
> 
> But I'm doubted if it's overkill to just implement the iopoll.
> 
> 
> Another issue still unclear is that, if we should implement the
> iopoll in a recursive way.
> 
> In a recursive way, to poll dm 0, we should poll all its
> sub-devices, that is, bio 4/5/7/8.
> 
> Oppositely we can insert only the bottom bio (bio 9/10/11 which have
> valid cookie) at
> 
> the very beginning (at submit_bio() phase), and to poll dm 0, we
> only need to poll bio 9/10/11.

I feel we need the ability to walk the entire graph and call down to
next level.  The lowest level would be what you call a "valid cookie"
that blk-mq returned.  But the preceding cookies need to be made valid
and used when walking the graph (from highest to lowest) and calling
down to the underlying layers.

> 
> 
> To implement this non-recursive design, we can add a field in struct bio
> 
> ```
> 
> struct bio {
> 
>     ...
> 
>     struct bio *orig;
> 
> }
> 
> ```
> 
> @orig points to the original bio inputted into submit_bio(), that is, bio 0.
> 
> 
> @orig field is transmitted through bio splitting.
> 
> ```
> 
> bio_split()
> 
>     split->orig = bio->orig ? : bio
> 
> 
> dm.c: __clone_and_map_data_bio
> 
>     clone->orig = bio->orig ? : bio
> 
> ```
> 
> 
> Finally bio 9/10/11 can be inserted into bio 0.
> 
> ```
> 
> blk-mq.c: blk_mq_submit_bio
> 
>     if (bio->orig)
> 
>         init blk_poll_data and insert it into bio->orig's @cookies list
> 
> ```

If you feel that is doable: certainly give it a shot.

But it is not clear to me how you intend to translate from cookie passed
in to ->blk_poll hook (returned from submit_bio) to the saved off
bio->orig.

What is your cookie strategy in this non-recursive implementation?  What
will you be returning?  Where will you be storing it?

Mike
Jingbo Xu Nov. 8, 2020, 1:09 a.m. UTC | #11
On 11/7/20 1:45 AM, Mike Snitzer wrote:
> On Thu, Nov 05 2020 at  9:51pm -0500,
> JeffleXu <jefflexu@linux.alibaba.com> wrote:
>
>> On 11/4/20 11:08 PM, Mike Snitzer wrote:
>>>> I'm doubted if this should be implemented in block layer like:
>>>>
>>>> ```
>>>>
>>>> struct bio {
>>>>
>>>>      ...
>>>>
>>>>      struct list_head  cookies;
>>>>
>>>> };
>>>>
>>>> ```
>>>>
>>>> After all it's only used by bio-based queue, or more specifically
>>>> only dm device currently.
>>> I do think this line of work really should be handled in block core
>>> because I cannot see any reason why MD or bcache or whatever bio-based
>>> device wouldn't want the ability to better support io_uring (with IO
>>> poll).
>>>
>>>> Another design I can come up with is to maintain a global data
>>>> structure for the very beginning
>>>> original bio. Currently the blocking point is that now one original
>>>> bio to the dm device (@bio of dm_submit()) can correspond to multiple
>>>> dm_io and thus we have nowhere to place the @cookies list.
>>> Yes, and that will always be the case.  We need the design to handle an
>>> arbitrary sprawl of splitting from a given bio.  The graph of bios
>>> resulting from that fan-out needs to be walked at various levels -- be
>>> it the top-level original bio's submit_bio() returned cookie or some
>>> intermediate point in the chain of bios.
>>>
>>> The problem is the lifetime of the data structure created for a given
>>> split bio versus layering boundaries (that come from block core's
>>> simplistic recursion via bio using submit_bio).
>>>
>>>> Now we have to maintain one data structure for every original bio,
>>>> something like
>>>>
>>>> ```
>>>>
>>>> struct dm_poll_instance {
>>>>
>>>>      ...
>>>>
>>>>      struct list_head cookies;
>>>>
>>>> };
>>>>
>>>> ```
>>> I do think we need a hybrid where at the point of recursion we're able
>>> to make the associated data structure available across the recursion
>>> boundary so that modeling the association in a chain of split bios is
>>> possible. (e.g. struct dm_poll_data or dm_poll_instance as you named it,
>>> _but_ that struct definition would live in block core, but would be part
>>> of per-bio-data; so 'struct blk_poll_data' is more logical name when
>>> elevated to block core).
>>>
>>> It _might_ be worthwhile to see if a new BIO_ flag could be added to
>>> allow augmenting the bio_split + bio_chain pattern to also track this
>>> additional case of carrying additional data per-bio while creating
>>> bio-chains.  I may not be clear yet, said differently: augmenting
>>> bio_chain to not only chain bios, but to _also_ thread/chain together
>>> per-bio-data that lives within those chained bios.  SO you have the
>>> chain of bios _and_ the chain of potentially opaque void * that happens
>>> to point to a list head for a list of 'struct blk_poll_data'.
>>>
>>> Does that make sense?
>>
>> I'm doubted if it really makes sense to maintain a *complete* bio
>> chain across the recursive
>>
>> call boundary.
>>
>>
>> Considering the following device stack:
>>
>> ```
>>
>>                                    dm0
>>
>>          dm1                   dm2           dm3
>>
>> nvme0  nvme1          ....               ....
>>
>> ```
>>
>>
>> We have the following bio graph (please let me know if it's wrong or
>> the image can't display)
>>
>>
>> For example, for dm1 there are three bios containing valid cookie in
>> the end, that is
>>
>> bio 9/10/11. We only need to insert the corresponding blk_poll_data
>> (containing
>>
>> request_queue, cookie, etc) of these three bios into the very
>> beginning original
>>
>> bio (that is bio0). Of course we can track all the sub-bios down
>> through the device
>>
>> stack, layer by layer, e.g.,
>>
>> - get bio 1/2/3 from bio 0
>>
>> - get bio 4 from bio 3
>>
>> - finally get bio 9 from bio 4
>>
>> But I'm doubted if it's overkill to just implement the iopoll.
>>
>>
>> Another issue still unclear is that, if we should implement the
>> iopoll in a recursive way.
>>
>> In a recursive way, to poll dm 0, we should poll all its
>> sub-devices, that is, bio 4/5/7/8.
>>
>> Oppositely we can insert only the bottom bio (bio 9/10/11 which have
>> valid cookie) at
>>
>> the very beginning (at submit_bio() phase), and to poll dm 0, we
>> only need to poll bio 9/10/11.
> I feel we need the ability to walk the entire graph and call down to
> next level.  The lowest level would be what you call a "valid cookie"
> that blk-mq returned.  But the preceding cookies need to be made valid
> and used when walking the graph (from highest to lowest) and calling
> down to the underlying layers.
>
>>
>> To implement this non-recursive design, we can add a field in struct bio
>>
>> ```
>>
>> struct bio {
>>
>>      ...
>>
>>      struct bio *orig;
>>
>> }
>>
>> ```
>>
>> @orig points to the original bio inputted into submit_bio(), that is, bio 0.
>>
>>
>> @orig field is transmitted through bio splitting.
>>
>> ```
>>
>> bio_split()
>>
>>      split->orig = bio->orig ? : bio
>>
>>
>> dm.c: __clone_and_map_data_bio
>>
>>      clone->orig = bio->orig ? : bio
>>
>> ```
>>
>>
>> Finally bio 9/10/11 can be inserted into bio 0.
>>
>> ```
>>
>> blk-mq.c: blk_mq_submit_bio
>>
>>      if (bio->orig)
>>
>>          init blk_poll_data and insert it into bio->orig's @cookies list
>>
>> ```
> If you feel that is doable: certainly give it a shot.

Make sense.

> But it is not clear to me how you intend to translate from cookie passed
> in to ->blk_poll hook (returned from submit_bio) to the saved off
> bio->orig.
>
> What is your cookie strategy in this non-recursive implementation?  What
> will you be returning?  Where will you be storing it?

Actually I think it's a common issue to design the cookie returned by 
submit_bio() whenever

it's implemented in a recursive or non-recursive way. After all you need 
to translate the

returned cookie to the original bio even if it's implemented in a 
recursive way as you

described. Or how could you walk through the bio graph when the returned 
cookie is

only 'unsigned int' type?


How about this:


```

typedef uintptr_t blk_qc_t;

```


or something like union

```

typedef union {

     unsigned int cookie;

     struct bio *orig; // the original bio of submit_bio()

} blk_qc_t;

```


When serving for blk-mq, the integer part of blk_qc_t is used and it 
stores the valid cookie,

while it stores a pointer to the original bio when serving bio-based device.


By the way, would you mind sharing your plan and progress on this work, 
I mean, supporting

iopoll for dm device. To be honest, I don't want to re-invent the wheel 
as you have started on

this work, but I do want to participate in somehow. Please let me know 
if there's something

I could do here.
Mike Snitzer Nov. 9, 2020, 6:15 p.m. UTC | #12
On Sat, Nov 07 2020 at  8:09pm -0500,
JeffleXu <jefflexu@linux.alibaba.com> wrote:

> 
> On 11/7/20 1:45 AM, Mike Snitzer wrote:
> >On Thu, Nov 05 2020 at  9:51pm -0500,
> >JeffleXu <jefflexu@linux.alibaba.com> wrote:
> >
> >>blk-mq.c: blk_mq_submit_bio
> >>
> >>     if (bio->orig)
> >>
> >>         init blk_poll_data and insert it into bio->orig's @cookies list
> >>
> >>```
> >If you feel that is doable: certainly give it a shot.
> 
> Make sense.
> 
> >But it is not clear to me how you intend to translate from cookie passed
> >in to ->blk_poll hook (returned from submit_bio) to the saved off
> >bio->orig.
> >
> >What is your cookie strategy in this non-recursive implementation?  What
> >will you be returning?  Where will you be storing it?
> 
> Actually I think it's a common issue to design the cookie returned
> by submit_bio() whenever it's implemented in a recursive or
> non-recursive way. After all you need to translate the returned cookie
> to the original bio even if it's implemented in a recursive way as you
> described.

Yes.

> Or how could you walk through the bio graph when the returned cookie
> is only 'unsigned int' type?

You could store a pointer (to orig bio, or per-bio-data stored in split
clone, or whatever makes sense for how you're associating poll data with
split bios) in 'unsigned int' type but that's very clumsy -- as I
discovered when trying to do that ;)

> How about this:
> 
> 
> ```
> 
> typedef uintptr_t blk_qc_t;
> 
> ```
> 
> 
> or something like union
> 
> ```
> 
> typedef union {
> 
>     unsigned int cookie;
> 
>     struct bio *orig; // the original bio of submit_bio()
> 
> } blk_qc_t;
> 
> ```
> When serving for blk-mq, the integer part of blk_qc_t is used and it
> stores the valid cookie, while it stores a pointer to the original bio
> when serving bio-based device.

Union looks ideal, but maybe make it a void *?  Initial implementation
may store bio pointer but it _could_ evolve to be 'struct blk_poll_data
*' or whatever.  But not a big deal either way.

> By the way, would you mind sharing your plan and progress on this
> work, I mean, supporting iopoll for dm device. To be honest, I don't
> want to re-invent the wheel as you have started on this work, but I do
> want to participate in somehow. Please let me know if there's something
> I could do here.

I thought I said as much before but: I really don't have anything
meaningful to share.  My early exploration was more rough pseudo code
that served to try to get my arms around the scope of the design
problem.

Please feel free to own all aspects of this work.

I will gladly help analyze/test/refine your approach once you reach the
point of sharing RFC patches.

Thanks,
Mike
Jingbo Xu Nov. 10, 2020, 1:43 a.m. UTC | #13
On 11/10/20 2:15 AM, Mike Snitzer wrote:
> On Sat, Nov 07 2020 at  8:09pm -0500,
> JeffleXu <jefflexu@linux.alibaba.com> wrote:
>
>> On 11/7/20 1:45 AM, Mike Snitzer wrote:
>>> On Thu, Nov 05 2020 at  9:51pm -0500,
>>> JeffleXu <jefflexu@linux.alibaba.com> wrote:
>>>
>>>> blk-mq.c: blk_mq_submit_bio
>>>>
>>>>      if (bio->orig)
>>>>
>>>>          init blk_poll_data and insert it into bio->orig's @cookies list
>>>>
>>>> ```
>>> If you feel that is doable: certainly give it a shot.
>> Make sense.
>>
>>> But it is not clear to me how you intend to translate from cookie passed
>>> in to ->blk_poll hook (returned from submit_bio) to the saved off
>>> bio->orig.
>>>
>>> What is your cookie strategy in this non-recursive implementation?  What
>>> will you be returning?  Where will you be storing it?
>> Actually I think it's a common issue to design the cookie returned
>> by submit_bio() whenever it's implemented in a recursive or
>> non-recursive way. After all you need to translate the returned cookie
>> to the original bio even if it's implemented in a recursive way as you
>> described.
> Yes.
>
>> Or how could you walk through the bio graph when the returned cookie
>> is only 'unsigned int' type?
> You could store a pointer (to orig bio, or per-bio-data stored in split
> clone, or whatever makes sense for how you're associating poll data with
> split bios) in 'unsigned int' type but that's very clumsy -- as I
> discovered when trying to do that ;)

Fine, that's also what I thought.


>
>> How about this:
>>
>>
>> ```
>>
>> typedef uintptr_t blk_qc_t;
>>
>> ```
>>
>>
>> or something like union
>>
>> ```
>>
>> typedef union {
>>
>>      unsigned int cookie;
>>
>>      struct bio *orig; // the original bio of submit_bio()
>>
>> } blk_qc_t;
>>
>> ```
>> When serving for blk-mq, the integer part of blk_qc_t is used and it
>> stores the valid cookie, while it stores a pointer to the original bio
>> when serving bio-based device.
> Union looks ideal, but maybe make it a void *?  Initial implementation
> may store bio pointer but it _could_ evolve to be 'struct blk_poll_data
> *' or whatever.  But not a big deal either way.

Of course you could define blk_qc_t as a pointer type (e.g. void *), or 
integer type (e.g. unsigned int),

but you will get a gcc warning in each case. For example, if it's 
defined as "void *", then gcc will warn

'return makes pointer from integer without a cast' in request_to_qc_t() 
as cookie returned by mq

device is actually integer type. Vice versa. So we need a type cast in 
request_to_qc_t().


The union is also not perfect though, as we also need type cast.


So both these two designs are quite equal to me, though 'void *' may be 
more concise though.

But one annoying issue is that the request_to_qc_t() and blk_poll() 
should be revised somehow

if it's defined as a union or 'void *'. For example if it's defined as 
'void *', then in request_to_qc_t()

integer need to be cast to pointer and in blk_poll() pointer need to be 
cast to integer.


The benefit of uintptr_t is that, it's still integer type which means 
the original request_to_qc_t()/

blk_poll() routine for blk-mq can retain unchanged, while the size of 
the data type can be large

enough to contain a pointer type. So we only need  type cast in 
bio-based routine, while keeping

the request-based routine unchanged.


And yes it's a trivial issue though.


>> By the way, would you mind sharing your plan and progress on this
>> work, I mean, supporting iopoll for dm device. To be honest, I don't
>> want to re-invent the wheel as you have started on this work, but I do
>> want to participate in somehow. Please let me know if there's something
>> I could do here.
> I thought I said as much before but: I really don't have anything
> meaningful to share.  My early exploration was more rough pseudo code
> that served to try to get my arms around the scope of the design
> problem.
>
> Please feel free to own all aspects of this work.
>
> I will gladly help analyze/test/refine your approach once you reach the
> point of sharing RFC patches.

Got it. Thanks for that. Really. I will continue working on this.