mbox series

[v2,0/6] dm: support IO polling for bio-based dm device

Message ID 20210125121340.70459-1-jefflexu@linux.alibaba.com (mailing list archive)
Headers show
Series dm: support IO polling for bio-based dm device | expand

Message

Jingbo Xu Jan. 25, 2021, 12:13 p.m. UTC
Since currently we have no simple but efficient way to implement the
bio-based IO polling in the split-bio tracking style, this patch set
turns to the original implementation mechanism that iterates and
polls all underlying hw queues in polling mode. One optimization is
introduced to mitigate the race of one hw queue among multiple polling
instances.

I'm still open to the split bio tracking mechanism, if there's
reasonable way to implement it.


[Performance Test]
The performance is tested by fio (engine=io_uring) 4k randread on
dm-linear device. The dm-linear device is built upon nvme devices,
and every nvme device has one polling hw queue (nvme.poll_queues=1).

Test Case		    | IOPS in IRQ mode | IOPS in polling mode | Diff
			    | (hipri=0)	       | (hipri=1)	      |
--------------------------- | ---------------- | -------------------- | ----
3 target nvme, num_jobs = 1 | 198k 	       | 276k		      | ~40%
3 target nvme, num_jobs = 3 | 608k 	       | 705k		      | ~16%
6 target nvme, num_jobs = 6 | 1197k 	       | 1347k		      | ~13%
3 target nvme, num_jobs = 6 | 1285k 	       | 1293k		      | ~0%

As the number of polling instances (num_jobs) increases, the
performance improvement decreases, though it's still positive
compared to the IRQ mode.

[Optimization]
To mitigate the race when iterating all the underlying hw queues, one
flag is maintained on a per-hw-queue basis. This flag is used to
indicate whether this polling hw queue currently being polled on or
not. Every polling hw queue is exclusive to one polling instance, i.e.,
the polling instance will skip this polling hw queue if this hw queue
currently is being polled by another polling instance, and start
polling on the next hw queue.

This per-hw-queue flag map is currently maintained in dm layer. In
the table load phase, a table describing all underlying polling hw
queues is built and stored in 'struct dm_table'. It is safe when
reloading the mapping table.


changes since v1:
- patch 1,2,4 is the same as v1 and have already been reviewed
- patch 3 is refactored a bit on the basis of suggestions from
Mike Snitzer.
- patch 5 is newly added and introduces one new queue flag
representing if the queue is capable of IO polling. This mainly
simplifies the logic in queue_poll_store().
- patch 6 implements the core mechanism supporting IO polling.
The sanity check checking if the dm device supports IO polling is
also folded into this patch, and the queue flag will be cleared if
it doesn't support, in case of table reloading.


Jeffle Xu (6):
  block: move definition of blk_qc_t to types.h
  block: add queue_to_disk() to get gendisk from request_queue
  block: add iopoll method to support bio-based IO polling
  dm: always return BLK_QC_T_NONE for bio-based device
  block: add QUEUE_FLAG_POLL_CAP flag
  dm: support IO polling for bio-based dm device

 block/blk-core.c             |  76 +++++++++++++++++++++
 block/blk-mq.c               |  76 +++------------------
 block/blk-sysfs.c            |   3 +-
 drivers/md/dm-core.h         |  21 ++++++
 drivers/md/dm-table.c        | 127 +++++++++++++++++++++++++++++++++++
 drivers/md/dm.c              |  61 ++++++++++++-----
 include/linux/blk-mq.h       |   3 +
 include/linux/blk_types.h    |   2 +-
 include/linux/blkdev.h       |   9 +++
 include/linux/fs.h           |   2 +-
 include/linux/types.h        |   3 +
 include/trace/events/kyber.h |   6 +-
 12 files changed, 302 insertions(+), 87 deletions(-)

Comments

Mike Snitzer Jan. 27, 2021, 5:19 p.m. UTC | #1
On Mon, Jan 25 2021 at  7:13am -0500,
Jeffle Xu <jefflexu@linux.alibaba.com> wrote:

> Since currently we have no simple but efficient way to implement the
> bio-based IO polling in the split-bio tracking style, this patch set
> turns to the original implementation mechanism that iterates and
> polls all underlying hw queues in polling mode. One optimization is
> introduced to mitigate the race of one hw queue among multiple polling
> instances.
> 
> I'm still open to the split bio tracking mechanism, if there's
> reasonable way to implement it.
> 
> 
> [Performance Test]
> The performance is tested by fio (engine=io_uring) 4k randread on
> dm-linear device. The dm-linear device is built upon nvme devices,
> and every nvme device has one polling hw queue (nvme.poll_queues=1).
> 
> Test Case		    | IOPS in IRQ mode | IOPS in polling mode | Diff
> 			    | (hipri=0)	       | (hipri=1)	      |
> --------------------------- | ---------------- | -------------------- | ----
> 3 target nvme, num_jobs = 1 | 198k 	       | 276k		      | ~40%
> 3 target nvme, num_jobs = 3 | 608k 	       | 705k		      | ~16%
> 6 target nvme, num_jobs = 6 | 1197k 	       | 1347k		      | ~13%
> 3 target nvme, num_jobs = 6 | 1285k 	       | 1293k		      | ~0%
> 
> As the number of polling instances (num_jobs) increases, the
> performance improvement decreases, though it's still positive
> compared to the IRQ mode.

I think there is serious room for improvement for DM's implementation;
but the block changes for this are all we'd need for DM in the longrun
anyway (famous last words). So on a block interface level I'm OK with
block patches 1-3.

I don't see why patch 5 is needed (said the same in reply to it; but I
just saw your reason below..).

Anyway, I can pick up DM patches 4 and 6 via linux-dm.git if Jens picks
up patches 1-3. Jens, what do you think?

> [Optimization]
> To mitigate the race when iterating all the underlying hw queues, one
> flag is maintained on a per-hw-queue basis. This flag is used to
> indicate whether this polling hw queue currently being polled on or
> not. Every polling hw queue is exclusive to one polling instance, i.e.,
> the polling instance will skip this polling hw queue if this hw queue
> currently is being polled by another polling instance, and start
> polling on the next hw queue.
> 
> This per-hw-queue flag map is currently maintained in dm layer. In
> the table load phase, a table describing all underlying polling hw
> queues is built and stored in 'struct dm_table'. It is safe when
> reloading the mapping table.
> 
> 
> changes since v1:
> - patch 1,2,4 is the same as v1 and have already been reviewed
> - patch 3 is refactored a bit on the basis of suggestions from
> Mike Snitzer.
> - patch 5 is newly added and introduces one new queue flag
> representing if the queue is capable of IO polling. This mainly
> simplifies the logic in queue_poll_store().

Ah OK, don't see why we want to eat a queue flag for that though!

> - patch 6 implements the core mechanism supporting IO polling.
> The sanity check checking if the dm device supports IO polling is
> also folded into this patch, and the queue flag will be cleared if
> it doesn't support, in case of table reloading.

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Jingbo Xu Jan. 28, 2021, 3:06 a.m. UTC | #2
On 1/28/21 1:19 AM, Mike Snitzer wrote:
> On Mon, Jan 25 2021 at  7:13am -0500,
> Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> 
>> Since currently we have no simple but efficient way to implement the
>> bio-based IO polling in the split-bio tracking style, this patch set
>> turns to the original implementation mechanism that iterates and
>> polls all underlying hw queues in polling mode. One optimization is
>> introduced to mitigate the race of one hw queue among multiple polling
>> instances.
>>
>> I'm still open to the split bio tracking mechanism, if there's
>> reasonable way to implement it.
>>
>>
>> [Performance Test]
>> The performance is tested by fio (engine=io_uring) 4k randread on
>> dm-linear device. The dm-linear device is built upon nvme devices,
>> and every nvme device has one polling hw queue (nvme.poll_queues=1).
>>
>> Test Case		    | IOPS in IRQ mode | IOPS in polling mode | Diff
>> 			    | (hipri=0)	       | (hipri=1)	      |
>> --------------------------- | ---------------- | -------------------- | ----
>> 3 target nvme, num_jobs = 1 | 198k 	       | 276k		      | ~40%
>> 3 target nvme, num_jobs = 3 | 608k 	       | 705k		      | ~16%
>> 6 target nvme, num_jobs = 6 | 1197k 	       | 1347k		      | ~13%
>> 3 target nvme, num_jobs = 6 | 1285k 	       | 1293k		      | ~0%
>>
>> As the number of polling instances (num_jobs) increases, the
>> performance improvement decreases, though it's still positive
>> compared to the IRQ mode.
> 
> I think there is serious room for improvement for DM's implementation;
> but the block changes for this are all we'd need for DM in the longrun
> anyway (famous last words).

Agreed.


> So on a block interface level I'm OK with
> block patches 1-3.
> 
> I don't see why patch 5 is needed (said the same in reply to it; but I
> just saw your reason below..).
> 
> Anyway, I can pick up DM patches 4 and 6 via linux-dm.git if Jens picks
> up patches 1-3. Jens, what do you think?

cc Jens.

Also I will send a new version later, maybe some refactor on patch5 and
some typo modifications.

> 
>> [Optimization]
>> To mitigate the race when iterating all the underlying hw queues, one
>> flag is maintained on a per-hw-queue basis. This flag is used to
>> indicate whether this polling hw queue currently being polled on or
>> not. Every polling hw queue is exclusive to one polling instance, i.e.,
>> the polling instance will skip this polling hw queue if this hw queue
>> currently is being polled by another polling instance, and start
>> polling on the next hw queue.
>>
>> This per-hw-queue flag map is currently maintained in dm layer. In
>> the table load phase, a table describing all underlying polling hw
>> queues is built and stored in 'struct dm_table'. It is safe when
>> reloading the mapping table.
>>
>>
>> changes since v1:
>> - patch 1,2,4 is the same as v1 and have already been reviewed
>> - patch 3 is refactored a bit on the basis of suggestions from
>> Mike Snitzer.
>> - patch 5 is newly added and introduces one new queue flag
>> representing if the queue is capable of IO polling. This mainly
>> simplifies the logic in queue_poll_store().
> 
> Ah OK, don't see why we want to eat a queue flag for that though!
> 
>> - patch 6 implements the core mechanism supporting IO polling.
>> The sanity check checking if the dm device supports IO polling is
>> also folded into this patch, and the queue flag will be cleared if
>> it doesn't support, in case of table reloading.
> 
> Thanks,
> Mike
>
Mike Snitzer Jan. 31, 2021, 4:26 p.m. UTC | #3
On Wed, Jan 27 2021 at 10:06pm -0500,
JeffleXu <jefflexu@linux.alibaba.com> wrote:

> 
> 
> On 1/28/21 1:19 AM, Mike Snitzer wrote:
> > On Mon, Jan 25 2021 at  7:13am -0500,
> > Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> > 
> >> Since currently we have no simple but efficient way to implement the
> >> bio-based IO polling in the split-bio tracking style, this patch set
> >> turns to the original implementation mechanism that iterates and
> >> polls all underlying hw queues in polling mode. One optimization is
> >> introduced to mitigate the race of one hw queue among multiple polling
> >> instances.
> >>
> >> I'm still open to the split bio tracking mechanism, if there's
> >> reasonable way to implement it.
> >>
> >>
> >> [Performance Test]
> >> The performance is tested by fio (engine=io_uring) 4k randread on
> >> dm-linear device. The dm-linear device is built upon nvme devices,
> >> and every nvme device has one polling hw queue (nvme.poll_queues=1).
> >>
> >> Test Case		    | IOPS in IRQ mode | IOPS in polling mode | Diff
> >> 			    | (hipri=0)	       | (hipri=1)	      |
> >> --------------------------- | ---------------- | -------------------- | ----
> >> 3 target nvme, num_jobs = 1 | 198k 	       | 276k		      | ~40%
> >> 3 target nvme, num_jobs = 3 | 608k 	       | 705k		      | ~16%
> >> 6 target nvme, num_jobs = 6 | 1197k 	       | 1347k		      | ~13%
> >> 3 target nvme, num_jobs = 6 | 1285k 	       | 1293k		      | ~0%
> >>
> >> As the number of polling instances (num_jobs) increases, the
> >> performance improvement decreases, though it's still positive
> >> compared to the IRQ mode.
> > 
> > I think there is serious room for improvement for DM's implementation;
> > but the block changes for this are all we'd need for DM in the longrun
> > anyway (famous last words).
> 
> Agreed.
> 
> 
> > So on a block interface level I'm OK with
> > block patches 1-3.
> > 
> > I don't see why patch 5 is needed (said the same in reply to it; but I
> > just saw your reason below..).
> > 
> > Anyway, I can pick up DM patches 4 and 6 via linux-dm.git if Jens picks
> > up patches 1-3. Jens, what do you think?
> 
> cc Jens.
> 
> Also I will send a new version later, maybe some refactor on patch5 and
> some typo modifications.

Thinking further, there is no benefit to Jens picking up the block core
changes until the DM changes are ready.  While I think the refactoring
to expose the blk_poll (in patch 3) that supports blk-mq and bio-based
is reasonable -- Christoph correctly points out there is extra branching
that blk-mq must tolerate as implemented.  So even that needs followup
work as suggested here:
https://www.redhat.com/archives/dm-devel/2021-January/msg00397.html

Also, your followup about oversights in the the latest bio-based DM io
polling implementation speaks to all of this needing more time:
https://www.redhat.com/archives/dm-devel/2021-January/msg00436.html

You advocating going back to what is effectively the first RFC patchset
you proposed (with its underwhelming bio-based polling performance)
isn't a strong indication these changes are ready, or that we even have
a patch forward for how to make bio-based IO polling be worthwhile.

So: I retract my question to Jens about whether he'd pick up the block
core changes (while I think those are close, the corresponding DM
changes aren't).

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel