diff mbox series

block: set NOWAIT for sync polling

Message ID 20201013084051.27255-1-jefflexu@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series block: set NOWAIT for sync polling | expand

Commit Message

Jingbo Xu Oct. 13, 2020, 8:40 a.m. UTC
Sync polling also needs REQ_NOWAIT flag. One sync read/write may be
split into several bios (and thus several requests), and can used up the
queue depth sometimes. Thus the following bio in the same sync
read/write will wait for usable request if REQ_NOWAIT flag not set, in
which case the following sync polling will cause a deadlock.

One case (maybe the only case) for above situation is preadv2/pwritev2
+ direct + highpri. Two conditions need to be satisfied to trigger the
deadlock.

1. HIPRI IO in sync routine. Normal read(2)/pread(2)/readv(2)/preadv(2)
and corresponding write family syscalls don't support high-priority IO and
thus won't trigger polling routine. Only preadv2(2)/pwritev2(2) supports
high-priority IO by RWF_HIPRI flag of @flags parameter.

2. Polling support in sync routine. Currently both the blkdev and
iomap-based fs (ext4/xfs, etc) support polling in direct IO routine. The
general routine is described as follows.

submit_bio
  wait for blk_mq_get_tag(), waiting for requests completion, which
  should be done by the following polling, thus causing a deadlock.
blk_poll
  poll in current process context

This issue can be reproduced by following steps:
1. set max_sectors_kb to '4', and set nr_requests to '4'.
This deliberate setting is imposed to stress the issue.

2. fio test with the following configuration:
ioengine=pvsync2
iodepth=128
numjobs=1
thread
rw=randread
direct=1
hipri=1
bs=4M
filename=/dev/nvme0n1

In this case one sync read will be split into 1024 bios and will exhaust
the queue depth rapidly, causing the following bio waiting in the
following stack forever.

[<0>] blk_mq_get_tag+0x19d/0x290
[<0>] __blk_mq_alloc_request+0x5c/0x130
[<0>] blk_mq_submit_bio+0x103/0x5d0
[<0>] submit_bio_noacct+0x309/0x400
[<0>] submit_bio+0x81/0x190
[<0>] blkdev_direct_IO+0x44c/0x4a0
[<0>] generic_file_read_iter+0x80/0x160
[<0>] do_iter_readv_writev+0x186/0x1c0
[<0>] do_iter_read+0xc6/0x1c0
[<0>] vfs_readv+0x7e/0xc0
[<0>] do_preadv+0xad/0xc0
[<0>] do_syscall_64+0x2d/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix: 675d9e95538e5 ("block: add bio_set_polled() helper")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 include/linux/bio.h | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

Comments

Ming Lei Oct. 13, 2020, 12:09 p.m. UTC | #1
On Tue, Oct 13, 2020 at 04:40:51PM +0800, Jeffle Xu wrote:
> Sync polling also needs REQ_NOWAIT flag. One sync read/write may be
> split into several bios (and thus several requests), and can used up the
> queue depth sometimes. Thus the following bio in the same sync
> read/write will wait for usable request if REQ_NOWAIT flag not set, in
> which case the following sync polling will cause a deadlock.
> 
> One case (maybe the only case) for above situation is preadv2/pwritev2
> + direct + highpri. Two conditions need to be satisfied to trigger the
> deadlock.
> 
> 1. HIPRI IO in sync routine. Normal read(2)/pread(2)/readv(2)/preadv(2)
> and corresponding write family syscalls don't support high-priority IO and
> thus won't trigger polling routine. Only preadv2(2)/pwritev2(2) supports
> high-priority IO by RWF_HIPRI flag of @flags parameter.
> 
> 2. Polling support in sync routine. Currently both the blkdev and
> iomap-based fs (ext4/xfs, etc) support polling in direct IO routine. The
> general routine is described as follows.
> 
> submit_bio
>   wait for blk_mq_get_tag(), waiting for requests completion, which
>   should be done by the following polling, thus causing a deadlock.

Another blocking point is rq_qos_throttle(), so I guess falling back to
REQ_NOWAIT may not fix the issue completely.

Given iopoll isn't supposed to in case of big IO, another solution
may be to disable iopoll when bio splitting is needed, something
like the following change:

diff --git a/block/blk-merge.c b/block/blk-merge.c
index bcf5e4580603..8e762215660b 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -279,6 +279,12 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 	return NULL;
 split:
 	*segs = nsegs;
+
+	/*
+	 * bio splitting may cause more trouble for iopoll which isn't supposed
+	 * to be used in case of big IO
+	 */
+	bio->bi_opf &= ~REQ_HIPRI;
 	return bio_split(bio, sectors, GFP_NOIO, bs);
 }
 


Thanks, 
Ming
Jens Axboe Oct. 13, 2020, 7:28 p.m. UTC | #2
On 10/13/20 6:09 AM, Ming Lei wrote:
> On Tue, Oct 13, 2020 at 04:40:51PM +0800, Jeffle Xu wrote:
>> Sync polling also needs REQ_NOWAIT flag. One sync read/write may be
>> split into several bios (and thus several requests), and can used up the
>> queue depth sometimes. Thus the following bio in the same sync
>> read/write will wait for usable request if REQ_NOWAIT flag not set, in
>> which case the following sync polling will cause a deadlock.
>>
>> One case (maybe the only case) for above situation is preadv2/pwritev2
>> + direct + highpri. Two conditions need to be satisfied to trigger the
>> deadlock.
>>
>> 1. HIPRI IO in sync routine. Normal read(2)/pread(2)/readv(2)/preadv(2)
>> and corresponding write family syscalls don't support high-priority IO and
>> thus won't trigger polling routine. Only preadv2(2)/pwritev2(2) supports
>> high-priority IO by RWF_HIPRI flag of @flags parameter.
>>
>> 2. Polling support in sync routine. Currently both the blkdev and
>> iomap-based fs (ext4/xfs, etc) support polling in direct IO routine. The
>> general routine is described as follows.
>>
>> submit_bio
>>   wait for blk_mq_get_tag(), waiting for requests completion, which
>>   should be done by the following polling, thus causing a deadlock.
> 
> Another blocking point is rq_qos_throttle(), so I guess falling back to
> REQ_NOWAIT may not fix the issue completely.
> 
> Given iopoll isn't supposed to in case of big IO, another solution
> may be to disable iopoll when bio splitting is needed, something
> like the following change:

I kind of like that better, especially since polling for split bio's is
somewhat of a weird thing. Needs a better comment though, not just on
size, but also why multiple bio polling isn't really something that
works.
Jingbo Xu Oct. 14, 2020, 8:31 a.m. UTC | #3
What about just disabling HIPRI in preadv2(2)/pwritev2(2)? Christoph 
Hellwig disabled HIPRI for libaio in

commit 154989e45fd8de9bfb52bbd6e5ea763e437e54c5 ("aio: clear 
IOCB_HIPRI"). What do you think, @Christoph?

(cc Christoph Hellwig)


>   static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb)
>   {
> -	bio->bi_opf |= REQ_HIPRI;
> -	if (!is_sync_kiocb(kiocb))
> -		bio->bi_opf |= REQ_NOWAIT;
> +	bio->bi_opf |= REQ_HIPRI | REQ_NOWAIT;
>   }

The original patch indeed could not fix the problem. Though it could fix 
the potential deadlock,

the VFS code read(2)/write(2) is not ready by handling the returned 
-EAGAIN gracefully. Currently

read(2)/write(2) will just return -EAGAIN to user space.



On 10/13/20 8:09 PM, Ming Lei wrote:
> On Tue, Oct 13, 2020 at 04:40:51PM +0800, Jeffle Xu wrote:
>> Sync polling also needs REQ_NOWAIT flag. One sync read/write may be
>> split into several bios (and thus several requests), and can used up the
>> queue depth sometimes. Thus the following bio in the same sync
>> read/write will wait for usable request if REQ_NOWAIT flag not set, in
>> which case the following sync polling will cause a deadlock.
>>
>> One case (maybe the only case) for above situation is preadv2/pwritev2
>> + direct + highpri. Two conditions need to be satisfied to trigger the
>> deadlock.
>>
>> 1. HIPRI IO in sync routine. Normal read(2)/pread(2)/readv(2)/preadv(2)
>> and corresponding write family syscalls don't support high-priority IO and
>> thus won't trigger polling routine. Only preadv2(2)/pwritev2(2) supports
>> high-priority IO by RWF_HIPRI flag of @flags parameter.
>>
>> 2. Polling support in sync routine. Currently both the blkdev and
>> iomap-based fs (ext4/xfs, etc) support polling in direct IO routine. The
>> general routine is described as follows.
>>
>> submit_bio
>>    wait for blk_mq_get_tag(), waiting for requests completion, which
>>    should be done by the following polling, thus causing a deadlock.
> Another blocking point is rq_qos_throttle(),

What is the issue here in rq_qos_throttle()? More details?


> so I guess falling back to
> REQ_NOWAIT may not fix the issue completely.



> Given iopoll isn't supposed to in case of big IO, another solution
> may be to disable iopoll when bio splitting is needed, something
> like the following change:
>
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index bcf5e4580603..8e762215660b 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -279,6 +279,12 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
>   	return NULL;
>   split:
>   	*segs = nsegs;
> +
> +	/*
> +	 * bio splitting may cause more trouble for iopoll which isn't supposed
> +	 * to be used in case of big IO
> +	 */
> +	bio->bi_opf &= ~REQ_HIPRI;
>   	return bio_split(bio, sectors, GFP_NOIO, bs);
>   }

Actually split is not only from blk_mq_submit_bio->__blk_queue_split. In 
__blkdev_direct_IO,

one input iov_iter could be split to several bios.

```

__blkdev_direct_IO:

for (;;) {
         ret = bio_iov_iter_get_pages(bio, iter);
         submit_bio(bio);
}

for (;;) {
         blk_poll()

         ...

}

```

Since  one single bio can contain at most BIO_MAX_PAGES, i.e. 256 
bio_vec in @bio->bi_io_vec,

if the @iovcnt parameter of preadv2(2)/pwritev2(2) is larger than 256, 
then one call of

preadv2(2)/pwritev2(2) can be split into several bios. These bios are 
submitted at once, and then

start sync polling in the process context.


If the number of bios split from one call of preadv2(2)/pwritev2(2) is 
larger than the queue depth,

bios from single preadv2(2)/pwritev2(2) call can exhaust the queue depth 
and thus cause deadlock.

Fortunately the maximum of @iovcnt parameter of preadv2(2)/pwritev2(2) 
is UIO_MAXIOV, i.e. 1024,

and the minimum of queue depth is BLKDEV_MIN_RQ i.e. 4. That means one 
preadv2(2)/pwritev2(2)

call can submit at most 4 bios, which will fill up the queue depth 
exactly and there's no deadlock in this

case. I'm not sure if the setting of 
UIO_MAXIOV/BIO_MAX_PAGES/BLKDEV_MIN_RQ is coincident or

deliberately tuned. At least it will not cause deadlock currently , 
though the constraint may be a little fragile.


By the way, this patch could fix the potential hang I mentioned in

https://patchwork.kernel.org/project/linux-block/patch/20200911032958.125068-1-jefflexu@linux.alibaba.com/


>   
>
>
> Thanks,
> Ming
Ming Lei Oct. 14, 2020, 9:49 a.m. UTC | #4
On Wed, Oct 14, 2020 at 04:31:49PM +0800, JeffleXu wrote:
> What about just disabling HIPRI in preadv2(2)/pwritev2(2)? Christoph Hellwig
> disabled HIPRI for libaio in

Then people will complain that poll can't be used for sync dio, and it is
an regression.

> 
> commit 154989e45fd8de9bfb52bbd6e5ea763e437e54c5 ("aio: clear IOCB_HIPRI").
> What do you think, @Christoph?
> 
> (cc Christoph Hellwig)
> 
> 
> >   static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb)
> >   {
> > -	bio->bi_opf |= REQ_HIPRI;
> > -	if (!is_sync_kiocb(kiocb))
> > -		bio->bi_opf |= REQ_NOWAIT;
> > +	bio->bi_opf |= REQ_HIPRI | REQ_NOWAIT;
> >   }
> 
> The original patch indeed could not fix the problem. Though it could fix the
> potential deadlock,
> 
> the VFS code read(2)/write(2) is not ready by handling the returned -EAGAIN
> gracefully. Currently
> 
> read(2)/write(2) will just return -EAGAIN to user space.
> 
> 
> 
> On 10/13/20 8:09 PM, Ming Lei wrote:
> > On Tue, Oct 13, 2020 at 04:40:51PM +0800, Jeffle Xu wrote:
> > > Sync polling also needs REQ_NOWAIT flag. One sync read/write may be
> > > split into several bios (and thus several requests), and can used up the
> > > queue depth sometimes. Thus the following bio in the same sync
> > > read/write will wait for usable request if REQ_NOWAIT flag not set, in
> > > which case the following sync polling will cause a deadlock.
> > > 
> > > One case (maybe the only case) for above situation is preadv2/pwritev2
> > > + direct + highpri. Two conditions need to be satisfied to trigger the
> > > deadlock.
> > > 
> > > 1. HIPRI IO in sync routine. Normal read(2)/pread(2)/readv(2)/preadv(2)
> > > and corresponding write family syscalls don't support high-priority IO and
> > > thus won't trigger polling routine. Only preadv2(2)/pwritev2(2) supports
> > > high-priority IO by RWF_HIPRI flag of @flags parameter.
> > > 
> > > 2. Polling support in sync routine. Currently both the blkdev and
> > > iomap-based fs (ext4/xfs, etc) support polling in direct IO routine. The
> > > general routine is described as follows.
> > > 
> > > submit_bio
> > >    wait for blk_mq_get_tag(), waiting for requests completion, which
> > >    should be done by the following polling, thus causing a deadlock.
> > Another blocking point is rq_qos_throttle(),
> 
> What is the issue here in rq_qos_throttle()? More details?

Like allocating request, rq_qos_throttle() may wait until rq completion
is done, see wbt_wait() and wbt_done().

> 
> 
> > so I guess falling back to
> > REQ_NOWAIT may not fix the issue completely.
> 
> 
> 
> > Given iopoll isn't supposed to in case of big IO, another solution
> > may be to disable iopoll when bio splitting is needed, something
> > like the following change:
> > 
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index bcf5e4580603..8e762215660b 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -279,6 +279,12 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
> >   	return NULL;
> >   split:
> >   	*segs = nsegs;
> > +
> > +	/*
> > +	 * bio splitting may cause more trouble for iopoll which isn't supposed
> > +	 * to be used in case of big IO
> > +	 */
> > +	bio->bi_opf &= ~REQ_HIPRI;
> >   	return bio_split(bio, sectors, GFP_NOIO, bs);
> >   }
> 
> Actually split is not only from blk_mq_submit_bio->__blk_queue_split. In
> __blkdev_direct_IO,
> 
> one input iov_iter could be split to several bios.
> 
> ```
> 
> __blkdev_direct_IO:
> 
> for (;;) {
>         ret = bio_iov_iter_get_pages(bio, iter);
>         submit_bio(bio);
> }
> 
> for (;;) {
>         blk_poll()
> 
>         ...
> 
> }
> 
> ```
> 
> Since  one single bio can contain at most BIO_MAX_PAGES, i.e. 256 bio_vec in
> @bio->bi_io_vec,

As Jens mentioned, it is weird to poll on multiple bios, so we can
disable io poll simply in __blkdev_direct_IO(). And it is reasonable from
user's viewpoint to not poll on big bio cause io poll is often used
in latency sensitive cases.

> 
> if the @iovcnt parameter of preadv2(2)/pwritev2(2) is larger than 256, then
> one call of
> 
> preadv2(2)/pwritev2(2) can be split into several bios. These bios are
> submitted at once, and then
> 
> start sync polling in the process context.
> 
> 
> If the number of bios split from one call of preadv2(2)/pwritev2(2) is
> larger than the queue depth,
> 
> bios from single preadv2(2)/pwritev2(2) call can exhaust the queue depth and
> thus cause deadlock.
> 
> Fortunately the maximum of @iovcnt parameter of preadv2(2)/pwritev2(2) is
> UIO_MAXIOV, i.e. 1024,
> 
> and the minimum of queue depth is BLKDEV_MIN_RQ i.e. 4. That means one
> preadv2(2)/pwritev2(2)
> 
> call can submit at most 4 bios, which will fill up the queue depth exactly
> and there's no deadlock in this
> 
> case. I'm not sure if the setting of UIO_MAXIOV/BIO_MAX_PAGES/BLKDEV_MIN_RQ
> is coincident or
> 
> deliberately tuned. At least it will not cause deadlock currently , though
> the constraint may be a little fragile.
> 
> 
> By the way, this patch could fix the potential hang I mentioned in
> 
> https://patchwork.kernel.org/project/linux-block/patch/20200911032958.125068-1-jefflexu@linux.alibaba.com/

Right, I remembered that race.


Thanks,
Ming
Christoph Hellwig Oct. 15, 2020, 7:43 a.m. UTC | #5
On Wed, Oct 14, 2020 at 04:31:49PM +0800, JeffleXu wrote:
> What about just disabling HIPRI in preadv2(2)/pwritev2(2)? Christoph 
> Hellwig disabled HIPRI for libaio in
>
> commit 154989e45fd8de9bfb52bbd6e5ea763e437e54c5 ("aio: clear IOCB_HIPRI"). 
> What do you think, @Christoph?
>
> (cc Christoph Hellwig)

aio clears the flag because it can't work.  For preadv2/pwritev2 it
works perfectly fine.
diff mbox series

Patch

diff --git a/include/linux/bio.h b/include/linux/bio.h
index c6d765382926..884ee5aa3df4 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -800,17 +800,15 @@  static inline int bio_integrity_add_page(struct bio *bio, struct page *page,
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
 /*
- * Mark a bio as polled. Note that for async polled IO, the caller must
- * expect -EWOULDBLOCK if we cannot allocate a request (or other resources).
+ * Mark a bio as polled. Note that for polled IO, the caller must expect
+ * -EWOULDBLOCK if we cannot allocate a request (or other resources).
  * We cannot block waiting for requests on polled IO, as those completions
  * must be found by the caller. This is different than IRQ driven IO, where
  * it's safe to wait for IO to complete.
  */
 static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb)
 {
-	bio->bi_opf |= REQ_HIPRI;
-	if (!is_sync_kiocb(kiocb))
-		bio->bi_opf |= REQ_NOWAIT;
+	bio->bi_opf |= REQ_HIPRI | REQ_NOWAIT;
 }
 
 #endif /* __LINUX_BIO_H */