iomap: Fix the write_count in iomap_add_to_ioend().

Message ID	20200819102841.481461-1-anju@linux.vnet.ibm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=dccG=B5=vger.kernel.org=linux-xfs-owner@kernel.org> From: Anju T Sudhakar <anju@linux.vnet.ibm.com> To: hch@infradead.org, darrick.wong@oracle.com Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, willy@infradead.org, riteshh@linux.ibm.com, anju@linux.vnet.ibm.com Subject: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). Date: Wed, 19 Aug 2020 15:58:41 +0530 Message-Id: <20200819102841.481461-1-anju@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk
Series	iomap: Fix the write_count in iomap_add_to_ioend(). \| expand iomap: Fix the write_count in iomap_add_to_ioend().

Anju T Sudhakar Aug. 19, 2020, 10:28 a.m. UTC

From: Ritesh Harjani <riteshh@linux.ibm.com>

__bio_try_merge_page() may return same_page = 1 and merged = 0. 
This could happen when bio->bi_iter.bi_size + len > UINT_MAX. 
Handle this case in iomap_add_to_ioend() by incrementing write_count.
This scenario mostly happens where we have too much dirty data accumulated. 

w/o the patch we hit below kernel warning,
 
 WARNING: CPU: 18 PID: 5130 at fs/iomap/buffered-io.c:74 iomap_page_release+0x120/0x150
 CPU: 18 PID: 5130 Comm: fio Kdump: loaded Tainted: G        W         5.8.0-rc3 #6
 Call Trace:
  __remove_mapping+0x154/0x320 (unreliable)
  iomap_releasepage+0x80/0x180
  try_to_release_page+0x94/0xe0
  invalidate_inode_page+0xc8/0x110
  invalidate_mapping_pages+0x1dc/0x540
  generic_fadvise+0x3c8/0x450
  xfs_file_fadvise+0x2c/0xe0 [xfs]
  vfs_fadvise+0x3c/0x60
  ksys_fadvise64_64+0x68/0xe0
  sys_fadvise64+0x28/0x40
  system_call_exception+0xf8/0x1c0
  system_call_common+0xf0/0x278

Reported-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
---
 fs/iomap/buffered-io.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Dave Chinner Aug. 20, 2020, 11:11 p.m. UTC | #1

On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote:
> From: Ritesh Harjani <riteshh@linux.ibm.com>
> 
> __bio_try_merge_page() may return same_page = 1 and merged = 0. 
> This could happen when bio->bi_iter.bi_size + len > UINT_MAX. 

Ummm, silly question, but exactly how are we getting a bio that
large in ->writepages getting built? Even with 64kB pages, that's a
bio with 2^16 pages attached to it. We shouldn't be building single
bios in writeback that large - what storage hardware is allowing
such huge bios to be built? (i.e. can you dump all the values in
/sys/block/<dev>/queue/* for that device for us?)

Cheers,

Dave.

Ritesh Harjani Aug. 21, 2020, 4:45 a.m. UTC | #2

Hello Dave,

Thanks for reviewing this.

On 8/21/20 4:41 AM, Dave Chinner wrote:
> On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote:
>> From: Ritesh Harjani <riteshh@linux.ibm.com>
>>
>> __bio_try_merge_page() may return same_page = 1 and merged = 0.
>> This could happen when bio->bi_iter.bi_size + len > UINT_MAX.
> 
> Ummm, silly question, but exactly how are we getting a bio that
> large in ->writepages getting built? Even with 64kB pages, that's a
> bio with 2^16 pages attached to it. We shouldn't be building single
> bios in writeback that large - what storage hardware is allowing
> such huge bios to be built? (i.e. can you dump all the values in
> /sys/block/<dev>/queue/* for that device for us?)

Please correct me here, but as I see, bio has only these two limits
which it checks for adding page to bio. It doesn't check for limits
of /sys/block/<dev>/queue/* no? I guess then it could be checked
by block layer below b4 submitting the bio?

113 static inline bool bio_full(struct bio *bio, unsigned len)
114 {
115         if (bio->bi_vcnt >= bio->bi_max_vecs)
116                 return true;
117
118         if (bio->bi_iter.bi_size > UINT_MAX - len)
119                 return true;
120
121         return false;
122 }


This issue was first observed while running a fio run on a system with
huge memory. But then here is an easy way we figured out to trigger the
issue almost everytime with loop device on my VM setup. I have provided
all the details on this below.

<cmds to trigger it fairly quickly>
===================================
echo 99999999 > /proc/sys/vm/dirtytime_expire_seconds
echo 99999999 > /proc/sys/vm/dirty_expire_centisecs
echo 90  > /proc/sys/vm/dirty_rati0
echo 90  > /proc/sys/vm/dirty_background_ratio
echo 0  > /proc/sys/vm/dirty_writeback_centisecs

sudo perf probe -s ~/host_shared/src/linux/ -a '__bio_try_merge_page:10 
bio page page->index bio->bi_iter.bi_size len same_page[0]'

sudo perf record -e probe:__bio_try_merge_page_L10 -a --filter 'bi_size 
 > 0xff000000' sudo fio --rw=write --bs=1M --numjobs=1 
--name=/mnt/testfile --size=24G --ioengine=libaio


# on running this 2nd time it gets hit everytime on my setup

sudo perf record -e probe:__bio_try_merge_page_L10 -a --filter 'bi_size 
 > 0xff000000' sudo fio --rw=write --bs=1M --numjobs=1 
--name=/mnt/testfile --size=24G --ioengine=libaio


Perf o/p from above filter causing overflow
===========================================
<...>
              fio 25194 [029] 70471.559084: 
probe:__bio_try_merge_page_L10: (c000000000aa054c) 
bio=0xc0000013d49a4b80 page=0xc00c000004029d80 index=0x10a9d 
bi_size=0xffff8000 len=0x1000 same_page=0x1
              fio 25194 [029] 70471.559087: 
probe:__bio_try_merge_page_L10: (c000000000aa054c) 
bio=0xc0000013d49a4b80 page=0xc00c000004029d80 index=0x10a9d 
bi_size=0xffff9000 len=0x1000 same_page=0x1
              fio 25194 [029] 70471.559090: 
probe:__bio_try_merge_page_L10: (c000000000aa054c) 
bio=0xc0000013d49a4b80 page=0xc00c000004029d80 index=0x10a9d 
bi_size=0xffffa000 len=0x1000 same_page=0x1
              fio 25194 [029] 70471.559093: 
probe:__bio_try_merge_page_L10: (c000000000aa054c) 
bio=0xc0000013d49a4b80 page=0xc00c000004029d80 index=0x10a9d 
bi_size=0xffffb000 len=0x1000 same_page=0x1
              fio 25194 [029] 70471.559095: 
probe:__bio_try_merge_page_L10: (c000000000aa054c) 
bio=0xc0000013d49a4b80 page=0xc00c000004029d80 index=0x10a9d 
bi_size=0xffffc000 len=0x1000 same_page=0x1
              fio 25194 [029] 70471.559098: 
probe:__bio_try_merge_page_L10: (c000000000aa054c) 
bio=0xc0000013d49a4b80 page=0xc00c000004029d80 index=0x10a9d 
bi_size=0xffffd000 len=0x1000 same_page=0x1
              fio 25194 [029] 70471.559101: 
probe:__bio_try_merge_page_L10: (c000000000aa054c) 
bio=0xc0000013d49a4b80 page=0xc00c000004029d80 index=0x10a9d 
bi_size=0xffffe000 len=0x1000 same_page=0x1
              fio 25194 [029] 70471.559104: 
probe:__bio_try_merge_page_L10: (c000000000aa054c) 
bio=0xc0000013d49a4b80 page=0xc00c000004029d80 index=0x10a9d 
bi_size=0xfffff000 len=0x1000 same_page=0x1

^^^^^^ (this could cause an overflow)

loop dev
=========
NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE    DIO LOG-SEC
/dev/loop1         0      0         0  0 /mnt1/filefs   0     512


mount o/p
=========
/dev/loop1 on /mnt type xfs 
(rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)


/sys/block/<dev>/queue/*
========================

setup:/run/perf$ cat /sys/block/loop1/queue/max_segments
128
setup:/run/perf$ cat /sys/block/loop1/queue/max_segment_size
65536
setup:/run/perf$ cat /sys/block/loop1/queue/max_hw_sectors_kb
1280
setup:/run/perf$ cat /sys/block/loop1/queue/logical_block_size
512
setup:/run/perf$ cat /sys/block/loop1/queue/max_sectors_kb
1280
setup:/run/perf$ cat /sys/block/loop1/queue/hw_sector_size
512
setup:/run/perf$ cat /sys/block/loop1/queue/discard_max_bytes
4294966784
setup:/run/perf$ cat /sys/block/loop1/queue/discard_max_hw_bytes
4294966784
setup:/run/perf$ cat /sys/block/loop1/queue/discard_zeroes_data
0
setup:/run/perf$ cat /sys/block/loop1/queue/discard_granularity
4096
setup:/run/perf$ cat /sys/block/loop1/queue/chunk_sectors
0
setup:/run/perf$ cat /sys/block/loop1/queue/max_discard_segments
1
setup:/run/perf$ cat /sys/block/loop1/queue/read_ahead_kb
128
setup:/run/perf$ cat /sys/block/loop1/queue/rotational
1
setup:/run/perf$ cat /sys/block/loop1/queue/physical_block_size
512
setup:/run/perf$ cat /sys/block/loop1/queue/write_same_max_bytes
0
setup:/run/perf$ cat /sys/block/loop1/queue/write_zeroes_max_bytes
4294966784

Christoph Hellwig Aug. 21, 2020, 6 a.m. UTC | #3

On Fri, Aug 21, 2020 at 10:15:33AM +0530, Ritesh Harjani wrote:
> Please correct me here, but as I see, bio has only these two limits
> which it checks for adding page to bio. It doesn't check for limits
> of /sys/block/<dev>/queue/* no? I guess then it could be checked
> by block layer below b4 submitting the bio?

The bio does not, but the blk-mq code will split the bios when mapping
it to requests, take a look at blk_mq_submit_bio and __blk_queue_split.

But while the default limits are quite low, they can be increased
siginificantly, which tends to help with performance and is often
also done by scripts shipped by the distributions.

> This issue was first observed while running a fio run on a system with
> huge memory. But then here is an easy way we figured out to trigger the
> issue almost everytime with loop device on my VM setup. I have provided
> all the details on this below.

Can you wire this up for xfstests?

Christoph Hellwig Aug. 21, 2020, 6:01 a.m. UTC | #4

On Fri, Aug 21, 2020 at 09:11:40AM +1000, Dave Chinner wrote:
> On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote:
> > From: Ritesh Harjani <riteshh@linux.ibm.com>
> > 
> > __bio_try_merge_page() may return same_page = 1 and merged = 0. 
> > This could happen when bio->bi_iter.bi_size + len > UINT_MAX. 
> 
> Ummm, silly question, but exactly how are we getting a bio that
> large in ->writepages getting built? Even with 64kB pages, that's a
> bio with 2^16 pages attached to it. We shouldn't be building single
> bios in writeback that large - what storage hardware is allowing
> such huge bios to be built? (i.e. can you dump all the values in
> /sys/block/<dev>/queue/* for that device for us?)

NVMe controller should not have a problem with such huge I/O,
especially if they support SGLs (extent based I/O :)) instead of the
default dumb PRP scheme.  Higher end SCSI controller should also have
huge limits.

Christoph Hellwig Aug. 21, 2020, 6:07 a.m. UTC | #5

On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote:
> From: Ritesh Harjani <riteshh@linux.ibm.com>
> 
> __bio_try_merge_page() may return same_page = 1 and merged = 0. 
> This could happen when bio->bi_iter.bi_size + len > UINT_MAX. 
> Handle this case in iomap_add_to_ioend() by incrementing write_count.
> This scenario mostly happens where we have too much dirty data accumulated. 
> 
> w/o the patch we hit below kernel warning,

I think this is better fixed in the block layer rather than working
around the problem in the callers.  Something like this:

diff --git a/block/bio.c b/block/bio.c
index c63ba04bd62967..ef321cd1072e4e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -879,8 +879,10 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
 		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
 
 		if (page_is_mergeable(bv, page, len, off, same_page)) {
-			if (bio->bi_iter.bi_size > UINT_MAX - len)
+			if (bio->bi_iter.bi_size > UINT_MAX - len) {
+				*same_page = false;
 				return false;
+			}
 			bv->bv_len += len;
 			bio->bi_iter.bi_size += len;
 			return true;

Ritesh Harjani Aug. 21, 2020, 8:53 a.m. UTC | #6

On 8/21/20 11:37 AM, Christoph Hellwig wrote:
> On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote:
>> From: Ritesh Harjani <riteshh@linux.ibm.com>
>>
>> __bio_try_merge_page() may return same_page = 1 and merged = 0.
>> This could happen when bio->bi_iter.bi_size + len > UINT_MAX.
>> Handle this case in iomap_add_to_ioend() by incrementing write_count.
>> This scenario mostly happens where we have too much dirty data accumulated.
>>
>> w/o the patch we hit below kernel warning,
> 
> I think this is better fixed in the block layer rather than working
> around the problem in the callers.  Something like this:
> 
> diff --git a/block/bio.c b/block/bio.c
> index c63ba04bd62967..ef321cd1072e4e 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -879,8 +879,10 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
>   		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>   
>   		if (page_is_mergeable(bv, page, len, off, same_page)) {
> -			if (bio->bi_iter.bi_size > UINT_MAX - len)
> +			if (bio->bi_iter.bi_size > UINT_MAX - len) {
> +				*same_page = false;
>   				return false;
> +			}
>   			bv->bv_len += len;
>   			bio->bi_iter.bi_size += len;
>   			return true;
> 

Ya, we had think of that. But what we then thought was, maybe the
API does return the right thing. Meaning, what API says is, same_page is
true, but the page couldn't be merged hence it returned ret = false.
With that thought, we fixed this in the caller.

But agree with you that with ret = false, there is no meaning of
same_page being true. Ok, so let linux-block comment on whether
above also looks good. If yes, I can spin out an official patch with
all details.

-ritesh

Ritesh Harjani Aug. 21, 2020, 9:09 a.m. UTC | #7

On 8/21/20 11:30 AM, Christoph Hellwig wrote:
> On Fri, Aug 21, 2020 at 10:15:33AM +0530, Ritesh Harjani wrote:
>> Please correct me here, but as I see, bio has only these two limits
>> which it checks for adding page to bio. It doesn't check for limits
>> of /sys/block/<dev>/queue/* no? I guess then it could be checked
>> by block layer below b4 submitting the bio?
> 
> The bio does not, but the blk-mq code will split the bios when mapping
> it to requests, take a look at blk_mq_submit_bio and __blk_queue_split.

Thanks :)

> 
> But while the default limits are quite low, they can be increased
> siginificantly, which tends to help with performance and is often
> also done by scripts shipped by the distributions.
> 
>> This issue was first observed while running a fio run on a system with
>> huge memory. But then here is an easy way we figured out to trigger the
>> issue almost everytime with loop device on my VM setup. I have provided
>> all the details on this below.
> 
> Can you wire this up for xfstests?
> 

Sure, will do that.
I do see generic/460 does play with such vm dirty_ratio params which
this test would also require to manipulate to trigger this issue.


Thanks for the suggestion!
-ritesh

Matthew Wilcox Aug. 21, 2020, 1:31 p.m. UTC | #8

On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote:
> __bio_try_merge_page() may return same_page = 1 and merged = 0. 
> This could happen when bio->bi_iter.bi_size + len > UINT_MAX. 
> Handle this case in iomap_add_to_ioend() by incrementing write_count.

One of the patches I have pending ignores same_page by just using the
write_count as a byte count instead of a segment count.  It's currently
mixed into this patch but needs to be separated.

http://git.infradead.org/users/willy/pagecache.git/commitdiff/0186d1dde949a458584c56b706fa8dfd252466ff

(another patch does the same thing to the read count).

Jens Axboe Aug. 21, 2020, 2:49 p.m. UTC | #9

On 8/21/20 12:07 AM, Christoph Hellwig wrote:
> On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote:
>> From: Ritesh Harjani <riteshh@linux.ibm.com>
>>
>> __bio_try_merge_page() may return same_page = 1 and merged = 0. 
>> This could happen when bio->bi_iter.bi_size + len > UINT_MAX. 
>> Handle this case in iomap_add_to_ioend() by incrementing write_count.
>> This scenario mostly happens where we have too much dirty data accumulated. 
>>
>> w/o the patch we hit below kernel warning,
> 
> I think this is better fixed in the block layer rather than working
> around the problem in the callers.  Something like this:
> 
> diff --git a/block/bio.c b/block/bio.c
> index c63ba04bd62967..ef321cd1072e4e 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -879,8 +879,10 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
>  		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>  
>  		if (page_is_mergeable(bv, page, len, off, same_page)) {
> -			if (bio->bi_iter.bi_size > UINT_MAX - len)
> +			if (bio->bi_iter.bi_size > UINT_MAX - len) {
> +				*same_page = false;
>  				return false;
> +			}
>  			bv->bv_len += len;
>  			bio->bi_iter.bi_size += len;
>  			return true;
> 

This looks good to me.

Dave Chinner Aug. 21, 2020, 9:53 p.m. UTC | #10

On Fri, Aug 21, 2020 at 10:15:33AM +0530, Ritesh Harjani wrote:
> Hello Dave,
> 
> Thanks for reviewing this.
> 
> On 8/21/20 4:41 AM, Dave Chinner wrote:
> > On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote:
> > > From: Ritesh Harjani <riteshh@linux.ibm.com>
> > > 
> > > __bio_try_merge_page() may return same_page = 1 and merged = 0.
> > > This could happen when bio->bi_iter.bi_size + len > UINT_MAX.
> > 
> > Ummm, silly question, but exactly how are we getting a bio that
> > large in ->writepages getting built? Even with 64kB pages, that's a
> > bio with 2^16 pages attached to it. We shouldn't be building single
> > bios in writeback that large - what storage hardware is allowing
> > such huge bios to be built? (i.e. can you dump all the values in
> > /sys/block/<dev>/queue/* for that device for us?)
> 
> Please correct me here, but as I see, bio has only these two limits
> which it checks for adding page to bio. It doesn't check for limits
> of /sys/block/<dev>/queue/* no? I guess then it could be checked
> by block layer below b4 submitting the bio?
> 
> 113 static inline bool bio_full(struct bio *bio, unsigned len)
> 114 {
> 115         if (bio->bi_vcnt >= bio->bi_max_vecs)
> 116                 return true;
> 117
> 118         if (bio->bi_iter.bi_size > UINT_MAX - len)
> 119                 return true;
> 120
> 121         return false;
> 122 }

but iomap only allows BIO_MAX_PAGES when creating the bio. And:

#define BIO_MAX_PAGES 256

So even on a 64k page machine, we should not be building a bio with
more than 16MB of data in it. So how are we getting 4GB of data into
it?

Further, the writeback code is designed around the bios having a
bound size that is relatively small to keep IO submission occurring
as we pack pages into bios. This keeps IO latency down and minimises
the individual IO completion overhead of each IO. This is especially
important as the writeback path is critical for memory relcaim to
make progress because we do not want to trap gigabytes of dirty
memory in the writeback IO path.

IOWs, seeing huge bios being built by writeback is indicative of
design assumptions and contraints being violated - huge bios on the
buffered writeback path like this are not a good thing to see.

FWIW, We've also recently got reports of hard lockups in IO
completion of overwrites because our ioend bio chains have grown to
almost 3 million pages and all the writeback pages get processed as
a single completion. This is a similar situation to this bug report
in that the bio chains are unbound in length, and I'm betting the
cause is the same: overwrite a 10GB file in memory (with dirty
limits turned up), then run fsync so we do a single writepages call
that tries to write 10GB of dirty pages....

The only reason we don't normally see this is that background
writeback caps the number of pages written per writepages call to
1024. i.e.  it caps writeback IO sizes to a small amount so that IO
latency, writeback fairness across dirty inodes, etc can be
maintained for background writeback - no one dirty file can
monopolise the available writeback bandwidth and starve writeback
to other dirty inodes.

So combine the two, and we've got a problem that the writeback IO
sizes are not being bound to sane IO sizes. I have no problems with
building individual bios that are 4MB or even 16MB in size - that
allows the block layer to work efficiently. Problems at a system
start to occur, however, when individual bios or bio chains built
by writeback end up being orders of magnitude larger than this....

i.e. I'm not looking at this as a "bio overflow bug" - I'm
commenting on what this overflow implies from an architectural point
of view. i.e. that uncapped bio sizes and bio chain lengths in
writeback are actually a bad thing and something we've always
tried to avoid doing....

.....

> /sys/block/<dev>/queue/*
> ========================
> 
> setup:/run/perf$ cat /sys/block/loop1/queue/max_segments
> 128
> setup:/run/perf$ cat /sys/block/loop1/queue/max_segment_size
> 65536

A maximumally size bio (16MB) will get split into two bios for this
hardware based on this (8MB max size).

> setup:/run/perf$ cat /sys/block/loop1/queue/max_hw_sectors_kb
> 1280

Except this says 1280kB is the max size, so it will actually get
split into 14 bios.

So a stream of 16MB bios from writeback will be more than large
enough to keep this hardware's pipeline full....

Cheers,

Dave.

Christoph Hellwig Aug. 22, 2020, 1:13 p.m. UTC | #11

On Sat, Aug 22, 2020 at 07:53:58AM +1000, Dave Chinner wrote:
> but iomap only allows BIO_MAX_PAGES when creating the bio. And:
> 
> #define BIO_MAX_PAGES 256
> 
> So even on a 64k page machine, we should not be building a bio with
> more than 16MB of data in it. So how are we getting 4GB of data into
> it?

BIO_MAX_PAGES is the number of bio_vecs in the bio, it has no
direct implication on the size, as each entry can fit up to UINT_MAX
bytes.

Brian Foster Aug. 24, 2020, 2:28 p.m. UTC | #12

On Sat, Aug 22, 2020 at 02:13:12PM +0100, Christoph Hellwig wrote:
> On Sat, Aug 22, 2020 at 07:53:58AM +1000, Dave Chinner wrote:
> > but iomap only allows BIO_MAX_PAGES when creating the bio. And:
> > 
> > #define BIO_MAX_PAGES 256
> > 
> > So even on a 64k page machine, we should not be building a bio with
> > more than 16MB of data in it. So how are we getting 4GB of data into
> > it?
> 
> BIO_MAX_PAGES is the number of bio_vecs in the bio, it has no
> direct implication on the size, as each entry can fit up to UINT_MAX
> bytes.
> 

Do I understand the current code (__bio_try_merge_page() ->
page_is_mergeable()) correctly in that we're checking for physical page
contiguity and not necessarily requiring a new bio_vec per physical
page?

With regard to Dave's earlier point around seeing excessively sized bio
chains.. If I set up a large memory box with high dirty mem ratios and
do contiguous buffered overwrites over a 32GB range followed by fsync, I
can see upwards of 1GB per bio and thus chains on the order of 32+ bios
for the entire write. If I play games with how the buffered overwrite is
submitted (i.e., in reverse) however, then I can occasionally reproduce
a ~32GB chain of ~32k bios, which I think is what leads to problems in
I/O completion on some systems. Granted, I don't reproduce soft lockup
issues on my system with that behavior, so perhaps there's more to that
particular issue.

Regardless, it seems reasonable to me to at least have a conservative
limit on the length of an ioend bio chain. Would anybody object to
iomap_ioend growing a chain counter and perhaps forcing into a new ioend
if we chain something like more than 1k bios at once?

Brian

Christoph Hellwig Aug. 24, 2020, 3:04 p.m. UTC | #13

On Mon, Aug 24, 2020 at 10:28:23AM -0400, Brian Foster wrote:
> Do I understand the current code (__bio_try_merge_page() ->
> page_is_mergeable()) correctly in that we're checking for physical page
> contiguity and not necessarily requiring a new bio_vec per physical
> page?


Yes.

> With regard to Dave's earlier point around seeing excessively sized bio
> chains.. If I set up a large memory box with high dirty mem ratios and
> do contiguous buffered overwrites over a 32GB range followed by fsync, I
> can see upwards of 1GB per bio and thus chains on the order of 32+ bios
> for the entire write. If I play games with how the buffered overwrite is
> submitted (i.e., in reverse) however, then I can occasionally reproduce
> a ~32GB chain of ~32k bios, which I think is what leads to problems in
> I/O completion on some systems. Granted, I don't reproduce soft lockup
> issues on my system with that behavior, so perhaps there's more to that
> particular issue.
> 
> Regardless, it seems reasonable to me to at least have a conservative
> limit on the length of an ioend bio chain. Would anybody object to
> iomap_ioend growing a chain counter and perhaps forcing into a new ioend
> if we chain something like more than 1k bios at once?

So what exactly is the problem of processing a long chain in the
workqueue vs multiple small chains?  Maybe we need a cond_resched()
here and there, but I don't see how we'd substantially change behavior.

Brian Foster Aug. 24, 2020, 3:48 p.m. UTC | #14

On Mon, Aug 24, 2020 at 04:04:17PM +0100, Christoph Hellwig wrote:
> On Mon, Aug 24, 2020 at 10:28:23AM -0400, Brian Foster wrote:
> > Do I understand the current code (__bio_try_merge_page() ->
> > page_is_mergeable()) correctly in that we're checking for physical page
> > contiguity and not necessarily requiring a new bio_vec per physical
> > page?
> 
> 
> Yes.
> 

Ok. I also realize now that this occurs on a kernel without commit
07173c3ec276 ("block: enable multipage bvecs"). That is probably a
contributing factor, but it's not clear to me whether it's feasible to
backport whatever supporting infrastructure is required for that
mechanism to work (I suspect not).

> > With regard to Dave's earlier point around seeing excessively sized bio
> > chains.. If I set up a large memory box with high dirty mem ratios and
> > do contiguous buffered overwrites over a 32GB range followed by fsync, I
> > can see upwards of 1GB per bio and thus chains on the order of 32+ bios
> > for the entire write. If I play games with how the buffered overwrite is
> > submitted (i.e., in reverse) however, then I can occasionally reproduce
> > a ~32GB chain of ~32k bios, which I think is what leads to problems in
> > I/O completion on some systems. Granted, I don't reproduce soft lockup
> > issues on my system with that behavior, so perhaps there's more to that
> > particular issue.
> > 
> > Regardless, it seems reasonable to me to at least have a conservative
> > limit on the length of an ioend bio chain. Would anybody object to
> > iomap_ioend growing a chain counter and perhaps forcing into a new ioend
> > if we chain something like more than 1k bios at once?
> 
> So what exactly is the problem of processing a long chain in the
> workqueue vs multiple small chains?  Maybe we need a cond_resched()
> here and there, but I don't see how we'd substantially change behavior.
> 

The immediate problem is a watchdog lockup detection in bio completion:

  NMI watchdog: Watchdog detected hard LOCKUP on cpu 25

This effectively lands at the following segment of iomap_finish_ioend():

		...
               /* walk each page on bio, ending page IO on them */
                bio_for_each_segment_all(bv, bio, iter_all)
                        iomap_finish_page_writeback(inode, bv->bv_page, error);

I suppose we could add a cond_resched(), but is that safe directly
inside of a ->bi_end_io() handler? Another option could be to dump large
chains into the completion workqueue, but we may still need to track the
length to do that. Thoughts?

Brian

Dave Chinner Aug. 25, 2020, 12:42 a.m. UTC | #15

On Mon, Aug 24, 2020 at 11:48:41AM -0400, Brian Foster wrote:
> On Mon, Aug 24, 2020 at 04:04:17PM +0100, Christoph Hellwig wrote:
> > On Mon, Aug 24, 2020 at 10:28:23AM -0400, Brian Foster wrote:
> > > Do I understand the current code (__bio_try_merge_page() ->
> > > page_is_mergeable()) correctly in that we're checking for physical page
> > > contiguity and not necessarily requiring a new bio_vec per physical
> > > page?
> > 
> > 
> > Yes.
> > 
> 
> Ok. I also realize now that this occurs on a kernel without commit
> 07173c3ec276 ("block: enable multipage bvecs"). That is probably a
> contributing factor, but it's not clear to me whether it's feasible to
> backport whatever supporting infrastructure is required for that
> mechanism to work (I suspect not).
> 
> > > With regard to Dave's earlier point around seeing excessively sized bio
> > > chains.. If I set up a large memory box with high dirty mem ratios and
> > > do contiguous buffered overwrites over a 32GB range followed by fsync, I
> > > can see upwards of 1GB per bio and thus chains on the order of 32+ bios
> > > for the entire write. If I play games with how the buffered overwrite is
> > > submitted (i.e., in reverse) however, then I can occasionally reproduce
> > > a ~32GB chain of ~32k bios, which I think is what leads to problems in
> > > I/O completion on some systems. Granted, I don't reproduce soft lockup
> > > issues on my system with that behavior, so perhaps there's more to that
> > > particular issue.
> > > 
> > > Regardless, it seems reasonable to me to at least have a conservative
> > > limit on the length of an ioend bio chain. Would anybody object to
> > > iomap_ioend growing a chain counter and perhaps forcing into a new ioend
> > > if we chain something like more than 1k bios at once?
> > 
> > So what exactly is the problem of processing a long chain in the
> > workqueue vs multiple small chains?  Maybe we need a cond_resched()
> > here and there, but I don't see how we'd substantially change behavior.
> > 
> 
> The immediate problem is a watchdog lockup detection in bio completion:
> 
>   NMI watchdog: Watchdog detected hard LOCKUP on cpu 25
> 
> This effectively lands at the following segment of iomap_finish_ioend():
> 
> 		...
>                /* walk each page on bio, ending page IO on them */
>                 bio_for_each_segment_all(bv, bio, iter_all)
>                         iomap_finish_page_writeback(inode, bv->bv_page, error);
> 
> I suppose we could add a cond_resched(), but is that safe directly
> inside of a ->bi_end_io() handler? Another option could be to dump large
> chains into the completion workqueue, but we may still need to track the
> length to do that. Thoughts?

We have ioend completion merging that will run the compeltion once
for all the pending ioend completions on that inode. IOWs, we do not
need to build huge chains at submission time to batch up completions
efficiently. However, huge bio chains at submission time do cause
issues with writeback fairness, pinning GBs of ram as unreclaimable
for seconds because they are queued for completion while we are
still submitting the bio chain and submission is being throttled by
the block layer writeback throttle, etc. Not to mention the latency
of stable pages in a situation like this - a mmap() write fault
could stall for many seconds waiting for a huge bio chain to finish
submission and run completion processing even when the IO for the
given page we faulted on was completed before the page fault
occurred...

Hence I think we really do need to cap the length of the bio
chains here so that we start completing and ending page writeback on
large writeback ranges long before the writeback code finishes
submitting the range it was asked to write back.

Cheers,

Dave.

Brian Foster Aug. 25, 2020, 2:49 p.m. UTC | #16

cc Ming

On Tue, Aug 25, 2020 at 10:42:03AM +1000, Dave Chinner wrote:
> On Mon, Aug 24, 2020 at 11:48:41AM -0400, Brian Foster wrote:
> > On Mon, Aug 24, 2020 at 04:04:17PM +0100, Christoph Hellwig wrote:
> > > On Mon, Aug 24, 2020 at 10:28:23AM -0400, Brian Foster wrote:
> > > > Do I understand the current code (__bio_try_merge_page() ->
> > > > page_is_mergeable()) correctly in that we're checking for physical page
> > > > contiguity and not necessarily requiring a new bio_vec per physical
> > > > page?
> > > 
> > > 
> > > Yes.
> > > 
> > 
> > Ok. I also realize now that this occurs on a kernel without commit
> > 07173c3ec276 ("block: enable multipage bvecs"). That is probably a
> > contributing factor, but it's not clear to me whether it's feasible to
> > backport whatever supporting infrastructure is required for that
> > mechanism to work (I suspect not).
> > 
> > > > With regard to Dave's earlier point around seeing excessively sized bio
> > > > chains.. If I set up a large memory box with high dirty mem ratios and
> > > > do contiguous buffered overwrites over a 32GB range followed by fsync, I
> > > > can see upwards of 1GB per bio and thus chains on the order of 32+ bios
> > > > for the entire write. If I play games with how the buffered overwrite is
> > > > submitted (i.e., in reverse) however, then I can occasionally reproduce
> > > > a ~32GB chain of ~32k bios, which I think is what leads to problems in
> > > > I/O completion on some systems. Granted, I don't reproduce soft lockup
> > > > issues on my system with that behavior, so perhaps there's more to that
> > > > particular issue.
> > > > 
> > > > Regardless, it seems reasonable to me to at least have a conservative
> > > > limit on the length of an ioend bio chain. Would anybody object to
> > > > iomap_ioend growing a chain counter and perhaps forcing into a new ioend
> > > > if we chain something like more than 1k bios at once?
> > > 
> > > So what exactly is the problem of processing a long chain in the
> > > workqueue vs multiple small chains?  Maybe we need a cond_resched()
> > > here and there, but I don't see how we'd substantially change behavior.
> > > 
> > 
> > The immediate problem is a watchdog lockup detection in bio completion:
> > 
> >   NMI watchdog: Watchdog detected hard LOCKUP on cpu 25
> > 
> > This effectively lands at the following segment of iomap_finish_ioend():
> > 
> > 		...
> >                /* walk each page on bio, ending page IO on them */
> >                 bio_for_each_segment_all(bv, bio, iter_all)
> >                         iomap_finish_page_writeback(inode, bv->bv_page, error);
> > 
> > I suppose we could add a cond_resched(), but is that safe directly
> > inside of a ->bi_end_io() handler? Another option could be to dump large
> > chains into the completion workqueue, but we may still need to track the
> > length to do that. Thoughts?
> 
> We have ioend completion merging that will run the compeltion once
> for all the pending ioend completions on that inode. IOWs, we do not
> need to build huge chains at submission time to batch up completions
> efficiently. However, huge bio chains at submission time do cause
> issues with writeback fairness, pinning GBs of ram as unreclaimable
> for seconds because they are queued for completion while we are
> still submitting the bio chain and submission is being throttled by
> the block layer writeback throttle, etc. Not to mention the latency
> of stable pages in a situation like this - a mmap() write fault
> could stall for many seconds waiting for a huge bio chain to finish
> submission and run completion processing even when the IO for the
> given page we faulted on was completed before the page fault
> occurred...
> 
> Hence I think we really do need to cap the length of the bio
> chains here so that we start completing and ending page writeback on
> large writeback ranges long before the writeback code finishes
> submitting the range it was asked to write back.
> 

Ming pointed out separately that limiting the bio chain itself might not
be enough because with multipage bvecs, we can effectively capture the
same number of pages in much fewer bios. Given that, what do you think
about something like the patch below to limit ioend size? This
effectively limits the number of pages per ioend regardless of whether
in-core state results in a small chain of dense bios or a large chain of
smaller bios, without requiring any new explicit page count tracking.

Brian

--- 8< ---

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6ae98d3cb157..4aa96705ffd7 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1301,7 +1301,7 @@ iomap_chain_bio(struct bio *prev)
 
 static bool
 iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
-		sector_t sector)
+		unsigned len, sector_t sector)
 {
 	if ((wpc->iomap.flags & IOMAP_F_SHARED) !=
 	    (wpc->ioend->io_flags & IOMAP_F_SHARED))
@@ -1312,6 +1312,8 @@ iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
 		return false;
 	if (sector != bio_end_sector(wpc->ioend->io_bio))
 		return false;
+	if (wpc->ioend->io_size + len > IOEND_MAX_IOSIZE)
+		return false;
 	return true;
 }
 
@@ -1329,7 +1331,7 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page,
 	unsigned poff = offset & (PAGE_SIZE - 1);
 	bool merged, same_page = false;
 
-	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, sector)) {
+	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, len, sector)) {
 		if (wpc->ioend)
 			list_add(&wpc->ioend->io_list, iolist);
 		wpc->ioend = iomap_alloc_ioend(inode, wpc, offset, sector, wbc);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 4d1d3c3469e9..5d1b1a08ec96 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -200,6 +200,8 @@ struct iomap_ioend {
 	struct bio		io_inline_bio;	/* MUST BE LAST! */
 };
 
+#define IOEND_MAX_IOSIZE	(262144 << PAGE_SHIFT)
+
 struct iomap_writeback_ops {
 	/*
 	 * Required, maps the blocks so that writeback can be performed on

Ming Lei Aug. 31, 2020, 4:01 a.m. UTC | #17

On Tue, Aug 25, 2020 at 10:49:17AM -0400, Brian Foster wrote:
> cc Ming
> 
> On Tue, Aug 25, 2020 at 10:42:03AM +1000, Dave Chinner wrote:
> > On Mon, Aug 24, 2020 at 11:48:41AM -0400, Brian Foster wrote:
> > > On Mon, Aug 24, 2020 at 04:04:17PM +0100, Christoph Hellwig wrote:
> > > > On Mon, Aug 24, 2020 at 10:28:23AM -0400, Brian Foster wrote:
> > > > > Do I understand the current code (__bio_try_merge_page() ->
> > > > > page_is_mergeable()) correctly in that we're checking for physical page
> > > > > contiguity and not necessarily requiring a new bio_vec per physical
> > > > > page?
> > > > 
> > > > 
> > > > Yes.
> > > > 
> > > 
> > > Ok. I also realize now that this occurs on a kernel without commit
> > > 07173c3ec276 ("block: enable multipage bvecs"). That is probably a
> > > contributing factor, but it's not clear to me whether it's feasible to
> > > backport whatever supporting infrastructure is required for that
> > > mechanism to work (I suspect not).
> > > 
> > > > > With regard to Dave's earlier point around seeing excessively sized bio
> > > > > chains.. If I set up a large memory box with high dirty mem ratios and
> > > > > do contiguous buffered overwrites over a 32GB range followed by fsync, I
> > > > > can see upwards of 1GB per bio and thus chains on the order of 32+ bios
> > > > > for the entire write. If I play games with how the buffered overwrite is
> > > > > submitted (i.e., in reverse) however, then I can occasionally reproduce
> > > > > a ~32GB chain of ~32k bios, which I think is what leads to problems in
> > > > > I/O completion on some systems. Granted, I don't reproduce soft lockup
> > > > > issues on my system with that behavior, so perhaps there's more to that
> > > > > particular issue.
> > > > > 
> > > > > Regardless, it seems reasonable to me to at least have a conservative
> > > > > limit on the length of an ioend bio chain. Would anybody object to
> > > > > iomap_ioend growing a chain counter and perhaps forcing into a new ioend
> > > > > if we chain something like more than 1k bios at once?
> > > > 
> > > > So what exactly is the problem of processing a long chain in the
> > > > workqueue vs multiple small chains?  Maybe we need a cond_resched()
> > > > here and there, but I don't see how we'd substantially change behavior.
> > > > 
> > > 
> > > The immediate problem is a watchdog lockup detection in bio completion:
> > > 
> > >   NMI watchdog: Watchdog detected hard LOCKUP on cpu 25
> > > 
> > > This effectively lands at the following segment of iomap_finish_ioend():
> > > 
> > > 		...
> > >                /* walk each page on bio, ending page IO on them */
> > >                 bio_for_each_segment_all(bv, bio, iter_all)
> > >                         iomap_finish_page_writeback(inode, bv->bv_page, error);
> > > 
> > > I suppose we could add a cond_resched(), but is that safe directly
> > > inside of a ->bi_end_io() handler? Another option could be to dump large
> > > chains into the completion workqueue, but we may still need to track the
> > > length to do that. Thoughts?
> > 
> > We have ioend completion merging that will run the compeltion once
> > for all the pending ioend completions on that inode. IOWs, we do not
> > need to build huge chains at submission time to batch up completions
> > efficiently. However, huge bio chains at submission time do cause
> > issues with writeback fairness, pinning GBs of ram as unreclaimable
> > for seconds because they are queued for completion while we are
> > still submitting the bio chain and submission is being throttled by
> > the block layer writeback throttle, etc. Not to mention the latency
> > of stable pages in a situation like this - a mmap() write fault
> > could stall for many seconds waiting for a huge bio chain to finish
> > submission and run completion processing even when the IO for the
> > given page we faulted on was completed before the page fault
> > occurred...
> > 
> > Hence I think we really do need to cap the length of the bio
> > chains here so that we start completing and ending page writeback on
> > large writeback ranges long before the writeback code finishes
> > submitting the range it was asked to write back.
> > 
> 
> Ming pointed out separately that limiting the bio chain itself might not
> be enough because with multipage bvecs, we can effectively capture the
> same number of pages in much fewer bios. Given that, what do you think
> about something like the patch below to limit ioend size? This
> effectively limits the number of pages per ioend regardless of whether
> in-core state results in a small chain of dense bios or a large chain of
> smaller bios, without requiring any new explicit page count tracking.

Hello Brian,

This patch looks fine.

However, I am wondering why iomap has to chain bios in one ioend, and why not
submit each bio in usual way just like what fs/direct-io.c does? Then each bio
can complete the pages in its own .bi_end_io().


thanks,
Ming

Brian Foster Aug. 31, 2020, 2:35 p.m. UTC | #18

On Mon, Aug 31, 2020 at 12:01:07PM +0800, Ming Lei wrote:
> On Tue, Aug 25, 2020 at 10:49:17AM -0400, Brian Foster wrote:
> > cc Ming
> > 
> > On Tue, Aug 25, 2020 at 10:42:03AM +1000, Dave Chinner wrote:
> > > On Mon, Aug 24, 2020 at 11:48:41AM -0400, Brian Foster wrote:
> > > > On Mon, Aug 24, 2020 at 04:04:17PM +0100, Christoph Hellwig wrote:
> > > > > On Mon, Aug 24, 2020 at 10:28:23AM -0400, Brian Foster wrote:
> > > > > > Do I understand the current code (__bio_try_merge_page() ->
> > > > > > page_is_mergeable()) correctly in that we're checking for physical page
> > > > > > contiguity and not necessarily requiring a new bio_vec per physical
> > > > > > page?
> > > > > 
> > > > > 
> > > > > Yes.
> > > > > 
> > > > 
> > > > Ok. I also realize now that this occurs on a kernel without commit
> > > > 07173c3ec276 ("block: enable multipage bvecs"). That is probably a
> > > > contributing factor, but it's not clear to me whether it's feasible to
> > > > backport whatever supporting infrastructure is required for that
> > > > mechanism to work (I suspect not).
> > > > 
> > > > > > With regard to Dave's earlier point around seeing excessively sized bio
> > > > > > chains.. If I set up a large memory box with high dirty mem ratios and
> > > > > > do contiguous buffered overwrites over a 32GB range followed by fsync, I
> > > > > > can see upwards of 1GB per bio and thus chains on the order of 32+ bios
> > > > > > for the entire write. If I play games with how the buffered overwrite is
> > > > > > submitted (i.e., in reverse) however, then I can occasionally reproduce
> > > > > > a ~32GB chain of ~32k bios, which I think is what leads to problems in
> > > > > > I/O completion on some systems. Granted, I don't reproduce soft lockup
> > > > > > issues on my system with that behavior, so perhaps there's more to that
> > > > > > particular issue.
> > > > > > 
> > > > > > Regardless, it seems reasonable to me to at least have a conservative
> > > > > > limit on the length of an ioend bio chain. Would anybody object to
> > > > > > iomap_ioend growing a chain counter and perhaps forcing into a new ioend
> > > > > > if we chain something like more than 1k bios at once?
> > > > > 
> > > > > So what exactly is the problem of processing a long chain in the
> > > > > workqueue vs multiple small chains?  Maybe we need a cond_resched()
> > > > > here and there, but I don't see how we'd substantially change behavior.
> > > > > 
> > > > 
> > > > The immediate problem is a watchdog lockup detection in bio completion:
> > > > 
> > > >   NMI watchdog: Watchdog detected hard LOCKUP on cpu 25
> > > > 
> > > > This effectively lands at the following segment of iomap_finish_ioend():
> > > > 
> > > > 		...
> > > >                /* walk each page on bio, ending page IO on them */
> > > >                 bio_for_each_segment_all(bv, bio, iter_all)
> > > >                         iomap_finish_page_writeback(inode, bv->bv_page, error);
> > > > 
> > > > I suppose we could add a cond_resched(), but is that safe directly
> > > > inside of a ->bi_end_io() handler? Another option could be to dump large
> > > > chains into the completion workqueue, but we may still need to track the
> > > > length to do that. Thoughts?
> > > 
> > > We have ioend completion merging that will run the compeltion once
> > > for all the pending ioend completions on that inode. IOWs, we do not
> > > need to build huge chains at submission time to batch up completions
> > > efficiently. However, huge bio chains at submission time do cause
> > > issues with writeback fairness, pinning GBs of ram as unreclaimable
> > > for seconds because they are queued for completion while we are
> > > still submitting the bio chain and submission is being throttled by
> > > the block layer writeback throttle, etc. Not to mention the latency
> > > of stable pages in a situation like this - a mmap() write fault
> > > could stall for many seconds waiting for a huge bio chain to finish
> > > submission and run completion processing even when the IO for the
> > > given page we faulted on was completed before the page fault
> > > occurred...
> > > 
> > > Hence I think we really do need to cap the length of the bio
> > > chains here so that we start completing and ending page writeback on
> > > large writeback ranges long before the writeback code finishes
> > > submitting the range it was asked to write back.
> > > 
> > 
> > Ming pointed out separately that limiting the bio chain itself might not
> > be enough because with multipage bvecs, we can effectively capture the
> > same number of pages in much fewer bios. Given that, what do you think
> > about something like the patch below to limit ioend size? This
> > effectively limits the number of pages per ioend regardless of whether
> > in-core state results in a small chain of dense bios or a large chain of
> > smaller bios, without requiring any new explicit page count tracking.
> 
> Hello Brian,
> 
> This patch looks fine.
> 
> However, I am wondering why iomap has to chain bios in one ioend, and why not
> submit each bio in usual way just like what fs/direct-io.c does? Then each bio
> can complete the pages in its own .bi_end_io().
> 

I think it's mainly for efficiency and code simplicity reasons. The
ioend describes a contiguous range of blocks with the same io type
(written, unwritten, append, etc.), so whatever post-completion action
might be required for a particular ioend (i.e. unwritten conversion)
shouldn't execute until I/O completes on the entire range. I believe
this goes back to XFS commit 0e51a8e191db ("xfs: optimize bio handling
in the buffer writeback path"), which basically reimplemented similar,
custom ioend behavior to rely on bio chains and was eventually lifted
from XFS into iomap.

Brian

> 
> thanks,
> Ming
>

Darrick J. Wong Sept. 16, 2020, 12:12 a.m. UTC | #19

On Tue, Aug 25, 2020 at 10:49:17AM -0400, Brian Foster wrote:
> cc Ming
> 
> On Tue, Aug 25, 2020 at 10:42:03AM +1000, Dave Chinner wrote:
> > On Mon, Aug 24, 2020 at 11:48:41AM -0400, Brian Foster wrote:
> > > On Mon, Aug 24, 2020 at 04:04:17PM +0100, Christoph Hellwig wrote:
> > > > On Mon, Aug 24, 2020 at 10:28:23AM -0400, Brian Foster wrote:
> > > > > Do I understand the current code (__bio_try_merge_page() ->
> > > > > page_is_mergeable()) correctly in that we're checking for physical page
> > > > > contiguity and not necessarily requiring a new bio_vec per physical
> > > > > page?
> > > > 
> > > > 
> > > > Yes.
> > > > 
> > > 
> > > Ok. I also realize now that this occurs on a kernel without commit
> > > 07173c3ec276 ("block: enable multipage bvecs"). That is probably a
> > > contributing factor, but it's not clear to me whether it's feasible to
> > > backport whatever supporting infrastructure is required for that
> > > mechanism to work (I suspect not).
> > > 
> > > > > With regard to Dave's earlier point around seeing excessively sized bio
> > > > > chains.. If I set up a large memory box with high dirty mem ratios and
> > > > > do contiguous buffered overwrites over a 32GB range followed by fsync, I
> > > > > can see upwards of 1GB per bio and thus chains on the order of 32+ bios
> > > > > for the entire write. If I play games with how the buffered overwrite is
> > > > > submitted (i.e., in reverse) however, then I can occasionally reproduce
> > > > > a ~32GB chain of ~32k bios, which I think is what leads to problems in
> > > > > I/O completion on some systems. Granted, I don't reproduce soft lockup
> > > > > issues on my system with that behavior, so perhaps there's more to that
> > > > > particular issue.
> > > > > 
> > > > > Regardless, it seems reasonable to me to at least have a conservative
> > > > > limit on the length of an ioend bio chain. Would anybody object to
> > > > > iomap_ioend growing a chain counter and perhaps forcing into a new ioend
> > > > > if we chain something like more than 1k bios at once?
> > > > 
> > > > So what exactly is the problem of processing a long chain in the
> > > > workqueue vs multiple small chains?  Maybe we need a cond_resched()
> > > > here and there, but I don't see how we'd substantially change behavior.
> > > > 
> > > 
> > > The immediate problem is a watchdog lockup detection in bio completion:
> > > 
> > >   NMI watchdog: Watchdog detected hard LOCKUP on cpu 25
> > > 
> > > This effectively lands at the following segment of iomap_finish_ioend():
> > > 
> > > 		...
> > >                /* walk each page on bio, ending page IO on them */
> > >                 bio_for_each_segment_all(bv, bio, iter_all)
> > >                         iomap_finish_page_writeback(inode, bv->bv_page, error);
> > > 
> > > I suppose we could add a cond_resched(), but is that safe directly
> > > inside of a ->bi_end_io() handler? Another option could be to dump large
> > > chains into the completion workqueue, but we may still need to track the
> > > length to do that. Thoughts?
> > 
> > We have ioend completion merging that will run the compeltion once
> > for all the pending ioend completions on that inode. IOWs, we do not
> > need to build huge chains at submission time to batch up completions
> > efficiently. However, huge bio chains at submission time do cause
> > issues with writeback fairness, pinning GBs of ram as unreclaimable
> > for seconds because they are queued for completion while we are
> > still submitting the bio chain and submission is being throttled by
> > the block layer writeback throttle, etc. Not to mention the latency
> > of stable pages in a situation like this - a mmap() write fault
> > could stall for many seconds waiting for a huge bio chain to finish
> > submission and run completion processing even when the IO for the
> > given page we faulted on was completed before the page fault
> > occurred...
> > 
> > Hence I think we really do need to cap the length of the bio
> > chains here so that we start completing and ending page writeback on
> > large writeback ranges long before the writeback code finishes
> > submitting the range it was asked to write back.
> > 
> 
> Ming pointed out separately that limiting the bio chain itself might not
> be enough because with multipage bvecs, we can effectively capture the
> same number of pages in much fewer bios. Given that, what do you think
> about something like the patch below to limit ioend size? This
> effectively limits the number of pages per ioend regardless of whether
> in-core state results in a small chain of dense bios or a large chain of
> smaller bios, without requiring any new explicit page count tracking.
> 
> Brian

Dave was asking on IRC if I was going to pull this patch in.  I'm unsure
of its status (other than it hasn't been sent as a proper [PATCH]) so I
wonder, is this necessary, and if so, can it be cleaned up and
submitted?

--D

> --- 8< ---
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 6ae98d3cb157..4aa96705ffd7 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1301,7 +1301,7 @@ iomap_chain_bio(struct bio *prev)
>  
>  static bool
>  iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
> -		sector_t sector)
> +		unsigned len, sector_t sector)
>  {
>  	if ((wpc->iomap.flags & IOMAP_F_SHARED) !=
>  	    (wpc->ioend->io_flags & IOMAP_F_SHARED))
> @@ -1312,6 +1312,8 @@ iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
>  		return false;
>  	if (sector != bio_end_sector(wpc->ioend->io_bio))
>  		return false;
> +	if (wpc->ioend->io_size + len > IOEND_MAX_IOSIZE)
> +		return false;
>  	return true;
>  }
>  
> @@ -1329,7 +1331,7 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page,
>  	unsigned poff = offset & (PAGE_SIZE - 1);
>  	bool merged, same_page = false;
>  
> -	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, sector)) {
> +	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, len, sector)) {
>  		if (wpc->ioend)
>  			list_add(&wpc->ioend->io_list, iolist);
>  		wpc->ioend = iomap_alloc_ioend(inode, wpc, offset, sector, wbc);
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 4d1d3c3469e9..5d1b1a08ec96 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -200,6 +200,8 @@ struct iomap_ioend {
>  	struct bio		io_inline_bio;	/* MUST BE LAST! */
>  };
>  
> +#define IOEND_MAX_IOSIZE	(262144 << PAGE_SHIFT)
> +
>  struct iomap_writeback_ops {
>  	/*
>  	 * Required, maps the blocks so that writeback can be performed on
>

Christoph Hellwig Sept. 16, 2020, 8:45 a.m. UTC | #20

On Tue, Sep 15, 2020 at 05:12:42PM -0700, Darrick J. Wong wrote:
> On Tue, Aug 25, 2020 at 10:49:17AM -0400, Brian Foster wrote:
> > cc Ming
> > 
> > On Tue, Aug 25, 2020 at 10:42:03AM +1000, Dave Chinner wrote:
> > > On Mon, Aug 24, 2020 at 11:48:41AM -0400, Brian Foster wrote:
> > > > On Mon, Aug 24, 2020 at 04:04:17PM +0100, Christoph Hellwig wrote:
> > > > > On Mon, Aug 24, 2020 at 10:28:23AM -0400, Brian Foster wrote:
> > > > > > Do I understand the current code (__bio_try_merge_page() ->
> > > > > > page_is_mergeable()) correctly in that we're checking for physical page
> > > > > > contiguity and not necessarily requiring a new bio_vec per physical
> > > > > > page?
> > > > > 
> > > > > 
> > > > > Yes.
> > > > > 
> > > > 
> > > > Ok. I also realize now that this occurs on a kernel without commit
> > > > 07173c3ec276 ("block: enable multipage bvecs"). That is probably a
> > > > contributing factor, but it's not clear to me whether it's feasible to
> > > > backport whatever supporting infrastructure is required for that
> > > > mechanism to work (I suspect not).
> > > > 
> > > > > > With regard to Dave's earlier point around seeing excessively sized bio
> > > > > > chains.. If I set up a large memory box with high dirty mem ratios and
> > > > > > do contiguous buffered overwrites over a 32GB range followed by fsync, I
> > > > > > can see upwards of 1GB per bio and thus chains on the order of 32+ bios
> > > > > > for the entire write. If I play games with how the buffered overwrite is
> > > > > > submitted (i.e., in reverse) however, then I can occasionally reproduce
> > > > > > a ~32GB chain of ~32k bios, which I think is what leads to problems in
> > > > > > I/O completion on some systems. Granted, I don't reproduce soft lockup
> > > > > > issues on my system with that behavior, so perhaps there's more to that
> > > > > > particular issue.
> > > > > > 
> > > > > > Regardless, it seems reasonable to me to at least have a conservative
> > > > > > limit on the length of an ioend bio chain. Would anybody object to
> > > > > > iomap_ioend growing a chain counter and perhaps forcing into a new ioend
> > > > > > if we chain something like more than 1k bios at once?
> > > > > 
> > > > > So what exactly is the problem of processing a long chain in the
> > > > > workqueue vs multiple small chains?  Maybe we need a cond_resched()
> > > > > here and there, but I don't see how we'd substantially change behavior.
> > > > > 
> > > > 
> > > > The immediate problem is a watchdog lockup detection in bio completion:
> > > > 
> > > >   NMI watchdog: Watchdog detected hard LOCKUP on cpu 25
> > > > 
> > > > This effectively lands at the following segment of iomap_finish_ioend():
> > > > 
> > > > 		...
> > > >                /* walk each page on bio, ending page IO on them */
> > > >                 bio_for_each_segment_all(bv, bio, iter_all)
> > > >                         iomap_finish_page_writeback(inode, bv->bv_page, error);
> > > > 
> > > > I suppose we could add a cond_resched(), but is that safe directly
> > > > inside of a ->bi_end_io() handler? Another option could be to dump large
> > > > chains into the completion workqueue, but we may still need to track the
> > > > length to do that. Thoughts?
> > > 
> > > We have ioend completion merging that will run the compeltion once
> > > for all the pending ioend completions on that inode. IOWs, we do not
> > > need to build huge chains at submission time to batch up completions
> > > efficiently. However, huge bio chains at submission time do cause
> > > issues with writeback fairness, pinning GBs of ram as unreclaimable
> > > for seconds because they are queued for completion while we are
> > > still submitting the bio chain and submission is being throttled by
> > > the block layer writeback throttle, etc. Not to mention the latency
> > > of stable pages in a situation like this - a mmap() write fault
> > > could stall for many seconds waiting for a huge bio chain to finish
> > > submission and run completion processing even when the IO for the
> > > given page we faulted on was completed before the page fault
> > > occurred...
> > > 
> > > Hence I think we really do need to cap the length of the bio
> > > chains here so that we start completing and ending page writeback on
> > > large writeback ranges long before the writeback code finishes
> > > submitting the range it was asked to write back.
> > > 
> > 
> > Ming pointed out separately that limiting the bio chain itself might not
> > be enough because with multipage bvecs, we can effectively capture the
> > same number of pages in much fewer bios. Given that, what do you think
> > about something like the patch below to limit ioend size? This
> > effectively limits the number of pages per ioend regardless of whether
> > in-core state results in a small chain of dense bios or a large chain of
> > smaller bios, without requiring any new explicit page count tracking.
> > 
> > Brian
> 
> Dave was asking on IRC if I was going to pull this patch in.  I'm unsure
> of its status (other than it hasn't been sent as a proper [PATCH]) so I
> wonder, is this necessary, and if so, can it be cleaned up and
> submitted?

Maybe it is lost somewhere, but what is the point of this patch?
What does the magic number try to represent?

Brian Foster Sept. 16, 2020, 1:07 p.m. UTC | #21

On Wed, Sep 16, 2020 at 09:45:10AM +0100, Christoph Hellwig wrote:
> On Tue, Sep 15, 2020 at 05:12:42PM -0700, Darrick J. Wong wrote:
> > On Tue, Aug 25, 2020 at 10:49:17AM -0400, Brian Foster wrote:
> > > cc Ming
> > > 
> > > On Tue, Aug 25, 2020 at 10:42:03AM +1000, Dave Chinner wrote:
> > > > On Mon, Aug 24, 2020 at 11:48:41AM -0400, Brian Foster wrote:
> > > > > On Mon, Aug 24, 2020 at 04:04:17PM +0100, Christoph Hellwig wrote:
> > > > > > On Mon, Aug 24, 2020 at 10:28:23AM -0400, Brian Foster wrote:
> > > > > > > Do I understand the current code (__bio_try_merge_page() ->
> > > > > > > page_is_mergeable()) correctly in that we're checking for physical page
> > > > > > > contiguity and not necessarily requiring a new bio_vec per physical
> > > > > > > page?
> > > > > > 
> > > > > > 
> > > > > > Yes.
> > > > > > 
> > > > > 
> > > > > Ok. I also realize now that this occurs on a kernel without commit
> > > > > 07173c3ec276 ("block: enable multipage bvecs"). That is probably a
> > > > > contributing factor, but it's not clear to me whether it's feasible to
> > > > > backport whatever supporting infrastructure is required for that
> > > > > mechanism to work (I suspect not).
> > > > > 
> > > > > > > With regard to Dave's earlier point around seeing excessively sized bio
> > > > > > > chains.. If I set up a large memory box with high dirty mem ratios and
> > > > > > > do contiguous buffered overwrites over a 32GB range followed by fsync, I
> > > > > > > can see upwards of 1GB per bio and thus chains on the order of 32+ bios
> > > > > > > for the entire write. If I play games with how the buffered overwrite is
> > > > > > > submitted (i.e., in reverse) however, then I can occasionally reproduce
> > > > > > > a ~32GB chain of ~32k bios, which I think is what leads to problems in
> > > > > > > I/O completion on some systems. Granted, I don't reproduce soft lockup
> > > > > > > issues on my system with that behavior, so perhaps there's more to that
> > > > > > > particular issue.
> > > > > > > 
> > > > > > > Regardless, it seems reasonable to me to at least have a conservative
> > > > > > > limit on the length of an ioend bio chain. Would anybody object to
> > > > > > > iomap_ioend growing a chain counter and perhaps forcing into a new ioend
> > > > > > > if we chain something like more than 1k bios at once?
> > > > > > 
> > > > > > So what exactly is the problem of processing a long chain in the
> > > > > > workqueue vs multiple small chains?  Maybe we need a cond_resched()
> > > > > > here and there, but I don't see how we'd substantially change behavior.
> > > > > > 
> > > > > 
> > > > > The immediate problem is a watchdog lockup detection in bio completion:
> > > > > 
> > > > >   NMI watchdog: Watchdog detected hard LOCKUP on cpu 25
> > > > > 
> > > > > This effectively lands at the following segment of iomap_finish_ioend():
> > > > > 
> > > > > 		...
> > > > >                /* walk each page on bio, ending page IO on them */
> > > > >                 bio_for_each_segment_all(bv, bio, iter_all)
> > > > >                         iomap_finish_page_writeback(inode, bv->bv_page, error);
> > > > > 
> > > > > I suppose we could add a cond_resched(), but is that safe directly
> > > > > inside of a ->bi_end_io() handler? Another option could be to dump large
> > > > > chains into the completion workqueue, but we may still need to track the
> > > > > length to do that. Thoughts?
> > > > 
> > > > We have ioend completion merging that will run the compeltion once
> > > > for all the pending ioend completions on that inode. IOWs, we do not
> > > > need to build huge chains at submission time to batch up completions
> > > > efficiently. However, huge bio chains at submission time do cause
> > > > issues with writeback fairness, pinning GBs of ram as unreclaimable
> > > > for seconds because they are queued for completion while we are
> > > > still submitting the bio chain and submission is being throttled by
> > > > the block layer writeback throttle, etc. Not to mention the latency
> > > > of stable pages in a situation like this - a mmap() write fault
> > > > could stall for many seconds waiting for a huge bio chain to finish
> > > > submission and run completion processing even when the IO for the
> > > > given page we faulted on was completed before the page fault
> > > > occurred...
> > > > 
> > > > Hence I think we really do need to cap the length of the bio
> > > > chains here so that we start completing and ending page writeback on
> > > > large writeback ranges long before the writeback code finishes
> > > > submitting the range it was asked to write back.
> > > > 
> > > 
> > > Ming pointed out separately that limiting the bio chain itself might not
> > > be enough because with multipage bvecs, we can effectively capture the
> > > same number of pages in much fewer bios. Given that, what do you think
> > > about something like the patch below to limit ioend size? This
> > > effectively limits the number of pages per ioend regardless of whether
> > > in-core state results in a small chain of dense bios or a large chain of
> > > smaller bios, without requiring any new explicit page count tracking.
> > > 
> > > Brian
> > 
> > Dave was asking on IRC if I was going to pull this patch in.  I'm unsure
> > of its status (other than it hasn't been sent as a proper [PATCH]) so I
> > wonder, is this necessary, and if so, can it be cleaned up and
> > submitted?
> 

I was waiting on some feedback from a few different angles before
posting a proper patch..

> Maybe it is lost somewhere, but what is the point of this patch?
> What does the magic number try to represent?
> 

Dave described the main purpose earlier in this thread [1]. The initial
motivation is that we've had downstream reports of soft lockup problems
in writeback bio completion down in the bio -> bvec loop of
iomap_finish_ioend() that has to finish writeback on each individual
page of insanely large bios and/or chains. We've also had an upstream
reports of a similar problem on linux-xfs [2].

The magic number itself was just pulled out of a hat. I picked it
because it seemed conservative enough to still allow large contiguous
bios (1GB w/ 4k pages) while hopefully preventing I/O completion
problems, but was hoping for some feedback on that bit if the general
approach was acceptable. I was also waiting for some feedback on either
of the two users who reported the problem but I don't think I've heard
back on that yet...

Brian

[1] https://lore.kernel.org/linux-fsdevel/20200821215358.GG7941@dread.disaster.area/
[2] https://lore.kernel.org/linux-xfs/alpine.LRH.2.02.2008311513150.7870@file01.intranet.prod.int.rdu2.redhat.com/

Christoph Hellwig Sept. 17, 2020, 8:04 a.m. UTC | #22

On Wed, Sep 16, 2020 at 09:07:14AM -0400, Brian Foster wrote:
> Dave described the main purpose earlier in this thread [1]. The initial
> motivation is that we've had downstream reports of soft lockup problems
> in writeback bio completion down in the bio -> bvec loop of
> iomap_finish_ioend() that has to finish writeback on each individual
> page of insanely large bios and/or chains. We've also had an upstream
> reports of a similar problem on linux-xfs [2].
> 
> The magic number itself was just pulled out of a hat. I picked it
> because it seemed conservative enough to still allow large contiguous
> bios (1GB w/ 4k pages) while hopefully preventing I/O completion
> problems, but was hoping for some feedback on that bit if the general
> approach was acceptable. I was also waiting for some feedback on either
> of the two users who reported the problem but I don't think I've heard
> back on that yet...

I think the saner answer is to always run large completions in the
workqueue, and add a bunch of cond_resched() calls, rather than
arbitrarily breaking up the I/O size.

Brian Foster Sept. 17, 2020, 10:42 a.m. UTC | #23

On Thu, Sep 17, 2020 at 09:04:55AM +0100, Christoph Hellwig wrote:
> On Wed, Sep 16, 2020 at 09:07:14AM -0400, Brian Foster wrote:
> > Dave described the main purpose earlier in this thread [1]. The initial
> > motivation is that we've had downstream reports of soft lockup problems
> > in writeback bio completion down in the bio -> bvec loop of
> > iomap_finish_ioend() that has to finish writeback on each individual
> > page of insanely large bios and/or chains. We've also had an upstream
> > reports of a similar problem on linux-xfs [2].
> > 
> > The magic number itself was just pulled out of a hat. I picked it
> > because it seemed conservative enough to still allow large contiguous
> > bios (1GB w/ 4k pages) while hopefully preventing I/O completion
> > problems, but was hoping for some feedback on that bit if the general
> > approach was acceptable. I was also waiting for some feedback on either
> > of the two users who reported the problem but I don't think I've heard
> > back on that yet...
> 
> I think the saner answer is to always run large completions in the
> workqueue, and add a bunch of cond_resched() calls, rather than
> arbitrarily breaking up the I/O size.
> 

That wouldn't address the latency concern Dave brought up. That said, I
have no issue with this as a targeted solution for the softlockup issue.
iomap_finish_ioend[s]() is common code for both the workqueue and
->bi_end_io() contexts so that would require either some kind of context
detection (and my understanding is in_atomic() is unreliable/frowned
upon) or a new "atomic" parameter through iomap_finish_ioend[s]() to
indicate whether it's safe to reschedule. Preference?

Brian

Christoph Hellwig Sept. 17, 2020, 2:48 p.m. UTC | #24

On Thu, Sep 17, 2020 at 06:42:19AM -0400, Brian Foster wrote:
> That wouldn't address the latency concern Dave brought up. That said, I
> have no issue with this as a targeted solution for the softlockup issue.
> iomap_finish_ioend[s]() is common code for both the workqueue and
> ->bi_end_io() contexts so that would require either some kind of context
> detection (and my understanding is in_atomic() is unreliable/frowned
> upon) or a new "atomic" parameter through iomap_finish_ioend[s]() to
> indicate whether it's safe to reschedule. Preference?

True, it would not help with latency.  But then again the latency
should be controlled by the writeback code not doing giant writebacks
to start with, shouldn't it?

Any XFS/iomap specific limit also would not help with the block layer
merging bios.

Darrick J. Wong Sept. 17, 2020, 9:33 p.m. UTC | #25

On Thu, Sep 17, 2020 at 03:48:04PM +0100, Christoph Hellwig wrote:
> On Thu, Sep 17, 2020 at 06:42:19AM -0400, Brian Foster wrote:
> > That wouldn't address the latency concern Dave brought up. That said, I
> > have no issue with this as a targeted solution for the softlockup issue.
> > iomap_finish_ioend[s]() is common code for both the workqueue and
> > ->bi_end_io() contexts so that would require either some kind of context
> > detection (and my understanding is in_atomic() is unreliable/frowned
> > upon) or a new "atomic" parameter through iomap_finish_ioend[s]() to
> > indicate whether it's safe to reschedule. Preference?
> 
> True, it would not help with latency.  But then again the latency
> should be controlled by the writeback code not doing giant writebacks
> to start with, shouldn't it?
> 
> Any XFS/iomap specific limit also would not help with the block layer
> merging bios.

/me hasn't totally been following this thread, but iomap will also
aggregate the ioend completions; do we need to cap that to keep
latencies down?  I was assuming that amortization was always favorable,
but maybe not?

--D

Ming Lei Sept. 17, 2020, 11:13 p.m. UTC | #26

On Thu, Sep 17, 2020 at 09:04:55AM +0100, Christoph Hellwig wrote:
> On Wed, Sep 16, 2020 at 09:07:14AM -0400, Brian Foster wrote:
> > Dave described the main purpose earlier in this thread [1]. The initial
> > motivation is that we've had downstream reports of soft lockup problems
> > in writeback bio completion down in the bio -> bvec loop of
> > iomap_finish_ioend() that has to finish writeback on each individual
> > page of insanely large bios and/or chains. We've also had an upstream
> > reports of a similar problem on linux-xfs [2].
> > 
> > The magic number itself was just pulled out of a hat. I picked it
> > because it seemed conservative enough to still allow large contiguous
> > bios (1GB w/ 4k pages) while hopefully preventing I/O completion
> > problems, but was hoping for some feedback on that bit if the general
> > approach was acceptable. I was also waiting for some feedback on either
> > of the two users who reported the problem but I don't think I've heard
> > back on that yet...
> 
> I think the saner answer is to always run large completions in the
> workqueue, and add a bunch of cond_resched() calls, rather than
> arbitrarily breaking up the I/O size.

Completion all ioend pages in single bio->end_bio() may pin too many pages
unnecessary long, and adding cond_resched() can make the issue worse.


thanks,
Ming

iomap: Fix the write_count in iomap_add_to_ioend().

Commit Message

Comments

Patch