diff mbox series

[v2,2/2] xfs: kick extra large ioends to completion workqueue

Message ID 20201005152102.15797-1-bfoster@redhat.com (mailing list archive)
State New, archived
Headers show
Series None | expand

Commit Message

Brian Foster Oct. 5, 2020, 3:21 p.m. UTC
We've had reports of soft lockup warnings in the iomap ioend
completion path due to very large bios and/or bio chains. Divert any
ioends with 256k or more pages to process to the workqueue so
completion occurs in non-atomic context and can reschedule to avoid
soft lockup warnings.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

v2:
- Fix type in macro.

 fs/xfs/xfs_aops.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Comments

Darrick J. Wong Oct. 6, 2020, 3:55 a.m. UTC | #1
On Mon, Oct 05, 2020 at 11:21:02AM -0400, Brian Foster wrote:
> We've had reports of soft lockup warnings in the iomap ioend
> completion path due to very large bios and/or bio chains. Divert any
> ioends with 256k or more pages to process to the workqueue so
> completion occurs in non-atomic context and can reschedule to avoid
> soft lockup warnings.

Hmmmm... is there any way we can just make end_page_writeback faster?

TBH it still strikes me as odd that we'd cap ioends this way just to
cover for the fact that we have to poke each and every page.

(Also, those 'bool atomic' in the other patch make me kind of nervous --
how do we make sure (from a QA perspective) that nobody gets that wrong?)

--D

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> v2:
> - Fix type in macro.
> 
>  fs/xfs/xfs_aops.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 3e061ea99922..c00cc0624986 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -30,6 +30,13 @@ XFS_WPC(struct iomap_writepage_ctx *ctx)
>  	return container_of(ctx, struct xfs_writepage_ctx, ctx);
>  }
>  
> +/*
> + * Kick extra large ioends off to the workqueue. Completion will process a lot
> + * of pages for a large bio or bio chain and a non-atomic context is required to
> + * reschedule and avoid soft lockup warnings.
> + */
> +#define XFS_LARGE_IOEND	(262144ULL << PAGE_SHIFT)
> +
>  /*
>   * Fast and loose check if this write could update the on-disk inode size.
>   */
> @@ -239,7 +246,8 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
>  {
>  	return ioend->io_private ||
>  		ioend->io_type == IOMAP_UNWRITTEN ||
> -		(ioend->io_flags & IOMAP_F_SHARED);
> +		(ioend->io_flags & IOMAP_F_SHARED) ||
> +		(ioend->io_size >= XFS_LARGE_IOEND);
>  }
>  
>  STATIC void
> -- 
> 2.25.4
>
Brian Foster Oct. 6, 2020, 12:44 p.m. UTC | #2
On Mon, Oct 05, 2020 at 08:55:37PM -0700, Darrick J. Wong wrote:
> On Mon, Oct 05, 2020 at 11:21:02AM -0400, Brian Foster wrote:
> > We've had reports of soft lockup warnings in the iomap ioend
> > completion path due to very large bios and/or bio chains. Divert any
> > ioends with 256k or more pages to process to the workqueue so
> > completion occurs in non-atomic context and can reschedule to avoid
> > soft lockup warnings.
> 
> Hmmmm... is there any way we can just make end_page_writeback faster?
> 

I'm not sure that would help us. It's not doing much work as it is. The
issue is just that we effectively queue so many of them to a single bio
completion due to either bio chaining or the physical page merging
implemented by multipage bvecs.

> TBH it still strikes me as odd that we'd cap ioends this way just to
> cover for the fact that we have to poke each and every page.
> 

I suppose, but it's not like we don't already account for constructing
bios that must be handed off to a workqueue for completion processing.
Also FWIW this doesn't cap ioend size like my original patch does. It
just updates XFS to send them to the completion workqueue.

> (Also, those 'bool atomic' in the other patch make me kind of nervous --
> how do we make sure (from a QA perspective) that nobody gets that wrong?)
> 

Yeah, that's a bit ugly. If somebody has a better idea on the factoring
I'm interested in hearing about it. My understanding is that in_atomic()
is not reliable and/or generally frowned upon, hence the explicit
context parameter.

Also, I don't have the error handy but my development kernel complains
quite clearly if we make a call that can potentially sleep in atomic
context. I believe this is the purpose of the __might_sleep()
(CONFIG_DEBUG_ATOMIC_SLEEP) annotation.

Brian

> --D
> 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > v2:
> > - Fix type in macro.
> > 
> >  fs/xfs/xfs_aops.c | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 3e061ea99922..c00cc0624986 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -30,6 +30,13 @@ XFS_WPC(struct iomap_writepage_ctx *ctx)
> >  	return container_of(ctx, struct xfs_writepage_ctx, ctx);
> >  }
> >  
> > +/*
> > + * Kick extra large ioends off to the workqueue. Completion will process a lot
> > + * of pages for a large bio or bio chain and a non-atomic context is required to
> > + * reschedule and avoid soft lockup warnings.
> > + */
> > +#define XFS_LARGE_IOEND	(262144ULL << PAGE_SHIFT)
> > +
> >  /*
> >   * Fast and loose check if this write could update the on-disk inode size.
> >   */
> > @@ -239,7 +246,8 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
> >  {
> >  	return ioend->io_private ||
> >  		ioend->io_type == IOMAP_UNWRITTEN ||
> > -		(ioend->io_flags & IOMAP_F_SHARED);
> > +		(ioend->io_flags & IOMAP_F_SHARED) ||
> > +		(ioend->io_size >= XFS_LARGE_IOEND);
> >  }
> >  
> >  STATIC void
> > -- 
> > 2.25.4
> > 
>
Matthew Wilcox (Oracle) Oct. 6, 2020, 2:07 p.m. UTC | #3
On Mon, Oct 05, 2020 at 08:55:37PM -0700, Darrick J. Wong wrote:
> On Mon, Oct 05, 2020 at 11:21:02AM -0400, Brian Foster wrote:
> > We've had reports of soft lockup warnings in the iomap ioend
> > completion path due to very large bios and/or bio chains. Divert any
> > ioends with 256k or more pages to process to the workqueue so
> > completion occurs in non-atomic context and can reschedule to avoid
> > soft lockup warnings.
> 
> Hmmmm... is there any way we can just make end_page_writeback faster?

There are ways to make it faster.  I don't know if they're a "just"
solution ...

1. We can use THPs.  That will reduce the number of pages being operated
on.  I hear somebody might have a patch set for that.  Incidentally,
this patch set will clash with the THP patchset, so one of us is going to
have to rebase on the other's work.  Not a complaint, just acknowledging
that some coordination will be needed for the 5.11 merge window.

2. We could create end_writeback_pages(struct pagevec *pvec) which
calls a new test_clear_writeback_pages(pvec).  That could amortise
taking the memcg lock and finding the lruvec and taking the mapping
lock -- assuming these pages are sufficiently virtually contiguous.
It can definitely amortise all the statistics updates.

3. We can make wake_up_page(page, PG_writeback); more efficient.  If
you can produce this situation on demand, I had a patch for that which
languished due to lack of interest.

https://lore.kernel.org/linux-fsdevel/20200416220130.13343-1-willy@infradead.org/
Darrick J. Wong May 6, 2021, 7:31 p.m. UTC | #4
On Tue, Oct 06, 2020 at 08:44:40AM -0400, Brian Foster wrote:
> On Mon, Oct 05, 2020 at 08:55:37PM -0700, Darrick J. Wong wrote:
> > On Mon, Oct 05, 2020 at 11:21:02AM -0400, Brian Foster wrote:
> > > We've had reports of soft lockup warnings in the iomap ioend
> > > completion path due to very large bios and/or bio chains. Divert any
> > > ioends with 256k or more pages to process to the workqueue so

Heh, lol, so now we're hitting this internally too.  Certain customers
are sending 580,000-page bios, which then trip the hangcheck timer when
we stall the interrupt handler while clearing all the page bits.

> > > completion occurs in non-atomic context and can reschedule to avoid
> > > soft lockup warnings.
> > 
> > Hmmmm... is there any way we can just make end_page_writeback faster?
> > 
> 
> I'm not sure that would help us. It's not doing much work as it is. The
> issue is just that we effectively queue so many of them to a single bio
> completion due to either bio chaining or the physical page merging
> implemented by multipage bvecs.
> 
> > TBH it still strikes me as odd that we'd cap ioends this way just to
> > cover for the fact that we have to poke each and every page.
> > 
> 
> I suppose, but it's not like we don't already account for constructing
> bios that must be handed off to a workqueue for completion processing.
> Also FWIW this doesn't cap ioend size like my original patch does. It
> just updates XFS to send them to the completion workqueue.

<nod> So I guess I'm saying that my resistance to /this/ part of the
changes is melting away.  For a 2GB+ write IO, I guess the extra overhead
of poking a workqueue can be amortized over the sheer number of pages.

> > (Also, those 'bool atomic' in the other patch make me kind of nervous --
> > how do we make sure (from a QA perspective) that nobody gets that wrong?)
> > 
> 
> Yeah, that's a bit ugly. If somebody has a better idea on the factoring
> I'm interested in hearing about it. My understanding is that in_atomic()
> is not reliable and/or generally frowned upon, hence the explicit
> context parameter.
> 
> Also, I don't have the error handy but my development kernel complains
> quite clearly if we make a call that can potentially sleep in atomic
> context. I believe this is the purpose of the __might_sleep()
> (CONFIG_DEBUG_ATOMIC_SLEEP) annotation.

I wonder if it's not too late to establish a new iomap rule?

All clients whose ->prepare_ioend handler overrides the default
ioend->io_bio->bi_end_io handler must call iomap_finish_ioends from
process context, because the "only" reason why a filesystem would need
to do that is because some post-write metadata update is necessary, and
those really shouldn't be running from interrupt context.

With such a rule (no idea how we'd enforce that) we could at least
constrain that in_atomic variable to buffered-io.c, since the only time
it would be unsafe to call cond_resched() is if iomap_writepage_end_bio
is in use, and it decides to call iomap_finish_ioend directly.

Right now XFS is the only filesystem that overrides the bio endio
handler, and the only time it does that is for writes that need extra
metadata updates (unwritten conversion, setfilesize, cow).

--D

> Brian
> 
> > --D
> > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > > 
> > > v2:
> > > - Fix type in macro.
> > > 
> > >  fs/xfs/xfs_aops.c | 10 +++++++++-
> > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index 3e061ea99922..c00cc0624986 100644
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > > @@ -30,6 +30,13 @@ XFS_WPC(struct iomap_writepage_ctx *ctx)
> > >  	return container_of(ctx, struct xfs_writepage_ctx, ctx);
> > >  }
> > >  
> > > +/*
> > > + * Kick extra large ioends off to the workqueue. Completion will process a lot
> > > + * of pages for a large bio or bio chain and a non-atomic context is required to
> > > + * reschedule and avoid soft lockup warnings.
> > > + */
> > > +#define XFS_LARGE_IOEND	(262144ULL << PAGE_SHIFT)
> > > +
> > >  /*
> > >   * Fast and loose check if this write could update the on-disk inode size.
> > >   */
> > > @@ -239,7 +246,8 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
> > >  {
> > >  	return ioend->io_private ||
> > >  		ioend->io_type == IOMAP_UNWRITTEN ||
> > > -		(ioend->io_flags & IOMAP_F_SHARED);
> > > +		(ioend->io_flags & IOMAP_F_SHARED) ||
> > > +		(ioend->io_size >= XFS_LARGE_IOEND);
> > >  }
> > >  
> > >  STATIC void
> > > -- 
> > > 2.25.4
> > > 
> > 
>
Darrick J. Wong May 6, 2021, 7:34 p.m. UTC | #5
On Tue, Oct 06, 2020 at 03:07:20PM +0100, Matthew Wilcox wrote:
> On Mon, Oct 05, 2020 at 08:55:37PM -0700, Darrick J. Wong wrote:
> > On Mon, Oct 05, 2020 at 11:21:02AM -0400, Brian Foster wrote:
> > > We've had reports of soft lockup warnings in the iomap ioend
> > > completion path due to very large bios and/or bio chains. Divert any
> > > ioends with 256k or more pages to process to the workqueue so
> > > completion occurs in non-atomic context and can reschedule to avoid
> > > soft lockup warnings.
> > 
> > Hmmmm... is there any way we can just make end_page_writeback faster?
> 
> There are ways to make it faster.  I don't know if they're a "just"
> solution ...
> 
> 1. We can use THPs.  That will reduce the number of pages being operated
> on.  I hear somebody might have a patch set for that.  Incidentally,
> this patch set will clash with the THP patchset, so one of us is going to
> have to rebase on the other's work.  Not a complaint, just acknowledging
> that some coordination will be needed for the 5.11 merge window.

How far off is this, anyway?  I assume it's in line behind the folio
series?

> 2. We could create end_writeback_pages(struct pagevec *pvec) which
> calls a new test_clear_writeback_pages(pvec).  That could amortise
> taking the memcg lock and finding the lruvec and taking the mapping
> lock -- assuming these pages are sufficiently virtually contiguous.
> It can definitely amortise all the statistics updates.

/me kinda wonders if THPs arent the better solution for people who want
to run large ios.

> 3. We can make wake_up_page(page, PG_writeback); more efficient.  If
> you can produce this situation on demand, I had a patch for that which
> languished due to lack of interest.

I can (well, someone can) so I'll talk to you internally about their
seeekret reproducer.

> https://lore.kernel.org/linux-fsdevel/20200416220130.13343-1-willy@infradead.org/

--D
Matthew Wilcox (Oracle) May 6, 2021, 7:45 p.m. UTC | #6
On Thu, May 06, 2021 at 12:34:06PM -0700, Darrick J. Wong wrote:
> On Tue, Oct 06, 2020 at 03:07:20PM +0100, Matthew Wilcox wrote:
> > On Mon, Oct 05, 2020 at 08:55:37PM -0700, Darrick J. Wong wrote:
> > > On Mon, Oct 05, 2020 at 11:21:02AM -0400, Brian Foster wrote:
> > > > We've had reports of soft lockup warnings in the iomap ioend
> > > > completion path due to very large bios and/or bio chains. Divert any
> > > > ioends with 256k or more pages to process to the workqueue so
> > > > completion occurs in non-atomic context and can reschedule to avoid
> > > > soft lockup warnings.
> > > 
> > > Hmmmm... is there any way we can just make end_page_writeback faster?
> > 
> > There are ways to make it faster.  I don't know if they're a "just"
> > solution ...
> > 
> > 1. We can use THPs.  That will reduce the number of pages being operated
> > on.  I hear somebody might have a patch set for that.  Incidentally,
> > this patch set will clash with the THP patchset, so one of us is going to
> > have to rebase on the other's work.  Not a complaint, just acknowledging
> > that some coordination will be needed for the 5.11 merge window.
> 
> How far off is this, anyway?  I assume it's in line behind the folio
> series?

Right.  The folio series found all kinds of fun places where the
accounting was wrong (eg accounting for an N-page I/O as a single page),
so the THP work is all renamed folio now.  The folio patchset I posted
yesterday [1] is _most_ of what is necessary from an XFS point of view.
There's probably another three dozen mm patches to actually enable
multi-page folios after that, and a lot more patches to optimise the
mm/vfs for multi-page folios, but your side of things is almost all
published and reviewable.

[1] https://lore.kernel.org/linux-fsdevel/20210505150628.111735-1-willy@infradead.org/

> > 2. We could create end_writeback_pages(struct pagevec *pvec) which
> > calls a new test_clear_writeback_pages(pvec).  That could amortise
> > taking the memcg lock and finding the lruvec and taking the mapping
> > lock -- assuming these pages are sufficiently virtually contiguous.
> > It can definitely amortise all the statistics updates.
> 
> /me kinda wonders if THPs arent the better solution for people who want
> to run large ios.

Yes, definitely.  It does rather depend on their usage patterns,
but if they're working on a file-contiguous chunk of memory, this
can help a lot.

> > 3. We can make wake_up_page(page, PG_writeback); more efficient.  If
> > you can produce this situation on demand, I had a patch for that which
> > languished due to lack of interest.
> 
> I can (well, someone can) so I'll talk to you internally about their
> seeekret reproducer.

Fantastic.  If it's something that needs to get backported to a
stable-ABI kernel ... this isn't going to be a viable solution.
Brian Foster May 7, 2021, 2:06 p.m. UTC | #7
On Thu, May 06, 2021 at 12:31:58PM -0700, Darrick J. Wong wrote:
> On Tue, Oct 06, 2020 at 08:44:40AM -0400, Brian Foster wrote:
> > On Mon, Oct 05, 2020 at 08:55:37PM -0700, Darrick J. Wong wrote:
> > > On Mon, Oct 05, 2020 at 11:21:02AM -0400, Brian Foster wrote:
> > > > We've had reports of soft lockup warnings in the iomap ioend
> > > > completion path due to very large bios and/or bio chains. Divert any
> > > > ioends with 256k or more pages to process to the workqueue so
> 
> Heh, lol, so now we're hitting this internally too.  Certain customers
> are sending 580,000-page bios, which then trip the hangcheck timer when
> we stall the interrupt handler while clearing all the page bits.
> 

Yep, sounds about right. :P

> > > > completion occurs in non-atomic context and can reschedule to avoid
> > > > soft lockup warnings.
> > > 
> > > Hmmmm... is there any way we can just make end_page_writeback faster?
> > > 
> > 
> > I'm not sure that would help us. It's not doing much work as it is. The
> > issue is just that we effectively queue so many of them to a single bio
> > completion due to either bio chaining or the physical page merging
> > implemented by multipage bvecs.
> > 
> > > TBH it still strikes me as odd that we'd cap ioends this way just to
> > > cover for the fact that we have to poke each and every page.
> > > 
> > 
> > I suppose, but it's not like we don't already account for constructing
> > bios that must be handed off to a workqueue for completion processing.
> > Also FWIW this doesn't cap ioend size like my original patch does. It
> > just updates XFS to send them to the completion workqueue.
> 
> <nod> So I guess I'm saying that my resistance to /this/ part of the
> changes is melting away.  For a 2GB+ write IO, I guess the extra overhead
> of poking a workqueue can be amortized over the sheer number of pages.
> 

I think the main question is what is a suitable size threshold to kick
an ioend over to the workqueue? Looking back, I think this patch just
picked 256k randomly to propose the idea. ISTM there could be a
potentially large window from the point where I/O latency starts to
dominate (over the extra context switch for wq processing) and where the
softlockup warning thing will eventually trigger due to having too many
pages. I think that means we could probably use a more conservative
value, I'm just not sure what value should be (10MB, 100MB, 1GB?). If
you have a reproducer it might be interesting to experiment with that.

> > > (Also, those 'bool atomic' in the other patch make me kind of nervous --
> > > how do we make sure (from a QA perspective) that nobody gets that wrong?)
> > > 
> > 
> > Yeah, that's a bit ugly. If somebody has a better idea on the factoring
> > I'm interested in hearing about it. My understanding is that in_atomic()
> > is not reliable and/or generally frowned upon, hence the explicit
> > context parameter.
> > 
> > Also, I don't have the error handy but my development kernel complains
> > quite clearly if we make a call that can potentially sleep in atomic
> > context. I believe this is the purpose of the __might_sleep()
> > (CONFIG_DEBUG_ATOMIC_SLEEP) annotation.
> 
> I wonder if it's not too late to establish a new iomap rule?
> 
> All clients whose ->prepare_ioend handler overrides the default
> ioend->io_bio->bi_end_io handler must call iomap_finish_ioends from
> process context, because the "only" reason why a filesystem would need
> to do that is because some post-write metadata update is necessary, and
> those really shouldn't be running from interrupt context.
> 
> With such a rule (no idea how we'd enforce that) we could at least
> constrain that in_atomic variable to buffered-io.c, since the only time
> it would be unsafe to call cond_resched() is if iomap_writepage_end_bio
> is in use, and it decides to call iomap_finish_ioend directly.
> 

I'm not following if you mean to suggest to change what patch 1 does
somehow or another (it seems similar to what you're describing here) or
something else..?

Brian

> Right now XFS is the only filesystem that overrides the bio endio
> handler, and the only time it does that is for writes that need extra
> metadata updates (unwritten conversion, setfilesize, cow).
> 
> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > > 
> > > > v2:
> > > > - Fix type in macro.
> > > > 
> > > >  fs/xfs/xfs_aops.c | 10 +++++++++-
> > > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > index 3e061ea99922..c00cc0624986 100644
> > > > --- a/fs/xfs/xfs_aops.c
> > > > +++ b/fs/xfs/xfs_aops.c
> > > > @@ -30,6 +30,13 @@ XFS_WPC(struct iomap_writepage_ctx *ctx)
> > > >  	return container_of(ctx, struct xfs_writepage_ctx, ctx);
> > > >  }
> > > >  
> > > > +/*
> > > > + * Kick extra large ioends off to the workqueue. Completion will process a lot
> > > > + * of pages for a large bio or bio chain and a non-atomic context is required to
> > > > + * reschedule and avoid soft lockup warnings.
> > > > + */
> > > > +#define XFS_LARGE_IOEND	(262144ULL << PAGE_SHIFT)
> > > > +
> > > >  /*
> > > >   * Fast and loose check if this write could update the on-disk inode size.
> > > >   */
> > > > @@ -239,7 +246,8 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
> > > >  {
> > > >  	return ioend->io_private ||
> > > >  		ioend->io_type == IOMAP_UNWRITTEN ||
> > > > -		(ioend->io_flags & IOMAP_F_SHARED);
> > > > +		(ioend->io_flags & IOMAP_F_SHARED) ||
> > > > +		(ioend->io_size >= XFS_LARGE_IOEND);
> > > >  }
> > > >  
> > > >  STATIC void
> > > > -- 
> > > > 2.25.4
> > > > 
> > > 
> > 
>
Matthew Wilcox (Oracle) May 7, 2021, 2:40 p.m. UTC | #8
On Fri, May 07, 2021 at 10:06:31AM -0400, Brian Foster wrote:
> > <nod> So I guess I'm saying that my resistance to /this/ part of the
> > changes is melting away.  For a 2GB+ write IO, I guess the extra overhead
> > of poking a workqueue can be amortized over the sheer number of pages.
> 
> I think the main question is what is a suitable size threshold to kick
> an ioend over to the workqueue? Looking back, I think this patch just
> picked 256k randomly to propose the idea. ISTM there could be a
> potentially large window from the point where I/O latency starts to
> dominate (over the extra context switch for wq processing) and where the
> softlockup warning thing will eventually trigger due to having too many
> pages. I think that means we could probably use a more conservative
> value, I'm just not sure what value should be (10MB, 100MB, 1GB?). If
> you have a reproducer it might be interesting to experiment with that.

To my mind, there are four main types of I/Os.

1. Small, dependent reads -- maybe reading a B-tree block so we can get
the next pointer.  Those need latency above all.

2. Readahead.  Tend to be large I/Os and latency is not a concern

3. Page writeback which tend to be large and can afford the extra latency.

4. Log writes.  These tend to be small, and I'm not sure what increasing
their latency would do.  Probably bad.

I like 256kB as a threshold.  I think I could get behind anything from
128kB to 1MB.  I don't think playing with it is going to be really
interesting because most IOs are going to be far below or far above
that threshold.
Dave Chinner May 10, 2021, 2:45 a.m. UTC | #9
On Fri, May 07, 2021 at 03:40:39PM +0100, Matthew Wilcox wrote:
> On Fri, May 07, 2021 at 10:06:31AM -0400, Brian Foster wrote:
> > > <nod> So I guess I'm saying that my resistance to /this/ part of the
> > > changes is melting away.  For a 2GB+ write IO, I guess the extra overhead
> > > of poking a workqueue can be amortized over the sheer number of pages.
> > 
> > I think the main question is what is a suitable size threshold to kick
> > an ioend over to the workqueue? Looking back, I think this patch just
> > picked 256k randomly to propose the idea. ISTM there could be a
> > potentially large window from the point where I/O latency starts to
> > dominate (over the extra context switch for wq processing) and where the
> > softlockup warning thing will eventually trigger due to having too many
> > pages. I think that means we could probably use a more conservative
> > value, I'm just not sure what value should be (10MB, 100MB, 1GB?). If
> > you have a reproducer it might be interesting to experiment with that.
> 
> To my mind, there are four main types of I/Os.
> 
> 1. Small, dependent reads -- maybe reading a B-tree block so we can get
> the next pointer.  Those need latency above all.
> 
> 2. Readahead.  Tend to be large I/Os and latency is not a concern
> 
> 3. Page writeback which tend to be large and can afford the extra latency.
> 
> 4. Log writes.  These tend to be small, and I'm not sure what increasing
> their latency would do.  Probably bad.
> 
> I like 256kB as a threshold.  I think I could get behind anything from
> 128kB to 1MB.  I don't think playing with it is going to be really
> interesting because most IOs are going to be far below or far above
> that threshold.

256kB is waaaaay too small for writeback IO. Brian actually proposed
256k *pages*, not bytes. 256kB will take us back to the bad old days
where we are dependent on bio merging to get decent IO patterns down
to the hardware, and that's a bad place to be from both an IO and
CPU efficiency POV.

IOWs, the IO that is built by the filesystem needs to be large
w.r.t. the underlying hardware so that the block layer can slice and
dice it efficiently so we don't end up doing lots of small partial
stripe writes to RAID devices.

ISTR proposing - in response to Brian's patch - a limit of 16MB or
so - large enough that for most storage stacks we'll keep it's
queues full with well formed IO, but also small enough that we don't
end up with long completion latencies due to walking hundreds of
thousands of pages completing them in a tight loop...

Yup, here's the relevent chunk of that patch:

+/*
+ * Maximum ioend IO size is used to prevent ioends from becoming unbound in
+ * size. Bios can read 4GB in size is pages are contiguous, and bio chains are
+ * effectively unbound in length. Hence the only limits on the size of the bio
+ * chain is the contiguity of the extent on disk and the length of the run of
+ * sequential dirty pages in the page cache. This can be tens of GBs of physical
+ * extents and if memory is large enough, tens of millions of dirty pages.
+ * Locking them all under writeback until the final bio in the chain is
+ * submitted and completed locks all those pages for the legnth of time it takes
+ * to write those many, many GBs of data to storage.
+ *
+ * Background writeback caps any single writepages call to half the device
+ * bandwidth to ensure fairness and prevent any one dirty inode causing
+ * writeback starvation.  fsync() and other WB_SYNC_ALL writebacks have no such
+ * cap on wbc->nr_pages, and that's where the above massive bio chain lengths
+ * come from. We want large IOs to reach the storage, but we need to limit
+ * completion latencies, hence we need to control the maximum IO size we
+ * dispatch to the storage stack.
+ *
+ * We don't really have to care about the extra IO completion overhead here -
+ * iomap has contiguous IO completion merging, so if the device can sustain a
+ * very high throughput and large bios, the ioends will be merged on completion
+ * and processed in large, efficient chunks with no additional IO latency. IOWs,
+ * we really don't need the huge bio chains to be built in the first place for
+ * sequential writeback...
+ *
+ * Note that page size has an effect here - 64k pages really needs lower
+ * page count thresholds because they have the same IO capabilities. We can do
+ * larger IOs because of the lower completion overhead, so cap it at 1024
+ * pages. For everything else, use 16MB as the ioend size.
+ */
+#if (PAGE_SIZE == 65536)
+#define IOMAP_MAX_IOEND_PAGES  1024
+#else
+#define IOMAP_MAX_IOEND_PAGES  4096
+#endif
+#define IOMAP_MAX_IOEND_SIZE   (IOMAP_MAX_IOEND_PAGES * PAGE_SIZE)


Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3e061ea99922..c00cc0624986 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -30,6 +30,13 @@  XFS_WPC(struct iomap_writepage_ctx *ctx)
 	return container_of(ctx, struct xfs_writepage_ctx, ctx);
 }
 
+/*
+ * Kick extra large ioends off to the workqueue. Completion will process a lot
+ * of pages for a large bio or bio chain and a non-atomic context is required to
+ * reschedule and avoid soft lockup warnings.
+ */
+#define XFS_LARGE_IOEND	(262144ULL << PAGE_SHIFT)
+
 /*
  * Fast and loose check if this write could update the on-disk inode size.
  */
@@ -239,7 +246,8 @@  static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
 {
 	return ioend->io_private ||
 		ioend->io_type == IOMAP_UNWRITTEN ||
-		(ioend->io_flags & IOMAP_F_SHARED);
+		(ioend->io_flags & IOMAP_F_SHARED) ||
+		(ioend->io_size >= XFS_LARGE_IOEND);
 }
 
 STATIC void