diff mbox series

gfs2 iomap dealock, IOMAP_F_UNBALANCED

Message ID 20190321131304.21618-1-agruenba@redhat.com (mailing list archive)
State New, archived
Headers show
Series gfs2 iomap dealock, IOMAP_F_UNBALANCED | expand

Commit Message

Andreas Gruenbacher March 21, 2019, 1:13 p.m. UTC
Hi Christoph,

we need your help fixing a gfs2 deadlock involving iomap.  What's going
on is the following:

* During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush
  lock and keeps it until gfs2_iomap_end.  It currently always does that
  even though there is no point other than for journaled data writes.

* iomap_file_buffered_write then calls balance_dirty_pages_ratelimited.
  If that ends up calling gfs2_write_inode, gfs2 will try to grab the
  log flush lock again and deadlock.

We can stop gfs2_iomap_begin from keeping the log flush lock held for
non-journaled data writes, but that still leaves us with the deadlock
for journaled data writes: for them, we need to add the data pages to
the running transaction, so dropping the log flush lock wouldn't work.

I've tried to work around the unwanted balance_dirty_pages_ratelimited
in gfs2_write_inode as originally suggested by Ross.  That works to a
certain degree, but only if we always skip inodes in the WB_SYNC_NONE
case, and that means that gfs2_write_inode becomes quite ineffective.

So it seems that it would be more reasonable to avoid the dirty page
balancing under the log flush lock in the first place.

The attached patch changes gfs2_iomap_begin to only hold on to the log
flush lock for journaled data writes.  In that case, we also make sure
to limit the write size to not overrun dirty limits using a similar
logic as in balance_dirty_pages_ratelimited; there is precedent for that
approach in btrfs_buffered_write.  Then, we prevent iomap from balancing
dirty pages via the new IOMAP_F_UNBALANCED flag.

Does that seem like an acceptable approach?

Thanks,
Andreas

---
 fs/gfs2/bmap.c        | 62 +++++++++++++++++++++++++++++++++----------
 fs/gfs2/file.c        |  2 ++
 fs/iomap.c            |  3 ++-
 include/linux/iomap.h |  1 +
 4 files changed, 53 insertions(+), 15 deletions(-)

Comments

Dave Chinner March 21, 2019, 9:43 p.m. UTC | #1
On Thu, Mar 21, 2019 at 02:13:04PM +0100, Andreas Gruenbacher wrote:
> Hi Christoph,
> 
> we need your help fixing a gfs2 deadlock involving iomap.  What's going
> on is the following:
> 
> * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush
>   lock and keeps it until gfs2_iomap_end.  It currently always does that
>   even though there is no point other than for journaled data writes.
> 
> * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited.
>   If that ends up calling gfs2_write_inode, gfs2 will try to grab the
>   log flush lock again and deadlock.
> 
> We can stop gfs2_iomap_begin from keeping the log flush lock held for
> non-journaled data writes, but that still leaves us with the deadlock
> for journaled data writes: for them, we need to add the data pages to
> the running transaction, so dropping the log flush lock wouldn't work.
> 
> I've tried to work around the unwanted balance_dirty_pages_ratelimited
> in gfs2_write_inode as originally suggested by Ross.  That works to a
> certain degree, but only if we always skip inodes in the WB_SYNC_NONE
> case, and that means that gfs2_write_inode becomes quite ineffective.
> 
> So it seems that it would be more reasonable to avoid the dirty page
> balancing under the log flush lock in the first place.
> 
> The attached patch changes gfs2_iomap_begin to only hold on to the log
> flush lock for journaled data writes.  In that case, we also make sure
> to limit the write size to not overrun dirty limits using a similar
> logic as in balance_dirty_pages_ratelimited; there is precedent for that
> approach in btrfs_buffered_write.  Then, we prevent iomap from balancing
> dirty pages via the new IOMAP_F_UNBALANCED flag.

....
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 58a768e59712e..542bd149c4aa3 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -867,6 +867,8 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  			iocb->ki_pos += ret;
>  	}
>  
> +	balance_dirty_pages_ratelimited(file->f_mapping);
> +
>  out2:
>  	current->backing_dev_info = NULL;
>  out:

The problem is calling balance_dirty_pages() inside the
->iomap_begin/->iomap_end calls and not that it is called by the
iomap infrastructure itself, right?

Is so, I'd prefer to see this in iomap_apply() after the call to
ops->iomap_end because iomap_file_buffered_write() can iterate and
call iomap_apply() multiple times. This would keep the balancing to
a per-iomap granularity, rather than a per-syscall granularity.

i.e. if we do write(2GB), we want more than one balancing call
during that syscall, so it woul dbe up to the filesystem to a) limit
the size of write mappings to something smaller (e.g. 1024 pages)
so that there are still frequent balancing calls for large writes.

Cheers,

Dave.
Andreas Gruenbacher March 21, 2019, 11:01 p.m. UTC | #2
On Thu, 21 Mar 2019 at 22:43, Dave Chinner <david@fromorbit.com> wrote:
> The problem is calling balance_dirty_pages() inside the
> ->iomap_begin/->iomap_end calls and not that it is called by the
> iomap infrastructure itself, right?
>
> Is so, I'd prefer to see this in iomap_apply() after the call to
> ops->iomap_end because iomap_file_buffered_write() can iterate and
> call iomap_apply() multiple times. This would keep the balancing to
> a per-iomap granularity, rather than a per-syscall granularity.
>
> i.e. if we do write(2GB), we want more than one balancing call
> during that syscall, so it would be up to the filesystem to a) limit
> the size of write mappings to something smaller (e.g. 1024 pages)
> so that there are still frequent balancing calls for large writes.

Hmm. The looping across multiple mappings isn't done in iomap_apply
but in iomap_file_buffered_write, so the balancing could go into
iomap_apply or iomap_file_buffered_write, but can't go further up the
stack. Given that, iomap_file_buffered_write seems the better place,
but this is still quite horrible.

Thanks,
Andreas
Andreas Gruenbacher March 22, 2019, 12:21 a.m. UTC | #3
On Fri, 22 Mar 2019 at 00:01, Andreas Gruenbacher <agruenba@redhat.com> wrote:
> On Thu, 21 Mar 2019 at 22:43, Dave Chinner <david@fromorbit.com> wrote:
> > The problem is calling balance_dirty_pages() inside the
> > ->iomap_begin/->iomap_end calls and not that it is called by the
> > iomap infrastructure itself, right?
> >
> > Is so, I'd prefer to see this in iomap_apply() after the call to
> > ops->iomap_end because iomap_file_buffered_write() can iterate and
> > call iomap_apply() multiple times. This would keep the balancing to
> > a per-iomap granularity, rather than a per-syscall granularity.
> >
> > i.e. if we do write(2GB), we want more than one balancing call
> > during that syscall, so it would be up to the filesystem to a) limit
> > the size of write mappings to something smaller (e.g. 1024 pages)
> > so that there are still frequent balancing calls for large writes.
>
> Hmm. The looping across multiple mappings isn't done in iomap_apply
> but in iomap_file_buffered_write, so the balancing could go into
> iomap_apply or iomap_file_buffered_write, but can't go further up the
> stack. Given that, iomap_file_buffered_write seems the better place,
> but this is still quite horrible.

Here's a more reasonable version of my first patch, with a cleaned up
and hopefully fixed gfs2 part.

In addition, this checks for IOMAP_F_UNBALANCED in iomap_dirty_actor,
the actor for iomap_file_dirty.  We don't use iomap_file_dirty in gfs2,
but we should probably allowing to skip the dirty page balancing there
as well.

Thanks,
Andreas
---
 fs/gfs2/bmap.c        | 64 +++++++++++++++++++++++++++++++++----------
 fs/iomap.c            |  6 ++--
 include/linux/iomap.h |  1 +
 3 files changed, 55 insertions(+), 16 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 02b2646d84b3a..628d66d07fc6c 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -974,6 +974,19 @@ static void gfs2_iomap_journaled_page_done(struct inode *inode, loff_t pos,
 	gfs2_page_add_databufs(ip, page, offset_in_page(pos), copied);
 }
 
+static void gfs2_write_end(struct inode *inode, struct buffer_head *dibh)
+{
+	struct gfs2_inode *ip = GFS2_I(inode);
+	struct gfs2_trans *tr = current->journal_info;
+
+	gfs2_ordered_add_inode(ip);
+
+	if (tr->tr_num_buf_new)
+		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
+	else
+		gfs2_trans_add_meta(ip->i_gl, dibh);
+}
+
 static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 				  loff_t length, unsigned flags,
 				  struct iomap *iomap,
@@ -996,6 +1009,25 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 	if (ret)
 		goto out_unlock;
 
+	if (gfs2_is_stuffed(ip) || gfs2_is_jdata(ip) || inode == sdp->sd_rindex) {
+		int max_pages;
+		u64 max_length;
+
+		iomap->flags |= IOMAP_F_UNBALANCED;
+
+		/*
+		 * Limit the write size: this ensures that write throttling
+		 * will kick in fast enough even when we don't call
+		 * balance_dirty_pages_ratelimited for each page written.
+		 */
+		max_pages = current->nr_dirtied_pause - current->nr_dirtied;
+		if (max_pages < 8)
+			max_pages = 8;
+		max_length = (u64)max_pages << PAGE_SHIFT;
+		if (iomap->length > max_length)
+			iomap->length = max_length;
+	}
+
 	alloc_required = unstuff || iomap->type == IOMAP_HOLE;
 
 	if (alloc_required || gfs2_is_jdata(ip))
@@ -1052,6 +1084,11 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 	}
 	if (!gfs2_is_stuffed(ip) && gfs2_is_jdata(ip))
 		iomap->page_done = gfs2_iomap_journaled_page_done;
+
+	if (!(iomap->flags & IOMAP_F_UNBALANCED)) {
+		gfs2_write_end(inode, mp->mp_bh[0]);
+		gfs2_trans_end(sdp);
+	}
 	return 0;
 
 out_trans_end:
@@ -1103,30 +1140,29 @@ static int gfs2_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 			  ssize_t written, unsigned flags, struct iomap *iomap)
 {
 	struct gfs2_inode *ip = GFS2_I(inode);
-	struct gfs2_sbd *sdp = GFS2_SB(inode);
-	struct gfs2_trans *tr = current->journal_info;
 	struct buffer_head *dibh = iomap->private;
 
 	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) != IOMAP_WRITE)
 		goto out;
 
-	if (iomap->type != IOMAP_INLINE) {
-		gfs2_ordered_add_inode(ip);
+	if (current->journal_info) {
+		struct gfs2_sbd *sdp = GFS2_SB(inode);
 
-		if (tr->tr_num_buf_new)
-			__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
-		else
-			gfs2_trans_add_meta(ip->i_gl, dibh);
-	}
+		if (iomap->type != IOMAP_INLINE)
+			gfs2_write_end(inode, dibh);
 
-	if (inode == sdp->sd_rindex) {
-		adjust_fs_space(inode);
-		sdp->sd_rindex_uptodate = 0;
-	}
+		if (inode == sdp->sd_rindex) {
+			adjust_fs_space(inode);
+			sdp->sd_rindex_uptodate = 0;
+		}
 
-	gfs2_trans_end(sdp);
+		gfs2_trans_end(sdp);
+	}
 	gfs2_inplace_release(ip);
 
+	if (iomap->flags & IOMAP_F_UNBALANCED)
+		balance_dirty_pages_ratelimited(inode->i_mapping);
+
 	if (length != written && (iomap->flags & IOMAP_F_NEW)) {
 		/* Deallocate blocks that were just allocated. */
 		loff_t blockmask = i_blocksize(inode) - 1;
diff --git a/fs/iomap.c b/fs/iomap.c
index 97cb9d486a7da..5f950fee0834f 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -863,7 +863,8 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		written += copied;
 		length -= copied;
 
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		if (!(iomap->flags & IOMAP_F_UNBALANCED))
+			balance_dirty_pages_ratelimited(inode->i_mapping);
 	} while (iov_iter_count(i) && length);
 
 	return written ? written : status;
@@ -945,7 +946,8 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		written += status;
 		length -= status;
 
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		if (!(iomap->flags & IOMAP_F_UNBALANCED))
+			balance_dirty_pages_ratelimited(inode->i_mapping);
 	} while (length);
 
 	return written;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0fefb5455bdaf..e9a04e76a3217 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -35,6 +35,7 @@ struct vm_fault;
 #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
 #define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */
 #define IOMAP_F_BUFFER_HEAD	0x04	/* file system requires buffer heads */
+#define IOMAP_F_UNBALANCED	0x08	/* don't balance dirty pages */
 
 /*
  * Flags that only need to be reported for IOMAP_REPORT requests:
Ross Lagerwall March 27, 2019, 4:49 p.m. UTC | #4
On 3/22/19 12:21 AM, Andreas Gruenbacher wrote:
> On Fri, 22 Mar 2019 at 00:01, Andreas Gruenbacher <agruenba@redhat.com> wrote:
>> On Thu, 21 Mar 2019 at 22:43, Dave Chinner <david@fromorbit.com> wrote:
>>> The problem is calling balance_dirty_pages() inside the
>>> ->iomap_begin/->iomap_end calls and not that it is called by the
>>> iomap infrastructure itself, right?
>>>
>>> Is so, I'd prefer to see this in iomap_apply() after the call to
>>> ops->iomap_end because iomap_file_buffered_write() can iterate and
>>> call iomap_apply() multiple times. This would keep the balancing to
>>> a per-iomap granularity, rather than a per-syscall granularity.
>>>
>>> i.e. if we do write(2GB), we want more than one balancing call
>>> during that syscall, so it would be up to the filesystem to a) limit
>>> the size of write mappings to something smaller (e.g. 1024 pages)
>>> so that there are still frequent balancing calls for large writes.
>>
>> Hmm. The looping across multiple mappings isn't done in iomap_apply
>> but in iomap_file_buffered_write, so the balancing could go into
>> iomap_apply or iomap_file_buffered_write, but can't go further up the
>> stack. Given that, iomap_file_buffered_write seems the better place,
>> but this is still quite horrible.
> 
> Here's a more reasonable version of my first patch, with a cleaned up
> and hopefully fixed gfs2 part.
> 
> In addition, this checks for IOMAP_F_UNBALANCED in iomap_dirty_actor,
> the actor for iomap_file_dirty.  We don't use iomap_file_dirty in gfs2,
> but we should probably allowing to skip the dirty page balancing there
> as well.
> 
> Thanks,
> Andreas
> ---
>   fs/gfs2/bmap.c        | 64 +++++++++++++++++++++++++++++++++----------
>   fs/iomap.c            |  6 ++--
>   include/linux/iomap.h |  1 +
>   3 files changed, 55 insertions(+), 16 deletions(-)
> 
Thanks, this fixes the reported deadlock. I haven't yet checked whether 
it has any performance impact.

Tested-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Christoph Hellwig March 28, 2019, 4:51 p.m. UTC | #5
On Thu, Mar 21, 2019 at 02:13:04PM +0100, Andreas Gruenbacher wrote:
> Hi Christoph,
> 
> we need your help fixing a gfs2 deadlock involving iomap.  What's going
> on is the following:
> 
> * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush
>   lock and keeps it until gfs2_iomap_end.  It currently always does that
>   even though there is no point other than for journaled data writes.
> 
> * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited.
>   If that ends up calling gfs2_write_inode, gfs2 will try to grab the
>   log flush lock again and deadlock.

What is the exactly call chain?  balance_dirty_pages_ratelimited these
days doesn't start I/O, but just wakes up the flusher threads.  Or
do we have a issue where it is blocking on those threads?

Also why do you need to flush the log for background writeback in
->write_inode?

balance_dirty_pages_ratelimited is per definition not a data integrity
writeback, so there shouldn't be a good reason to flush the log
(which I assume the log flush log is for).  If we look gfs2_write_inode,
this seems to be the code:

	bool flush_all = (wbc->sync_mode == WB_SYNC_ALL || gfs2_is_jdata(ip));

        if (flush_all)
		gfs2_log_flush(GFS2_SB(inode), ip->i_gl,
			       GFS2_LOG_HEAD_FLUSH_NORMAL |
			       GFS2_LFC_WRITE_INODE);

But what is the requirement to do this in writeback context?  Can't
we move it out into another context instead?
Andreas Gruenbacher March 29, 2019, 10:13 p.m. UTC | #6
On Thu, 28 Mar 2019 at 17:51, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Mar 21, 2019 at 02:13:04PM +0100, Andreas Gruenbacher wrote:
> > Hi Christoph,
> >
> > we need your help fixing a gfs2 deadlock involving iomap.  What's going
> > on is the following:
> >
> > * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush
> >   lock and keeps it until gfs2_iomap_end.  It currently always does that
> >   even though there is no point other than for journaled data writes.
> >
> > * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited.
> >   If that ends up calling gfs2_write_inode, gfs2 will try to grab the
> >   log flush lock again and deadlock.
>
> What is the exact call chain?

It's laid out here:
https://www.redhat.com/archives/cluster-devel/2019-March/msg00000.html

> balance_dirty_pages_ratelimited these days doesn't start I/O, but just
> wakes up the flusher threads.  Or do we have a issue where it is blocking
> on those threads?

Yes, the writer is holding sd_log_flush_lock at the point where it
ends up kicking the flusher thread and waiting for writeback to
happen. The flusher thread calls gfs2_write_inode, and that tries to
grab sd_log_flush_lock again.

> Also why do you need to flush the log for background writeback in
> ->write_inode?

If we stop doing that in the (wbc->sync_mode == WB_SYNC_NONE) case,
then inodes will remain dirty until the journal is flushed for some
other reason (or a write_inode with WB_SYNC_ALL). That doesn't seem
right. We could perhaps trigger a background journal flush in the
WB_SYNC_NONE case, but that would remove the back pressure on
balance_dirty_pages. Not sure this is a good idea, either.

> balance_dirty_pages_ratelimited is per definition not a data integrity
> writeback, so there shouldn't be a good reason to flush the log
> (which I assume the log flush log is for).
>
> If we look gfs2_write_inode, this seems to be the code:
>
>         bool flush_all = (wbc->sync_mode == WB_SYNC_ALL || gfs2_is_jdata(ip));
>
>         if (flush_all)
>                 gfs2_log_flush(GFS2_SB(inode), ip->i_gl,
>                                GFS2_LOG_HEAD_FLUSH_NORMAL |
>                                GFS2_LFC_WRITE_INODE);
>
> But what is the requirement to do this in writeback context?  Can't
> we move it out into another context instead?

Indeed, this isn't for data integrity in this case but because the
dirty limit is exceeded. What other context would you suggest to move
this to?

(The iomap flag I've proposed would save us from getting into this
situation in the first place.)

Thanks,
Andreas
Christoph Hellwig April 7, 2019, 7:32 a.m. UTC | #7
[adding Jan and linux-mm]

On Fri, Mar 29, 2019 at 11:13:00PM +0100, Andreas Gruenbacher wrote:
> > But what is the requirement to do this in writeback context?  Can't
> > we move it out into another context instead?
> 
> Indeed, this isn't for data integrity in this case but because the
> dirty limit is exceeded. What other context would you suggest to move
> this to?
> 
> (The iomap flag I've proposed would save us from getting into this
> situation in the first place.)

Your patch does two things:

 - it only calls balance_dirty_pages_ratelimited once per write
   operation instead of once per page.  In the past btrfs did
   hacks like that, but IIRC they caused VM balancing issues.
   That is why everyone now calls balance_dirty_pages_ratelimited
   one per page.  If calling it at a coarse granularity would
   be fine we should do it everywhere instead of just in gfs2
   in journaled mode
 - it artifically reduces the size of writes to a low value,
   which I suspect is going to break real life application

So I really think we need to fix this properly.  And if that means
that you can't make use of the iomap batching for gfs2 in journaled
mode that is still a better option.  But I really think you need
to look into the scope of your flush_log and figure out a good way
to reduce that as solve the root cause.
Andreas Gruenbacher April 8, 2019, 8:53 a.m. UTC | #8
On Sun, 7 Apr 2019 at 09:32, Christoph Hellwig <hch@lst.de> wrote:
>
> [adding Jan and linux-mm]
>
> On Fri, Mar 29, 2019 at 11:13:00PM +0100, Andreas Gruenbacher wrote:
> > > But what is the requirement to do this in writeback context?  Can't
> > > we move it out into another context instead?
> >
> > Indeed, this isn't for data integrity in this case but because the
> > dirty limit is exceeded. What other context would you suggest to move
> > this to?
> >
> > (The iomap flag I've proposed would save us from getting into this
> > situation in the first place.)
>
> Your patch does two things:
>
>  - it only calls balance_dirty_pages_ratelimited once per write
>    operation instead of once per page.  In the past btrfs did
>    hacks like that, but IIRC they caused VM balancing issues.
>    That is why everyone now calls balance_dirty_pages_ratelimited
>    one per page.  If calling it at a coarse granularity would
>    be fine we should do it everywhere instead of just in gfs2
>    in journaled mode
>  - it artifically reduces the size of writes to a low value,
>    which I suspect is going to break real life application

Not quite, balance_dirty_pages_ratelimited is called from iomap_end,
so once per iomap mapping returned, not per write. (The first version
of this patch got that wrong by accident, but not the second.)

We can limit the size of the mappings returned just in that case. I'm
aware that there is a risk of balancing problems, I just don't have
any better ideas.

This is a problem all filesystems with data-journaling will have with
iomap, it's not that gfs2 is doing anything particularly stupid.

> So I really think we need to fix this properly.  And if that means
> that you can't make use of the iomap batching for gfs2 in journaled
> mode that is still a better option.

That would mean using the old-style, page-size allocations, and a
completely separate write path in that case. That would be quite a
nightmare.

> But I really think you need
> to look into the scope of your flush_log and figure out a good way
> to reduce that as solve the root cause.

We won't be able to do a log flush while another transaction is
active, but that's what's needed to clean dirty pages. iomap doesn't
allow us to put the block allocation into a separate transaction from
the page writes; for that, the opposite to the page_done hook would
probably be needed.

Thanks,
Andreas
Jan Kara April 8, 2019, 1:44 p.m. UTC | #9
On Mon 08-04-19 10:53:34, Andreas Gruenbacher wrote:
> On Sun, 7 Apr 2019 at 09:32, Christoph Hellwig <hch@lst.de> wrote:
> >
> > [adding Jan and linux-mm]
> >
> > On Fri, Mar 29, 2019 at 11:13:00PM +0100, Andreas Gruenbacher wrote:
> > > > But what is the requirement to do this in writeback context?  Can't
> > > > we move it out into another context instead?
> > >
> > > Indeed, this isn't for data integrity in this case but because the
> > > dirty limit is exceeded. What other context would you suggest to move
> > > this to?
> > >
> > > (The iomap flag I've proposed would save us from getting into this
> > > situation in the first place.)
> >
> > Your patch does two things:
> >
> >  - it only calls balance_dirty_pages_ratelimited once per write
> >    operation instead of once per page.  In the past btrfs did
> >    hacks like that, but IIRC they caused VM balancing issues.
> >    That is why everyone now calls balance_dirty_pages_ratelimited
> >    one per page.  If calling it at a coarse granularity would
> >    be fine we should do it everywhere instead of just in gfs2
> >    in journaled mode
> >  - it artifically reduces the size of writes to a low value,
> >    which I suspect is going to break real life application
> 
> Not quite, balance_dirty_pages_ratelimited is called from iomap_end,
> so once per iomap mapping returned, not per write. (The first version
> of this patch got that wrong by accident, but not the second.)
> 
> We can limit the size of the mappings returned just in that case. I'm
> aware that there is a risk of balancing problems, I just don't have
> any better ideas.
> 
> This is a problem all filesystems with data-journaling will have with
> iomap, it's not that gfs2 is doing anything particularly stupid.

I agree that if ext4 would be using iomap, it would have similar issues.

> > So I really think we need to fix this properly.  And if that means
> > that you can't make use of the iomap batching for gfs2 in journaled
> > mode that is still a better option.
> 
> That would mean using the old-style, page-size allocations, and a
> completely separate write path in that case. That would be quite a
> nightmare.
> 
> > But I really think you need
> > to look into the scope of your flush_log and figure out a good way
> > to reduce that as solve the root cause.
> 
> We won't be able to do a log flush while another transaction is
> active, but that's what's needed to clean dirty pages. iomap doesn't
> allow us to put the block allocation into a separate transaction from
> the page writes; for that, the opposite to the page_done hook would
> probably be needed.

I agree that a ->page_prepare() hook would be probably the cleanest
solution for this.

								Honza
Christoph Hellwig April 9, 2019, 12:15 p.m. UTC | #10
On Mon, Apr 08, 2019 at 03:44:05PM +0200, Jan Kara wrote:
> > We won't be able to do a log flush while another transaction is
> > active, but that's what's needed to clean dirty pages. iomap doesn't
> > allow us to put the block allocation into a separate transaction from
> > the page writes; for that, the opposite to the page_done hook would
> > probably be needed.
> 
> I agree that a ->page_prepare() hook would be probably the cleanest
> solution for this.

That doesn't sound too bad to me.
Andreas Gruenbacher April 9, 2019, 12:27 p.m. UTC | #11
On Tue, 9 Apr 2019 at 14:15, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Apr 08, 2019 at 03:44:05PM +0200, Jan Kara wrote:
> > > We won't be able to do a log flush while another transaction is
> > > active, but that's what's needed to clean dirty pages. iomap doesn't
> > > allow us to put the block allocation into a separate transaction from
> > > the page writes; for that, the opposite to the page_done hook would
> > > probably be needed.
> >
> > I agree that a ->page_prepare() hook would be probably the cleanest
> > solution for this.
>
> That doesn't sound too bad to me.

Okay, I'll see how the code for that will turn out.

Thanks,
Andreas
diff mbox series

Patch

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 02b2646d84b3a..dad7d3810533e 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -862,6 +862,24 @@  static int gfs2_iomap_get(struct inode *inode, loff_t pos, loff_t length,
 	iomap->offset = lblock << inode->i_blkbits;
 	lblock_stop = (pos + length - 1) >> inode->i_blkbits;
 	len = lblock_stop - lblock + 1;
+
+	if ((flags & IOMAP_WRITE) && gfs2_is_jdata(ip)) {
+		int max_pages;
+		u64 max_blocks;
+
+		/*
+		 * Limit the write size: this ensures that write throttling
+		 * will kick in fast enough even when we don't call
+		 * balance_dirty_pages_ratelimited for each page written.
+		 */
+		max_pages = min((int)DIV_ROUND_UP(length, PAGE_SIZE),
+				current->nr_dirtied_pause - current->nr_dirtied);
+		max_pages = max(max_pages, 8);
+		max_blocks = (u64)max_pages << (PAGE_SHIFT - inode->i_blkbits);
+		if (len > max_blocks)
+			len = max_blocks;
+	}
+
 	iomap->length = len << inode->i_blkbits;
 
 	height = ip->i_height;
@@ -974,6 +992,19 @@  static void gfs2_iomap_journaled_page_done(struct inode *inode, loff_t pos,
 	gfs2_page_add_databufs(ip, page, offset_in_page(pos), copied);
 }
 
+static void gfs2_write_end(struct inode *inode, struct buffer_head *dibh)
+{
+	struct gfs2_inode *ip = GFS2_I(inode);
+	struct gfs2_trans *tr = current->journal_info;
+
+	gfs2_ordered_add_inode(ip);
+
+	if (tr->tr_num_buf_new)
+		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
+	else
+		gfs2_trans_add_meta(ip->i_gl, dibh);
+}
+
 static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 				  loff_t length, unsigned flags,
 				  struct iomap *iomap,
@@ -1052,6 +1083,13 @@  static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 	}
 	if (!gfs2_is_stuffed(ip) && gfs2_is_jdata(ip))
 		iomap->page_done = gfs2_iomap_journaled_page_done;
+
+	if (gfs2_is_stuffed(ip) || gfs2_is_jdata(ip) || inode == sdp->sd_rindex) {
+		iomap->flags |= IOMAP_F_UNBALANCED;
+	} else {
+		gfs2_write_end(inode, mp->mp_bh[0]);
+		gfs2_trans_end(sdp);
+	}
 	return 0;
 
 out_trans_end:
@@ -1103,28 +1141,24 @@  static int gfs2_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 			  ssize_t written, unsigned flags, struct iomap *iomap)
 {
 	struct gfs2_inode *ip = GFS2_I(inode);
-	struct gfs2_sbd *sdp = GFS2_SB(inode);
-	struct gfs2_trans *tr = current->journal_info;
 	struct buffer_head *dibh = iomap->private;
 
 	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) != IOMAP_WRITE)
 		goto out;
 
-	if (iomap->type != IOMAP_INLINE) {
-		gfs2_ordered_add_inode(ip);
+	if (current->journal_info) {
+		struct gfs2_sbd *sdp = GFS2_SB(inode);
 
-		if (tr->tr_num_buf_new)
-			__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
-		else
-			gfs2_trans_add_meta(ip->i_gl, dibh);
-	}
+		if (iomap->type != IOMAP_INLINE)
+			gfs2_write_end(inode, dibh);
 
-	if (inode == sdp->sd_rindex) {
-		adjust_fs_space(inode);
-		sdp->sd_rindex_uptodate = 0;
-	}
+		if (inode == sdp->sd_rindex) {
+			adjust_fs_space(inode);
+			sdp->sd_rindex_uptodate = 0;
+		}
 
-	gfs2_trans_end(sdp);
+		gfs2_trans_end(sdp);
+	}
 	gfs2_inplace_release(ip);
 
 	if (length != written && (iomap->flags & IOMAP_F_NEW)) {
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 58a768e59712e..542bd149c4aa3 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -867,6 +867,8 @@  static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			iocb->ki_pos += ret;
 	}
 
+	balance_dirty_pages_ratelimited(file->f_mapping);
+
 out2:
 	current->backing_dev_info = NULL;
 out:
diff --git a/fs/iomap.c b/fs/iomap.c
index 97cb9d486a7da..0d8ea3f29b2ef 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -863,7 +863,8 @@  iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		written += copied;
 		length -= copied;
 
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		if (!(iomap->flags & IOMAP_F_UNBALANCED))
+			balance_dirty_pages_ratelimited(inode->i_mapping);
 	} while (iov_iter_count(i) && length);
 
 	return written ? written : status;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0fefb5455bdaf..e9a04e76a3217 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -35,6 +35,7 @@  struct vm_fault;
 #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
 #define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */
 #define IOMAP_F_BUFFER_HEAD	0x04	/* file system requires buffer heads */
+#define IOMAP_F_UNBALANCED	0x08	/* don't balance dirty pages */
 
 /*
  * Flags that only need to be reported for IOMAP_REPORT requests: