diff mbox

iomap infrastructure and multipage writes V5

Message ID 20160630172239.GA23082@lst.de (mailing list archive)
State New, archived
Headers show

Commit Message

Christoph Hellwig June 30, 2016, 5:22 p.m. UTC
On Tue, Jun 28, 2016 at 10:26:49AM +1000, Dave Chinner wrote:
> Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here.
> generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has
> significantly different behaviour once ENOSPC is hit withi this patchset.
> 
> It ends up with an endless stream of errors like this:

I've spent some time trying to reproduce this.  I'm actually getting
the OOM killer almost reproducible for for-next without the iomap
patches as well when just using 1GB of mem.  1400 MB is the minimum
I can reproducibly finish the test with either code base.

But with the 1400 MB setup I see a few interesting things.  Even
with the baseline, no-iomap case I see a few errors in the log:

[   70.407465] Filesystem "vdc": reserve blocks depleted! Consider increasing
reserve pool
size.
[   70.195645] XFS (vdc): page discard on page ffff88005682a988, inode 0xd3, offset 761856.
[   70.408079] Buffer I/O error on dev vdc, logical block 1048513, lost async
page write
[   70.408598] Buffer I/O error on dev vdc, logical block 1048514, lost async
page write
 27s

With iomap I also see the spew of page discard errors your see, but while
I see a lot of them, the rest still finishes after a reasonable time,
just a few seconds more than the pre-iomap baseline.  I also see the
reserve block depleted message in this case.

Digging into the reserve block depleted message - it seems we have
too many parallel iomap_allocate transactions going on.  I suspect
this might be because the writeback code will not finish a writeback
context if we have multiple blocks inside a page, which can
happen easily for this 1k ENOSPC setup.  I've not had time to fully
check if this is what really happens, but I did a quick hack (see below)
to only allocate 1k at a time in iomap_begin, and with that generic/224
finishes without the warning spew.  Of course this isn't a real fix,
and I need to fully understand what's going on in writeback due to
different allocation / dirtying patterns from the iomap change.


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Dave Chinner June 30, 2016, 11:16 p.m. UTC | #1
On Thu, Jun 30, 2016 at 07:22:39PM +0200, Christoph Hellwig wrote:
> On Tue, Jun 28, 2016 at 10:26:49AM +1000, Dave Chinner wrote:
> > Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here.
> > generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has
> > significantly different behaviour once ENOSPC is hit withi this patchset.
> > 
> > It ends up with an endless stream of errors like this:
> 
> I've spent some time trying to reproduce this.  I'm actually getting
> the OOM killer almost reproducible for for-next without the iomap
> patches as well when just using 1GB of mem.  1400 MB is the minimum
> I can reproducibly finish the test with either code base.
> 
> But with the 1400 MB setup I see a few interesting things.  Even
> with the baseline, no-iomap case I see a few errors in the log:
> 
> [   70.407465] Filesystem "vdc": reserve blocks depleted! Consider increasing
> reserve pool
> size.
> [   70.195645] XFS (vdc): page discard on page ffff88005682a988, inode 0xd3, offset 761856.
> [   70.408079] Buffer I/O error on dev vdc, logical block 1048513, lost async
> page write
> [   70.408598] Buffer I/O error on dev vdc, logical block 1048514, lost async
> page write
>  27s
> 
> With iomap I also see the spew of page discard errors your see, but while
> I see a lot of them, the rest still finishes after a reasonable time,
> just a few seconds more than the pre-iomap baseline.  I also see the
> reserve block depleted message in this case.

The reserve block pool depleted message is normal for me in this
test.  We're throwing a thousand concurrent processes at the
filesystem at ENOSPC, and so the metadata reservation for the
delayed allocation totals quite a lot. We only reserve 8192 blocks
for the reserve pool, so a delalloc reservation for one page on each
file (4 blocks per page, which means a couple of blocks for the
metadata reservation via the indlen calculation) is going to consume
the reserve pool quite quickly if the up front reservation
overshoots the XFS_ALLOC_SET_ASIDE() ENOSPC threshold.

> Digging into the reserve block depleted message - it seems we have
> too many parallel iomap_allocate transactions going on.  I suspect
> this might be because the writeback code will not finish a writeback
> context if we have multiple blocks inside a page, which can
> happen easily for this 1k ENOSPC setup.

Right - this test has regularly triggered that warning on this
particular test setup for me - it's not something new to the iomap
patchset.

> I've not had time to fully
> check if this is what really happens, but I did a quick hack (see below)
> to only allocate 1k at a time in iomap_begin, and with that generic/224
> finishes without the warning spew.  Of course this isn't a real fix,
> and I need to fully understand what's going on in writeback due to
> different allocation / dirtying patterns from the iomap change.

Which tends to indicate that the multi-block allocation has a larger
indlen reservation, and that's what is causing the code to hit
whatever edge-case that is leading to it not recovering. However,
I'm still wondering how we are not throwing ENOSPC back to userspace
at XFS_ALLOC_SET_ASIDE limits.

Cheers,

Dave.
Dave Chinner July 18, 2016, 11:14 a.m. UTC | #2
On Thu, Jun 30, 2016 at 07:22:39PM +0200, Christoph Hellwig wrote:
> On Tue, Jun 28, 2016 at 10:26:49AM +1000, Dave Chinner wrote:
> > Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here.
> > generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has
> > significantly different behaviour once ENOSPC is hit withi this patchset.
> > 
> > It ends up with an endless stream of errors like this:
> 
> I've spent some time trying to reproduce this.  I'm actually getting
> the OOM killer almost reproducible for for-next without the iomap
> patches as well when just using 1GB of mem.  1400 MB is the minimum
> I can reproducibly finish the test with either code base.
> 
> But with the 1400 MB setup I see a few interesting things.  Even
> with the baseline, no-iomap case I see a few errors in the log:
> 
> [   70.407465] Filesystem "vdc": reserve blocks depleted! Consider increasing
> reserve pool
> size.
> [   70.195645] XFS (vdc): page discard on page ffff88005682a988, inode 0xd3, offset 761856.
> [   70.408079] Buffer I/O error on dev vdc, logical block 1048513, lost async
> page write
> [   70.408598] Buffer I/O error on dev vdc, logical block 1048514, lost async
> page write
>  27s
> 
> With iomap I also see the spew of page discard errors your see, but while
> I see a lot of them, the rest still finishes after a reasonable time,
> just a few seconds more than the pre-iomap baseline.  I also see the
> reserve block depleted message in this case.
> 
> Digging into the reserve block depleted message - it seems we have
> too many parallel iomap_allocate transactions going on.  I suspect
> this might be because the writeback code will not finish a writeback
> context if we have multiple blocks inside a page, which can
> happen easily for this 1k ENOSPC setup.  I've not had time to fully
> check if this is what really happens, but I did a quick hack (see below)
> to only allocate 1k at a time in iomap_begin, and with that generic/224
> finishes without the warning spew.  Of course this isn't a real fix,
> and I need to fully understand what's going on in writeback due to
> different allocation / dirtying patterns from the iomap change.

Any progress here, Christoph? The current test run has been running
generic/224 on the 1GB mem test Vm for almost 6 hours now, and it's
still discarding pages. This doesn't always happen - sometimes it
takes the normal amount of time to run, but every so often it falls
into this "discard every page" loop and it takes hours to
complete...

Cheers,

Dave.
Dave Chinner July 18, 2016, 11:18 a.m. UTC | #3
On Mon, Jul 18, 2016 at 09:14:00PM +1000, Dave Chinner wrote:
> On Thu, Jun 30, 2016 at 07:22:39PM +0200, Christoph Hellwig wrote:
> > On Tue, Jun 28, 2016 at 10:26:49AM +1000, Dave Chinner wrote:
> > > Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here.
> > > generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has
> > > significantly different behaviour once ENOSPC is hit withi this patchset.
> > > 
> > > It ends up with an endless stream of errors like this:
> > 
> > I've spent some time trying to reproduce this.  I'm actually getting
> > the OOM killer almost reproducible for for-next without the iomap
> > patches as well when just using 1GB of mem.  1400 MB is the minimum
> > I can reproducibly finish the test with either code base.
> > 
> > But with the 1400 MB setup I see a few interesting things.  Even
> > with the baseline, no-iomap case I see a few errors in the log:
> > 
> > [   70.407465] Filesystem "vdc": reserve blocks depleted! Consider increasing
> > reserve pool
> > size.
> > [   70.195645] XFS (vdc): page discard on page ffff88005682a988, inode 0xd3, offset 761856.
> > [   70.408079] Buffer I/O error on dev vdc, logical block 1048513, lost async
> > page write
> > [   70.408598] Buffer I/O error on dev vdc, logical block 1048514, lost async
> > page write
> >  27s
> > 
> > With iomap I also see the spew of page discard errors your see, but while
> > I see a lot of them, the rest still finishes after a reasonable time,
> > just a few seconds more than the pre-iomap baseline.  I also see the
> > reserve block depleted message in this case.
> > 
> > Digging into the reserve block depleted message - it seems we have
> > too many parallel iomap_allocate transactions going on.  I suspect
> > this might be because the writeback code will not finish a writeback
> > context if we have multiple blocks inside a page, which can
> > happen easily for this 1k ENOSPC setup.  I've not had time to fully
> > check if this is what really happens, but I did a quick hack (see below)
> > to only allocate 1k at a time in iomap_begin, and with that generic/224
> > finishes without the warning spew.  Of course this isn't a real fix,
> > and I need to fully understand what's going on in writeback due to
> > different allocation / dirtying patterns from the iomap change.
> 
> Any progress here, Christoph? The current test run has been running
> generic/224 on the 1GB mem test Vm for almost 6 hours now, and it's
> still discarding pages. This doesn't always happen - sometimes it
> takes the normal amount of time to run, but every so often it falls
> into this "discard every page" loop and it takes hours to
> complete...

.... and I've now got a 16p/16GB RAM VM stuck in this loop in
generic/224, so it's not limited to low memory machines....

Cheers,

Dave.
Christoph Hellwig July 19, 2016, 3:50 a.m. UTC | #4
On Mon, Jul 18, 2016 at 09:14:00PM +1000, Dave Chinner wrote:
> Any progress here, Christoph?

Nothing after the last posting yet.  Shortly after that I left to Japan
for about two weeks of hiking and a conference, which doesn't help my
ability to spend some quiet hours with the code.   It's on top of
my priority list for non-trivial things and I should be able to get
back it tomorrow after finishing the next conference today..
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 620fc91..d9afba2 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1018,7 +1018,7 @@  xfs_file_iomap_begin(
 		 * Note that the values needs to be less than 32-bits wide until
 		 * the lower level functions are updated.
 		 */
-		length = min_t(loff_t, length, 1024 * PAGE_SIZE);
+		length = min_t(loff_t, length, 1024);
 		if (xfs_get_extsz_hint(ip)) {
 			/*
 			 * xfs_iomap_write_direct() expects the shared lock. It