Message ID | CAKhLTr1UL3ePTpYjXOx2AJfNk8Ku2EdcEfu+CH1sf3Asr=B-Dw@mail.gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Possible regression with buffered writes + NOWAIT behavior, under memory pressure | expand |
On Mon, Feb 10, 2025 at 3:12 PM Raphael S. Carvalho <raphaelsc@scylladb.com> wrote: > > While running scylladb test suite, which uses io_uring + buffered > writes + XFS, the system was spuriously returning ENOMEM, despite > there being plenty of available memory to be reclaimed from the page > cache. FWIW, I am running: 6.12.9-100.fc40.x86_64 > > Tracing showed io_uring_complete failing the request with ENOMEM: > # cat /sys/kernel/debug/tracing/trace | grep "result -12" -B 100 | > grep "0000000065b91cd1" > reactor-1-707139 [000] ..... 46737.358518: > io_uring_submit_req: ring 00000000e52339b8, req 0000000065b91cd1, > user_data 0x50f0001e4000, opcode WRITE, flags 0x200000, sq_thread 0 > reactor-1-707139 [000] ..... 46737.358526: io_uring_file_get: > ring 00000000e52339b8, req 0000000065b91cd1, user_data 0x50f0001e4000, > fd 45 > reactor-1-707139 [000] ...1. 46737.358560: io_uring_complete: > ring 00000000e52339b8, req 0000000065b91cd1, user_data 0x50f0001e4000, > result -12, cflags 0x0 extra1 0 extra2 0 > > That puzzled me. > > Using retsnoop, it pointed to iomap_get_folio: > > 00:34:16.180612 -> 00:34:16.180651 TID/PID 253786/253721 > (reactor-1/combined_tests): > > entry_SYSCALL_64_after_hwframe+0x76 > do_syscall_64+0x82 > __do_sys_io_uring_enter+0x265 > io_submit_sqes+0x209 > io_issue_sqe+0x5b > io_write+0xdd > xfs_file_buffered_write+0x84 > iomap_file_buffered_write+0x1a6 > 32us [-ENOMEM] iomap_write_begin+0x408 > iter=&{.inode=0xffff8c67aa031138,.len=4096,.flags=33,.iomap={.addr=0xffffffffffffffff,.length=4096,.type=1,.flags=3,.bdev=0x… > pos=0 len=4096 foliop=0xffffb32c296b7b80 > ! 4us [-ENOMEM] iomap_get_folio > iter=&{.inode=0xffff8c67aa031138,.len=4096,.flags=33,.iomap={.addr=0xffffffffffffffff,.length=4096,.type=1,.flags=3,.bdev=0x… > pos=0 len=4096 > > Another trace shows iomap_file_buffered_write with ki_flags 2359304, > which translate into (IOCB_WRITE & IOCB_ALLOC_CACHE & IOCB_NOWAIT) > And flags 33 in iomap_get_folio means IOMAP_NOWAIT, which makes sense > since XFS translates IOCB_NOWAIT into IOMAP_NOWAIT for performing the > buffered write through iomap subsystem: > > fs/iomap/buffered-io.c- if (iocb->ki_flags & IOCB_NOWAIT) > fs/iomap/buffered-io.c: iter.flags |= IOMAP_NOWAIT; > > > We know io_uring works by first attempting to write with IOCB_NOWAIT, > and if it fails with EAGAIN, it falls back to worker thread without > the NOWAIT semantics. > > iomap_get_folio(), once called with IOMAP_NOWAIT, will request the > allocation to follow GFP_NOWAIT behavior, so allocation can > potentially fail under pressure. > > Coming across 'iomap: Add async buffered write support', I see Darrick wrote: > > "FGP_NOWAIT can cause __filemap_get_folio to return a NULL folio, which > makes iomap_write_begin return -ENOMEM. If nothing has been written > yet, won't that cause the ENOMEM to escape to userspace? Why do we want > that instead of EAGAIN?" > > In the patch ''mm: return an ERR_PTR from __filemap_get_folio', I see > the following changes: > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -468,19 +468,12 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate); > struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos) > { > unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS; > - struct folio *folio; > > if (iter->flags & IOMAP_NOWAIT) > fgp |= FGP_NOWAIT; > > - folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT, > + return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT, > fgp, mapping_gfp_mask(iter->inode->i_mapping)); > - if (folio) > - return folio; > - > - if (iter->flags & IOMAP_NOWAIT) > - return ERR_PTR(-EAGAIN); > - return ERR_PTR(-ENOMEM); > } > > This leads to me believe we have a regression in this area, after that > patch, since iomap_get_folio() is no longer returning EAGAIN with > IOMAP_NOWAIT, if __filemap_get_folio() failed to get a folio. Now it > returns ENOMEM unconditionally. > > Since we pushed the error picking decision to __filemap_get_folio, I > think it makes sense for us to patch it such that it returns EAGAIN if > allocation failed (under pressure) because IOMAP_NOWAIT was requested > by its caller and allocation is not allowed to block waiting for > reclaimer to do its thing. > > A possible way to fix it is this one-liner, but I am not well versed > in this area, so someone may end up suggesting a better fix: > diff --git a/mm/filemap.c b/mm/filemap.c > index 804d7365680c..9e698a619545 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1964,7 +1964,7 @@ struct folio *__filemap_get_folio(struct > address_space *mapping, pgoff_t index, > do { > gfp_t alloc_gfp = gfp; > > - err = -ENOMEM; > + err = (fgp_flags & FGP_NOWAIT) ? -ENOMEM : -EAGAIN; Sorry, I actually meant this: + err = (fgp_flags & FGP_NOWAIT) ? -EAGAIN : -ENOMEM;
> > A possible way to fix it is this one-liner, but I am not well versed > > in this area, so someone may end up suggesting a better fix: > > diff --git a/mm/filemap.c b/mm/filemap.c > > index 804d7365680c..9e698a619545 100644 > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -1964,7 +1964,7 @@ struct folio *__filemap_get_folio(struct > > address_space *mapping, pgoff_t index, > > do { > > gfp_t alloc_gfp = gfp; > > > > - err = -ENOMEM; > > + err = (fgp_flags & FGP_NOWAIT) ? -ENOMEM : -EAGAIN; > > Sorry, I actually meant this: > + err = (fgp_flags & FGP_NOWAIT) ? -EAGAIN : -ENOMEM; Digging a bit more, I realized a better patch (assuming regression indeed exists) is this one, since it accounts for ENOMEM coming from filemap_add_folio, which might allocate in xas_split_alloc() under same fgp flags: diff --git a/mm/filemap.c b/mm/filemap.c index 804d7365680c..dcf1f57e0a9a 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1984,6 +1984,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, folio = NULL; } while (order-- > min_order); + if ((fgp_flags & FGP_NOWAIT) && err == -ENOMEM) + return ERR_PTR(-EAGAIN); if (err == -EEXIST) goto repeat; if (err)
On Mon, Feb 10, 2025 at 03:12:24PM -0300, Raphael S. Carvalho wrote: > While running scylladb test suite, which uses io_uring + buffered > writes + XFS, the system was spuriously returning ENOMEM, despite > there being plenty of available memory to be reclaimed from the page > cache. FWIW, I am running: 6.12.9-100.fc40.x86_64 ..... > In the patch ''mm: return an ERR_PTR from __filemap_get_folio', I see > the following changes: > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -468,19 +468,12 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate); > struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos) > { > unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS; > - struct folio *folio; > > if (iter->flags & IOMAP_NOWAIT) > fgp |= FGP_NOWAIT; > > - folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT, > + return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT, > fgp, mapping_gfp_mask(iter->inode->i_mapping)); > - if (folio) > - return folio; > - > - if (iter->flags & IOMAP_NOWAIT) > - return ERR_PTR(-EAGAIN); > - return ERR_PTR(-ENOMEM); > } > > This leads to me believe we have a regression in this area, after that > patch, since iomap_get_folio() is no longer returning EAGAIN with > IOMAP_NOWAIT, if __filemap_get_folio() failed to get a folio. Now it > returns ENOMEM unconditionally. Yes, I think you are right - FGP_NOWAIT error returns are not handled correctly by __filemap_get_folio(). > Since we pushed the error picking decision to __filemap_get_folio, I > think it makes sense for us to patch it such that it returns EAGAIN if > allocation failed (under pressure) because IOMAP_NOWAIT was requested > by its caller and allocation is not allowed to block waiting for > reclaimer to do its thing. > > A possible way to fix it is this one-liner, but I am not well versed > in this area, so someone may end up suggesting a better fix: > diff --git a/mm/filemap.c b/mm/filemap.c > index 804d7365680c..9e698a619545 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1964,7 +1964,7 @@ struct folio *__filemap_get_folio(struct > address_space *mapping, pgoff_t index, > do { > gfp_t alloc_gfp = gfp; > > - err = -ENOMEM; > + err = (fgp_flags & FGP_NOWAIT) ? -ENOMEM : -EAGAIN; > if (order > min_order) > alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; > folio = filemap_alloc_folio(alloc_gfp, order); Better to only do the FGP_NOWAIT check when a failure occurs; that puts it in the slow path rather than having to evaluate it unnecessarily every time through the function/loop. i.e. folio = filemap_alloc_folio(gfp, order); - if (!folio) - return ERR_PTR(-ENOMEM); + if (!folio) { + if (fgp_flags & FGP_NOWAIT) + err = -EAGAIN; + else + err = -ENOMEM; + continue; + } -Dave.
On Tue, Feb 11, 2025 at 08:09:31AM +1100, Dave Chinner wrote: > Better to only do the FGP_NOWAIT check when a failure occurs; that > puts it in the slow path rather than having to evaluate it > unnecessarily every time through the function/loop. i.e. > > folio = filemap_alloc_folio(gfp, order); > - if (!folio) > - return ERR_PTR(-ENOMEM); > + if (!folio) { > + if (fgp_flags & FGP_NOWAIT) > + err = -EAGAIN; > + else > + err = -ENOMEM; > + continue; > + } Or would we be better off handling ENOMEM the same way we handle EAGAIN? eg something like: +++ b/io_uring/io_uring.c @@ -1842,7 +1842,7 @@ void io_wq_submit_work(struct io_wq_work *work) do { ret = io_issue_sqe(req, issue_flags); - if (ret != -EAGAIN) + if (ret != -EAGAIN || ret != -ENOMEM) break; /*
On Mon, Feb 10, 2025 at 09:18:04PM +0000, Matthew Wilcox wrote: > On Tue, Feb 11, 2025 at 08:09:31AM +1100, Dave Chinner wrote: > > Better to only do the FGP_NOWAIT check when a failure occurs; that > > puts it in the slow path rather than having to evaluate it > > unnecessarily every time through the function/loop. i.e. > > > > folio = filemap_alloc_folio(gfp, order); > > - if (!folio) > > - return ERR_PTR(-ENOMEM); > > + if (!folio) { > > + if (fgp_flags & FGP_NOWAIT) > > + err = -EAGAIN; > > + else > > + err = -ENOMEM; > > + continue; > > + } > > Or would we be better off handling ENOMEM the same way we handle EAGAIN? > eg something like: > > +++ b/io_uring/io_uring.c > @@ -1842,7 +1842,7 @@ void io_wq_submit_work(struct io_wq_work *work) > > do { > ret = io_issue_sqe(req, issue_flags); > - if (ret != -EAGAIN) > + if (ret != -EAGAIN || ret != -ENOMEM) > break; This still allows -ENOMEM to escape to userspace instead of -EAGAIN via pwritev2(RWF_NOWAIT) and AIO write interfaces. -Dave.
--- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -468,19 +468,12 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate); struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos) { unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS; - struct folio *folio; if (iter->flags & IOMAP_NOWAIT) fgp |= FGP_NOWAIT; - folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT, + return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT, fgp, mapping_gfp_mask(iter->inode->i_mapping)); - if (folio) - return folio; - - if (iter->flags & IOMAP_NOWAIT) - return ERR_PTR(-EAGAIN); - return ERR_PTR(-ENOMEM); } This leads to me believe we have a regression in this area, after that patch, since iomap_get_folio() is no longer returning EAGAIN with IOMAP_NOWAIT, if __filemap_get_folio() failed to get a folio. Now it returns ENOMEM unconditionally. Since we pushed the error picking decision to __filemap_get_folio, I think it makes sense for us to patch it such that it returns EAGAIN if allocation failed (under pressure) because IOMAP_NOWAIT was requested by its caller and allocation is not allowed to block waiting for reclaimer to do its thing. A possible way to fix it is this one-liner, but I am not well versed in this area, so someone may end up suggesting a better fix: diff --git a/mm/filemap.c b/mm/filemap.c index 804d7365680c..9e698a619545 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1964,7 +1964,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, do { gfp_t alloc_gfp = gfp; - err = -ENOMEM; + err = (fgp_flags & FGP_NOWAIT) ? -ENOMEM : -EAGAIN; if (order > min_order) alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;