Message ID | 20240131161006.1475094-1-dwysocha@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] NFS: Fix nfs_netfs_issue_read() xarray locking for writeback interrupt | expand |
On Wed, 2024-01-31 at 11:10 -0500, Dave Wysochanski wrote: > The loop inside nfs_netfs_issue_read() currently does not disable > interrupts while iterating through pages in the xarray to submit > for NFS read. This is not safe though since after taking xa_lock, > another page in the mapping could be processed for writeback inside > an interrupt, and deadlock can occur. The fix is simple and clean > if we use xa_for_each_range(), which handles the iteration with RCU > while reducing code complexity. > > The problem is easily reproduced with the following test: > mount -o vers=3,fsc 127.0.0.1:/export /mnt/nfs > dd if=/dev/zero of=/mnt/nfs/file1.bin bs=4096 count=1 > echo 3 > /proc/sys/vm/drop_caches > dd if=/mnt/nfs/file1.bin of=/dev/null > umount /mnt/nfs > > On the console with a lockdep-enabled kernel a message similar to > the following will be seen: > > ================================ > WARNING: inconsistent lock state > 6.7.0-lockdbg+ #10 Not tainted > -------------------------------- > inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. > test5/1708 [HC0[0]:SC0[0]:HE1:SE1] takes: > ffff888127baa598 (&xa->xa_lock#4){+.?.}-{3:3}, at: > nfs_netfs_issue_read+0x1b2/0x4b0 [nfs] > {IN-SOFTIRQ-W} state was registered at: > lock_acquire+0x144/0x380 > _raw_spin_lock_irqsave+0x4e/0xa0 > __folio_end_writeback+0x17e/0x5c0 > folio_end_writeback+0x93/0x1b0 > iomap_finish_ioend+0xeb/0x6a0 > blk_update_request+0x204/0x7f0 > blk_mq_end_request+0x30/0x1c0 > blk_complete_reqs+0x7e/0xa0 > __do_softirq+0x113/0x544 > __irq_exit_rcu+0xfe/0x120 > irq_exit_rcu+0xe/0x20 > sysvec_call_function_single+0x6f/0x90 > asm_sysvec_call_function_single+0x1a/0x20 > pv_native_safe_halt+0xf/0x20 > default_idle+0x9/0x20 > default_idle_call+0x67/0xa0 > do_idle+0x2b5/0x300 > cpu_startup_entry+0x34/0x40 > start_secondary+0x19d/0x1c0 > secondary_startup_64_no_verify+0x18f/0x19b > irq event stamp: 176891 > hardirqs last enabled at (176891): [<ffffffffa67a0be4>] > _raw_spin_unlock_irqrestore+0x44/0x60 > hardirqs last disabled at (176890): [<ffffffffa67a0899>] > _raw_spin_lock_irqsave+0x79/0xa0 > softirqs last enabled at (176646): [<ffffffffa515d91e>] > __irq_exit_rcu+0xfe/0x120 > softirqs last disabled at (176633): [<ffffffffa515d91e>] > __irq_exit_rcu+0xfe/0x120 > > other info that might help us debug this: > Possible unsafe locking scenario: > > CPU0 > ---- > lock(&xa->xa_lock#4); > <Interrupt> > lock(&xa->xa_lock#4); > > *** DEADLOCK *** > > 2 locks held by test5/1708: > #0: ffff888127baa498 (&sb->s_type->i_mutex_key#22){++++}-{4:4}, at: > nfs_start_io_read+0x28/0x90 [nfs] > #1: ffff888127baa650 (mapping.invalidate_lock#3){.+.+}-{4:4}, at: > page_cache_ra_unbounded+0xa4/0x280 > > stack backtrace: > CPU: 6 PID: 1708 Comm: test5 Kdump: loaded Not tainted 6.7.0-lockdbg+ > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 > 04/01/2014 > Call Trace: > dump_stack_lvl+0x5b/0x90 > mark_lock+0xb3f/0xd20 > __lock_acquire+0x77b/0x3360 > _raw_spin_lock+0x34/0x80 > nfs_netfs_issue_read+0x1b2/0x4b0 [nfs] > netfs_begin_read+0x77f/0x980 [netfs] > nfs_netfs_readahead+0x45/0x60 [nfs] > nfs_readahead+0x323/0x5a0 [nfs] > read_pages+0xf3/0x5c0 > page_cache_ra_unbounded+0x1c8/0x280 > filemap_get_pages+0x38c/0xae0 > filemap_read+0x206/0x5e0 > nfs_file_read+0xb7/0x140 [nfs] > vfs_read+0x2a9/0x460 > ksys_read+0xb7/0x140 > > Fixes: 000dbe0bec05 ("NFS: Convert buffered read paths to use netfs when > fscache is enabled") > Suggested-by: Jeff Layton <jlayton@redhat.com> > Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> > --- > fs/nfs/fscache.c | 9 ++------- > 1 file changed, 2 insertions(+), 7 deletions(-) > > diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c > index b05717fe0d4e..60a3c28784e0 100644 > --- a/fs/nfs/fscache.c > +++ b/fs/nfs/fscache.c > @@ -307,11 +307,11 @@ static void nfs_netfs_issue_read(struct netfs_io_subrequest *sreq) > struct inode *inode = sreq->rreq->inode; > struct nfs_open_context *ctx = sreq->rreq->netfs_priv; > struct page *page; > + unsigned long idx; > int err; > pgoff_t start = (sreq->start + sreq->transferred) >> PAGE_SHIFT; > pgoff_t last = ((sreq->start + sreq->len - > sreq->transferred - 1) >> PAGE_SHIFT); > - XA_STATE(xas, &sreq->rreq->mapping->i_pages, start); > > nfs_pageio_init_read(&pgio, inode, false, > &nfs_async_read_completion_ops); > @@ -322,19 +322,14 @@ static void nfs_netfs_issue_read(struct netfs_io_subrequest *sreq) > > pgio.pg_netfs = netfs; /* used in completion */ > > - xas_lock(&xas); > - xas_for_each(&xas, page, last) { > + xa_for_each_range(&sreq->rreq->mapping->i_pages, idx, page, start, last) { > /* nfs_read_add_folio() may schedule() due to pNFS layout and other RPCs */ > - xas_pause(&xas); > - xas_unlock(&xas); > err = nfs_read_add_folio(&pgio, ctx, page_folio(page)); > if (err < 0) { > netfs->error = err; > goto out; > } > - xas_lock(&xas); > } > - xas_unlock(&xas); > out: > nfs_pageio_complete_read(&pgio); > nfs_netfs_put(netfs); Reviewed-by: Jeff Layton <jlayton@kernel.org>
Dave Wysochanski <dwysocha@redhat.com> wrote: > The loop inside nfs_netfs_issue_read() currently does not disable > interrupts while iterating through pages in the xarray to submit > for NFS read. This is not safe though since after taking xa_lock, > another page in the mapping could be processed for writeback inside > an interrupt, and deadlock can occur. The fix is simple and clean > if we use xa_for_each_range(), which handles the iteration with RCU > while reducing code complexity. > > The problem is easily reproduced with the following test: > mount -o vers=3,fsc 127.0.0.1:/export /mnt/nfs > dd if=/dev/zero of=/mnt/nfs/file1.bin bs=4096 count=1 > echo 3 > /proc/sys/vm/drop_caches > dd if=/mnt/nfs/file1.bin of=/dev/null > umount /mnt/nfs > > On the console with a lockdep-enabled kernel a message similar to > the following will be seen: > > ================================ > WARNING: inconsistent lock state > 6.7.0-lockdbg+ #10 Not tainted > -------------------------------- > inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. > test5/1708 [HC0[0]:SC0[0]:HE1:SE1] takes: > ffff888127baa598 (&xa->xa_lock#4){+.?.}-{3:3}, at: > nfs_netfs_issue_read+0x1b2/0x4b0 [nfs] > {IN-SOFTIRQ-W} state was registered at: > lock_acquire+0x144/0x380 > _raw_spin_lock_irqsave+0x4e/0xa0 > __folio_end_writeback+0x17e/0x5c0 > folio_end_writeback+0x93/0x1b0 > iomap_finish_ioend+0xeb/0x6a0 > blk_update_request+0x204/0x7f0 > blk_mq_end_request+0x30/0x1c0 > blk_complete_reqs+0x7e/0xa0 > __do_softirq+0x113/0x544 > __irq_exit_rcu+0xfe/0x120 > irq_exit_rcu+0xe/0x20 > sysvec_call_function_single+0x6f/0x90 > asm_sysvec_call_function_single+0x1a/0x20 > pv_native_safe_halt+0xf/0x20 > default_idle+0x9/0x20 > default_idle_call+0x67/0xa0 > do_idle+0x2b5/0x300 > cpu_startup_entry+0x34/0x40 > start_secondary+0x19d/0x1c0 > secondary_startup_64_no_verify+0x18f/0x19b > irq event stamp: 176891 > hardirqs last enabled at (176891): [<ffffffffa67a0be4>] > _raw_spin_unlock_irqrestore+0x44/0x60 > hardirqs last disabled at (176890): [<ffffffffa67a0899>] > _raw_spin_lock_irqsave+0x79/0xa0 > softirqs last enabled at (176646): [<ffffffffa515d91e>] > __irq_exit_rcu+0xfe/0x120 > softirqs last disabled at (176633): [<ffffffffa515d91e>] > __irq_exit_rcu+0xfe/0x120 > > other info that might help us debug this: > Possible unsafe locking scenario: > > CPU0 > ---- > lock(&xa->xa_lock#4); > <Interrupt> > lock(&xa->xa_lock#4); > > *** DEADLOCK *** > > 2 locks held by test5/1708: > #0: ffff888127baa498 (&sb->s_type->i_mutex_key#22){++++}-{4:4}, at: > nfs_start_io_read+0x28/0x90 [nfs] > #1: ffff888127baa650 (mapping.invalidate_lock#3){.+.+}-{4:4}, at: > page_cache_ra_unbounded+0xa4/0x280 > > stack backtrace: > CPU: 6 PID: 1708 Comm: test5 Kdump: loaded Not tainted 6.7.0-lockdbg+ > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 > 04/01/2014 > Call Trace: > dump_stack_lvl+0x5b/0x90 > mark_lock+0xb3f/0xd20 > __lock_acquire+0x77b/0x3360 > _raw_spin_lock+0x34/0x80 > nfs_netfs_issue_read+0x1b2/0x4b0 [nfs] > netfs_begin_read+0x77f/0x980 [netfs] > nfs_netfs_readahead+0x45/0x60 [nfs] > nfs_readahead+0x323/0x5a0 [nfs] > read_pages+0xf3/0x5c0 > page_cache_ra_unbounded+0x1c8/0x280 > filemap_get_pages+0x38c/0xae0 > filemap_read+0x206/0x5e0 > nfs_file_read+0xb7/0x140 [nfs] > vfs_read+0x2a9/0x460 > ksys_read+0xb7/0x140 > > Fixes: 000dbe0bec05 ("NFS: Convert buffered read paths to use netfs when > fscache is enabled") > Suggested-by: Jeff Layton <jlayton@redhat.com> > Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Reviewed-by: David Howells <dhowells@redhat.com>
diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c index b05717fe0d4e..60a3c28784e0 100644 --- a/fs/nfs/fscache.c +++ b/fs/nfs/fscache.c @@ -307,11 +307,11 @@ static void nfs_netfs_issue_read(struct netfs_io_subrequest *sreq) struct inode *inode = sreq->rreq->inode; struct nfs_open_context *ctx = sreq->rreq->netfs_priv; struct page *page; + unsigned long idx; int err; pgoff_t start = (sreq->start + sreq->transferred) >> PAGE_SHIFT; pgoff_t last = ((sreq->start + sreq->len - sreq->transferred - 1) >> PAGE_SHIFT); - XA_STATE(xas, &sreq->rreq->mapping->i_pages, start); nfs_pageio_init_read(&pgio, inode, false, &nfs_async_read_completion_ops); @@ -322,19 +322,14 @@ static void nfs_netfs_issue_read(struct netfs_io_subrequest *sreq) pgio.pg_netfs = netfs; /* used in completion */ - xas_lock(&xas); - xas_for_each(&xas, page, last) { + xa_for_each_range(&sreq->rreq->mapping->i_pages, idx, page, start, last) { /* nfs_read_add_folio() may schedule() due to pNFS layout and other RPCs */ - xas_pause(&xas); - xas_unlock(&xas); err = nfs_read_add_folio(&pgio, ctx, page_folio(page)); if (err < 0) { netfs->error = err; goto out; } - xas_lock(&xas); } - xas_unlock(&xas); out: nfs_pageio_complete_read(&pgio); nfs_netfs_put(netfs);
The loop inside nfs_netfs_issue_read() currently does not disable interrupts while iterating through pages in the xarray to submit for NFS read. This is not safe though since after taking xa_lock, another page in the mapping could be processed for writeback inside an interrupt, and deadlock can occur. The fix is simple and clean if we use xa_for_each_range(), which handles the iteration with RCU while reducing code complexity. The problem is easily reproduced with the following test: mount -o vers=3,fsc 127.0.0.1:/export /mnt/nfs dd if=/dev/zero of=/mnt/nfs/file1.bin bs=4096 count=1 echo 3 > /proc/sys/vm/drop_caches dd if=/mnt/nfs/file1.bin of=/dev/null umount /mnt/nfs On the console with a lockdep-enabled kernel a message similar to the following will be seen: ================================ WARNING: inconsistent lock state 6.7.0-lockdbg+ #10 Not tainted -------------------------------- inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. test5/1708 [HC0[0]:SC0[0]:HE1:SE1] takes: ffff888127baa598 (&xa->xa_lock#4){+.?.}-{3:3}, at: nfs_netfs_issue_read+0x1b2/0x4b0 [nfs] {IN-SOFTIRQ-W} state was registered at: lock_acquire+0x144/0x380 _raw_spin_lock_irqsave+0x4e/0xa0 __folio_end_writeback+0x17e/0x5c0 folio_end_writeback+0x93/0x1b0 iomap_finish_ioend+0xeb/0x6a0 blk_update_request+0x204/0x7f0 blk_mq_end_request+0x30/0x1c0 blk_complete_reqs+0x7e/0xa0 __do_softirq+0x113/0x544 __irq_exit_rcu+0xfe/0x120 irq_exit_rcu+0xe/0x20 sysvec_call_function_single+0x6f/0x90 asm_sysvec_call_function_single+0x1a/0x20 pv_native_safe_halt+0xf/0x20 default_idle+0x9/0x20 default_idle_call+0x67/0xa0 do_idle+0x2b5/0x300 cpu_startup_entry+0x34/0x40 start_secondary+0x19d/0x1c0 secondary_startup_64_no_verify+0x18f/0x19b irq event stamp: 176891 hardirqs last enabled at (176891): [<ffffffffa67a0be4>] _raw_spin_unlock_irqrestore+0x44/0x60 hardirqs last disabled at (176890): [<ffffffffa67a0899>] _raw_spin_lock_irqsave+0x79/0xa0 softirqs last enabled at (176646): [<ffffffffa515d91e>] __irq_exit_rcu+0xfe/0x120 softirqs last disabled at (176633): [<ffffffffa515d91e>] __irq_exit_rcu+0xfe/0x120 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&xa->xa_lock#4); <Interrupt> lock(&xa->xa_lock#4); *** DEADLOCK *** 2 locks held by test5/1708: #0: ffff888127baa498 (&sb->s_type->i_mutex_key#22){++++}-{4:4}, at: nfs_start_io_read+0x28/0x90 [nfs] #1: ffff888127baa650 (mapping.invalidate_lock#3){.+.+}-{4:4}, at: page_cache_ra_unbounded+0xa4/0x280 stack backtrace: CPU: 6 PID: 1708 Comm: test5 Kdump: loaded Not tainted 6.7.0-lockdbg+ Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014 Call Trace: dump_stack_lvl+0x5b/0x90 mark_lock+0xb3f/0xd20 __lock_acquire+0x77b/0x3360 _raw_spin_lock+0x34/0x80 nfs_netfs_issue_read+0x1b2/0x4b0 [nfs] netfs_begin_read+0x77f/0x980 [netfs] nfs_netfs_readahead+0x45/0x60 [nfs] nfs_readahead+0x323/0x5a0 [nfs] read_pages+0xf3/0x5c0 page_cache_ra_unbounded+0x1c8/0x280 filemap_get_pages+0x38c/0xae0 filemap_read+0x206/0x5e0 nfs_file_read+0xb7/0x140 [nfs] vfs_read+0x2a9/0x460 ksys_read+0xb7/0x140 Fixes: 000dbe0bec05 ("NFS: Convert buffered read paths to use netfs when fscache is enabled") Suggested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> --- fs/nfs/fscache.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-)